Welcome to the Scorpio project! This repository contains advanced tools for training triplet networks using contrastive learning on diverse DNA sequences and data from promoter detection, phylogenomic analysis, antimicrobial resistance (AMR) detection, and any hierarchical information, which can improve downstream analysis and insights.
The GitHub Wiki also contains tutorials to help you learn how to use Scorpio tools with real data.
For training the gene-taxa model with full genes, we have included the data in this Zenodo record: Zenodo. Please follow the instructions below to fully download the data. This data can be used with the trainer to train and save the model:
wget https://zenodo.org/api/records/12175913/files-archive -O scorpio-gene-taxa.zip
unzip scorpio-gene-taxa.zip -d scorpio-gene-taxa-
ScorpioBigDynamic
https://zenodo.org/record/14176840 -
ScorpioBigEmbed
https://zenodo.org/records/14176823
Our pre-trained model, MetaBERTa, a version of BigBird trained on gene sequences, is available here:
MetaBERTa-BigBird-Gene on Hugging Face
You can set up the environment for scorpio using either a conda environment or a Docker image. Follow the instructions below for your preferred method:
-
Create a conda environment named
scorpiobased on the environment file in thesrcdirectory:conda env create -f src/environment.yml -n scorpio
-
Activate the conda environment:
conda activate scorpio
-
Run the setup script to add
scorpioto your PATH:./src/setup.sh
-
Download and run the Docker image:
docker pull eesilab/scorpio docker run -it eesilab/scorpio
After following the steps for either method, your environment should be set up and ready to use the scorpio tool.
We encourage community involvement and welcome contributions to the Scorpio project (Report Issues, Submit Pull Requests, Join Discussions).
Please email us for any inquiries.
Maintainer: Saleh Refahi (sr3622 at drexel dot edu)
Owner: Gail Rosen (gailr26 at drexel dot edu)
Citation :
@article{refahi2025enhancing,
title={Enhancing nucleotide sequence representations in genomic analysis with contrastive optimization},
author={Refahi, Mohammadsaleh and Sokhansanj, Bahrad A and Mell, Joshua C and Brown, James R and Yoo, Hyunwoo and Hearne, Gavin and Rosen, Gail L},
journal={Communications Biology},
volume={8},
number={1},
pages={517},
year={2025},
publisher={Nature Publishing Group UK London}
}