Deploying GenPipes locally in your server¶
This document covers details on how to deploy GenPipes locally on a bare metal or virtual server. If you wish to install GenPipes locally in a container, please refer to GenPipes in a container section of deployment guide.
For more details on other available options to deploy and access GenPipes you may refer to GenPipes Deployment Options Page.
Step 1: Download latest GenPipes sources¶
First of all, visit GenPipes Download Page and get a copy of the latest stable release software. To fetch the most recent version of GenPipes, you may use the following command:
git clone git@bitbucket.org:mugqic/genpipes.git
Step 2: Setup environment variables¶
Add the following line in your your $HOME/.bash_profile: to set MUGQIC_PIPELINES_HOME to your local copy path. For example,
export MUGQIC_PIPELINES_HOME=/path/to/your/local/genpipes
Step 3: Accessing software modules and genomes needed for GenPipe¶
GenPipes was formerly known as MUGQIC Pipelines. Genomic analysis executed using these pipelines requires genomes and software modules. You need to load the software modules in your shell environment. To do so, set the environment variable MUGQIC_INSTALL_HOME to the directory where you want to install those resources in your $HOME/.bash_profile as follows:
## MUGQIC genomes and modules
export MUGQIC_INSTALL_HOME=/path/to/your/local/mugqic_resources
module use $MUGQIC_INSTALL_HOME/modulefiles
Installing available modules
Software tools and associated modules must be installed in $MUGQIC_INSTALL_HOME/software/ and $MUGQIC_INSTALL_HOME/modulefiles/. Default software/module installation scripts are already available in $MUGQIC_PIPELINES_HOME/resources/modules/.
Install new modules
To install a new module or new software tool and associated modules semi-automatically, use the following instructions:
cp $MUGQIC_PIPELINES_HOME/resources/modules/MODULE_INSTALL_TEMPLATE.sh $MUGQIC_PIPELINES_HOME/resources/modules/<my_software>.sh
Follow the instructions in the file $MUGQIC_PIPELINES_HOME/resources/modules/<my_software>.sh and modify it accordingly. Next you need to run the following command with No arguments. By default, it will download and extract the remote software archive, build the software and create the associated module, all in $MUGQIC_INSTALL_HOME_DEV if it is set.
$MUGQIC_PIPELINES_HOME/resources/modules/<my_software>.sh
If everything executes OK with no error, you are ready to install the my_software module in production. Use the command:
$MUGQIC_PIPELINES_HOME/resources/modules/<my_software>.sh MUGQIC_INSTALL_HOME
Note
Please note there is no $ before MUGQIC_INSTALL_HOME specified as argument above!.
Next, you need to check if the module is successfully installed and available for use by executing the following command:
module avail 2>&1 | grep mugqic/<my_software>/<version>
This completes the software module setup for GenPipes execution. Next you need to make sure all required reference genomes are available in your local deployment. Refer to the next section if you wish to install additional genomes.
Installing genomes
Reference genomes and annotations must be installed in the following directory:
$MUGQIC_INSTALL_HOME/genomes/
Default genome installation scripts are already available locally in the following directory:
$MUGQIC_PIPELINES_HOME/resources/genomes/
To install all of the available genomes that are bundled with GenPipes package, use the following script:
$MUGQIC_PIPELINES_HOME/resources/genomes/install_all_genomes.sh
All species related files are in the following directory:
$MUGQIC_INSTALL_HOME/genomes/species/<species_scientific_name>.<assembly>/
For example, Homo Sapiens assembly GRCh37 genome directory hierarchy is as follows:
$MUGQIC_INSTALL_HOME/genomes/species/Homo_sapiens.GRCh37/
├── annotations/
│ ├── gtf_tophat_index/
│ ├── Homo_sapiens.GRCh37.dbSNP142.vcf.gz
│ ├── Homo_sapiens.GRCh37.dbSNP142.vcf.gz.tbi
│ ├── Homo_sapiens.GRCh37.Ensembl75.geneid2Symbol.tsv
│ ├── Homo_sapiens.GRCh37.Ensembl75.genes.length.tsv
│ ├── Homo_sapiens.GRCh37.Ensembl75.genes.tsv
│ ├── Homo_sapiens.GRCh37.Ensembl75.GO.tsv
│ ├── Homo_sapiens.GRCh37.Ensembl75.gtf
│ ├── Homo_sapiens.GRCh37.Ensembl75.ncrna.fa
│ ├── Homo_sapiens.GRCh37.Ensembl75.rrna.fa
│ ├── Homo_sapiens.GRCh37.Ensembl75.transcript_id.gtf
│ ├── Homo_sapiens.GRCh37.Ensembl75.vcf.gz
│ ├── ncrna_bwa_index/
│ └── rrna_bwa_index/
├── downloads/
│ ├── ftp.1000genomes.ebi.ac.uk/
│ ├── ftp.ensembl.org/
│ └── ftp.ncbi.nih.gov/
├── genome/
│ ├── bowtie2_index/
│ ├── bwa_index/
│ ├── Homo_sapiens.GRCh37.dict
│ ├── Homo_sapiens.GRCh37.fa
│ ├── Homo_sapiens.GRCh37.fa.fai
│ └── star_index/
├── Homo_sapiens.GRCh37.ini
└── log/
The assembly name is the one used by the download source. For e.g. “GRCh37” is used for Ensembl.
Each species directory contains a “.ini” file such as:
<scientific_name>.<assembly>.ini
Among other things, this “.ini” file lists the assembly synonyms. In case of “hg19”, the contents of Homo_sapiens.GRCh37.ini are as shown below:
[DEFAULT]
scientific_name=Homo_sapiens
common_name=Human
assembly=GRCh37
assembly_synonyms=hg19
source=Ensembl
version=75
dbsnp_version=142
Install a new Genome
New genomes and annotations can be installed semi-automatically from Ensembl (vertebrate species), Ensemble Genomes (other species) or UCSC (genome and indexes only; no annotations).
Example - how to set up genomes for Chimpanzee:
Retrieve the species scientific name on Ensemble Genomes or UCSC :
Pan troglodytes
Retrieve the assembly name:
Ensembl: “CHIMP2.1.4”
UCSC: “panTro4”
Retrieve the source version:
Ensembl: “78”
UCSC: unfortunately, UCSC does not have version numbers. Use panTro4.2bit date formatted as “YYYY-MM-DD”: “2012-01-09”
Next, copy the template file to a new file name using the scientific name.
cp $MUGQIC_PIPELINES_HOME/resources/genomes/GENOME_INSTALL_TEMPLATE.sh $MUGQIC_PIPELINES_HOME/resources/genomes/<scientific_name>.<assembly>.sh
For example, in case of Ensembl, use the following command:
cp $MUGQIC_PIPELINES_HOME/resources/genomes/GENOME_INSTALL_TEMPLATE.sh $MUGQIC_PIPELINES_HOME/resources/genomes/Pan_troglodytes.CHIMP2.1.4.sh
In case of genomes from UCSC, use the following command to copy the genome install instructions:
cp $MUGQIC_PIPELINES_HOME/resources/genomes/GENOME_INSTALL_TEMPLATE.sh $MUGQIC_PIPELINES_HOME/resources/genomes/Pan_troglodytes.panTro4.sh
Next, you need to modify the following file:
$MUGQIC_PIPELINES_HOME/resources/genomes/<scientific_name>.<assembly>.sh
Please note that ASSEMBLY_SYNONYMS can be left empty but if you know that 2 assemblies are identical apart from chr sequence prefixes, document it.
Example below shows the modifications for Ensembl:
SPECIES=Pan_troglodytes # With "_"; no space!
COMMON_NAME=Chimpanzee
ASSEMBLY=CHIMP2.1.4
ASSEMBLY_SYNONYMS=panTro4
SOURCE=Ensembl
VERSION=78
Example below shows the modifications for UCSC:
SPECIES=Pan_troglodytes # With "_"; no space!
COMMON_NAME=Chimpanzee
ASSEMBLY=panTro4
ASSEMBLY_SYNONYMS=CHIMP2.1.4
SOURCE=UCSC
VERSION=2012-01-09
Now you can run the following command to install the genome in $MUGQIC_INSTALL_HOME_DEV (by default). This will download and install genomes, indexes and, for Ensembl only, annotations (GTF, VCF, etc.).
bash $MUGQIC_PIPELINES_HOME/resources/genomes/<scientific_name>.<assembly>.sh
Admin-only To install it in $MUGQIC_INSTALL_HOME, run the following command:
bash $MUGQIC_PIPELINES_HOME/resources/genomes/<scientific_name>.<assembly>.sh MUGQIC_INSTALL_HOME
Admin-only If the new genome has been installed in $MUGQIC_INSTALL_HOME_DEV, to deploy in $MUGQIC_INSTALL_HOME you can use the following command:
rsync -vca --no-o --no-g --no-p --size-only -I -O --ignore-times $MUGQIC_INSTALL_HOME_DEV/genomes/species/<scientific_name>.<assembly> $MUGQIC_INSTALL_HOME/genomes/species/
Lastly, add the newly created “.ini” file to the genome configuration files for further use in subsequent genomic analysis pipeline runs by the following command:
cp $MUGQIC_INSTALL_HOME/genomes/species/<scientific_name>.<assembly>/<scientific_name>.<assembly>.ini $MUGQIC_PIPELINES_HOME/resources/genomes/config/
Step 4: Validating GenPipes local deployment¶
You are now all set to use GenPipes pipelines. For each pipeline, you can get help about its usage through the help command:
$MUGQIC_PIPELINES_HOME/pipelines/<pipeline_name>/<pipeline_name>.py --help
Running pipelines requires other inputs such as Configuration File, Readset File and Design File. For details on how to run individual pipelines you can see Running GenPipes or GenPipes User Guide.
Note
In case of any issues, you can try GenPipes Support or check out other communication channels to view latest discussions around using GenPipes by the community.
Note
You may also want to check the latest GenPipes deployment and setup instructions listed in GenPipes README.md file.