Nanopore CoVSeQ Pipeline

GenPipes Nanopore CoVSeq pipeline is built using the Nanopore ARTIC-Nanopolish protocol. This protocol has been widely adopted by research groups worldwide to assist in epidemiological investigations. This protocol is mainly focused around the use of portable Oxford Nanopore MinION sequencer. However, other aspects of the protocol related to primer scheme and sample amplification can be generalized to other sequencing platforms.

Direct amplification of the virus using tiled, multiplexed primers approach has been proven to have high sensitivity. It enables researchers to work directly from clinical samples compared to metagenomic projects. It has been widely used to analyze viral genome data generated during outbreaks such as SARS-CoV-2 for information about relatedness to other viruses.

The GenPipes Nanopore CoVSeQ Sequencing Pipeline is based on nCoV-2019 novel coronavirus bioinformatics protocol (ARTIC V4.1) that takes the output from the sequencing protocol to consensus genome sequences. It includes basecalling, de-multiplexing, mapping, polishing and consensus generation.


Introduction

The Nanopore CoVSeQ pipeline is used to analyze long reads produced by the Oxford Nanopore Technologies (ONT) sequencers.

The SOP for Nanopore data is based on the ARTIC SARS-CoV2 protocol, Version 4 / 4.1 (V4.1), using nanopolish. This protocol is closely followed in GenPipes Nanopore sequencing pipeline with majority of changes related to technical adaptation of the protocol to be able to run in a High Performance Computing (HPC) environment. In such environments, Conda is not advisable.

Key steps in this pipeline include basecalling with Guppy, demultiplexing, read filtering and consensus sequencing. Basecalling with Guppy happens only if the `-t basecalling` option is selected.

If basecalling protocol option is selected through the -t command line option, the Nanopore CoVSeQ pipeline will do basecalling with Guppy (GPU) and demultiplexing. After basecalling, the pipeline performs de-hosting, for all the samples, followed by running the ARTIC-Nanopolish wrapper which performs alignment to the SARS-CoV2 reference (using minimap2), variant calling (using Nanopolish software). The Nanopolish software performs signal-level analysis of Oxford Nanopore sequencing data. After Nanopolish processing, the pipeline performs consensus generation through artic_mask and bcftools consensus steps. Lastly, custom scripts and ncov_tools are run to report on quality metrics for Nanopore CoVSeQ GenPipes Sequencing Pipeline.

Details of structure and contents of the Nanopore readset file are available here.


Version

4.3.2

For the latest implementation and usage details refer to Nanopore Sequencing implementation README file file.


Usage

nanopore_covseq.py [-h] [--help] [-c CONFIG [CONFIG ...]] [-s STEPS]
          [-o OUTPUT_DIR] [-j {pbs,batch,daemon,slurm}] [-f]
          [--no-json] [--report] [--clean]
          [-l {debug,info,warning,error,critical}] [--sanity-check]
          [-t {default,basecalling}]
          [--genpipes_file GENPIPES_FILE]
          [--container {wrapper, singularity} {<CONTAINER PATH>, <CONTAINER NAME>}]
          [-v]

Optional Arguments

-h                        show this help message and exit
--help                    show detailed description of pipeline and steps
-c CONFIG [CONFIG ...], --config CONFIG [CONFIG ...]

                          config INI-style list of files; config parameters
                          are overwritten based on files order
-s STEPS, --steps STEPS   step range e.g. '1-5', '3,6,7', '2,4-8'
-o OUTPUT_DIR, --output-dir OUTPUT_DIR

                          output directory (default: current)
-j {pbs,batch,daemon,slurm}, --job-scheduler {pbs,batch,daemon,slurm}

                          job scheduler type (default: slurm)
-f, --force               force creation of jobs even if up to date (default:
                          false)
--no-json                 do not create JSON file per analysed sample to track
                          the analysis status (default: false i.e. JSON file
                          will be created)
--report                  create 'pandoc' command to merge all job markdown
                          report files in the given step range into HTML, if
                          they exist; if --report is set, --job-scheduler,
                          --force, --clean options and job up-to-date status
                          are ignored (default: false)
--clean                   create 'rm' commands for all job removable files in
                          the given step range, if they exist; if --clean is
                          set, --job-scheduler, --force options and job up-to-
                          date status are ignored (default: false)
-l {debug,info,warning,error,critical}, --log {debug,info,warning,error,critical}

                          log level (default: info)
--sanity-check            run the pipeline in `sanity check mode` to verify
                          all the input files needed for the pipeline to run
                          are available on the system (default: false)
--container {wrapper, singularity} <IMAGE PATH>

                          run pipeline inside a container providing a container
                          image path or accessible singularity hub path
-v, --version             show the version information and exita
-g GENPIPES_FILE, --genpipes_file GENPIPES_FILE

                          Commands for running the pipeline are output to this
                          file pathname. The data specified to pipeline command
                          line is processed and pipeline run commands are
                          stored in GENPIPES_FILE, if this option is specified
                          . Otherwise, the output will be redirected to stdout
                          . This file can be used to actually "run the
                          GenPipes Pipeline".
-r READSETS, --readsets READSETS

                          readset file

Example Run

Use the following commands to execute Nanopore sequencing pipeline:

nanopore_covseq.py -c $MUGQIC_PIPELINES_HOME/pipelines/nanopore/nanopore.base.ini $MUGQIC_PIPELINES_HOME/pipelines/common_ini/beluga.ini  $MUGQIC_PIPELINES_HOME/pipelines/nanopore_covseq/ARTIC_v4.1.ini -g nanopore_covseq_commands_mugqic.sh

bash nanopore_covseq_commands.sh

Tip

Replace beluga.ini file name in the command above with the appropriate clustername.ini file located in the $MUGQIC_PIPELINES_HOME/pipelines/common_ini folder, depending upon the cluster where you are executing the pipeline. For e.g., narval.ini, cedar.ini, graham.ini or narval.ini.

Warning

While issuing the pipeline run command, use `-g GENPIPES_FILE` option (see example above) instead of using the ` > GENPIPES_FILE` option supported by GenPipes so far, as shown below:

[genpipes_seq_pipeline].py -t mugqic -c $MUGQIC_PIPELINES_HOME/pipelines/[genpipes_seq_pipeline]/[genpipes_seq_pipeline].base.ini $MUGQIC_PIPELINES_HOME/pipelines/common_ini/beluga.ini -r readset.[genpipes_seq_pipeline].txt -s 1-6 > [genpipes_seq_pipeline]_commands_mugqic.sh

bash [genpipes_seq_pipeline]_commands_mugqic.sh

` > scriptfile` should be considered deprecated and `-g scriptfile` option is recommended instead.

Please note that redirecting commands to a script `> genpipe_script.sh` is still supported for now. But going forward, this mechanism might be dropped in a future GenPipes release.

Tip

Replace beluga.ini file name in the command above with the appropriate clustername.ini file located in the $MUGQIC_PIPELINES_HOME/pipelines/common_ini folder, depending upon the cluster where you are executing the pipeline. For e.g., narval.ini, cedar.ini, graham.ini or narval.ini.

Warning

ARTIC v4 vs v4.1 selection

The Nanopore CoVSeQ pipeline uses ARTIC v4 amplicon scheme as a default. If ARTIC v4.1 is required, use the appropriate .ini file. For all other amplicon schemes, add the appropriate primer and amplicon bed files and use a custom .ini for processing.

You can download the test dataset for this pipeline here. Nanopore CoVSeQ readset file structure and content details are available here.


Pipeline Schema

Figure below shows the schema of the Nanopore CoVSeQ ARTIC SARS-CoV2 sequencing protocol. You can refer to the latest pipeline implementation

nanopore covseq (-t default) schema

Figure: Schema of Nanopore CoVSeQ (Default) Sequencing protocol

nanopore covseq (-t basecalling) schema

Figure: Schema of Nanopore CoVSeQ (Basecalling) Sequencing protocol


Step Details

Following are the various steps that are part of GenPipes Nanopore CoVSeQ genomic analysis pipeline:

Guppy Basecall

This step uses the Oxford Nanopore basecaller, Guppy to basecall raw FAST5 files and produce FASTQ files. Basecalling model dna_r9.4.1_450bps_hac.cfg is used by default.

Guppy Demultiplex

This step uses he Oxford Nanopore basecaller Guppy to demultiplex FASTQ files based on their barcode. Barcode arrangement barcode_arrs_nb96.cfg is used by default.

Note

In the Guppy Demultiplex call, the following parameter, `--require_barcodes_both_ends`, is set by default.

pycoQC

In this step, pycoQC Software is used produce an interactive quality report based on the summary file and alignment outputs. PycoQC relies on the sequencing_summary.txt file generated by Guppy. If needed, it can also generate a summary file from basecalled FAST5 files. PycoQC computes metrics and generates interactive QC plots for Oxford Nanopore technologies sequencing data.

Host Reads Removal

This step uses a mapping approach with a hybrid GRCh38 + SARS-CoV2 genome. The reads that map to the Human Genome are removed from the analysis. A “de-hosted” FASTQ is produced.

Kraken Analysis

Kraken is a taxonomic sequence classifier that assigns taxonomic labels to short DNA reads. It does this by examining the k-mers within a read and querying a database with those k-mers.

Additionally, Kraken2 is used to produce a report on the raw data, which can be used to detect additional host contamination.

ARTIC Nanopolish

The ARTIC Nanopolish pipeline is used to produce consensus sequences and VCFs. Since Nanopolish is used, this step requires both FAST5 and FASTQ files.

Wub Metrics

Wub Package is used to calculate alignment metrics in this pipeline step.

CoVSeQ Metrics

Using all previous metrics calculated so far, a table is produced with a summary of all metrics for each individual sample.

SnpEff Annotate

The VCF produced by ARTIC Nanopolish step is annotated using SnpEff.

Quast Consensus Metrics

Consensus metrics are calculated using the tool QUAST.

Rename Consensus Header

A final consensus sequence is produced, with the appropriate header and naming convention based on genome completeness.

Prepare Report

Using ncov-tools package and additional R scripts, final reports are produced for all samples in the run, including basic QC plots as well as a preliminary lineage assignment through ncov-tools package.


More information

For the latest implementation and usage details refer to Nanopore CoVSeq Pipeline implementation README.md.