Nanopore Pipeline

Structural Variations (SV) are genomic alterations including insertions, deletions, duplications, inversions, and translocation. They account for approximately 1% of the differences among human genomes and play a significant role in phenotypic variation and disease susceptibility.

Nanopore sequencing technology can generate long sequence reads and provides more accurate SV identification in terms of both sequencing and data analysis. For SV analysis, several new aligners and SV callers have been developed to leverage the long-read sequencing data. Assembly based approaches can also be used for SV identification. Minimap2 aligner offers high speed and relatively balanced performance for calling both insertions as well as deletions.

The Nanopore sequencing technology commercialized by Oxford Nanopore Technologies (ONT).


Introduction

The Nanopore is used to analyze long reads produced by the Oxford Nanopore Technologies (ONT) sequencers. Currently, the pipeline uses Minimap2 to align reads to the reference genome. Additionally, it produces a QC report that includes an interactive dashboard for each readset with data from the basecalling summary file as well as the alignment. A step aligning random reads to the NCBI nucleotide database and reporting the species of the highest hits is also done as QC.

Once the QC and alignments have been produced, Picard is used to merge readsets coming from the same sample. Finally, SVIM is used to detect Structural Variants (SV) including deletions, insertions and translocations. For a full summary of the types of SVs detected, please consult the following site.

The SV calls produced by SVIM are saved as VCFs for each sample, which can then be used in downstream analyses. No filtering is performed on the SV calls.

This pipeline currently does not perform base calling and requires both FASTQ and a sequencing_summary file produced by a ONT supported basecaller (we recommend Guppy). Additionally, the testing and development of the pipeline were focused on genomics applications, and functionality has not been tested for transcriptomic or epigenomic datasets. Beyond the QC dashboards for each readset, there is currently no implemented reporting step in this pipeline.

For more information on using ONT data for structural variant detection, as well as an alternative approach, please consult Oxford Nanopore Technologies SV Pipeline GitHub repository.

For information on the structure and contents of the Nanopore readset file, please consult Nanopore Readsets details.


Version

4.5.0

For the latest implementation and usage details refer to Nanopore Sequencing implementation README file file.


Usage

nanopore.py [-h] [--help] [-c CONFIG [CONFIG ...]]
            [-s STEPS] [-o OUTPUT_DIR]
            [-j {pbs,batch,daemon,slurm}] [-f]
            [--no-json] [--report] [--clean]
            [-l {debug,info,warning,error,critical}]
            [--sanity-check]
            [--container {wrapper, singularity} <IMAGE FILE>]
            [--genpipes_file GENPIPES_FILE]
            [-r READSETS] [-v]

Optional Arguments

-r READSETS, --readsets READSETS

                          readset file
-h                        show this help message and exit
--help                    show detailed description of pipeline and steps
-c CONFIG [CONFIG ...], --config CONFIG [CONFIG ...]

                          config INI-style list of files; config parameters
                          are overwritten based on files order
-s STEPS, --steps STEPS   step range e.g. '1-5', '3,6,7', '2,4-8'
-o OUTPUT_DIR, --output-dir OUTPUT_DIR

                          output directory (default: current)
-j {pbs,batch,daemon,slurm}, --job-scheduler {pbs,batch,daemon,slurm}

                          job scheduler type (default: slurm)
-f, --force               force creation of jobs even if up to date (default:
                          false)
--no-json                 do not create JSON file per analysed sample to track
                          the analysis status (default: false i.e. JSON file
                          will be created)
--report                  create 'pandoc' command to merge all job markdown
                          report files in the given step range into HTML, if
                          they exist; if --report is set, --job-scheduler,
                          --force, --clean options and job up-to-date status
                          are ignored (default: false)
--clean                   create 'rm' commands for all job removable files in
                          the given step range, if they exist; if --clean is
                          set, --job-scheduler, --force options and job up-to-
                          date status are ignored (default: false)

                          Note: Do not use -g option with --clean, use '>' to redirect
                          the output of the --clean command option
-l {debug,info,warning,error,critical}, --log {debug,info,warning,error,critical}

                          log level (default: info)
--sanity-check            run the pipeline in `sanity check mode` to verify
                          all the input files needed for the pipeline to run
                          are available on the system (default: false)
--container {wrapper, singularity} <IMAGE PATH>

                          run pipeline inside a container providing a container
                          image path or accessible singularity hub path
-v, --version             show the version information and exit
-g GENPIPES_FILE, --genpipes_file GENPIPES_FILE

                          Commands for running the pipeline are output to this
                          file pathname. The data specified to pipeline command
                          line is processed and pipeline run commands are
                          stored in GENPIPES_FILE, if this option is specified
                          . Otherwise, the output will be redirected to stdout
                          . This file can be used to actually "run the
                          GenPipes Pipeline"

                          Note: Do not use -g option with -clean. Use '>' to redirect
                          the output to a file when using -clean option.

Example Run

Use the following commands to execute Nanopore Sequencing Pipeline:

nanopore.py <Add options - info not available in README file -g nanopore_cmd.sh

bash nanopore_cmd.sh

Tip

Replace beluga.ini file name in the command above with the appropriate clustername.ini file located in the $MUGQIC_PIPELINES_HOME/pipelines/common_ini folder, depending upon the cluster where you are executing the pipeline. For e.g., narval.ini, cedar.ini, graham.ini or narval.ini.

Warning

While issuing the pipeline run command, use `-g GENPIPES_FILE` option (see example above) instead of using the ` > GENPIPES_FILE` option supported by GenPipes so far, as shown below:

[genpipes_seq_pipeline].py -t mugqic -c $MUGQIC_PIPELINES_HOME/pipelines/[genpipes_seq_pipeline]/[genpipes_seq_pipeline].base.ini $MUGQIC_PIPELINES_HOME/pipelines/common_ini/beluga.ini -r readset.[genpipes_seq_pipeline].txt -s 1-6 > [genpipes_seq_pipeline]_commands_mugqic.sh

bash [genpipes_seq_pipeline]_commands_mugqic.sh

` > scriptfile` should be considered deprecated and `-g scriptfile` option is recommended instead.

Please note that redirecting commands to a script `> genpipe_script.sh` is still supported for now. But going forward, this mechanism might be dropped in a future GenPipes release.

Note

Nanopore Readset Format

Use the following readset file format for Nanopore, Nanopore CoV-Seq pipelines:

  • Sample: must contain letters A-Z, numbers 0-9, hyphens (-) or underscores (_) only; mandatory;

  • Readset: a unique readset name with the same allowed characters as above; mandatory;

  • Run: a unique ONT run name, usually has a structure similar to PAE0000_a1b2c3d;

  • Flowcell: code of the type of Flowcell used, for example: the code for PromethION Flow Cell (R9.4) is FLO-PRO002;

  • Library: code of the type of library preparation kit used, for example: the code for the Ligation Sequencing Kit is SQK-LSK109;

  • Summary: path to the sequencing_summary.txt file outputted by the ONT basecaller; mandatory;

  • FASTQ: mandatory, path to the fastq_pass directory, that is usually created by the basecaller

  • FAST5: path to the directory containing the raw fast5 files, before basecalling

Example:

Sample

Readset

Run

Flowcell

Library

Summary

FASTQ

FAST5

sampleA

readset1

PAE00001_abcd123

FLO-PRO002

SQK-LSK109

path/to/readset1_sequencing_summary.txt

path/to/readset1/fastq_pass

path/to/readset1/fast5_pass

sampleA

readset2

PAE00002_abcd456

FLO-PRO002

SQK-LSK109

path/to/readset2_sequencing_summary.txt

path/to/readset2/fastq_pass

path/to/readset2/fast5_pass

sampleA

readset3

PAE00003_abcd789

FLO-PRO002

SQK-LSK109

path/to/readset3_sequencing_summary.txt

path/to/readset3/fastq_pass

path/to/readset3/fast5_pass

sampleA

readset4

PAE00004_abcd246

FLO-PRO002

SQK-LSK109

path/to/readset4_sequencing_summary.txt

path/to/readset4/fastq_pass

path/to/readset4/fast5_pass

You can download the test dataset for this pipeline here.


Pipeline Schema

The following figure shows the schema for Nanopore sequencing pipeline:

nanopore schema

Figure: Schema of Nanopore Sequencing protocol

dada2 ampseq

Pipeline Steps

The table below shows various steps that are part of Nanopore Sequencing Pipeline:

Nanopore Sequencing Steps

BlastQC

Minimap2 Align

pycoQC

Picard Merge SAM Files

Structural Variant Identification using Mapped Long Reads


Step Details

Following are the various steps that are part of GenPipes Nanopore genomic analysis pipeline:

BlastQC

In this step, Blast-QC utility is used for sequence alignment and identification. It performs a basic QC test by aligning 1000bp of randomly selected reads to the NCBI Nucleotide Database in order to detect potential contamination.

Minimap2 Align

Minimap2 Align Program is a fast, general purpose sequencing alignment program that maps DNA and long mRNA sequences against a large reference database. It can be used for Nanopore sequencing for mapping 1kb genomic reads at an error rate of 15% (e.g., PacBio or Oxford Nanopore genomic reads), among other uses.

In this step, minimap2 to align the Fastq reads that passed the minimum QC threshold to the provided reference genome. By default, it aligns to the human genome reference (GRCh38) with Minimap2.

pycoQC

In this step, pycoQC Software is used produce an interactive quality report based on the summary file and alignment outputs. PycoQC relies on the sequencing_summary.txt file generated by Guppy. If needed, it can also generate a summary file from basecalled FAST5 files. PycoQC computes metrics and generates interactive QC plots for Oxford Nanopore technologies sequencing data.

Picard Merge SAM Files

BAM readset files are merged into one file per sample in this step. Using aligned and sorted BAM output files from Minimap2 Align step, it performs the merge using Picard.

Structural Variant Identification using Mapped Long Reads

In this step, Structural Variant Identification using Mapped Long Reads (SVIM methodology), is used to perform structural variant (SV) analysis on each sample.


More Information