ChIP Sequencing Pipeline

Version 6.1.0

New Try GenPipes Wizard

Danger

When using the mouse genome, please note that the annotation files for GRCm38 do not work with the Homer analysis. Use mm10, instead of the GRCm38 program.

Command
user@machine:~$ genpipes chipseq [options] [--genpipes_file GENPIPES_FILE.sh]
user@machine:~$ bash GENPIPES_FILE.sh
Options
-t {chipseq, atacseq},
--type {chipseq, atacseq}

                            Type of pipeline (default=chipseq)
-d DESIGN, --design DESIGN

                              design file
-r READSETS, --readsets READSETS

                      readset file
-h                        show this help message and exit
--help                    show detailed description of pipeline and steps
-c CONFIG [CONFIG ...],
--config CONFIG [CONFIG ...]

                        config INI-style list of files; config parameters
                        are overwritten based on files order
-s STEPS, --steps STEPS   step range e.g. '1-5', '3,6,7', '2,4-8'
-o OUTPUT_DIR, --output-dir OUTPUT_DIR

                        output directory (default: current)
-j {pbs,batch,daemon,slurm},
--job-scheduler {pbs,batch,daemon,slurm}

                        job scheduler type (default: slurm)
-f, --force               force creation of jobs even if up to date (default:
                          false)

                          Take the mem input in the ini file and force to have a
                          minimum of mem_per_cpu by correcting the number of cpu
                          (default: None)
--force_mem_per_cpu       FORCE_MEM_PER_CPU

                          Take the mem input in the ini file and force to have a
                          minimum of mem_per_cpu by correcting the number of cpu
                          (default: None)
--json-pt                 create JSON file for project_tracking database
                          ingestion (default: false i.e. JSON file will NOT be
                          created)
- -report                   create 'pandoc' command to merge all job markdown
                            report files in the given step range into HTML, if
                            they exist; if --report is set, --job-scheduler,
                            --force, --clean options and job up-to-date status
                            are ignored (default: false)
--clean                   create 'rm' commands for all job removable files in
                          the given step range, if they exist; if --clean is
                          set, --job-scheduler, --force options and job up-to-
                          date status are ignored (default: false)

                          Note: Do not use -g option with --clean, use '>' to redirect
                          the output of the --clean command option
-l {debug,info,warning,error,critical},
--log {debug,info,warning,error,critical}

                        log level (default: info)
--sanity-check            run the pipeline in `sanity check mode` to verify
                          all the input files needed for the pipeline to run
                          are available on the system (default: false)
--container {wrapper, singularity} <IMAGE PATH>

                        run pipeline inside a container providing a container
                        image path or accessible singularity hub path
--wrap [WRAP]             path to the GenPipes cvmfs wrapper script (default:
                          genpipes/ressources/container/bin/container_wrapper.sh,
                          a convenience option for using GenPipes in a container)
-v, --version             show the version information and exit
-g GENPIPES_FILE, --genpipes_file GENPIPES_FILE

                          Commands for running the pipeline are output to this
                          file pathname. The data specified to pipeline command
                          line is processed and pipeline run commands are
                          stored in GENPIPES_FILE, if this option is specified
                          . Otherwise, the output will be redirected to stdout
                          . This file can be used to actually "run the
                          GenPipes Pipeline"

Important

Do not use -g option with -clean.

Use ‘>’ to redirect the output of the genpipes command to a file when using -clean option.

Example
user@machine:~$ genpipes chipseq -c $GENPIPES_INIS/chipseq/chipseq.base.ini \
                                    $GENPIPES_INIS/common_ini/rorqual.ini \
                                 -r readsets.chipseq.txt \
                                 -d design.chipseq.txt \
                                 -s 1-19 \
                                 -g chipseqCommands.sh

user@machine:~$ bash chipseqCommands.sh

Tip

Depending upon the cluster where you are executing the pipeline, substitute the file name rorqual.ini in the command with the appropriate <DRAC server cluster name>.ini file located in the $GENPIPES_INIS/common_ini folder.

For e.g., rorqual.ini, fir.ini, or narval.ini.

The commands will be sent to the job queue and you will be notified once each step is done. If everything runs smoothly, you should get MUGQICexitStatus:0 or Exit_status=0. If that is not the case, then an error has occurred after which the pipeline usually aborts. To examine the errors, check the content of the job_output folder.

Caution

It is recommended that you use the -g GENPIPES_CMD.sh option instead of redirecting the output of the pipeline command to a file via > GENPIPES_CMD.sh.

user@machine:~$ genpipes [pipeline] [options] -g genpipes_cmd.sh

user@machine:~$ bash genpipes_cmd.sh

user@machine:~$ genpipes [pipeline] [options] > genpipes_cmd.sh

user@machine:~$ bash genpies_cmd.sh

The > scriptfile method is supported but will be deprecated in a future GenPipes release.

Test Dataset

You can download the test dataset for this pipeline here.

Test Datasets
ChipSeq
ChIP-Seq schema

Figure: Schema of ChIP Sequencing protocol

dada2 ampseq
ATACSeq
ChIP-Seq schema atacseq option

Figure: Schema of ChIP Sequencing -t atacseq protocol

dada2 ampseq
ChIP-Seq
ATAC-Seq

Picard Sam to Fastq

If FASTQ files are not already specified in the Readset file, then this step converts SAM/BAM files from the input Readset into FASTQ format. Otherwise, it does nothing.

Trimmomatic

Raw reads quality trimming and removing of Illumina adapters is performed using Trimmomatic Process. If an adapter FASTA file is specified in the config file (section ‘trimmomatic’, param ‘adapter_fasta’), it is used first. Else, ‘Adapter1’ and ‘Adapter2’ columns from the readset file are used to create an adapter FASTA file, given then to Trimmomatic. For PAIRED_END readsets, readset adapters are reversed-complemented and swapped, to match Trimmomatic Palindrome strategy. For SINGLE_END readsets, only Adapter1 is used and left unchanged.

If available, trimmomatic step in Hi-C analysis takes FASTQ files from the readset file as input. Otherwise, it uses the FASTQ output file generated from the previous Picard Sam to Fastq step conversion of the BAM files.

Merge Trimmomatic Stats

The trim statistics per Readset file are merged at this step.

Mapping BWA Mem Sambamba

This step takes as input files trimmed FASTQ files, if available. Otherwise it takes FASTQ files from the readset. If readset is not supplied then it uses FASTQ output files from the previous Picard Sam to Fastq conversion of BAM files.

SAMbamba Merge BAM

BAM readset files are merged into one file per sample. Merge is done using Sambamba.

If available, the aligned and sorted BAM output files from previous Mapping BWA Mem Sambamba step are used as input. Otherwise, BAM files from the readset file is used as input.

SAMbamba Mark Duplicates

Mark duplicates. Aligned reads per sample are duplicates if they have the same 5’ alignment positions (for both mates in the case of paired-end reads). All but the best pair (based on alignment score) will be marked as a duplicate in the BAM file. Marking duplicates is done using SAMbamba.

SAMbamba View Filter

Filter unique reads by mapping quality using SAMbamba.

Metrics

The number of raw/filtered and aligned reads per sample are computed at this stage.

Homer Make Tag Directory

The Homer Tag directories, used to check for quality metrics, are computed at this step.

QC Metrics

Sequencing quality metrics as tag count, tag auto-correlation, sequence bias and GC bias are generated.

Deeptools QC

Fingerplot quality control answers the question “Did my ChIP work?” for ChIP-seq sample analysis. The tool processes indexed BAM files, creating a profile of cumulative read coverage. Reads in a specified window (bin) are counted, sorted, and plotted as a cumulative sum.

The Correlation Matrix tool analyzes and visualizes sample correlations using multiBamSummary or multiBigwigSummary output. It supports Pearson or Spearman methods to calculate correlation coefficients.

Homer Make UCSC files

Wiggle Track Format files are generated from the aligned reads using Homer. The resulting files can be loaded in browsers like IGV or UCSC.

MACS2 call peak

Peaks are called using the MACS2 software. Different calling strategies are used for narrow and broad peaks. The mfold parameter used in the model building step is estimated from a peak enrichment diagnosis run. The estimated mfold lower bound is 10 and the estimated upper bound can vary between 15 and 100.

The default mfold parameter of MACS2 is [10,30].

MACS2 ATAC-seq call peak

Assay for Transposon Accessible Chromatin (ATAC-seq) analysis enables measurement of chromatin structure modifications (nucleosome free regions) on gene regulation. This step involves calling peaks using the MACS2 software. Different calling strategies are used for narrow and broad peaks. The mfold parameter used in the model building step is estimated from a peak enrichment diagnosis run. The estimated mfold lower bound is 10 and the estimated upper bound can vary between 15 and 100.

The default mfold parameter of MACS2 is [10,30].

Homer annotate peaks

The peaks called previously are annotated with HOMER tool using RefSeq annotations for the reference genome. Gene ontology and genome ontology analysis are also performed at this stage.

Homer find motifs genome

In this step, De novo and known motif analysis per design are performed using HOMER.

Annotation Graphs

This step focuses on peak location statistics. The following peak location statistics are generated per design: proportions of the genomic locations of the peaks. The locations are: Gene (exon or intron), Proximal ([0;2] kb upstream of a transcription start site), Distal ([2;10] kb upstream of a transcription start site), 5d ([10;100] kb upstream of a transcription start site), Gene desert (>= 100 kb upstream or downstream of a transcription start site), Other (anything not included in the above categories); The distribution of peaks found within exons and introns; The distribution of peak distance relative to the transcription start sites (TSS); the Location of peaks per design.

Run SPP

This step runs spp to estimate NSC and RSC ENCODE metrics. For more information - see quality enrichment of ChIP sequence data, phantompeakqualtools.

Differential Binding

Differential Binding step is meant for processing DNA data enriched for genomic loci, including ChIP- seq data enriched for sites where specific protein binding occurs, or histone marks are enriched. It uses DiffBind package that helps in identifying sites that are differentially bound between sample groups.

GenPipes Chipseq pipeline performs differential binding based on the provided treatments and controls as per the particular comparison specified in the design file. The differential analysis results are generated separately for each specified comparison in the Design File, with correctly specified treatments (2) and controls (1) samples.

Samples with ‘0’ (zero) are ignored during the comparison. For details regarding how to specify sample group membership in the design file, refer to Design File Format details.

For comparison, at least two samples for each group must be specified. If two samples per group are not specified, the differential binding step will be skipped during the pipeline run.

Warning

Incorrect Design File Format

If the specified design file does not follow the specified Design File Format for ChIPseq pipeline, the differential binding step will be skipped during the pipeline run.

The output of differential analysis containing differentially bound peaks are saved as a TSV. In addition, for each comparison done using DiffBind, an html report is also generated for QC differential analysis.

Note

Limitation of Differential Binding

  1. Differential Binding analysis currently supports pair-wise (two groups) comparisons only. If you have a more complex experimental design, you may manually conduct the analysis.

  2. The Chip Sequencing protocol expects an input for the differential binding step. If pipeline users want to run this protocol without an input, they should skip the Differential Binding step and run it themselves locally.

IHEC Metrics

This step generates IHEC’s standard metrics.

MultiQC Report

A quality control report for all samples is generated. For more detailed information see MultiQC documentation.

Bedtools Blacklist Filter

Remove reads in blacklist regions from BAM with Bedtools intersect, if blacklist file is supplied, otherwise, do nothing.

GATK Haplotype Caller

Use GATK Haplotype caller for SNPs and small indels.

Merge & Call Individual GVCF

Merges the GVCFs of haplotype caller and also generates a per sample VCF containing genotypes.

CRAM Output

Generate long term storage version of the final alignment files in CRAM format. Using this function will include the original final BAM file into the removable file list.

Chromatin Immunoprecipitation (ChIP) sequencing technique is used for mapping DNA-protein interactions. It is a powerful method for identifying genome-wide DNA binding sites for transcription factors and other proteins.

The ChIP-Seq workflow is based on the ENCODE Project workflow. The pipeline starts by trimming adapters and low quality bases and mapping the reads (single end or paired end ) to a reference genome using Burrows-Wheeler Aligner (BWA). Reads are filtered by mapping quality and duplicate reads are marked. It creates tag directories using Homer routines. Peaks are called using Model based Analysis for Chip Sequencing (MACS2) and annotated using Homer. Binding motifs are also identified using Homer. Metrics are calculated based on IHEC requirements.

The ChIP-Seq pipeline can also be used for assay for transposase-accessible chromatin using sequencing ATAC-Seq samples. Use ‘ATAC’ protocol option to run the pipeline using ATAC-Seq.

See Schema tab for the pipeline workflow. Check the README.md file for implementation details.

ChIP-Seq-HL

Figure: Schematic representation of major methods used to detect functional elements in DNA (Source PLOS)

References

Input requirements

The Chip Sequencing protocol expects an input for the Differential Binding step. If pipeline users want to run this protocol without an input, they should skip the differential binding step and run it themselves locally.

Chip Sequencing Readset File

Please make sure you use the special ChIPSeq Pipeline Readset file format and not the general readset file format.

Chip Sequencing Design File Format

Note

ChIPSeq Pipeline Design File Format

The ChIPSeq Pipeline has two protocols: atac-seq and chip-seq. Each of these protocols requires a specific design file.

ChIPseq Protocol Format

Sample        MarkName        EW22_EW3_vs_EW7_TC71
EW22          H3K27ac         1
EW3           H3K27ac         1
EW7           H3K27ac         2
TC71          H3K27ac         2

ATACseq Protocol Format

Important

Note that the MarkName value for ATACseq protocol should be atac unlike the ChIPseq protocol.

Sample        MarkName        EW22_EW3_vs_EW7_TC71
EW22          atac            1
EW3           atac            1
EW7           atac            2
TC71          atac            2

where, the numbers specify the sample group membership for this contrast:

'0' or '': the sample does not belong to any group;
'1': the sample belongs to the control group;
'2': the sample belongs to the treatment test case group.

The design file only accepts 1 for control, 2 for treatment and 0 for other samples that do not need to compare.

Warning

Incorrect Design File Format

Please note that the first and second column in the design file must be sample name and histone mark/binding protein respectively.

If the user specifies any value other than the permitted ones above in the design file, the pipeline will fail.