ChIP Sequencing Pipeline

Version 6.1.0

Danger

When using the mouse genome, please note that the annotation files for GRCm38 do not work with the Homer analysis. Use mm10, instead of the GRCm38 program.

Usage

Schema

Steps

ChIP-Seq

Picard Sam to Fastq

Trimmomatic

Merge Trimmomatic Stats

Mapping BWA Mem Sambamba

SAMbamba Merge BAM

SAMbamba Mark Duplicates

SAMbamba View Filter

Bedtools Blacklist Filter

Metrics

Homer Make Tag Directory

QC Metrics

Deeptools QC

Homer Make UCSC file

MACS2 call peak

Homer annotate peaks

Homer find motifs genome

Annotation Graphs

Run SPP

Differential Binding

IHEC Metrics

MultiQC Report

CRAM Output

GATK Haplotype Caller

Merge and Call Individual GVCF

ATAC-Seq

Picard Sam to Fastq

Trimmomatic

Merge Trimmomatic Stats

Mapping BWA Mem Sambamba

SAMbamba Merge BAM

SAMbamba Mark Duplicates

SAMbamba View Filter

Bedtools Blacklist Filter

Metrics

Homer Make Tag Directory

QC Metrics

Homer Make UCSC file

MACS2 ATAC-seq call peak

homer_annotate_peaks|

Homer find motifs genome

Annotation Graphs

Run SPP

Differential Binding

IHEC Metrics

MultiQC Report

CRAM Output

GATK Haplotype Caller

Merge and Call Individual GVCF

Picard Sam to Fastq

If FASTQ files are not already specified in the Readset file, then this step converts SAM/BAM files from the input Readset into FASTQ format. Otherwise, it does nothing.

Trimmomatic

Raw reads quality trimming and removing of Illumina adapters is performed using Trimmomatic Process. If an adapter FASTA file is specified in the config file (section ‘trimmomatic’, param ‘adapter_fasta’), it is used first. Else, ‘Adapter1’ and ‘Adapter2’ columns from the readset file are used to create an adapter FASTA file, given then to Trimmomatic. For PAIRED_END readsets, readset adapters are reversed-complemented and swapped, to match Trimmomatic Palindrome strategy. For SINGLE_END readsets, only Adapter1 is used and left unchanged.

If available, trimmomatic step in Hi-C analysis takes FASTQ files from the readset file as input. Otherwise, it uses the FASTQ output file generated from the previous Picard Sam to Fastq step conversion of the BAM files.

Merge Trimmomatic Stats

The trim statistics per Readset file are merged at this step.

Mapping BWA Mem Sambamba

This step takes as input files trimmed FASTQ files, if available. Otherwise it takes FASTQ files from the readset. If readset is not supplied then it uses FASTQ output files from the previous Picard Sam to Fastq conversion of BAM files.

SAMbamba Merge BAM

BAM readset files are merged into one file per sample. Merge is done using Sambamba.

If available, the aligned and sorted BAM output files from previous Mapping BWA Mem Sambamba step are used as input. Otherwise, BAM files from the readset file is used as input.

SAMbamba Mark Duplicates

Mark duplicates. Aligned reads per sample are duplicates if they have the same 5’ alignment positions (for both mates in the case of paired-end reads). All but the best pair (based on alignment score) will be marked as a duplicate in the BAM file. Marking duplicates is done using SAMbamba.

SAMbamba View Filter

Filter unique reads by mapping quality using SAMbamba.

Metrics

The number of raw/filtered and aligned reads per sample are computed at this stage.

Homer Make Tag Directory

The Homer Tag directories, used to check for quality metrics, are computed at this step.

QC Metrics

Sequencing quality metrics as tag count, tag auto-correlation, sequence bias and GC bias are generated.

Deeptools QC

Fingerplot quality control answers the question “Did my ChIP work?” for ChIP-seq sample analysis. The tool processes indexed BAM files, creating a profile of cumulative read coverage. Reads in a specified window (bin) are counted, sorted, and plotted as a cumulative sum.

The Correlation Matrix tool analyzes and visualizes sample correlations using multiBamSummary or multiBigwigSummary output. It supports Pearson or Spearman methods to calculate correlation coefficients.

Homer Make UCSC files

Wiggle Track Format files are generated from the aligned reads using Homer. The resulting files can be loaded in browsers like IGV or UCSC.

MACS2 call peak

Peaks are called using the MACS2 software. Different calling strategies are used for narrow and broad peaks. The mfold parameter used in the model building step is estimated from a peak enrichment diagnosis run. The estimated mfold lower bound is 10 and the estimated upper bound can vary between 15 and 100.

The default mfold parameter of MACS2 is [10,30].

MACS2 ATAC-seq call peak

Assay for Transposon Accessible Chromatin (ATAC-seq) analysis enables measurement of chromatin structure modifications (nucleosome free regions) on gene regulation. This step involves calling peaks using the MACS2 software. Different calling strategies are used for narrow and broad peaks. The mfold parameter used in the model building step is estimated from a peak enrichment diagnosis run. The estimated mfold lower bound is 10 and the estimated upper bound can vary between 15 and 100.

The default mfold parameter of MACS2 is [10,30].

Homer annotate peaks

The peaks called previously are annotated with HOMER tool using RefSeq annotations for the reference genome. Gene ontology and genome ontology analysis are also performed at this stage.

Homer find motifs genome

In this step, De novo and known motif analysis per design are performed using HOMER.

Annotation Graphs

This step focuses on peak location statistics. The following peak location statistics are generated per design: proportions of the genomic locations of the peaks. The locations are: Gene (exon or intron), Proximal ([0;2] kb upstream of a transcription start site), Distal ([2;10] kb upstream of a transcription start site), 5d ([10;100] kb upstream of a transcription start site), Gene desert (>= 100 kb upstream or downstream of a transcription start site), Other (anything not included in the above categories); The distribution of peaks found within exons and introns; The distribution of peak distance relative to the transcription start sites (TSS); the Location of peaks per design.

Run SPP

This step runs spp to estimate NSC and RSC ENCODE metrics. For more information - see quality enrichment of ChIP sequence data, phantompeakqualtools.

Differential Binding

Differential Binding step is meant for processing DNA data enriched for genomic loci, including ChIP- seq data enriched for sites where specific protein binding occurs, or histone marks are enriched. It uses DiffBind package that helps in identifying sites that are differentially bound between sample groups.

GenPipes Chipseq pipeline performs differential binding based on the provided treatments and controls as per the particular comparison specified in the design file. The differential analysis results are generated separately for each specified comparison in the Design File, with correctly specified treatments (2) and controls (1) samples.

Samples with ‘0’ (zero) are ignored during the comparison. For details regarding how to specify sample group membership in the design file, refer to Design File Format details.

For comparison, at least two samples for each group must be specified. If two samples per group are not specified, the differential binding step will be skipped during the pipeline run.

Warning

Incorrect Design File Format

If the specified design file does not follow the specified Design File Format for ChIPseq pipeline, the differential binding step will be skipped during the pipeline run.

The output of differential analysis containing differentially bound peaks are saved as a TSV. In addition, for each comparison done using DiffBind, an html report is also generated for QC differential analysis.

Note

Limitation of Differential Binding

Differential Binding analysis currently supports pair-wise (two groups) comparisons only. If you have a more complex experimental design, you may manually conduct the analysis.
The Chip Sequencing protocol expects an input for the differential binding step. If pipeline users want to run this protocol without an input, they should skip the Differential Binding step and run it themselves locally.

IHEC Metrics

This step generates IHEC’s standard metrics.

MultiQC Report

A quality control report for all samples is generated. For more detailed information see MultiQC documentation.

Bedtools Blacklist Filter

Remove reads in blacklist regions from BAM with Bedtools intersect, if blacklist file is supplied, otherwise, do nothing.

GATK Haplotype Caller

Use GATK Haplotype caller for SNPs and small indels.

Merge & Call Individual GVCF

Merges the GVCFs of haplotype caller and also generates a per sample VCF containing genotypes.

CRAM Output

Generate long term storage version of the final alignment files in CRAM format. Using this function will include the original final BAM file into the removable file list.

About

Chromatin Immunoprecipitation (ChIP) sequencing technique is used for mapping DNA-protein interactions. It is a powerful method for identifying genome-wide DNA binding sites for transcription factors and other proteins.

The ChIP-Seq workflow is based on the ENCODE Project workflow. The pipeline starts by trimming adapters and low quality bases and mapping the reads (single end or paired end ) to a reference genome using Burrows-Wheeler Aligner (BWA). Reads are filtered by mapping quality and duplicate reads are marked. It creates tag directories using Homer routines. Peaks are called using Model based Analysis for Chip Sequencing (MACS2) and annotated using Homer. Binding motifs are also identified using Homer. Metrics are calculated based on IHEC requirements.

The ChIP-Seq pipeline can also be used for assay for transposase-accessible chromatin using sequencing ATAC-Seq samples. Use ‘ATAC’ protocol option to run the pipeline using ATAC-Seq.

See Schema tab for the pipeline workflow. Check the README.md file for implementation details.

ChIP-Seq-HL — Figure: Schematic representation of major methods used to detect functional elements in DNA (Source PLOS)

References