Chromatin Immunoprecipitation Sequencing Analysis Report
Contract Information | Contract Content |
---|---|
Contract ID | H201SC21103272 |
Batch ID | H201SC21103272 |
Species and Version | Homo_sapiens |
Report Time | 2021-12-13 |
I.Background
Chromatin Immunoprecipitation Sequencing (ChIP-Seq) provides genome-wide profiling of DNA targets for histone modifications, transcription factors, and other DNA-associated proteins. It combines the selectivity of chromatin immuno-precipitation (ChIP) for recovering specific protein-DNA complexes, with the power of next-generation sequencing (NGS) for high-throughput sequencing of the recovered DNA. Additionally, because the protein-DNA complexes are recovered from living cells, binding sites can be compared in different cell types and tissues, or under different conditions. Applications range from transcriptional regulation to developmental pathways to disease mechanisms and beyond.
Novogene experienced scientists offers comprehensive analysis solutions to meet your project needs by ensuring data and analysis quality that is publication ready.
II. Experimental Procedure
From DNA samples to final data report, each step of sample QC, library preparation and sequencing may have an impact on the quality and quantity of data output, and the quality of data will directly affect the results of subsequent information analysis. Therefore, obtaining high-quality data is the premise to ensure that bioinformatics analysis is correct, comprehensive and credible. In order to ensure the accuracy and reliability of sequencing data from the beginning, Novogene strictly checks every experimental step and fundamentally ensures the output of high-quality data. The workflow is as follows:
Project Workflow
1 DNA sample test
The DNA samples are immunoprecipitated and purified on the end of client. Novogene detection of these IPed DNA samples mainly includes two steps:
(1) Qubit accurately quantifies DNA concentration, based on which the total amount was calculated.
(2) Agarose gel electrophoresis analyses of DNA fragments size distribution and detects any contamination.
2 Library Preparation for Sequencing
After the DNA samples quality control:
(1) The DNA fragments are repaired, dA-tailed.
(2) The DNA fragments with A tail are ligated to sequencing adaptors.
(3) The final DNA library is obtained by size selection and PCR amplification.
The workflow of library construction is as follows:
Library Construction Workflow
3 Quality control of Library
After the construction of the library, the initial quantification was done with Qubit 2.0, and the library is diluted to 1ng/l. Then the insertion size of the library is detected with NGS3K. If meeting the expectation, the accurate concentration of the library is quantified by Q-PCR (library effective concentration > 2nM) to ensure the accurate molar amount that will pooled for sequencing.
4 Sequencing
After library quality control, sequencing is performed for different libraries according to the concentration and the demand of data amount on Illumina NovaSeq platform. The basic principle of sequencing is sequencing by synthesis. Four kinds of fluorescent labeled dNTP, DNA polymerase and adapter primers are added to the flow cell for amplification. When each sequence cluster extends the complementary chain, each fluorescent labeled dNTP releases the corresponding fluorescence. The sequencer captures the fluorescent signal and converts the light signal into the sequencing peak through computer software to obtain the sequence information of the fragment to be detected. The sequencing process is shown in the following figure.
The sequencing principle and process
III. Results
The bioinformatic analysis workflow of raw sequencing data is as follows:
1 Data quality control
The original raw data from Illumina platform are transformed to Sequenced Reads, known as Raw Data or RAW Reads, by base calling of CASAVA. Raw data are recorded in a FASTQ file, which contains sequencing reads and corresponding sequencing quality. Every read in FASTQ format is stored in four lines, as indicated below:
@HWI-ST1276:71:C1162ACXX:1:1101:1208:2458 1:N:0:CGATGT
NAAGAACACGTTCGGTCACCTCAGCACACTTGTGAATGTCATGGGATCCAT
+
#55???BBBBB?BA@DEEFFCFFHHFFCFFHHHHHHHFAE0ECFFD/AEHH
Line 1 begins with a '@' character and is followed by the Illumina Sequence Identifiers and an optional description.
Identifier | Meaning |
---|---|
HWI-ST1276 | Instrument – unique identifier of the sequencer |
71 | Run number – Run number on instrument |
C1162ACXX | Flowcell ID - ID of flowcell |
1 | LaneNumber - positive integer |
1101 | TileNumber - positive integer |
1208 | X - x coordinate of the spot. Integer which can be negative |
2458 | Y - y coordinate of the spot. Integer which can be negative |
1 | ReadNumber - 1 for single reads, 1 or 2 for paired ends |
N | whether it is filtered - NB: Y if the reads is filtered out, not in the delivered fastq files, otherwise N |
0 | Control number - 0 when none of the control bits are on. Otherwise it is an even number |
CGATGT | Illumina index sequences |
Line 2 is the raw sequence reads.
Line 3 begins with a '+' character and is optionally followed by the same sequence identifiers and descriptions as in Line 1.
Line 4 encodes the quality values for the sequence in Line 2 and must contain the same number of characters as the bases in the sequence (Cock, 2009.).
The details of Illumina Sequence Identifiers are as follows:
(1) HWI-ST1276:71 HWI-ST1276, Instrument - unique identifier of the sequencer; 71, run number - Run number on instrument.
(2) C1162ACXX:1:1101:1208:2458 refers to the coordinate of read on C1162ACXX (Flowcell ID) flowcell, line 1, 1101 tile is(x=1208, y=2458).
(3) 1:N:0:CGATGT the first number is 1 or 2, 1 refers to a single reads or the first read of paired ends, 2 refers to the second of paired ends; the second letter refers to whether reads is adjusted(Y means yes, N means no); the third number represent the number of Control Bits in sequence; six bases on the fourth place is Illumina index sequence.
Sequencing process itself has the possibility of machine errors. Sequencing error rate distribution can reflect the quality of sequencing data. The quality value of each base in sequence information is stored in the FASTQ file. Quality value of the reads represented by Qphred and the “e” represents the sequence error rate, Qphred=-10log10(e). The relationship between sequencing error rate (e) and sequencing base quality value (Qphred) is as following(Base Quality and Phred score relationship with the Illumina CASAVA v1.8 software).
Phred Score | Error Rate | Correct Rate | Q-score |
---|---|---|---|
10 | 1/10 | 90% | Q10 |
20 | 1/100 | 99% | Q20 |
30 | 1/1000 | 99.9% | Q30 |
40 | 1/10000 | 99.99% | Q40 |
(1) Due to the gradual consumption of reagents during sequencing, the sequencing error rate increases with the length of reads, which is a common feature of the Illumina high-throughput sequencing platform (Erlich Y, Mitra PP et al. 2008; Jiang L, Schlesinger F Et al.2011.).
(2) The first six bases have a relatively high error rate due to the incomplete binding of random hexamers used in priming cDNA synthesis (Jiang et al.). In general, a single base error rate should be lower than 1%.
1.1 FastQC of Raw_reads
The massive raw data obtained by high-through sequencing technology must be qualified to guarantee the downstream bioinformatic data analysis. The popular tool for the quality control is software FastQC. Novogene uses it to get the basic quality of the raw reads, that is shown in the following (Fig 1-1):
Figure 1-1 raw reads of FastQC
Top-left: horizontal axis represents position in read (bp), vertical axis represents quality scores. Different colors represent different quality range;
Top-middle: horizontal axis represents position in read (bp), vertical axis represents base percentage;
Top-right: the distribution of read length;
Bottom-left: GC distribution over all sequences. Red line is the actual GC content distribution, blue line is the theoretical distribution;
Bottom-middle: N-ratio across all bases;
Bottom-right: x-axis is the sequence duplication level. y-axis is the percent of the deduplicated reads.
1.2 Raw Data Trimming
For the raw data which pass the quality control, primer mismatch may result in nucleotidic composition bias at first several positions of the reads, which could lead to wrong bases insertion during sequencing. Because of the small fragment size of ChIP-Seq, there are often some adapter-appended reads. Trimming off the adapter sequences and low-quality bases is necessary. To ensure the quality of data analysis, raw data need to be filtered to get clean data, and all the follow-up analysis are based on these clean data.
The procedure for data trimming is in the following:
(1) Discard the reads with low quality (proportion of low quality bases larger than 50%);
(2) Discard the reads with N ratio (unsure base) larger than 15%;
(3) Discard the reads with adaptor at the 5’-end;
(4) Discard the reads without adaptor and inserted fragment at the 3’-end;
(5) Trim the adapter sequence at the 3’-end;
(6) Discard the reads whose length are less than 18nt after trimming.
Table 1-1 Summary of raw data quality control
Sample | Raw_reads | Low_quality | Degeneratives | Empty | Too_short | Trimmed | Untrimmed | Clean_reads | Clean_rate |
---|---|---|---|---|---|---|---|---|---|
si_Ctrl | 23999858 | 161 | 0 | 0 | 2 | 4585486 | 19414209 | 23999695 | 100.00% |
si_Ctrl_in | 24730333 | 165 | 0 | 0 | 4 | 4069556 | 20660608 | 24730164 | 100.00% |
si_SIRT3 | 20653181 | 118 | 0 | 0 | 0 | 3370916 | 17282147 | 20653063 | 100.00% |
si_SIRT3_in | 21192357 | 180 | 0 | 0 | 0 | 4196923 | 16995254 | 21192177 | 100.00% |
Raw_reads: reads from the base-calling. Click on the number to check the result of FastQC;
Low_quality: reads with mean quality lower than 20 before trimming;
Degeneratives: reads with at least 15% N before trimming;
Empty: reads with all bases from adapter (s);
Too_short: reads shorter than 18nt that are discarded after trimming;
Trimmed: reads with at least 18nt that are kept after trimming;
Untrimmed: reads that are kept untrimmed;
Clean_reads: kept reads after trimming, including both trimmed and untrimmed. Click on the number to check the result of FastQC;
Clean_rate: the ratio of Clean_reads to Raw_reads.
1.3 FastQC of Clean_reads
Quality control of the clean reads after trimming is in the following:
Figure 1-2 FastQC of Clean_reads
Top-left: horizontal axis represents position in read (bp), vertical axis represents quality scores, different colors represent different quality range.
Top-middle: horizontal axis represents position in read (bp), vertical axis represents base percentage.
Top-right: the distribution of read length.
Bottom-left: mean GC content (%). Red line is the actual GC content distribution, blue line is the theoretical distribution.
Bottom-middle: N-ratio across all bases.
Bottom-right: x-axis is the sequence duplication level. y-axis is the percent of the deduplicated reads.
2 Mapping
2.1 Summary of mapping
The common tools for mapping are Bowtie, BWA, MAQ, TOPhat, etc. We choose proper softwares and parameters according to different genome characters to do the genome mapping analysis for the filtered reads. Considering the small fragment size of ChIP-Seq, and the percentage of the unique sequence in the total sequence is the most important information, thus, we can map the reads to the reference genome using BWA much more accurately (Li, H. and R. Durbin, 2009). The summary of mapping is in the following table:
Table 2-1 Summary of mapping
Sample | Reads | Clean_reads | Mapped | Unique_mapped | Dup_Unique_mapped |
---|---|---|---|---|---|
si_Ctrl_in | pair | 24730164 | 24111435(97.50%) | 22429357(93.02%) | 3681605(16.41%) |
si_Ctrl_in | read1 | 24730164 | 24730042(100.00%) | 22747703(91.98%) | 3714065(16.33%) |
si_Ctrl_in | read2 | 24730164 | 24730031(100.00%) | 22661608(91.64%) | 3710183(16.37%) |
si_Ctrl | pair | 23999695 | 23298743(97.08%) | 21588053(92.66%) | 3232562(14.97%) |
si_Ctrl | read1 | 23999695 | 23999631(100.00%) | 21912150(91.30%) | 3256982(14.86%) |
si_Ctrl | read2 | 23999695 | 23999591(100.00%) | 21761380(90.67%) | 3250350(14.94%) |
si_SIRT3_in | pair | 21192177 | 20572640(97.08%) | 19151868(93.09%) | 2747511(14.35%) |
si_SIRT3_in | read1 | 21192177 | 21192030(100.00%) | 19485015(91.95%) | 2771941(14.23%) |
si_SIRT3_in | read2 | 21192177 | 21191945(100.00%) | 19336585(91.24%) | 2766201(14.31%) |
si_SIRT3 | pair | 20653063 | 20135824(97.50%) | 18631410(92.53%) | 3372619(18.10%) |
si_SIRT3 | read1 | 20653063 | 20652957(100.00%) | 18842382(91.23%) | 3389344(17.99%) |
si_SIRT3 | read2 | 20653063 | 20652920(100.00%) | 18741639(90.75%) | 3384869(18.06%) |
Unique_mapped: reads with MAPQ (Li and Ruan et al., 2008) no lower than 13; can be interpreted as the chance of non-accurate mapping (same score for the random mapping) is 0.05.
Duplicates: the reads mapped to the exact same position of the genome;
Mapped is relatively to Clean_reads;
Unique_mapped is relatively to Mapped;
Dup_Unique_mapped is relatively to Unique_mapped.
2.2 MAPQ
The most important thing during ChIP-Seq analysis is the percentage of the unique sequence in the total sequence number. Duplicates are labeled using SAMBLASTER (Faust and Hall, 2014) and mapping quality value is calculated by MAPQ. Proper quality value is chosen as the only threshold for mapping. Here we choose 13 as the threshold, which refer to that the mapping chance of the accordingly non-unique region is only 0.05. Only one read pair is kept for the duplicates in the followed peak calling.
Figure 2-1 Distribution of MAPQ
Horizontal axis is MAPQ, vertical axis is the reads count.
2.3 Genome-wide distribution of the mapped reads
Summary of the density of total mapped reads in different chromosomes (plus and minus) is shown in the followed figure. With 5k slide window size, the medium of the number of reads mapped to each base within window is calculated, and converted to log2. The longer the whole chromosome is, the more reads are mapped (Marquez et al. 2012). From the correlation of the number of the mapped reads and the length of the chromosome in the figure, we can see the correlation of the total number of the reads and the length of the chromosome much easier.
Figure 2-2 Genomewide distribution of the mapped reads
Horizontal axis represents the postition of the chromosome, vertical axis represents the number of the reads mapped to 1000nt window size. Here is the unique mapping and deduplication results.
2.4 Distribution of the reads mapped to the gene
Since the binding sites of transcription factor and histone protein are important for gene regulation, analysis of relative mapping position distribution can help us predict the protein function. Each gene and its 2kb upstream and 2kb downstream are divided into 100 equal parts. Reads density are calculated in each part by the mapped reads and the ratio of the reads to total reads.
Figure 2-3 Distribution of the reads mapped to gene
Horizontal axis: relative position of the gene. Vertical axis:reads density.
2.5 Sample correlation detection
Biological replicate is necessary for every experiment, same for high-throughput technology (Hansen et al.). Biological replicate mainly has two applications: One is to prove that the biological experiment can be replicated and there is no large variance. The other one is to make sure that following differential gene analysis can get reliable results. The correlation among samples is an important index to see whether the experiment design is reliable and whether the sampling is right. The correlation coefficient is much closer to 1, the similarity of the reads distribution pattern among samples is much higher.
Figure 2-4 Spearman test among samples
Heat-map of sample correlation test, correlation coefficient among samples (Spearman correlation coefficient). The darkness of the color represents how large is the correlation coefficient.
2.6 Visualization of pileup signal
We provide the visualization results of genome wide reads mapping in bam format. IGV (Integrative Genomics Viewer) browser is recommended to view the bam file. IGV browser has the following characters: (1) can reveal single or multiple mapping positions in the genome in different scales, including the distribution of the reads in different chromosomes, and the distribution of annotated exons, introns, splicing junctions and inter-gene region; (2) can reveal reads abundance in different region under different scales; (3) can reveal the annotation information of the gene and alternative splicing isoforms; (4) can reveal other annotation information; (5) can download annotation information from remote and local server. Please check the IGV manual for detail instruction (IGVQuickStart.pdf).
Figure 2-5 Visualization of pileup signal by IGV(demo)
3 Fragment size prediction
3.1 Summary of fragment size
For a specific binding site, there is a significant reads enrichment in the binding site. For single-end sequencing, we use MACS2 software to predict the frag_sizes of IP experiment. The whole genome is scaned using certain window size and the enrichment level of the reads in each window is calculated. Then proper windows (eg.1000) is extracted as the samples to build the enrichment model to predict the length of frag_sizes. For double-end sequencing, we use RSeQC software to predict the frag_sizes for the mapping results. The predicted frag_sizes are used for later peak calling.
Table 3-1 Summary of Fragment size
Sample | frag_sizes_length | infor |
---|---|---|
si_Ctrl | 303 | Calculated_from_BAM |
si_Ctrl_in | 281 | Calculated_from_BAM |
si_SIRT3 | 299 | Calculated_from_BAM |
si_SIRT3_in | 272 | Calculated_from_BAM |
Sample: sample name
frag_sizes_length: frag_sizes length
infor: default parameter
3.2 K distribution of fragment size
Length distribution is analyzed by kernel density method and shown below:
Figure 3-1 frag_size distribution Figure
Single end:
left: horizontal axis is the distance to the center of peak model. vertical axis is the percentage of the reads chosen for modeling.
Red color represents plus strand, blue color represents minus strand.
Right: horizontal axis is the distance to the middle of the peak model. vertical axis is the correlation between plus strand and minus strand. The distance from the red dash line to the middle is the predicted frag_sizes.
Double end:
horizontal axis is the length of the predicted length of frag_sizes, vertical axis is the value of kernel density.
4 Strand cross correlation
4.1 Summary of strand cross correlation
As we know that the sequenced reads will be approximately evenly distributed on the plus and minus strands. By calculating the correlation of plus and minus strand (SCC), we can get the distance between two strands. Through the SCC analysis of IP and input data, we can not only obtain the correlation coefficient between plus and minus strands, but also test the effect of IP experiment.
Table 4-1 Summary of strand cross correlation
Sample | Median_read_length | Predicted_fragment_length | CC_min | CC_read_length | CC_fragment_length | NSC | RSC |
---|---|---|---|---|---|---|---|
si_Ctrl_in | 150 | 235 | 0.2487 | 0.2658 | 0.2703 | 1.0870 | 1.2676 |
si_Ctrl | 150 | 245 | 0.2268 | 0.2426 | 0.2462 | 1.0855 | 1.2287 |
si_SIRT3_in | 150 | 230 | 0.2112 | 0.2361 | 0.2333 | 1.1045 | 0.8870 |
si_SIRT3 | 150 | 225 | 0.2015 | 0.2122 | 0.2222 | 1.1025 | 1.9312 |
Sample: sample name
Median_read_length: mean value of reads length
Predicted_fragment_length: predicted length of fragment sizes
CC_min: the lowest SCC
CC_read_length: the SCC of the longest reads
CC_fragment_lengt: fragment sizes relative SCC
NSC: normalized strand coefficient
RSC: relative strand correlation
4.2 Plots of strand cross correlation
The normalized SCC curve for all the samples is shown below:
Figure 4-1 Plots of strand cross correlation
Horizontal axis is the length of strand shift. Vertical axis is the value of normalized strand cross correlation. Different color represents different samples.
4.3 SCC distribution between experimental groups
The ratio of the CC (Cross Correlation) value of the same experimental group (including IP and Input) to the lowest CC value of the whole SCC curve (NSC) should be no less than 1.05. Besides, there is another shadow peak in the reads which is resulted from the sequencing read length. RSC value refers to the ratio of two differences: the difference between the CC value of fragment length and the lowest CC value, and the difference between CC value in the shadow peak and the lowest CC value.
Figure 4-2 SCC curve
Horizontal axis is the lag between plus and minus tags during Pearson Correlation Coefficient calculation. Vertical axis is the Pearson Correlation Coefficient.
5 Peak Calling
The annotation of transcription factor binding sites, histone binding sites is the important information for understanding the regulation mechanism and function. By mapping to the reference genome we can obtain the information of protein-DNA binding sites directly. By making use of MACS2 software (Yong Zhang, Tao Liu et al., 2008) (threshold q value = 0.05) to finish the peak calling, we can calculate the number of peaks, the peak width and its distribution, and find the peak related genes.
5.1 Summary of peak calling
Table 5-1 Summary of peak calling
Experiment | IP | Input/Mock | Fragment_length | Count_of_narrow_peak | FRiP | Count_of_summits |
---|---|---|---|---|---|---|
CTRL | si_Ctrl | si_Ctrl_in | 303(preset) | 2221 | 0.3425% | 2225 |
KD | si_SIRT3 | si_SIRT3_in | 299(preset) | 2291 | 0.4882% | 2299 |
Experiment: experimental group name(one ChIP Experiment includes an IP and a control, eg. Input or Mock or no control);
EIP:experiment name after chip handling;
EInput/Mock: control group;
EFragment_length: frag sizes length
ECount_of peak: the number of peak (narrow). If would like to test broad peak, need annotate in the information collection form;
EFRiP: the ratio of the numer of the reads in the peak to the total reads, which can test the effect of IP experiment;
ECount_of_summits: number of summits. Some peaks can have multiple summits due to close position.
5.2 Genome wide distribution of peaks
The summary of genome wide distribution of peaks is shown in the figure below. From the number of the peaks mapping to the chromosome and its distribution can reflect the distribution of the protein binding sites. When the number of the chromosomes with peaks is larger than 30, only 15 chromosomes are shown.
Figure 5-1 Genome wide distribution of peaks
Horizontal axis is the coordinate of the peaks in the chromosome. Vertical is the chromosomes. Every blue bar represents a peak.
5.3 Distribution of peak width
The peak width represents the length of the DNA that is bound by the studied protein. The number of peaks of different peak width is calculated, as shown below.
Figure 5-2 Distribution of peak width
Horizontal axis is the width of the peak (nt), vertical axis is the number of the corresponding peaks.
5.4 Distribution of fold enrichment
The fold enrichment value is the digital display of the peak signal during peak calling, which can be called signal value. The larger the value, the more reads are enriched to this peak. The peak number of different fold changes is calculated. The distribution of fold enrichment is shown below:
Figure 5-3 Distribution of fold enrichment
Horizontal axis represents the enrichment fold change of the peak. Vertical axis represents the number of the peaks.
5.5 Distribution of q values
The significance of the peak is the indication of the credibility of the peak, so the q value for eah peak is calculated. The peak number of differnet q values is calculated and the distribution is shown below:
Figure 5-4 Distribution of q values
Horizontal axis represents -log10 q value of the peak; Vertical axis represents the number of the peaks.
5.6 Count of summits in peaks
The number of summits in each peak is analysized, which infers the peak type in the IP experiment.
Figure 5-5 Count of summits in peaks
Horizontal axis represents the number of summit in each peak. Vertical axis is the corresponding number of the peak.
5.7 Summits distribution
Each peak is divided to 100bp windows and the summits in each window of all the peaks are counted as shown below:
Figure 5-6 Percentile position of summits in peaks
Horizontal axis is the position of sumits, vertical axis is the count of the summits.
6 Motif analysis
The binding of protein, such as transcription factor, histone etc. to DNA is not random, but sequence preference. MEME(Timothy L. Bailey and Charles Elkan, 1994)and Dreme (Timothy L. Bailey, 2011) softwares is used to detect significant motif sequence in the peak. Tomtom (Shobhit Gupta, JA Stamatoyannopolous, 2007) software is used to annotate the motif by mapping it to the annotated Motif database. We use sequence logo to show the base bias in different position in the binding sites in long Motif (8~30) (Fig 6-1) and short Motif (~8) (Fig 6-2). (Note: because of binding site specification, one motif sequence can be only shown in one section (<= 8 or >= 9), and parts of the figure below may have empty results).
6.1 Motif searching
Figure 6-1 long conservative sequence
Logos are listed in order. The motif in the right is the reverse complementary sequence of the left one.
Figure 6-2 short conservative sequence
Logos are listed in order. The motif in the right is the reverse complementary sequence of the left one.
6.2 MEME motif annotation
Tomtom is used to annotate the motif sequences detected by MEME. (Note: since the conservative sequence in the binding site is short, motif can be only in one region (<= 8), resulting in no result for meme test).
Table 6-1 MEME detection Motif calculation
#Query_ID | Target_ID | Optimal_offset | p-value | E-value | q-value | Overlap | Query_consensus | Target_consensus | Orientation |
---|---|---|---|---|---|---|---|---|---|
3 | M6159_1.02 | -12 | 0.000693386 | 0.508946 | 0.560319 | 16 | TAAAAACTAGACAGAAGCATTCTCAGAA | AAAAGCTTTCTAGGAA | - |
3 | M2283_1.02 | -1 | 0.000763378 | 0.560319 | 0.560319 | 15 | TAAAAACTAGACAGAAGCATTCTCAGAA | CAAAAGTAAACAAAG | + |
4 | M5320_1.02 | -16 | 0.000202653 | 0.148747 | 0.297494 | 14 | GGACCGCTTTGAGGCCTTCGTTGGAAACGG | TTCGTTGTATGCGGG | - |
4 | M6499_1.02 | -18 | 0.000908398 | 0.666764 | 0.455529 | 9 | GGACCGCTTTGAGGCCTTCGTTGGAAACGG | CGTGGGAAA | + |
4 | M5882_1.02 | -7 | 0.000930918 | 0.683294 | 0.455529 | 20 | GGACCGCTTTGAGGCCTTCGTTGGAAACGG | TTTCACACCTAGGTGTGAAA | - |
6 | M6336_1.02 | -10 | 0.000141746 | 0.104042 | 0.207847 | 16 | CCAGCAACTTGGGAGGCTGAGGCAGG | GGGGGGGGGAGGGAGGG | + |
#Query ID: detected motif;
Target ID: known motif ID in the database;
Optimal offset: the number of lag bases;
p-value:probability of MCMC;
E-value:false positive probability;
q-value: FDR value;
Overlap: overlapping base pair between two sequences;
Query consensus: detected motif sequence;
Target consensus: motif sequence in the target database;
Orientation: plus or minus strand for the target sequence.
The sequence logo below shows the comparison results between MEME detected motif and known motif. The result is in the following:
Figure 6-3 MEME motif annotation
6.3 DREME motif annotation
DREME is a motif discovery algorithm specifically designed to find the short, core DNA-binding motifs of eukaryotic TFs. (Note: Since the conservative sequence in the binding sites is long, the motif could only show up in one region (<= 8), which could show empty result with Dreme detection)
Table 6-2 Summary of motif detected by Dreme
#Query_ID | Target_ID | Optimal_offset | p-value | E-value | q-value | Overlap | Query_consensus | Target_consensus | Orientation |
---|---|---|---|---|---|---|---|---|---|
TCTCAGWA | M6496_1.02 | 2 | 0.00133367 | 0.978915 | 1 | 8 | TCTCAGAA | TTTCCCAGAAAAG | + |
CACTTSCA | M2306_1.02 | 2 | 0.000977487 | 0.717475 | 0.519417 | 8 | CACTTGCA | TTCATTTGCATAT | + |
CYACAAAA | M6459_1.02 | 2 | 0.000484201 | 0.355404 | 0.710808 | 8 | CTACAAAA | AACCACAAACCCCA | + |
CGCTTTS | M6506_1.02 | 1 | 0.000166878 | 0.122488 | 0.163243 | 7 | CGCTTTG | GCGCTTTGTTCT | + |
CGCTTTS | M6476_1.02 | 0 | 0.000222401 | 0.163243 | 0.163243 | 7 | CGCTTTG | CGCTTTGTTCTC | - |
CGCTTTS | M6314_1.02 | 5 | 0.000837945 | 0.615052 | 0.410035 | 7 | CGCTTTG | AGTTTCGCTTTC | - |
#Query ID: detected motif ID;
Target ID: known motif ID in the database;
Optimal offset: lag base number;
p-value: probability of MCMC;
E-value: false positive probability;
q-value: FDR value;
Overlap: overlapping base between two sequences;
Query consensus: detected motif base sequence;
Target consensus: motif base sequence in the target database;
Orientation: plus or minus tag for the targeted sequence.
The sequence logo below shows the comparison result between detected motif with Dreme and known motif:
Figure 6-4 Dreme motif annotation
7 Peak annotation
7.1 Peak-TSS distance
Peak-TSS distance distribution can be used to predict the protein binding sites. On the one hand, the effect of IP assay can be detected according to the protein binding sites; on the other hand, the regulation mode or function of the protein can be predicted through the characteristics of protein binding. TSS (transcription start site) of every peak related gene is detected by PeakAnnotator (Salmon-Divon and Dvinge et al., 2010). Peak-TSS distance distribution is analyzed according to the distance between the peak and TSS.
Figure 7-1 Distribution of Peak-TSS Distance
Horizontal axis represents the distance from peak to TSS. Vertical axis represents the number of the peak.
7.2 Distribution of peaks in functional regions
The figure below shows the distribution of peaks in different functional regions.
Figure 7-2 Peak distribution in functional gene region
Distribution of peak in different functional area. Horizontal axis represents different functional area, vertical axis represents the ratio of the peak in the functional region to the total peaks. The number on the top of the functional region represents peak number.
U2000 means 2000bp in the upstream region, D2000 means 2000bp in the downstream region;
CDSu2K and CDSd2K means upstream and downstream 2kb of CDS;
TSS100, Stop100, start100 means 100bp centered with TSS,TTS, Start-codon and stop-codon.
7.3 GO enrichment analysis
Gene Ontology (GO, http://www.geneontology.org/) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and disseminate annotation data; and 3 provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO, for example via enrichment analysis. GO analysis covers three aspects: Molecular Function, Biological Process and Cellular Component.
Results of GO enrichment analysis of the peak related gene are showed below:
Table 7-1 peak related gene GO enrichment
GO | Description | Term_type | Overrepresented_pValue | Corrected_pValue | Gene_item | Gene_list | Bg_item | Bg_list | genes |
---|---|---|---|---|---|---|---|---|---|
GO:0007268 | synaptic transmission | biological_process | 1.6132e-07 | 0.0013689 | 32 | 317 | 713 | 19782 | ...... |
GO:0050804 | modulation of synaptic transmission | biological_process | 1.895e-07 | 0.0013689 | 24 | 317 | 439 | 19782 | ...... |
GO:0051960 | regulation of nervous system development | biological_process | 4.3787e-07 | 0.0018768 | 37 | 317 | 940 | 19782 | ...... |
GO:0007416 | synapse assembly | biological_process | 5.1969e-07 | 0.0018768 | 14 | 317 | 167 | 19782 | ...... |
GO:0008045 | motor neuron axon guidance | biological_process | 0.00018923 | 1 | 5 | 306 | 35 | 19782 | ...... |
GO:0045720 | negative regulation of integrin biosynthetic process | biological_process | 0.000238 | 1 | 2 | 306 | 2 | 19782 | ...... |
GO_accession: the unique GO ID;
Description: Function description;
Term_type: Including cellular_component, biological_process and molecular_function;
Over_represented_pValue: significance of enrichment analysis;
Corrected_pValue: P-Value after correction, normally, with padj< 0.05 are enriched;
Gene_with_peak_item: the number of peak related genes with this GO term;
Gene_with_peak_list: the number of peak related gene;
Bg_item: the number of background gene with this GO term;
Bg_list: the number of background genes;
Genes: Annotated peak related gene ID (not show in this table);
Choose the first five results for each experiment.
Below is the GO enrichment bar of peak overlapping genes. It is directly reflect the distribution of the number of the peak overlapping genes that are enriched in the biological process, cellular component and molecular function.
Figure 7-3 GO enrichment
Bar: horizontal axis is the number of the peak overlapping gene. Different color is used to differentiate biological process, cellular component and molecular function.
DAG figure: every note represents a GO project. Rectangular represents top 10 enrichment GO. Darkness represents the enrichment level. The darker of the color, the higher of the enrichment.
The name of the project and the significance (padj) are shown in every note.
7.4 KEGG enrichment analysis
Different genes coordinate to each other to realize the function in organism. Kyoto Encyclopedia of Genes and Genomes (KEGG, http://www.kegg.jp/) is a collection of manually curated databases, containing resources onf genomic, biological-pathway and disease information (Kanehisa,2008). Pathway enrichment analysis identifies significantly enriched metabolic pathways or signal transduction pathways associated with peak overlapping genes, comparing the whole genome background.
Table 7-2 KEGG enrichment list
Term | Database | ID | Input number | Background number | P-Value | Corrected P-Value | Input | Hyperlink |
---|---|---|---|---|---|---|---|---|
Parkinson's disease | KEGG PATHWAY | hsa05012 | 28 | 142 | 1.26169066225e-15 | 1.61496404767e-13 | ...... | ...... |
Oxidative phosphorylation | KEGG PATHWAY | hsa00190 | 25 | 133 | 1.08519681653e-13 | 9.26034616775e-12 | ...... | ...... |
Metabolic pathways | KEGG PATHWAY | hsa01100 | 73 | 1243 | 6.00951341499e-11 | 3.84608858559e-09 | ...... | ...... |
Cardiac muscle contraction | KEGG PATHWAY | hsa04260 | 17 | 78 | 1.25842785933e-10 | 6.44315063977e-09 | ...... | ...... |
Ribosome | KEGG PATHWAY | hsa03010 | 65 | 138 | 6.92547597193e-55 | 1.64133780535e-52 | ...... | ...... |
Parkinson's disease | KEGG PATHWAY | hsa05012 | 22 | 142 | 5.05587974625e-11 | 5.99121749931e-09 | ...... | ...... |
(1) Term: Description information of KEGG pathway.
(2) Database: KEGG PATHWAY.
(3) ID: The only pathway identifier in the KEGG database.
(4) Input number: the number of peak overlapping gene in the corresponding pathway.
(5) Background number: The number of the genes in the corresponding pathway.
(6) P-value: statistical significant level of enrichment analysis.
(7) Corrected p-value: statistical significant level after correction. Normally, if the corrected P-value < 0.05 this pathway is enriched.
(8) Input: The peak overlapping genes in the corresponding pathway. Because of too much genes, here we ignore it.
(9) Hyperlink: see result file.
Peak overlapping gene KEGG enrichment scatter plot (Figure 3.24) is the visualization of KEGG enrichment analysis. In this figure, the extent of KEGG enrichment is measured by Rich factor, qvalue and the number of the genes that are enriched in this pathway. The top 20 significantly enriched terms in the KEGG enrichment analysis are displayed below. If the enriched pathway is less than 20, all the terms will be displayed.
Figure 7-4 KEGG enrichment scatter
Vertical axis represents pathway name, horizontal axis represents Rich factor, the size of the dot represents the numbers of the overlapping peak genes in the pathway, and the color of the dots corresponds to different qvalue range. The size of the dots represents the number of the genes whose peaks are overlapping in this pathway.
8 Differential analysis
Only when the number of groups is not less than two can the difference analysis be performed between groups.
FoldEnrich is used for differential analysis of peaks in different experimental groups (the ratio of RPM value in group A to group B). Differerntial binding sites are analyzed in different groups by finding differential peak when the ratio of FoldEnrich is larger than 2. We can get the differential binding sites related genes to do the follow-up annotation and enrichment analysis.
8.1 Summary of differential peaks
The Differential Expression Venn diagram illustrates the differential peaks of the compared groups. The sum of all the numbers in the circle represents the total number in the compared groups, and the overlapping part is the common peak number between two compared groups., as shown in the figure below.
Figure 8-1-1 Venn diagram of differential peak in comparable groups
Different colors represent different compared groups (A or B). The sum of the number in the pie is the peak number in this compared group, and the overlapping part is the number of common genes among comparison groups.
Figure 8-1-2 Venn diagram of differential peak in comparable gruops
Since some compared groups (eg. A or B) may include multiple experimental groups (eg. A1, A2, A3), here show the number of peak among experimental groups within one compared group (eg. A1&A2, A1&A3). Different colors represent different experimental groups (A1 or A2). The sum of the number in the pie is the peak number of this experimental group. Single color is group specific peak number. Overlapping part is the common peak number between two experimental groups.
8.2 Enrichment level analysis among different samples
RPM value of the peak in different samples (the ratio of 1M reads that enriched to the peak in a single sample) is used to do clustering analysis to determine the enrichment pattern of the same peak in different samples or the enrichment difference of different peaks in the same sample. At the same time, Enrichment comparison between IP and Input in the group can show the peak enrichment in IP experiment. Hierarchical clustering is done based on RPM value in different samples. Different colors represent different clustering information. The more close between the IP sample enrichment, the more similar function or biological process they have.
Figure 8-2 Enrichment analysis among samples
Top is the clustering among different samples in comparable groups.
Bottom is the correlation among different samples in comparable groups.
8.3 Reads density distribution among comparable groups for the annotated peaks
RPM is the number of reads enriched in peak area per million reads. Boxplot is used to show reads enrichment in the annotated peak area in different samples.
Figure 8-3 Boxplot of RPM value among samples in different groups
8.4 Enrichment level of different experimental peak analysis
FoldEnrich value (the Ratio of RPM value in IP to RPM in input) is used to do the cluster analysis of the peaks from different experimental group. It is used to determine the enrichment pattern of the same peak in different experimental groups, or the enrichment difference of different peaks in the same experimental group. At the same time, according to the comparison of peak enrichment among different experimental groups, we can show the different enrichment of peak between experimental groups, so as to find the different enrichment peak, which is the differential protein binding sites. Hierarchical clustering analysis is done based on the FoldEnrich value from different experimental groups. Different colors represent different clustering information.
Figure 8-4 Peak enrichment analysis among different experimental groups
Top is the clustering analysis of peaks from different experimental groups.
Bottom is the correlation among different experimental groups.
8.5 Reads density distribution among different experimental groups for the annotated peaks
Boxplot is used to show the different FoldEnrich (the ratio of RPM of IP to RPM of Input) i in the peak annotation region in different samples.
Figure 8-5 Reads density distribution in peak annotation region from different experiment was shown by boxplot.
8.6 GO enrichment analysis of differential peak related genes
GO enrichment analysis for the differential peak related genes is shown below:
Figure 8-6 GO enrichment analysis for the differential peak genes
8.7 KEGG enrichment analysis of differential peak related genes
KEGG enrichment analysis for the differential peak related genes is shown below:
Figure 8-7 KEGG enrichment analysis for differential peak genes
IV.Background
1 Experiment and library construction
ChIP-Seq is the efficient way to study protein-DNA interaction, transcription factor binding sites and histone modification in the whole genome (Landt and Marinov et al., 2012). The pipeline of ChIP-Seq and library construction is shown in the following figure. We provide services for every step started from library construction for the moment.
Normally, if you already know some target sequences, you can test the efficiency of the ChIP assay by qPCR before library construction. After sample quality control, library construction and quality control, we proceed it to the follow-up sequencing.
ChIP-seq library construction pipeline
2 Software list
Analysis | Software(version) | Parameters | Remarks |
---|---|---|---|
Trimming | skewer | 0.1.126 | - |
QC | fastqc | - | - |
Mapping | BWA | 0.7.12 | - |
Peak calling | MACS2 | 2.1.0 | - |
Motif prediction | meme | 4.10.2 | - |
Peak annotation | PeakAnnotator_Cpp | 1.4 | - |
Diff analysis | diffbind | - | ip/input>2 |
GO enrichment | Goseq/topGOBioconductor | -2.13 | - |
KEGG enrichment | KOBAS | 3.0 | - |
3 References
Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc
Ashburner, M. and C. A. Ball, et al. (2000). "Gene ontology: tool for the unification of biology. The Gene Ontology Consortium." Nat Genet 25 (1): 25-9.
Bailey, T. L. and N. Williams, et al. (2006). "MEME: discovering and analyzing DNA and protein sequence motifs." Nucleic Acids Res 34 (Web Server issue): W369-73.
Bailey, T. L. and M. Boden, et al. (2009). "MEME SUITE: tools for motif discovery and searching." Nucleic Acids Res 37 (Web Server issue): W202-8.
Bailey, T. and P. Krajewski, et al. (2013). "Practical guidelines for the comprehensive analysis of ChIP-seq data." PLoS Comput Biol 9 (11): e1003326.
Faust, G. G. and I. M. Hall (2014). "SAMBLASTER: fast duplicate marking and structural variant read extraction." Bioinformatics 30 (17): 2503-5.
Jiang, H. and R. Lei, et al. (2014). "Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads." BMC Bioinformatics 15: 182.
Kanehisa, M. and S. Goto (2000). "KEGG: kyoto encyclopedia of genes and genomes." Nucleic Acids Res 28 (1): 27-30.
Kanehisa M., M. Araki, et al. (2008). KEGG for linking genomes to life and the environment. Nucleic acids research.(KEGG)
Kent, W. J. and A. S. Zweig, et al. (2010). "BigWig and BigBed: enabling browsing of large distributed datasets." Bioinformatics 26 (17): 2204-7. Available online at: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/
Kharchenko, P. V. and M. Y. Tolstorukov, et al. (2008). "Design and analysis of ChIP-seq experiments for DNA-binding proteins." Nat Biotechnol 26 (12): 1351-9. Available online at: http://compbio.med.harvard.edu/Supplements/ChIP-seq/
Li, H. and R. Durbin (2009). "Fast and accurate short read alignment with Burrows-Wheeler transform." Bioinformatics 25 (14): 1754-60.
Li, H. and J. Ruan, et al. (2008). "Mapping short DNA sequencing reads and calling variants using mapping quality scores." Genome Res 18 (11): 1851-8.
Li, H. and B. Handsaker, et al. (2009). "The Sequence Alignment/Map format and SAMtools." Bioinformatics 25 (16): 2078-9.
Landt, S. G. and G. K. Marinov, et al. (2012). "ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia." Genome Res 22 (9): 1813-31.
Mao, X. and T. Cai, et al. (2005). "Automated genome annotation and pathway identification using the KEGG Orthology (KO) as a controlled vocabulary." Bioinformatics 21 (19): 3787-93.
Nicol, J. W. and G. A. Helt, et al. (2009). "The Integrated Genome Browser: free software for distribution and exploration of genome-scale datasets." Bioinformatics 25 (20): 2730-1. Available online at: http://bioviz.org/igb/index.html
Peter J.Park(2009). ChIP-seq: advantages and challenges of a maturing technology. Nature Reviews Genetics 10, 669-679.
Ramirez, F. and F. Dundar, et al. (2014). "deepTools: a flexible platform for exploring deep-sequencing data." Nucleic Acids Res 42 (Web Server issue): W187-91.
R Core Team (2015). R: A Language and Environment for Statistical Computing. Available online at: https://www.r-project.org/
Shirley Pepke, Barbara Wold and Ali Mortazavi (2009).Computation for ChIP-seq and RNA-seq studies. Nature methods, VOL.6 NO.11s
Salmon-Divon, M. and H. Dvinge, et al. (2010). "PeakAnalyzer: genome-wide annotation of chromatin binding and modification loci." BMC Bioinformatics 11: 415.
Shobhit Gupta, JA Stamatoyannopolous, Timothy Bailey and William Stafford Noble, "Quantifying similarity between motifs", Genome Biology, 8(2):R24, 2007.
Yong Zhang,Tao Liu et al. (2008). Model-based Analysis of ChIP-Seq (MACS). Genome Biology, 9:R137
Young, M. D. and M. J. Wakefield, et al. (2010). "Gene ontology analysis for RNA-seq: accounting for selection bias." Genome Biol 11 (2): R14. Available online at: https://bioconductor.org/packages/release/bioc/html/goseq.html
Thorvaldsdottir, H. and J. T. Robinson, et al. (2013). "Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration." Brief Bioinform 14 (2): 178-92. Available online at: https://www.broadinstitute.org/igv/
Timothy L. Bailey and Charles Elkan, "Fitting a mixture model by expectation maximization to discover motifs in biopolymers", Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28-36, AAAI Press, Menlo Park, California, 1994.
Timothy L. Bailey, "DREME: Motif discovery in transcription factor ChIP-seq data", Bioinformatics, 27(12):1653-1659, 2011.
Young M D, Wakefield M J, Smyth G K, et al. (2010).Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biology, doi:10.1186/gb-2010-11-2-r14. (GOseq)
Kanehisa, M., M. Araki, et al. (2008). KEGG for linking genomes to life and the environment. Nucleic acids research.(KEGG)