Chromatin Immunoprecipitation Sequencing Analysis Report


Contract InformationContract Content
Contract IDH201SC21103272
Batch IDH201SC21103272
Species and VersionHomo_sapiens
Report Time2021-12-13

Novogene Co.,Ltd.



I.Background

Chromatin Immunoprecipitation Sequencing (ChIP-Seq) provides genome-wide profiling of DNA targets for histone modifications, transcription factors, and other DNA-associated proteins. It combines the selectivity of chromatin immuno-precipitation (ChIP) for recovering specific protein-DNA complexes, with the power of next-generation sequencing (NGS) for high-throughput sequencing of the recovered DNA. Additionally, because the protein-DNA complexes are recovered from living cells, binding sites can be compared in different cell types and tissues, or under different conditions. Applications range from transcriptional regulation to developmental pathways to disease mechanisms and beyond.

Novogene experienced scientists offers comprehensive analysis solutions to meet your project needs by ensuring data and analysis quality that is publication ready.



Novogene Co.,Ltd.



II. Experimental Procedure

From DNA samples to final data report, each step of sample QC, library preparation and sequencing may have an impact on the quality and quantity of data output, and the quality of data will directly affect the results of subsequent information analysis. Therefore, obtaining high-quality data is the premise to ensure that bioinformatics analysis is correct, comprehensive and credible. In order to ensure the accuracy and reliability of sequencing data from the beginning, Novogene strictly checks every experimental step and fundamentally ensures the output of high-quality data. The workflow is as follows:



Project Workflow

1 DNA sample test

The DNA samples are immunoprecipitated and purified on the end of client. Novogene detection of these IPed DNA samples mainly includes two steps:

(1) Qubit accurately quantifies DNA concentration, based on which the total amount was calculated.

(2) Agarose gel electrophoresis analyses of DNA fragments size distribution and detects any contamination.

2 Library Preparation for Sequencing

After the DNA samples quality control:

(1) The DNA fragments are repaired, dA-tailed.

(2) The DNA fragments with A tail are ligated to sequencing adaptors.

(3) The final DNA library is obtained by size selection and PCR amplification.

The workflow of library construction is as follows:

Library Construction Workflow

3 Quality control of Library

After the construction of the library, the initial quantification was done with Qubit 2.0, and the library is diluted to 1ng/l. Then the insertion size of the library is detected with NGS3K. If meeting the expectation, the accurate concentration of the library is quantified by Q-PCR (library effective concentration > 2nM) to ensure the accurate molar amount that will pooled for sequencing.

4 Sequencing

After library quality control, sequencing is performed for different libraries according to the concentration and the demand of data amount on Illumina NovaSeq platform. The basic principle of sequencing is sequencing by synthesis. Four kinds of fluorescent labeled dNTP, DNA polymerase and adapter primers are added to the flow cell for amplification. When each sequence cluster extends the complementary chain, each fluorescent labeled dNTP releases the corresponding fluorescence. The sequencer captures the fluorescent signal and converts the light signal into the sequencing peak through computer software to obtain the sequence information of the fragment to be detected. The sequencing process is shown in the following figure.

The sequencing principle and process




Novogene Co.,Ltd.


III. Results

The bioinformatic analysis workflow of raw sequencing data is as follows:

1 Data quality control

The original raw data from Illumina platform are transformed to Sequenced Reads, known as Raw Data or RAW Reads, by base calling of CASAVA. Raw data are recorded in a FASTQ file, which contains sequencing reads and corresponding sequencing quality. Every read in FASTQ format is stored in four lines, as indicated below:

@HWI-ST1276:71:C1162ACXX:1:1101:1208:2458 1:N:0:CGATGT   
NAAGAACACGTTCGGTCACCTCAGCACACTTGTGAATGTCATGGGATCCAT
+
#55???BBBBB?BA@DEEFFCFFHHFFCFFHHHHHHHFAE0ECFFD/AEHH

Line 1 begins with a '@' character and is followed by the Illumina Sequence Identifiers and an optional description.

Identifier Meaning
HWI-ST1276 Instrument – unique identifier of the sequencer
71 Run number – Run number on instrument
C1162ACXX Flowcell ID - ID of flowcell
1 LaneNumber - positive integer
1101 TileNumber - positive integer
1208 X - x coordinate of the spot. Integer which can be negative
2458 Y - y coordinate of the spot. Integer which can be negative
1 ReadNumber - 1 for single reads, 1 or 2 for paired ends
N whether it is filtered - NB: Y if the reads is filtered out, not in the delivered fastq files, otherwise N
0 Control number - 0 when none of the control bits are on. Otherwise it is an even number
CGATGT Illumina index sequences

Line 2 is the raw sequence reads.

Line 3 begins with a '+' character and is optionally followed by the same sequence identifiers and descriptions as in Line 1.

Line 4 encodes the quality values for the sequence in Line 2 and must contain the same number of characters as the bases in the sequence (Cock, 2009.).

The details of Illumina Sequence Identifiers are as follows:

(1) HWI-ST1276:71 HWI-ST1276, Instrument - unique identifier of the sequencer; 71, run number - Run number on instrument.

(2) C1162ACXX:1:1101:1208:2458 refers to the coordinate of read on C1162ACXX (Flowcell ID) flowcell, line 1, 1101 tile is(x=1208, y=2458).

(3) 1:N:0:CGATGT the first number is 1 or 2, 1 refers to a single reads or the first read of paired ends, 2 refers to the second of paired ends; the second letter refers to whether reads is adjusted(Y means yes, N means no); the third number represent the number of Control Bits in sequence; six bases on the fourth place is Illumina index sequence.

Sequencing process itself has the possibility of machine errors. Sequencing error rate distribution can reflect the quality of sequencing data. The quality value of each base in sequence information is stored in the FASTQ file. Quality value of the reads represented by Qphred and the “e” represents the sequence error rate, Qphred=-10log10(e). The relationship between sequencing error rate (e) and sequencing base quality value (Qphred) is as following(Base Quality and Phred score relationship with the Illumina CASAVA v1.8 software).

Phred ScoreError RateCorrect RateQ-score
101/1090%Q10
201/10099%Q20
301/100099.9%Q30
401/1000099.99%Q40

(1) Due to the gradual consumption of reagents during sequencing, the sequencing error rate increases with the length of reads, which is a common feature of the Illumina high-throughput sequencing platform (Erlich Y, Mitra PP et al. 2008; Jiang L, Schlesinger F Et al.2011.).

(2) The first six bases have a relatively high error rate due to the incomplete binding of random hexamers used in priming cDNA synthesis (Jiang et al.). In general, a single base error rate should be lower than 1%.

1.1 FastQC of Raw_reads

The massive raw data obtained by high-through sequencing technology must be qualified to guarantee the downstream bioinformatic data analysis. The popular tool for the quality control is software FastQC. Novogene uses it to get the basic quality of the raw reads, that is shown in the following (Fig 1-1):

Figure 1-1 raw reads of FastQC

Top-left: horizontal axis represents position in read (bp), vertical axis represents quality scores. Different colors represent different quality range;

Top-middle: horizontal axis represents position in read (bp), vertical axis represents base percentage;

Top-right: the distribution of read length;

Bottom-left: GC distribution over all sequences. Red line is the actual GC content distribution, blue line is the theoretical distribution;

Bottom-middle: N-ratio across all bases;

Bottom-right: x-axis is the sequence duplication level. y-axis is the percent of the deduplicated reads.

Novogene Co.,Ltd.


1.2 Raw Data Trimming

For the raw data which pass the quality control, primer mismatch may result in nucleotidic composition bias at first several positions of the reads, which could lead to wrong bases insertion during sequencing. Because of the small fragment size of ChIP-Seq, there are often some adapter-appended reads. Trimming off the adapter sequences and low-quality bases is necessary. To ensure the quality of data analysis, raw data need to be filtered to get clean data, and all the follow-up analysis are based on these clean data.

The procedure for data trimming is in the following:

(1) Discard the reads with low quality (proportion of low quality bases larger than 50%);

(2) Discard the reads with N ratio (unsure base) larger than 15%;

(3) Discard the reads with adaptor at the 5’-end;

(4) Discard the reads without adaptor and inserted fragment at the 3’-end;

(5) Trim the adapter sequence at the 3’-end;

(6) Discard the reads whose length are less than 18nt after trimming.

Table 1-1 Summary of raw data quality control

Sample Raw_reads Low_quality Degeneratives Empty Too_short Trimmed Untrimmed Clean_reads Clean_rate
si_Ctrl 23999858 161 0 0 2 4585486 19414209 23999695 100.00%
si_Ctrl_in 24730333 165 0 0 4 4069556 20660608 24730164 100.00%
si_SIRT3 20653181 118 0 0 0 3370916 17282147 20653063 100.00%
si_SIRT3_in 21192357 180 0 0 0 4196923 16995254 21192177 100.00%

Raw_reads: reads from the base-calling. Click on the number to check the result of FastQC;

Low_quality: reads with mean quality lower than 20 before trimming;

Degeneratives: reads with at least 15% N before trimming;

Empty: reads with all bases from adapter (s);

Too_short: reads shorter than 18nt that are discarded after trimming;

Trimmed: reads with at least 18nt that are kept after trimming;

Untrimmed: reads that are kept untrimmed;

Clean_reads: kept reads after trimming, including both trimmed and untrimmed. Click on the number to check the result of FastQC;

Clean_rate: the ratio of Clean_reads to Raw_reads.

Novogene Co.,Ltd.



1.3 FastQC of Clean_reads

Quality control of the clean reads after trimming is in the following:

Figure 1-2 FastQC of Clean_reads

Top-left: horizontal axis represents position in read (bp), vertical axis represents quality scores, different colors represent different quality range.

Top-middle: horizontal axis represents position in read (bp), vertical axis represents base percentage.

Top-right: the distribution of read length.

Bottom-left: mean GC content (%). Red line is the actual GC content distribution, blue line is the theoretical distribution.

Bottom-middle: N-ratio across all bases.

Bottom-right: x-axis is the sequence duplication level. y-axis is the percent of the deduplicated reads.

Novogene Co.,Ltd.



2 Mapping

2.1 Summary of mapping

The common tools for mapping are Bowtie, BWA, MAQ, TOPhat, etc. We choose proper softwares and parameters according to different genome characters to do the genome mapping analysis for the filtered reads. Considering the small fragment size of ChIP-Seq, and the percentage of the unique sequence in the total sequence is the most important information, thus, we can map the reads to the reference genome using BWA much more accurately (Li, H. and R. Durbin, 2009). The summary of mapping is in the following table:

Table 2-1 Summary of mapping

Sample Reads Clean_reads Mapped Unique_mapped Dup_Unique_mapped
si_Ctrl_in pair 24730164 24111435(97.50%) 22429357(93.02%) 3681605(16.41%)
si_Ctrl_in read1 24730164 24730042(100.00%) 22747703(91.98%) 3714065(16.33%)
si_Ctrl_in read2 24730164 24730031(100.00%) 22661608(91.64%) 3710183(16.37%)
si_Ctrl pair 23999695 23298743(97.08%) 21588053(92.66%) 3232562(14.97%)
si_Ctrl read1 23999695 23999631(100.00%) 21912150(91.30%) 3256982(14.86%)
si_Ctrl read2 23999695 23999591(100.00%) 21761380(90.67%) 3250350(14.94%)
si_SIRT3_in pair 21192177 20572640(97.08%) 19151868(93.09%) 2747511(14.35%)
si_SIRT3_in read1 21192177 21192030(100.00%) 19485015(91.95%) 2771941(14.23%)
si_SIRT3_in read2 21192177 21191945(100.00%) 19336585(91.24%) 2766201(14.31%)
si_SIRT3 pair 20653063 20135824(97.50%) 18631410(92.53%) 3372619(18.10%)
si_SIRT3 read1 20653063 20652957(100.00%) 18842382(91.23%) 3389344(17.99%)
si_SIRT3 read2 20653063 20652920(100.00%) 18741639(90.75%) 3384869(18.06%)

Unique_mapped: reads with MAPQ (Li and Ruan et al., 2008) no lower than 13; can be interpreted as the chance of non-accurate mapping (same score for the random mapping) is 0.05.

Duplicates: the reads mapped to the exact same position of the genome;

Mapped is relatively to Clean_reads;

Unique_mapped is relatively to Mapped;

Dup_Unique_mapped is relatively to Unique_mapped.

2.2 MAPQ

The most important thing during ChIP-Seq analysis is the percentage of the unique sequence in the total sequence number. Duplicates are labeled using SAMBLASTER (Faust and Hall, 2014) and mapping quality value is calculated by MAPQ. Proper quality value is chosen as the only threshold for mapping. Here we choose 13 as the threshold, which refer to that the mapping chance of the accordingly non-unique region is only 0.05. Only one read pair is kept for the duplicates in the followed peak calling.

Figure 2-1 Distribution of MAPQ

Horizontal axis is MAPQ, vertical axis is the reads count.

Novogene Co.,Ltd.



2.3 Genome-wide distribution of the mapped reads

Summary of the density of total mapped reads in different chromosomes (plus and minus) is shown in the followed figure. With 5k slide window size, the medium of the number of reads mapped to each base within window is calculated, and converted to log2. The longer the whole chromosome is, the more reads are mapped (Marquez et al. 2012). From the correlation of the number of the mapped reads and the length of the chromosome in the figure, we can see the correlation of the total number of the reads and the length of the chromosome much easier.

Figure 2-2 Genomewide distribution of the mapped reads

Horizontal axis represents the postition of the chromosome, vertical axis represents the number of the reads mapped to 1000nt window size. Here is the unique mapping and deduplication results.

2.4 Distribution of the reads mapped to the gene

Since the binding sites of transcription factor and histone protein are important for gene regulation, analysis of relative mapping position distribution can help us predict the protein function. Each gene and its 2kb upstream and 2kb downstream are divided into 100 equal parts. Reads density are calculated in each part by the mapped reads and the ratio of the reads to total reads.

Figure 2-3 Distribution of the reads mapped to gene

Horizontal axis: relative position of the gene. Vertical axis:reads density.

Novogene Co.,Ltd.



2.5 Sample correlation detection

Biological replicate is necessary for every experiment, same for high-throughput technology (Hansen et al.). Biological replicate mainly has two applications: One is to prove that the biological experiment can be replicated and there is no large variance. The other one is to make sure that following differential gene analysis can get reliable results. The correlation among samples is an important index to see whether the experiment design is reliable and whether the sampling is right. The correlation coefficient is much closer to 1, the similarity of the reads distribution pattern among samples is much higher.

Figure 2-4 Spearman test among samples

Heat-map of sample correlation test, correlation coefficient among samples (Spearman correlation coefficient). The darkness of the color represents how large is the correlation coefficient.

2.6 Visualization of pileup signal

We provide the visualization results of genome wide reads mapping in bam format. IGV (Integrative Genomics Viewer) browser is recommended to view the bam file. IGV browser has the following characters: (1) can reveal single or multiple mapping positions in the genome in different scales, including the distribution of the reads in different chromosomes, and the distribution of annotated exons, introns, splicing junctions and inter-gene region; (2) can reveal reads abundance in different region under different scales; (3) can reveal the annotation information of the gene and alternative splicing isoforms; (4) can reveal other annotation information; (5) can download annotation information from remote and local server. Please check the IGV manual for detail instruction (IGVQuickStart.pdf).

Figure 2-5 Visualization of pileup signal by IGV(demo)

Novogene Co.,Ltd.



3 Fragment size prediction

3.1 Summary of fragment size

For a specific binding site, there is a significant reads enrichment in the binding site. For single-end sequencing, we use MACS2 software to predict the frag_sizes of IP experiment. The whole genome is scaned using certain window size and the enrichment level of the reads in each window is calculated. Then proper windows (eg.1000) is extracted as the samples to build the enrichment model to predict the length of frag_sizes. For double-end sequencing, we use RSeQC software to predict the frag_sizes for the mapping results. The predicted frag_sizes are used for later peak calling.

Table 3-1 Summary of Fragment size

Sample frag_sizes_length infor
si_Ctrl 303 Calculated_from_BAM
si_Ctrl_in 281 Calculated_from_BAM
si_SIRT3 299 Calculated_from_BAM
si_SIRT3_in 272 Calculated_from_BAM

Sample: sample name

frag_sizes_length: frag_sizes length

infor: default parameter

3.2 K distribution of fragment size

Length distribution is analyzed by kernel density method and shown below:

Figure 3-1 frag_size distribution Figure

Single end:

left: horizontal axis is the distance to the center of peak model. vertical axis is the percentage of the reads chosen for modeling.

Red color represents plus strand, blue color represents minus strand.

Right: horizontal axis is the distance to the middle of the peak model. vertical axis is the correlation between plus strand and minus strand. The distance from the red dash line to the middle is the predicted frag_sizes.

Double end:

horizontal axis is the length of the predicted length of frag_sizes, vertical axis is the value of kernel density.


Novogene Co.,Ltd.



4 Strand cross correlation

4.1 Summary of strand cross correlation

As we know that the sequenced reads will be approximately evenly distributed on the plus and minus strands. By calculating the correlation of plus and minus strand (SCC), we can get the distance between two strands. Through the SCC analysis of IP and input data, we can not only obtain the correlation coefficient between plus and minus strands, but also test the effect of IP experiment.



Table 4-1 Summary of strand cross correlation

Sample Median_read_length Predicted_fragment_length CC_min CC_read_length CC_fragment_length NSC RSC
si_Ctrl_in 150 235 0.2487 0.2658 0.2703 1.0870 1.2676
si_Ctrl 150 245 0.2268 0.2426 0.2462 1.0855 1.2287
si_SIRT3_in 150 230 0.2112 0.2361 0.2333 1.1045 0.8870
si_SIRT3 150 225 0.2015 0.2122 0.2222 1.1025 1.9312

Sample: sample name

Median_read_length: mean value of reads length

Predicted_fragment_length: predicted length of fragment sizes

CC_min: the lowest SCC

CC_read_length: the SCC of the longest reads

CC_fragment_lengt: fragment sizes relative SCC

NSC: normalized strand coefficient

RSC: relative strand correlation

4.2 Plots of strand cross correlation

The normalized SCC curve for all the samples is shown below:

Figure 4-1 Plots of strand cross correlation

Horizontal axis is the length of strand shift. Vertical axis is the value of normalized strand cross correlation. Different color represents different samples.


Novogene Co.,Ltd.



4.3 SCC distribution between experimental groups

The ratio of the CC (Cross Correlation) value of the same experimental group (including IP and Input) to the lowest CC value of the whole SCC curve (NSC) should be no less than 1.05. Besides, there is another shadow peak in the reads which is resulted from the sequencing read length. RSC value refers to the ratio of two differences: the difference between the CC value of fragment length and the lowest CC value, and the difference between CC value in the shadow peak and the lowest CC value.

Figure 4-2 SCC curve

Horizontal axis is the lag between plus and minus tags during Pearson Correlation Coefficient calculation. Vertical axis is the Pearson Correlation Coefficient.

Novogene Co.,Ltd.



5 Peak Calling

The annotation of transcription factor binding sites, histone binding sites is the important information for understanding the regulation mechanism and function. By mapping to the reference genome we can obtain the information of protein-DNA binding sites directly. By making use of MACS2 software (Yong Zhang, Tao Liu et al., 2008) (threshold q value = 0.05) to finish the peak calling, we can calculate the number of peaks, the peak width and its distribution, and find the peak related genes.

5.1 Summary of peak calling

Table 5-1 Summary of peak calling

Experiment IP Input/Mock Fragment_length Count_of_narrow_peak FRiP Count_of_summits
CTRL si_Ctrl si_Ctrl_in 303(preset) 2221 0.3425% 2225
KD si_SIRT3 si_SIRT3_in 299(preset) 2291 0.4882% 2299

Experiment: experimental group name(one ChIP Experiment includes an IP and a control, eg. Input or Mock or no control);

EIP:experiment name after chip handling;

EInput/Mock: control group;

EFragment_length: frag sizes length

ECount_of peak: the number of peak (narrow). If would like to test broad peak, need annotate in the information collection form;

EFRiP: the ratio of the numer of the reads in the peak to the total reads, which can test the effect of IP experiment;

ECount_of_summits: number of summits. Some peaks can have multiple summits due to close position.

Novogene Co.,Ltd.



5.2 Genome wide distribution of peaks

The summary of genome wide distribution of peaks is shown in the figure below. From the number of the peaks mapping to the chromosome and its distribution can reflect the distribution of the protein binding sites. When the number of the chromosomes with peaks is larger than 30, only 15 chromosomes are shown.

Figure 5-1 Genome wide distribution of peaks

Horizontal axis is the coordinate of the peaks in the chromosome. Vertical is the chromosomes. Every blue bar represents a peak.

5.3 Distribution of peak width

The peak width represents the length of the DNA that is bound by the studied protein. The number of peaks of different peak width is calculated, as shown below.

Figure 5-2 Distribution of peak width

Horizontal axis is the width of the peak (nt), vertical axis is the number of the corresponding peaks.

Novogene Co.,Ltd.



5.4 Distribution of fold enrichment

The fold enrichment value is the digital display of the peak signal during peak calling, which can be called signal value. The larger the value, the more reads are enriched to this peak. The peak number of different fold changes is calculated. The distribution of fold enrichment is shown below:

Figure 5-3 Distribution of fold enrichment

Horizontal axis represents the enrichment fold change of the peak. Vertical axis represents the number of the peaks.

5.5 Distribution of q values

The significance of the peak is the indication of the credibility of the peak, so the q value for eah peak is calculated. The peak number of differnet q values is calculated and the distribution is shown below:

Figure 5-4 Distribution of q values

Horizontal axis represents -log10 q value of the peak; Vertical axis represents the number of the peaks.

Novogene Co.,Ltd.



5.6 Count of summits in peaks

The number of summits in each peak is analysized, which infers the peak type in the IP experiment.

Figure 5-5 Count of summits in peaks

Horizontal axis represents the number of summit in each peak. Vertical axis is the corresponding number of the peak.

5.7 Summits distribution

Each peak is divided to 100bp windows and the summits in each window of all the peaks are counted as shown below:

Figure 5-6 Percentile position of summits in peaks

Horizontal axis is the position of sumits, vertical axis is the count of the summits.

Novogene Co.,Ltd.



6 Motif analysis

The binding of protein, such as transcription factor, histone etc. to DNA is not random, but sequence preference. MEME(Timothy L. Bailey and Charles Elkan, 1994)and Dreme (Timothy L. Bailey, 2011) softwares is used to detect significant motif sequence in the peak. Tomtom (Shobhit Gupta, JA Stamatoyannopolous, 2007) software is used to annotate the motif by mapping it to the annotated Motif database. We use sequence logo to show the base bias in different position in the binding sites in long Motif (8~30) (Fig 6-1) and short Motif (~8) (Fig 6-2). (Note: because of binding site specification, one motif sequence can be only shown in one section (<= 8 or >= 9), and parts of the figure below may have empty results).

6.1 Motif searching

Figure 6-1 long conservative sequence

Logos are listed in order. The motif in the right is the reverse complementary sequence of the left one.

Figure 6-2 short conservative sequence

Logos are listed in order. The motif in the right is the reverse complementary sequence of the left one.

Novogene Co.,Ltd.



6.2 MEME motif annotation

Tomtom is used to annotate the motif sequences detected by MEME. (Note: since the conservative sequence in the binding site is short, motif can be only in one region (<= 8), resulting in no result for meme test).

Table 6-1 MEME detection Motif calculation

#Query_ID Target_ID Optimal_offset p-value E-value q-value Overlap Query_consensus Target_consensus Orientation
3 M6159_1.02 -12 0.000693386 0.508946 0.560319 16 TAAAAACTAGACAGAAGCATTCTCAGAA AAAAGCTTTCTAGGAA -
3 M2283_1.02 -1 0.000763378 0.560319 0.560319 15 TAAAAACTAGACAGAAGCATTCTCAGAA CAAAAGTAAACAAAG +
4 M5320_1.02 -16 0.000202653 0.148747 0.297494 14 GGACCGCTTTGAGGCCTTCGTTGGAAACGG TTCGTTGTATGCGGG -
4 M6499_1.02 -18 0.000908398 0.666764 0.455529 9 GGACCGCTTTGAGGCCTTCGTTGGAAACGG CGTGGGAAA +
4 M5882_1.02 -7 0.000930918 0.683294 0.455529 20 GGACCGCTTTGAGGCCTTCGTTGGAAACGG TTTCACACCTAGGTGTGAAA -
6 M6336_1.02 -10 0.000141746 0.104042 0.207847 16 CCAGCAACTTGGGAGGCTGAGGCAGG GGGGGGGGGAGGGAGGG +

#Query ID: detected motif;

Target ID: known motif ID in the database;

Optimal offset: the number of lag bases;

p-value:probability of MCMC;

E-value:false positive probability;

q-value: FDR value;

Overlap: overlapping base pair between two sequences;

Query consensus: detected motif sequence;

Target consensus: motif sequence in the target database;

Orientation: plus or minus strand for the target sequence.


The sequence logo below shows the comparison results between MEME detected motif and known motif. The result is in the following:

Figure 6-3 MEME motif annotation

Novogene Co.,Ltd.



6.3 DREME motif annotation

DREME is a motif discovery algorithm specifically designed to find the short, core DNA-binding motifs of eukaryotic TFs. (Note: Since the conservative sequence in the binding sites is long, the motif could only show up in one region (<= 8), which could show empty result with Dreme detection)

Table 6-2 Summary of motif detected by Dreme

#Query_ID Target_ID Optimal_offset p-value E-value q-value Overlap Query_consensus Target_consensus Orientation
TCTCAGWA M6496_1.02 2 0.00133367 0.978915 1 8 TCTCAGAA TTTCCCAGAAAAG +
CACTTSCA M2306_1.02 2 0.000977487 0.717475 0.519417 8 CACTTGCA TTCATTTGCATAT +
CYACAAAA M6459_1.02 2 0.000484201 0.355404 0.710808 8 CTACAAAA AACCACAAACCCCA +
CGCTTTS M6506_1.02 1 0.000166878 0.122488 0.163243 7 CGCTTTG GCGCTTTGTTCT +
CGCTTTS M6476_1.02 0 0.000222401 0.163243 0.163243 7 CGCTTTG CGCTTTGTTCTC -
CGCTTTS M6314_1.02 5 0.000837945 0.615052 0.410035 7 CGCTTTG AGTTTCGCTTTC -

#Query ID: detected motif ID;

Target ID: known motif ID in the database;

Optimal offset: lag base number;

p-value: probability of MCMC;

E-value: false positive probability;

q-value: FDR value;

Overlap: overlapping base between two sequences;

Query consensus: detected motif base sequence;

Target consensus: motif base sequence in the target database;

Orientation: plus or minus tag for the targeted sequence.


The sequence logo below shows the comparison result between detected motif with Dreme and known motif:

Figure 6-4 Dreme motif annotation

Novogene Co.,Ltd.



7 Peak annotation

7.1 Peak-TSS distance

Peak-TSS distance distribution can be used to predict the protein binding sites. On the one hand, the effect of IP assay can be detected according to the protein binding sites; on the other hand, the regulation mode or function of the protein can be predicted through the characteristics of protein binding. TSS (transcription start site) of every peak related gene is detected by PeakAnnotator (Salmon-Divon and Dvinge et al., 2010). Peak-TSS distance distribution is analyzed according to the distance between the peak and TSS.

Figure 7-1 Distribution of Peak-TSS Distance

Horizontal axis represents the distance from peak to TSS. Vertical axis represents the number of the peak.

7.2 Distribution of peaks in functional regions

The figure below shows the distribution of peaks in different functional regions.

Figure 7-2 Peak distribution in functional gene region

Distribution of peak in different functional area. Horizontal axis represents different functional area, vertical axis represents the ratio of the peak in the functional region to the total peaks. The number on the top of the functional region represents peak number.

U2000 means 2000bp in the upstream region, D2000 means 2000bp in the downstream region;

CDSu2K and CDSd2K means upstream and downstream 2kb of CDS;

TSS100, Stop100, start100 means 100bp centered with TSS,TTS, Start-codon and stop-codon.

Novogene Co.,Ltd.



7.3 GO enrichment analysis

Gene Ontology (GO, http://www.geneontology.org/) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and disseminate annotation data; and 3 provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO, for example via enrichment analysis. GO analysis covers three aspects: Molecular Function, Biological Process and Cellular Component.

Results of GO enrichment analysis of the peak related gene are showed below:

Table 7-1 peak related gene GO enrichment

GO Description Term_type Overrepresented_pValue Corrected_pValue Gene_item Gene_list Bg_item Bg_list genes
GO:0007268 synaptic transmission biological_process 1.6132e-07 0.0013689 32 317 713 19782 ......
GO:0050804 modulation of synaptic transmission biological_process 1.895e-07 0.0013689 24 317 439 19782 ......
GO:0051960 regulation of nervous system development biological_process 4.3787e-07 0.0018768 37 317 940 19782 ......
GO:0007416 synapse assembly biological_process 5.1969e-07 0.0018768 14 317 167 19782 ......
GO:0008045 motor neuron axon guidance biological_process 0.00018923 1 5 306 35 19782 ......
GO:0045720 negative regulation of integrin biosynthetic process biological_process 0.000238 1 2 306 2 19782 ......

GO_accession: the unique GO ID;

Description: Function description;

Term_type: Including cellular_component, biological_process and molecular_function;

Over_represented_pValue: significance of enrichment analysis;

Corrected_pValue: P-Value after correction, normally, with padj< 0.05 are enriched;

Gene_with_peak_item: the number of peak related genes with this GO term;

Gene_with_peak_list: the number of peak related gene;

Bg_item: the number of background gene with this GO term;

Bg_list: the number of background genes;

Genes: Annotated peak related gene ID (not show in this table);

Choose the first five results for each experiment.

Novogene Co.,Ltd.


Below is the GO enrichment bar of peak overlapping genes. It is directly reflect the distribution of the number of the peak overlapping genes that are enriched in the biological process, cellular component and molecular function.

Figure 7-3 GO enrichment

Bar: horizontal axis is the number of the peak overlapping gene. Different color is used to differentiate biological process, cellular component and molecular function.

DAG figure: every note represents a GO project. Rectangular represents top 10 enrichment GO. Darkness represents the enrichment level. The darker of the color, the higher of the enrichment.

The name of the project and the significance (padj) are shown in every note.

Novogene Co.,Ltd.


7.4 KEGG enrichment analysis

Different genes coordinate to each other to realize the function in organism. Kyoto Encyclopedia of Genes and Genomes (KEGG, http://www.kegg.jp/) is a collection of manually curated databases, containing resources onf genomic, biological-pathway and disease information (Kanehisa,2008). Pathway enrichment analysis identifies significantly enriched metabolic pathways or signal transduction pathways associated with peak overlapping genes, comparing the whole genome background.

Table 7-2 KEGG enrichment list

Term Database ID Input number Background number P-Value Corrected P-Value Input Hyperlink
Parkinson's disease KEGG PATHWAY hsa05012 28 142 1.26169066225e-15 1.61496404767e-13 ...... ......
Oxidative phosphorylation KEGG PATHWAY hsa00190 25 133 1.08519681653e-13 9.26034616775e-12 ...... ......
Metabolic pathways KEGG PATHWAY hsa01100 73 1243 6.00951341499e-11 3.84608858559e-09 ...... ......
Cardiac muscle contraction KEGG PATHWAY hsa04260 17 78 1.25842785933e-10 6.44315063977e-09 ...... ......
Ribosome KEGG PATHWAY hsa03010 65 138 6.92547597193e-55 1.64133780535e-52 ...... ......
Parkinson's disease KEGG PATHWAY hsa05012 22 142 5.05587974625e-11 5.99121749931e-09 ...... ......

(1) Term: Description information of KEGG pathway.

(2) Database: KEGG PATHWAY.

(3) ID: The only pathway identifier in the KEGG database.

(4) Input number: the number of peak overlapping gene in the corresponding pathway.

(5) Background number: The number of the genes in the corresponding pathway.

(6) P-value: statistical significant level of enrichment analysis.

(7) Corrected p-value: statistical significant level after correction. Normally, if the corrected P-value < 0.05 this pathway is enriched.

(8) Input: The peak overlapping genes in the corresponding pathway. Because of too much genes, here we ignore it.

(9) Hyperlink: see result file.

Peak overlapping gene KEGG enrichment scatter plot (Figure 3.24) is the visualization of KEGG enrichment analysis. In this figure, the extent of KEGG enrichment is measured by Rich factor, qvalue and the number of the genes that are enriched in this pathway. The top 20 significantly enriched terms in the KEGG enrichment analysis are displayed below. If the enriched pathway is less than 20, all the terms will be displayed.

Figure 7-4 KEGG enrichment scatter

Vertical axis represents pathway name, horizontal axis represents Rich factor, the size of the dot represents the numbers of the overlapping peak genes in the pathway, and the color of the dots corresponds to different qvalue range. The size of the dots represents the number of the genes whose peaks are overlapping in this pathway.

Novogene Co.,Ltd.



8 Differential analysis

Only when the number of groups is not less than two can the difference analysis be performed between groups.

FoldEnrich is used for differential analysis of peaks in different experimental groups (the ratio of RPM value in group A to group B). Differerntial binding sites are analyzed in different groups by finding differential peak when the ratio of FoldEnrich is larger than 2. We can get the differential binding sites related genes to do the follow-up annotation and enrichment analysis.

8.1 Summary of differential peaks

The Differential Expression Venn diagram illustrates the differential peaks of the compared groups. The sum of all the numbers in the circle represents the total number in the compared groups, and the overlapping part is the common peak number between two compared groups., as shown in the figure below.

Figure 8-1-1 Venn diagram of differential peak in comparable groups

Different colors represent different compared groups (A or B). The sum of the number in the pie is the peak number in this compared group, and the overlapping part is the number of common genes among comparison groups.

Figure 8-1-2 Venn diagram of differential peak in comparable gruops

Since some compared groups (eg. A or B) may include multiple experimental groups (eg. A1, A2, A3), here show the number of peak among experimental groups within one compared group (eg. A1&A2, A1&A3). Different colors represent different experimental groups (A1 or A2). The sum of the number in the pie is the peak number of this experimental group. Single color is group specific peak number. Overlapping part is the common peak number between two experimental groups.

Novogene Co.,Ltd.



8.2 Enrichment level analysis among different samples

RPM value of the peak in different samples (the ratio of 1M reads that enriched to the peak in a single sample) is used to do clustering analysis to determine the enrichment pattern of the same peak in different samples or the enrichment difference of different peaks in the same sample. At the same time, Enrichment comparison between IP and Input in the group can show the peak enrichment in IP experiment. Hierarchical clustering is done based on RPM value in different samples. Different colors represent different clustering information. The more close between the IP sample enrichment, the more similar function or biological process they have.

Figure 8-2 Enrichment analysis among samples

Top is the clustering among different samples in comparable groups.

Bottom is the correlation among different samples in comparable groups.

8.3 Reads density distribution among comparable groups for the annotated peaks

RPM is the number of reads enriched in peak area per million reads. Boxplot is used to show reads enrichment in the annotated peak area in different samples.

Figure 8-3 Boxplot of RPM value among samples in different groups

Novogene Co.,Ltd.



8.4 Enrichment level of different experimental peak analysis

FoldEnrich value (the Ratio of RPM value in IP to RPM in input) is used to do the cluster analysis of the peaks from different experimental group. It is used to determine the enrichment pattern of the same peak in different experimental groups, or the enrichment difference of different peaks in the same experimental group. At the same time, according to the comparison of peak enrichment among different experimental groups, we can show the different enrichment of peak between experimental groups, so as to find the different enrichment peak, which is the differential protein binding sites. Hierarchical clustering analysis is done based on the FoldEnrich value from different experimental groups. Different colors represent different clustering information.

Figure 8-4 Peak enrichment analysis among different experimental groups

Top is the clustering analysis of peaks from different experimental groups.

Bottom is the correlation among different experimental groups.

8.5 Reads density distribution among different experimental groups for the annotated peaks

Boxplot is used to show the different FoldEnrich (the ratio of RPM of IP to RPM of Input) i in the peak annotation region in different samples.

Figure 8-5 Reads density distribution in peak annotation region from different experiment was shown by boxplot.


Novogene Co.,Ltd.



8.6 GO enrichment analysis of differential peak related genes

GO enrichment analysis for the differential peak related genes is shown below:

Figure 8-6 GO enrichment analysis for the differential peak genes

8.7 KEGG enrichment analysis of differential peak related genes

KEGG enrichment analysis for the differential peak related genes is shown below:

Figure 8-7 KEGG enrichment analysis for differential peak genes

Novogene Co.,Ltd.



IV.Background

1 Experiment and library construction

ChIP-Seq is the efficient way to study protein-DNA interaction, transcription factor binding sites and histone modification in the whole genome (Landt and Marinov et al., 2012). The pipeline of ChIP-Seq and library construction is shown in the following figure. We provide services for every step started from library construction for the moment.

Normally, if you already know some target sequences, you can test the efficiency of the ChIP assay by qPCR before library construction. After sample quality control, library construction and quality control, we proceed it to the follow-up sequencing.



ChIP-seq library construction pipeline

Novogene Co.,Ltd.



2 Software list

AnalysisSoftware(version)ParametersRemarks
Trimmingskewer0.1.126-
QCfastqc--
MappingBWA0.7.12-
Peak callingMACS22.1.0-
Motif predictionmeme4.10.2-
Peak annotationPeakAnnotator_Cpp1.4-
Diff analysisdiffbind-ip/input>2
GO enrichmentGoseq/topGO
Bioconductor
-
2.13
-
KEGG enrichmentKOBAS3.0-


Novogene Co.,Ltd.



3 References

Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc

Ashburner, M. and C. A. Ball, et al. (2000). "Gene ontology: tool for the unification of biology. The Gene Ontology Consortium." Nat Genet 25 (1): 25-9.

Bailey, T. L. and N. Williams, et al. (2006). "MEME: discovering and analyzing DNA and protein sequence motifs." Nucleic Acids Res 34 (Web Server issue): W369-73.

Bailey, T. L. and M. Boden, et al. (2009). "MEME SUITE: tools for motif discovery and searching." Nucleic Acids Res 37 (Web Server issue): W202-8.

Bailey, T. and P. Krajewski, et al. (2013). "Practical guidelines for the comprehensive analysis of ChIP-seq data." PLoS Comput Biol 9 (11): e1003326.

Faust, G. G. and I. M. Hall (2014). "SAMBLASTER: fast duplicate marking and structural variant read extraction." Bioinformatics 30 (17): 2503-5.

Jiang, H. and R. Lei, et al. (2014). "Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads." BMC Bioinformatics 15: 182.

Kanehisa, M. and S. Goto (2000). "KEGG: kyoto encyclopedia of genes and genomes." Nucleic Acids Res 28 (1): 27-30.

Kanehisa M., M. Araki, et al. (2008). KEGG for linking genomes to life and the environment. Nucleic acids research.(KEGG)

Kent, W. J. and A. S. Zweig, et al. (2010). "BigWig and BigBed: enabling browsing of large distributed datasets." Bioinformatics 26 (17): 2204-7. Available online at: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/

Kharchenko, P. V. and M. Y. Tolstorukov, et al. (2008). "Design and analysis of ChIP-seq experiments for DNA-binding proteins." Nat Biotechnol 26 (12): 1351-9. Available online at: http://compbio.med.harvard.edu/Supplements/ChIP-seq/

Li, H. and R. Durbin (2009). "Fast and accurate short read alignment with Burrows-Wheeler transform." Bioinformatics 25 (14): 1754-60.

Li, H. and J. Ruan, et al. (2008). "Mapping short DNA sequencing reads and calling variants using mapping quality scores." Genome Res 18 (11): 1851-8.

Li, H. and B. Handsaker, et al. (2009). "The Sequence Alignment/Map format and SAMtools." Bioinformatics 25 (16): 2078-9.

Landt, S. G. and G. K. Marinov, et al. (2012). "ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia." Genome Res 22 (9): 1813-31.

Mao, X. and T. Cai, et al. (2005). "Automated genome annotation and pathway identification using the KEGG Orthology (KO) as a controlled vocabulary." Bioinformatics 21 (19): 3787-93.

Nicol, J. W. and G. A. Helt, et al. (2009). "The Integrated Genome Browser: free software for distribution and exploration of genome-scale datasets." Bioinformatics 25 (20): 2730-1. Available online at: http://bioviz.org/igb/index.html

Peter J.Park(2009). ChIP-seq: advantages and challenges of a maturing technology. Nature Reviews Genetics 10, 669-679.

Ramirez, F. and F. Dundar, et al. (2014). "deepTools: a flexible platform for exploring deep-sequencing data." Nucleic Acids Res 42 (Web Server issue): W187-91.

R Core Team (2015). R: A Language and Environment for Statistical Computing. Available online at: https://www.r-project.org/

Shirley Pepke, Barbara Wold and Ali Mortazavi (2009).Computation for ChIP-seq and RNA-seq studies. Nature methods, VOL.6 NO.11s

Salmon-Divon, M. and H. Dvinge, et al. (2010). "PeakAnalyzer: genome-wide annotation of chromatin binding and modification loci." BMC Bioinformatics 11: 415.

Shobhit Gupta, JA Stamatoyannopolous, Timothy Bailey and William Stafford Noble, "Quantifying similarity between motifs", Genome Biology, 8(2):R24, 2007.

Yong Zhang,Tao Liu et al. (2008). Model-based Analysis of ChIP-Seq (MACS). Genome Biology, 9:R137

Young, M. D. and M. J. Wakefield, et al. (2010). "Gene ontology analysis for RNA-seq: accounting for selection bias." Genome Biol 11 (2): R14. Available online at: https://bioconductor.org/packages/release/bioc/html/goseq.html

Thorvaldsdottir, H. and J. T. Robinson, et al. (2013). "Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration." Brief Bioinform 14 (2): 178-92. Available online at: https://www.broadinstitute.org/igv/

Timothy L. Bailey and Charles Elkan, "Fitting a mixture model by expectation maximization to discover motifs in biopolymers", Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28-36, AAAI Press, Menlo Park, California, 1994.

Timothy L. Bailey, "DREME: Motif discovery in transcription factor ChIP-seq data", Bioinformatics, 27(12):1653-1659, 2011.

Young M D, Wakefield M J, Smyth G K, et al. (2010).Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biology, doi:10.1186/gb-2010-11-2-r14. (GOseq)

Kanehisa, M., M. Araki, et al. (2008). KEGG for linking genomes to life and the environment. Nucleic acids research.(KEGG)