Evidence-ranked motif identification
S. Georgiev, A. Boyle, K. Jayasurya, X. Ding, S. Mukherjee, U. Ohler (2010). Evidence-ranked motif identification. Genome Biology.
This website provides the cERMIT executable (linux) as well as a brief description of the chip-seq analysis pipeline described in "Evidence-ranked motif identification".
The computational identification of functional sequence motifs has been a challenging problem in computational biology. Traditionally, regulatory motif finding has been phrased as the de novo identification of short DNA or RNA sequences enriched in a small subset of regulatory regions (e.g. promoters or untranslated regions of transcripts). However, the increasing availability of genome-wide data sets which directly or indirectly reflect gene regulatory interactions has allowed for an alternative problem definition: identify enriched sequence motifs, given quantitative experimental evidence for each regulatory region in a genome-wide set. We propose the (conserved) Evidence-Ranked Motif Identification Tool cERMIT, which implements an efficient enumerative strategy for identifying cis-regulatory elements to address this reformulation of the motif finding problem. cERMIT operates on a set of putative regulatory regions and their corresponding evidence, for example sequence peaks from ChIP-seq experiments. Candidates for co-regulated sets of regions are defined by the presence of a shared degenrate k-mer motif. cERMIT identifies the motifs that correspond to the candidate sets with strong aggregate binding evidence. [full text]
Pipeline for analysis of deep sequencing data (ChIP-seq, DNaseI-seq)
Generally speaking, there are three main steps in the analysis of ChIP-seq data.
1. Read alignment. Sequence reads are aligned against the reference genome
2. Peak calling. Genomic regions significantly enriched in aligned reads are identfied for further study (typically of size 100bp-1000bp)
3. Motif analysis. Peaks are further analyzed to infer the binding afinity and target regions of the trans-acting element under study.
cERMIT provides an approach to the Motif analysis of the inferred sequence peaks from step 2, taking advantage of the information contained in the quantitive binding evidence provided by the number of reads aligned to each peak.
Alignment of short sequence reads
Align chip-seq reads using MAQ , retaining the reads that align against 4 or less locations. To avoid single base pile-ups of sequences, remove all sequence locations where within a 30bp window there are more than 10 sequences of which 70% map to a single base location. Trim locations with multiple identical sequences to a maximum of 5 sequences.
1. ChIP-seq a) Identify discrete ChIP peaks using the kernel density estimation (KDE) procedure implemented in Fseq .>
b) Assign binding score = maximum KDE value across all locations within the peak.
- Discard regions with binding score scores more than 10, as those are most likely to be pile-ups within repeat regions.
- Extend/Trim peaks (proportional to the distance from the maximum KDE score location) to fall within the range: 100-1000bp.
Peak regions are defined similarly to the ChIP-seq case. A detailed description of the the procedure is included in .
Processing of Fseq peaks
1. Define the space of putative sequence regions to be used as input to cERMIT
Recent high-throughput sequencing technologies coupled with DnaseI Hypersensitive Sites (DHS) assays have clearly demonstrated that regions of open chromatin tend to be highly enriched in functional DNA elements . Hence, we define the set of putative regulatory regions to be the DHS peaks assayed in the same experimental conditions and call this the "DNaseI" approach. Ideally, we would use DHS data combined with the factor-specific binding evidence (e.g. ChIP-seq) derived from the same cell type. When DHS data is unavailable we propose an alternative strategy--"ensemble" approach--which relies on the assumption that in general ChIP-seq peaks tend to fall within open chromatin regions, irrespective of the specific assay. Hence, the combined set of the top ChIP-seq peaks from an ensemble of unrelated ChIP-seq datasets would provide a useful proxy to open chromatin.
2. Assign binding scores based on ChIP-seq data
Each putative regulatory region is assigned the binding score for the corresponding overlapping ChIP-seq peak. If there is no overlapping ChIP-seq peak assign 0. Whenever two putative regulatory regions overlap, merge the two and assign the binding score of the longer of the two original regions.
Note: The processing steps described above have been implemented in a suite of Ruby scripts available here. All input parameters should be specified in a parameters file (see example) which is passed as an argument inside 'do_processing.sh'. Upon running 'do_processing.sh' all necessary input is created and cERMIT can be run from the command line.
This step is implemented in cERMITv1.0. Please, refer to the provided Readme file for detailed instructions on how to run cERMIT and interpret the generated output.
cERMITv1.0 + sample datasets
This archive includes a binary executable, compiled for Linux, as well as sample ChIP-chip, ChIP-seq, and microRNA datasets.
This archive includes the ruby pre-processing scripts implementaing the pipeline for analysis of deep sequencing data as well as the human and mouse ChIP-seq and DNase-seq peaks (called by Fseq) used as input in the paper. In order for these datasets to be analyzed by cERMIT a compressed version of the corresponding genomic data in .2bit format needs to be supplied as input by the user and specified inside the input parameters file (see example).
This archive includes a binary executable, compiled for Linux, as well as the latest version of the pipeline which implements the pre-processing of the input (in Ruby).
Additional tables with comprehensive prediction results on the yeast ChIP-chip datasets with known literature binding motifs as well as novel predictions can be found here.
 Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Reserach 2008, 18:11:1851-8.
 Boyle AP, Guinney J, Crawford G, Furey T: F-Seq: a feature density estimator for high-throughput sequence tags. Bioinformatics 2008, 24:2537-2538.
 Boyle AP, Davis S, Shulha H, Meltzer P, Margulies E, Weng Z, Furey T, Crawford G: High-resolution mapping and characterization of open chromatin across the genome. Cell 2008, 132:311-322.
Last updated: 08/31/2009