Transcription initiation and pervasive transcription.
Understanding how RNA polymerase II finds the right target location — to produce transcripts of protein coding genes and regulatory RNAs has been a long-standing focus of the lab. Our work here covers the range from genomics to analyses to computational modeling.
Recently, we worked with the labs of Chris Glass and Jim Kadonaga at UCSD to analyze nascent transcript initiation data obtained via the 5'GROseq protocol (Duttke, Lacadie et al Mol Cell 2015). This allowed us to precisely dissect forward and reverse transcription events at human promoters, and to identify sequence and chromatin features that discriminate unidirectional and divergent transcription.
Earlier, we teamed up with Jun Zhu's lab to generate a high-resolution map of transcription initiation with a dedicated deep sequencing protocol, paired end analysis of transcription or PEAT (Ni, Corcoran et al Nat Methods 2010). Our first study focused on transcription in Drosophila, and showed that transcription start comes in different guises, i.e. in different initiation patterns, that are conserved between flies and mammals despite strong differences in promoter sequence features.
Our early computational model of transcription start sites was called McPromoter. The fly version was improved in 2006 to specifically model different core promoter architectures. Batch fly predictions are provided in gff format for release 4 and release 5. (Both are run with a threshold of 0.03; the number of predictions is higher for release 5 because of extra predictions in unassembled contigs, as well as the reduction of the minimal distance between predicted start sites from 1,000 nt to 100 nt which increases the number of predictions for possible alternative TSSs.)
An earlier vertebrate McPromoter version was often used as benchmark to evaluate newly developed systems but was retired after 15 years of hard work. However, we DO encourage you to use S-Peaker (2009) to predict mammalian transcription start sites. It was our first study that made use of Capped Analysis of Gene Expression (CAGE) high-throughput data reflecting precise TSS locations, and strongly outperforms McPromoter especially in terms of resolution. S-Peaker was developed by Molly Megraw, in collaboration with Artemis Hatzigeorgiou and others. The code is available, and please make use of this system especially when benchmarking predictions.
Predicting enhancers and cell-type specific expression
Using ENCODE data, we have shown the sequence features in regions of open chromatin (as inferred from DNase-seq) help predict different patterns of cell-type specific expression (Natarajan et. al Genome Research 2012). The data for the analysis in this paper can be found here
We also use DNase-seq or related data sets such as ATAC-seq to identify functional binding sites based on sequence and "footprints", i.e. patterns of chromatin accessibility that indicate binding. We have implemented this basic idea in a footprint mixture model.
Binding sites are most frequently represented as position weight matrices that do not model dependencies between nucleotides. OMiMa , developed by Weichun Huang, was an early motif model approach with flexible higher order dependencies not restricted to neighboring nucleotides.
Our system for motif finding is cERMIT , a fast suffix array string-based motif finder that integrates high throughput genomics data such as ChIP-seq and CLIP-seq, optionally with conservation, to identify enriched motifs in genome-wide assays delivering information on possibly tens of thousands of sequences. This is joint work with Sayan Mukherjee at Duke.
With Alex Hartemink's group, we worked on motif finding in a Gibbs sampler setting. The Priority system makes use of location-based priors to improve the efficiency of motif finding.
Bill Majoros has been developing a number of open source programs to identify genes in eukaryotic genomes. In the latest work RSVP, ab initio hidden Markov models are augmented with RNA-seq data to predict protein-coding genes. Also check out his page and his book , which provides a great introduction to the kind of (sequence) modeling the lab is involved in.
Alignment and evolution.
Weichun Huang, first postdoc in the lab, developed a program called ACANA (ACcurate ANchoring Alignment), for fast heuristic pairwise alignments of biological sequences at both the local and global level. He used these ideas in a simulator of cis-regulatory sequence evolution , allowing for turnover events of transcription factor binding sites.