Explicit DNase sequence bias modeling enables high-resolution transcription factor footprint detection
G. Yardımcı, C. Frank, G. Crawford, U. Ohler (2014). Explicit DNase sequence bias modeling enables high-resolution transcription factor footprint detection. Nucleic acids research.
DNaseI footprinting is an established assay for identifying transcription factor (TF)-DNA interactions with single base pair resolution. High throughput DNase-seq assays have recently been used to detect in vivo DNase footprints across the genome. Multiple computational approaches have been developed to identify DNase-seq footprints as predictors of TF binding. However, recent studies have pointed to a substantial cleavage bias of DNase and its negative impact on predictive performance of footprinting. To assess the potential for using DNaseI to identify individual binding sites, we performed DNase-seq on deproteinized genomic DNA and determined sequence cleavage bias. This allowed us to build bias corrected and TF specific footprint models. The predictive performance of these models demonstrated that predicted footprints corresponded to high confidence TF-DNA interactions. Here we present our mixture modeling framework to train multinomial based footprint models and assign footprint likelihood scores to each candidate binding site.
R version '2.15.' or higher.
R libraries ‘mixtools’, ‘gtools’ and ‘GenomicRanges’.
Perl. On windows based systems, we used activeperl installation.
Package Source for Linux Users can be downloaded from: FootprintMixture.tgz
Download and install the required R libraries mixtools, gtools and GenomicRanges
Extract the contents of tarball into same directory.
To run the model, MixtureModel.r script needs to loaded to memory by
Tarball comes with example data and example script that runs the model on example data. To train the model, three types of data are needed:
The coordinates of candidate binding sites in bed format (bed.data.txt).
The number of DNase-seq reads that map to each coordinate surrounding the candidate binding sites. Typically the model uses a 25bp windows that surround the candidate binding site (upstream and downstream). This data can be stored as a matrix (a tab separated file), similar to example file (cut.data.txt).
The coordinates of ChIP-seq peaks for identifying known binding sites. A bed file in narrowpeak format may be used, similar to example file(peak.data.txt). In the absence of known ChIP-seq peaks, coordinates of DNase hypersensitive sites may be used since these regions tend to be enriched for binding sites.
The mixture model may be trained by using calling the function MultMMixture_Full
MultMMixture_Full(TF_Bed=b, Cuts=c, peakbed=p, Plot=T, PadLen=25, k=2, ReturnPar=T, Fixed=T, Background="Seq", FastaName="fasta.fa");
First two arguments correspond to names of cell type and transcription factor name. The data for bed file containing coordinates of candidate binding sites is passed to function by TF_Bed argument. Matrix containing the number of DNase-seq reads is passed by Cuts argument and coordinates of ChIP-seq peaks can be passed by peakbed argument.
PadLen argument corresponds to the width of the window that surrounds the binding site on both sides (upstream and downstream). The default value of 25bp was used which may be changed.
Argument ‘k’ corresponds to number of mixture components. This value is normally set to two for training background and footprint components. It may be set to higher values to detect any meaningful variation in the footprint model (explained in the manuscript).
Argument ‘Background’ determines the initialization of background component of mixture model. This component may be initialized to be uniform (Background=”Flat”) or the profile that’s estimated from DNase-seq sequence bias (Background=”Seq”). If background model is estimated from sequence bias, DNA sequence surrounding each candidate binding site supplied in TF_Bed argument needs to be supplied with FastaName argument.
DNA sequences surrounding the candidate binding sites need to be 3bp wider than the PadLen argument because our model uses 6-mer based model to estimate sequence bias. In the tarball, example file (fasta.fa) is provided to estimate the background model. For a quicker implementation, we used RebuildSignal.pl perl script to estimate sequence bias background signal.
Argument ‘Fixed’ ( possible values: TRUE or FALSE) is used to determine whether background component is kept fixed during training or it’s also reestimated by expectation maximization algorithm.
After model is trained, model parameters and footprint likelihood values are returned by the MultMMixture_Full function as a list. FLR values are stored in vector named ‘llr’ and model parameters are returned in matrix ‘par’.
Note: In some operating systems the command “shell” in the script MixtureModel.r might need to be replaced with the command “system”.