Home |
Documentation |
Download |
Examples |
MicroMUMMIE is a
specific model,
implemented within the MUMMIE framework, for
predicting
micro-RNA binding
sites using PAR-CLIP data. Thus, while MUMMIE can be used for
many different bioinformatic modeling tasks, microMUMMIE is a specific
model for a specific task. This page explains how to install
microMUMMIE and use it.
Skip to a section: 1. InstallationInstall Base Packages
Please begin by installing PARalyzer and MUMMIE (you will also need twoBitToFa, which must be in your PATH). PARalyzer uses Bowtie to align PAR-CLIP reads to the genome and then constructs a smooth signal curve that can be used for peak-calling to find (roughly) where an RNA-binding protein (such as Argonaute) binds. However, rather than using a peak-caller, microMUMMIE instead uses a hidden Markov model (HMM) to find the most probable miRNA seed match near the PARalyzer peak. This allows microMUMMIE to weigh multiple forms of evidence (e.g., T-to-C conversion rates, evolutionary conservation, RNA sequence) in making the most informed prediction. The base MUMMIE package also includes the microMUMMIE scripts and models. Compile Source Code or Download Binaries The distributed MUMMIE linux binaries may work on your system; if they don't, you'll need to compile the source code. Please refer to the download page for compilation instructions. Optional: Download Example Dataset The following dataset can be downloaded for testing purposes: sample inputs Set Environment Variables As noted in the
compilation instructions on the download page, you must add several
MUMMIE directories to your path. Please be sure you perform all of
these steps; these are the commands for csh and tsch (bash is similar):
setenv MUMMIE /path/to/mummie setenv PERL5LIB
${PERL5LIB}:$MUMMIE/perl rehash 2. RunningRun Bowtie and PARalyzer
The first task is to run Bowtie and PARalyzer on your raw PAR-CLIP reads. Please see the PARalyzer instructions for recommended Bowtie settings and other information. One of the outputs of the PARalyzer pipeline will be a "distribution" file (you may need to edit the PARalyzer parameter file to enable the generation of this optional output file). This distribution file provides the T-to-C conversion profile used by MUMMIE, but that profile must be combined with evolutionary conservation information and RNA sequence in order for MUMMIE to make predictions. Run MicroMUMMIE The microMUMMIE.pl script performs all of the required steps, from data preprocessing to model building to prediction. However, there are situations in which you will want to modify the script or perform these steps individually in order to obtain the results you desire. For example, microMUMMIE.pl uses the twoBitToFa utility to extract genomic sequence from a 2-bit genome file, whereas you may have your genomic sequence in another format (such as FASTA) and may already have specific genomic sections extracted (such as 3' UTRs). Note also that the script generates temporary files that will be overwritten each time the script is executed (i.e., do not try to run two copies of the script simultaneously in the same directory). Thus, we will first describe how to run the microMUMMIE.pl script, and then how to modify it. You can run the microMUMMIE.pl script on the UNIX command line as follows: microMUMMIE.pl
mature.txt genome.2bit paralyzer-output-dir library-name out.gff 1
coordinatefile For example, if you cd into the sample data directory, the following
command should work:microMUMMIE.pl mature.txt genome.2bit . D1 out.gff 1
UTRs.txt The parameters are as follows:
3. Interpreting the OutputOutput FormatThe output will be in a gff file (out.gff), which consists of 1-based coordinates of predicted miRNA targets and their posterior probability scores. Note that the script actually generates several sets of predictions made at different sensitivities and specificities; out.gff contains only one of these prediction sets, parameterized to have medium sensitivity and medium specificity. Additional prediction sets at other parameterizations are available in the files named predictions-varNNN.gff, where higher values of NNN correspond to higher specificity/SNR/accuracy, and lower sensitivity. Here is a sample line from microMUMMIE's output: chr9 hsa-let-7d 8mer-A1 74298636 74298643 0.665 -
. seq=CTACCTCA;sens=0.62;SNR=2.24; This line
can be
interpreted as follows. On chromosome 9 (chr9), occupying 1-based
coordinate interval 74298636-74298643 on the
antisense strand is a seed match to miRNA let-7d. The
corresponding DNA sequence for this RNA target site is CTACCTCA.
The posterior probability of this site under the microMUMMIE model is
0.665, the estimated sensitivity is 62%, and the estimated
signal-to-noise ratio (SNR) is 2.24 (these latter two statistics are
interpolated from previously performed shuffling experiments).
Finally, the type of seed match is 8mer-A1, which means that the match
is 8 nt long, but the 3'-most residue is an A even if the miRNA seed
residue at this position is not a U.
Scores and Postprocessing In its default operation, microMUMMIE performs posterior decoding, which means that multiple sites may be predicted for each PAR-CLIP cluster, and the scores assigned to individual sites are posterior probabilities. The posterior probability of a site is the probability of the HMM going through the foreground states for a site, irrespective of what other states are visited outside this putative site. One implication of this fact is that predicted sites that partially overlap will be forced to share probability, since different states are mutually exclusive at a given site in the HMM. However, for different types of seed matches (e.g., 6mer, 7mer, 8mer), the probabilities of each of these types of matches will be appropriately summed for any given miRNA, so that, for example, a 6mer match inside a 7mer match will not subtract from the 7mer score. Sensitivity, Specificity, and Signal-to-Noise Ratio MicroMUMMIE can be parameterized to run at different sensitivities or specificities. The single parameter to microMUMMIE, called the peak emission variance (PEV), controls the tradeoff between sensitivity and specificity. Higher PEV produces higher specificity, so that the predictions you obtain should be more confident. Lowever PEV produces higher sensitivity, so that you will receive more predictions, though not all predictions will be of the highest confidence. The individual output files named predictions-varNNN.gff contain predictions at different PEV values; the single default output file is simply a copy of one of these files selected to have medium sensitivity and medium specificity, but you can opt for greater sensitivity by choosing a lower PEV or greater specificity by choosing a higher PEV. When running microMUMMIE on a new data set, it is not feasible to assess the actual sensitivity, specificity, or signal-to-noise ratio (SNR, which generally correlates with specificity) without extensive simulation experiments. Thus, to provide a rough indication of the sensitivity and specificity trends, estimates of these values are inferred by interpolating from the following table, which was generated via large-scale simulation results:
Specificity correlates very closely with SNR, so increasing SNR increases specificity. Please note that scores for individual site predictions, which are posterior probabilities, are not comparable between microMUMMIE runs under different parameterizations, because these probabilities are conditional on the model, and the model differs when parameters are changed. Thus, we recommend that you use the PEV setting to select the overall desired level of sensitivity and specificity (SNR), and only then to take into consideration the posterior probabilities of individual sites when filtering results. 4. Using Conservation
Note that the default microMUMMIE.pl script does not
utilize evolutionary conservation evidence. We now describe how
to modify the script to include this evidence. There are a number
of programs that can be used to compute measures of sequence
conservation. We have tried both PhastCons (by Adam Siepel) and the
branch-length-score (BLS) script included in the TargetScan package (by Bartel et
al.). We have obtained superior results with the latter, and
recommend its use for now. |