MicroMUMMIE is a specific model, implemented within the MUMMIE framework, for predicting micro-RNA binding sites using PAR-CLIP data. Thus, while MUMMIE can be used for many different bioinformatic modeling tasks, microMUMMIE is a specific model for a specific task. This page explains how to install microMUMMIE and use it.
Skip to a section:
Install Base Packages
Please begin by installing PARalyzer and MUMMIE (you will also need twoBitToFa, which must be in your PATH). PARalyzer uses Bowtie to align PAR-CLIP reads to the genome and then constructs a smooth signal curve that can be used for peak-calling to find (roughly) where an RNA-binding protein (such as Argonaute) binds. However, rather than using a peak-caller, microMUMMIE instead uses a hidden Markov model (HMM) to find the most probable miRNA seed match near the PARalyzer peak. This allows microMUMMIE to weigh multiple forms of evidence (e.g., T-to-C conversion rates, evolutionary conservation, RNA sequence) in making the most informed prediction. The base MUMMIE package also includes the microMUMMIE scripts and models.
Compile Source Code or Download Binaries
The distributed MUMMIE linux binaries may work on your system; if they don't, you'll need to compile the source code. Please refer to the download page for compilation instructions.
Optional: Download Example Dataset
The following dataset can be downloaded for testing purposes: sample inputs
Set Environment Variables
As noted in the compilation instructions on the download page, you must add several MUMMIE directories to your path. Please be sure you perform all of these steps; these are the commands for csh and tsch (bash is similar):
Run Bowtie and PARalyzer
The first task is to run Bowtie and PARalyzer on your raw PAR-CLIP reads. Please see the PARalyzer instructions for recommended Bowtie settings and other information. One of the outputs of the PARalyzer pipeline will be a "distribution" file (you may need to edit the PARalyzer parameter file to enable the generation of this optional output file). This distribution file provides the T-to-C conversion profile used by MUMMIE, but that profile must be combined with evolutionary conservation information and RNA sequence in order for MUMMIE to make predictions.
The microMUMMIE.pl script performs all of the required steps, from data preprocessing to model building to prediction. However, there are situations in which you will want to modify the script or perform these steps individually in order to obtain the results you desire. For example, microMUMMIE.pl uses the twoBitToFa utility to extract genomic sequence from a 2-bit genome file, whereas you may have your genomic sequence in another format (such as FASTA) and may already have specific genomic sections extracted (such as 3' UTRs). Note also that the script generates temporary files that will be overwritten each time the script is executed (i.e., do not try to run two copies of the script simultaneously in the same directory). Thus, we will first describe how to run the microMUMMIE.pl script, and then how to modify it.
You can run the microMUMMIE.pl script on the UNIX command line as follows:
For example, if you
The parameters are as follows:
The output will be in a gff file (out.gff), which consists of 1-based coordinates of predicted miRNA targets and their posterior probability scores. Note that the script actually generates several sets of predictions made at different sensitivities and specificities; out.gff contains only one of these prediction sets, parameterized to have medium sensitivity and medium specificity. Additional prediction sets at other parameterizations are available in the files named predictions-varNNN.gff, where higher values of NNN correspond to higher specificity/SNR/accuracy, and lower sensitivity.
Here is a sample line from microMUMMIE's output:
This line can be interpreted as follows. On chromosome 9 (chr9), occupying 1-based coordinate interval 74298636-74298643 on the antisense strand is a seed match to miRNA let-7d. The corresponding DNA sequence for this RNA target site is CTACCTCA. The posterior probability of this site under the microMUMMIE model is 0.665, the estimated sensitivity is 62%, and the estimated signal-to-noise ratio (SNR) is 2.24 (these latter two statistics are interpolated from previously performed shuffling experiments). Finally, the type of seed match is 8mer-A1, which means that the match is 8 nt long, but the 3'-most residue is an A even if the miRNA seed residue at this position is not a U.
Scores and Postprocessing
In its default operation, microMUMMIE performs posterior decoding, which means that multiple sites may be predicted for each PAR-CLIP cluster, and the scores assigned to individual sites are posterior probabilities. The posterior probability of a site is the probability of the HMM going through the foreground states for a site, irrespective of what other states are visited outside this putative site. One implication of this fact is that predicted sites that partially overlap will be forced to share probability, since different states are mutually exclusive at a given site in the HMM. However, for different types of seed matches (e.g., 6mer, 7mer, 8mer), the probabilities of each of these types of matches will be appropriately summed for any given miRNA, so that, for example, a 6mer match inside a 7mer match will not subtract from the 7mer score.
Sensitivity, Specificity, and Signal-to-Noise Ratio
MicroMUMMIE can be parameterized to run at different sensitivities or specificities. The single parameter to microMUMMIE, called the peak emission variance (PEV), controls the tradeoff between sensitivity and specificity. Higher PEV produces higher specificity, so that the predictions you obtain should be more confident. Lowever PEV produces higher sensitivity, so that you will receive more predictions, though not all predictions will be of the highest confidence. The individual output files named predictions-varNNN.gff contain predictions at different PEV values; the single default output file is simply a copy of one of these files selected to have medium sensitivity and medium specificity, but you can opt for greater sensitivity by choosing a lower PEV or greater specificity by choosing a higher PEV.
When running microMUMMIE on a new data set, it is not feasible to assess the actual sensitivity, specificity, or signal-to-noise ratio (SNR, which generally correlates with specificity) without extensive simulation experiments. Thus, to provide a rough indication of the sensitivity and specificity trends, estimates of these values are inferred by interpolating from the following table, which was generated via large-scale simulation results:
Specificity correlates very closely with SNR, so increasing SNR increases specificity.
Please note that scores for individual site predictions, which are posterior probabilities, are not comparable between microMUMMIE runs under different parameterizations, because these probabilities are conditional on the model, and the model differs when parameters are changed. Thus, we recommend that you use the PEV setting to select the overall desired level of sensitivity and specificity (SNR), and only then to take into consideration the posterior probabilities of individual sites when filtering results.
PhastCons (by Adam Siepel) and the branch-length-score (BLS) script included in the TargetScan package (by Bartel et al.). We have obtained superior results with the latter, and recommend its use for now.
In order to add the TargetScan track to Fastb file you need to do the following:
Whatever program you use to evaluate conservation evidence, there are two steps invoved in incorporating these scores in the microMUMMIE prediction. First, you need to add the scores to the FASTB files that MUMMIE uses as input; see the MUMMIE documentation for instructions on adding tracks to a FASTB file. Second, you need to modify the microMUMMIE.pl script so that it includes a conservation track in the model. This can be done simply by changing the line near the top of the script to read "$WANT_CONSERVATION=1". With regards to the first step (adding the conservation track), you'll also want to modify the microMUMMIE.pl script so as not to overwrite your modified FASTB files; this can be done by locating the section of the script labeled "PREPARING INPUT FILES" and commenting out the three system calls.
The script can also generate so-called "bulge" predictions, in which certain residues in the miRNA seed are not matched in the mRNA. To obtain bulge predictions, you can edit the microMUMMIE.pl script and change the line near the top that reads "$WANT_BULGE=0" to instead read "$WANT_BULGE=1". The bulge predictions will be written into files of the form bulgeIII-predictions-var0.5.gff; the III is the bulge type, which can be I, II, or III, and the 0.5 is a parameter to the model that controls the tradeoff between sensitivity and specificity.
The stats.pl script provides useful statistics regarding MUMMIE predictions: