Home
Documentation
Download
Examples

microMUMMIE


MicroMUMMIE is a specific model, implemented within the MUMMIE framework, for predicting micro-RNA binding sites using PAR-CLIP data.  Thus, while MUMMIE can be used for many different bioinformatic modeling tasks, microMUMMIE is a specific model for a specific task. This page explains how to install microMUMMIE and use it.

Skip to a section:
  1. Installation
  2. Running
  3. Interpreting the Output
  4. Using Conservation
  5. Other Options


1. Installation

Install Base Packages

Please begin by installing
PARalyzer and MUMMIE (you will also need twoBitToFa, which must be in your PATH).  PARalyzer uses Bowtie to align PAR-CLIP reads to the genome and then constructs a smooth signal curve that can be used for peak-calling to find (roughly) where an RNA-binding protein (such as Argonaute) binds.  However, rather than using a peak-caller, microMUMMIE instead uses a hidden Markov model (HMM) to find the most probable miRNA seed match near the PARalyzer peak.  This allows microMUMMIE to weigh multiple forms of evidence (e.g., T-to-C conversion rates, evolutionary conservation, RNA sequence) in making the most informed prediction.  The base MUMMIE package also includes the microMUMMIE scripts and models.

Compile Source Code or Download Binaries

The distributed MUMMIE linux binaries may work on your system; if they don't, you'll need to compile the source code.  Please refer to the download page for compilation instructions.

Optional: Download Example Dataset

The following dataset can be downloaded for testing purposes: sample inputs

Set Environment Variables

As noted in the compilation instructions on the download page, you must add several MUMMIE directories to your path.  Please be sure you perform all of these steps; these are the commands for csh and tsch (bash is similar):

setenv MUMMIE /path/to/mummie
setenv PATH ${PATH}:$MUMMIE/perl:$MUMMIE/bin:$MUMMIE
setenv PERLLIB ${PERLLIB}:$MUMMIE/perl
setenv PERL5LIB ${PERL5LIB}:$MUMMIE/perl
rehash


2. Running

Run Bowtie and PARalyzer

The first task is to run Bowtie and PARalyzer on your raw PAR-CLIP reads.  Please see the PARalyzer instructions for recommended Bowtie settings and other information.  One of the outputs of the PARalyzer pipeline will be a "distribution" file (you may need to edit the PARalyzer parameter file to enable the generation of this optional output file).  This distribution file provides the T-to-C conversion profile used by MUMMIE, but that profile must be combined with evolutionary conservation information and RNA sequence in order for MUMMIE to make predictions.



Run MicroMUMMIE

The microMUMMIE.pl script performs all of the required steps, from data preprocessing to model building to prediction.  However, there are situations in which you will want to modify the script or perform these steps individually in order to obtain the results you desire.  For example, microMUMMIE.pl uses the twoBitToFa utility to extract genomic sequence from a 2-bit genome file, whereas you may have your genomic sequence in another format (such as FASTA) and may already have specific genomic sections extracted (such as 3' UTRs).  Note also that the script generates temporary files that will be overwritten each time the script is executed (i.e., do not try to run two copies of the script simultaneously in the same directory).  Thus, we will first describe how to run the microMUMMIE.pl script, and then how to modify it.


You can run the microMUMMIE.pl script on the UNIX command line as follows:

microMUMMIE.pl mature.txt genome.2bit paralyzer-output-dir library-name out.gff 1 coordinatefile

For example, if you cd into the sample data directory, the following command should work:

microMUMMIE.pl mature.txt genome.2bit . D1 out.gff 1 UTRs.txt

The parameters are as follows:
  • mature.txt : This is a 2-column text file giving the name of the miRNA in the first column and the mature miRNA sequence in the second column.  The sequence should be roughly 20-24 nt long; it is the result of processing by Dicer and Drosha.
  • genome.2bit : This is the complete genome in 2bit format.  Any organism can be used.
  • paralyzer-output-dir : Path to output directory from Paralyzer.
  • library-name : The name of library -- i.e., the initials XX in the file name XX.distribution/clusters/groups.csv
  • out.gff : This is the file that the output of microMUMMIE.pl will be written into.
  • 1 : This is an advanced option and should be set to 1 in most cases.  (1=posterior decoding, 0=Viterbi decoding; posterior decoding will generally produce more predictions than Viterbi).
  • coordinatefile : File containing coordinates of genomic sequences to search -- i.e., 3'UTR,  CDS, etc. Columns should be in this order: Chromosome, Start, End, GeneID, Strand, Transcript ID.  Separate columns with tabs.

3. Interpreting the Output

Output Format

The output will be in a gff file (out.gff), which consists of 1-based coordinates of predicted miRNA targets and their posterior probability scores.  Note that the script actually generates several sets of predictions made at different sensitivities and specificities; out.gff contains only one of these prediction sets, parameterized to have medium sensitivity and medium specificity.  Additional prediction sets at other parameterizations are available in the files named predictions-varNNN.gff, where higher values of NNN correspond to higher specificity/SNR/accuracy, and lower sensitivity.

Here is a sample line from microMUMMIE's output:

chr9 hsa-let-7d 8mer-A1 74298636 74298643 0.665 - .  seq=CTACCTCA;sens=0.62;SNR=2.24;

This line can be interpreted as follows.  On chromosome 9 (chr9), occupying 1-based coordinate interval 74298636-74298643 on the antisense strand is a seed match to miRNA let-7d.  The corresponding DNA sequence for this RNA target site is CTACCTCA.  The posterior probability of this site under the microMUMMIE model is 0.665, the estimated sensitivity is 62%, and the estimated signal-to-noise ratio (SNR) is 2.24 (these latter two statistics are interpolated from previously performed shuffling experiments).  Finally, the type of seed match is 8mer-A1, which means that the match is 8 nt long, but the 3'-most residue is an A even if the miRNA seed residue at this position is not a U.

Scores and Postprocessing

In its default operation, microMUMMIE performs posterior decoding, which means that multiple sites may be predicted for each PAR-CLIP cluster, and the scores assigned to individual sites are posterior probabilities.  The posterior probability of a site is the probability of the HMM going through the foreground states for a site, irrespective of what other states are visited outside this putative site.  One implication of this fact is that predicted sites that partially overlap will be forced to share probability, since different states are mutually exclusive at a given site in the HMM.  However, for different types of seed matches (e.g., 6mer, 7mer, 8mer), the probabilities of each of these types of matches will be appropriately summed for any given miRNA, so that, for example, a 6mer match inside a 7mer match will not subtract from the 7mer score.

Sensitivity, Specificity, and Signal-to-Noise Ratio

MicroMUMMIE can be parameterized to run at different sensitivities or specificities.  The single parameter to microMUMMIE, called the peak emission variance (PEV), controls the tradeoff between sensitivity and specificity.  Higher PEV produces higher specificity, so that the predictions you obtain should be more confident.  Lowever PEV produces higher sensitivity, so that you will receive more predictions, though not all predictions will be of the highest confidence.  The individual output files named predictions-varNNN.gff contain predictions at different PEV values; the single default output file is simply a copy of one of these files selected to have medium sensitivity and medium specificity, but you can opt for greater sensitivity by choosing a lower PEV or greater specificity by choosing a higher PEV.

When running microMUMMIE on a new data set, it is not feasible to assess the actual sensitivity, specificity, or signal-to-noise ratio (SNR, which generally correlates with specificity) without extensive simulation experiments.  Thus, to provide a rough indication of the sensitivity and specificity trends, estimates of these values are inferred by interpolating from the following table, which was generated via large-scale simulation results:

PEV
sensitivity
SNR
0.5
0.12
15.7
0.25
0.17
12.04
0.2
0.2
9.95
0.15
0.27
7.07
0.1
0.42
5.09
0.01
0.62
2.24


 
Specificity correlates very closely with SNR, so increasing SNR increases specificity.

Please note that scores for individual site predictions, which are posterior probabilities, are not comparable between microMUMMIE runs under different parameterizations, because these probabilities are conditional on the model, and the model differs when parameters are changed.  Thus, we recommend that you use the PEV setting to select the overall desired level of sensitivity and specificity (SNR), and only then to take into consideration the posterior probabilities of individual sites when filtering results.


4. Using Conservation

Note that the default microMUMMIE.pl script does not utilize evolutionary conservation evidence.  We now describe how to modify the script to include this evidence.  There are a number of programs that can be used to compute measures of sequence conservation.  We have tried both PhastCons (by Adam Siepel) and the branch-length-score (BLS) script included in the TargetScan package (by Bartel et al.).  We have obtained superior results with the latter, and recommend its use for now.

In order to add the TargetScan track to Fastb file you need to do the following:

  • Generage maf file : This is a multiple sequence aligment file for target transcripts. As we need aligment for exons only, it is important to slice the aligment for parts we need. We recommned using UCSC Kent tool mafFrag for this purpose.

  • mafFrag hg19 (=database) multiz23way (=track) Example.bed (=coordinate file) Output.maf (=file we need for Targetscan)

  • Run par2fastb.pl Genome2.bit Paralyzer_output_directory Name_of_output_directory Coordinatefile Name_of_Library (For details read description above)
  • Run assemble-transcripts.pl Input_Directory Output_Directory. Input_Directory  comes from step above and Output_Directory will have assembled transcripts. Note: ABC_2_0.fastb denotes isoform 3 and exon 1 (using 0-based indices).
  • Run TargetScan scripts Download the TargetScan scripts to calculate Context Score and Branchlength score: TargetScan. You will need the maf file to run targetscan scripts.
  • Add Conservation Score Once you have the TargetScan score, add it to Fastb files. We provide our own custom script for the purpose.

  • add-targetscan-tracks.pl Targetscanscore_file_gff Input_Directory (=Directory with Assembled Transcript) Output_Directory (=Directory with Fastb files having TargetScan score) Score_Type (=Context or Branch Length)

  • Run microMUMMIE wrapper (microMUMMIE_targetscan.pl) We provide a customized perl script for running the pipeline with TargetScan scores. Keep in mind that the Fastb files are already prepared and the directory containing those file should be given while running this script.


  • Whatever program you use to evaluate conservation evidence, there are two steps invoved in incorporating these scores in the microMUMMIE prediction.  First, you need to add the scores to the FASTB files that MUMMIE uses as input; see the MUMMIE documentation for instructions on adding tracks to a FASTB file.  Second, you need to modify the microMUMMIE.pl script so that it includes a conservation track in the model.  This can be done simply by changing the line near the top of the script to read "$WANT_CONSERVATION=1".  With regards to the first step (adding the conservation track), you'll also want to modify the microMUMMIE.pl script so as not to overwrite your modified FASTB files; this can be done by locating the section of the script labeled "PREPARING INPUT FILES" and commenting out the three system calls.

    5. Other Options

    Bulge Predictions

    The script can also generate so-called "bulge" predictions, in which certain residues in the miRNA seed are not matched in the mRNA.  To obtain bulge predictions, you can edit the microMUMMIE.pl script and change the line near the top that reads "$WANT_BULGE=0" to instead read "$WANT_BULGE=1".  The bulge predictions will be written into files of the form bulgeIII-predictions-var0.5.gff; the III is the bulge type, which can be I, II, or III, and the 0.5 is a parameter to the model that controls the tradeoff between sensitivity and specificity.

    Statistics

    The stats.pl script provides useful statistics regarding MUMMIE predictions:

    perl stats.pl --help


    contact