Data Preparation Guidelines

MUMMIE operates on multivariate sequential data stored in text files called FASTB files.  Each "track" (variate) begins with a defline, just as in a FASTA file.  For alphabetic data (like DNA), the defline consists of a ">" symbol followed by the track name, optionally followed then by any number of key-value attributes:

>dna /chr=X /begin=29384 /end=29498 /strand=+

This defline would be followed by DNA sequence, just as in a normal FASTA file:

>dna /chr=X /begin=29384 /end=29498 /strand=+

For continuous attributes, the defline instead begins with a "%" symbol, but is otherwise qualitatively just like a typical FASTA defline:

%conservation /chr=X /begin=29384 /end=29498 /strand=+

Following this defline would be the continuous values for the track, one value per line:

%conservation /chr=X /begin=29384 /end=29498 /strand=+

A FASTB file may have any number of tracks (at least one), but all of the tracks in a single file must have the same length (because although they correspond to different variates, they jointly represent a single sequence).  In the current version of MUMMIE, every FASTB file must contain at least one continuous track; this requirement will be relaxed in a future version of the software.

You will generally have multiple sequences you wish to analyze.  These must be in separate FASTB files, but they can be grouped into a single directory; most MUMMIE programs have a command-line option (usually -d) telling them to operate on all of the files in a directory rather than a single file.

In addition to the FASTB files for your multivariate sequences, there are a few configuration files that are required by some of the programs in MUMMIE.  The first is the schema file, which specifies the track names and their data types:

dna : discrete ACGT
conservation : continuous

The first field is the track name, which must match the identifier that occurs on the deflines in the FASTB files.  The second field is the data type (discrete or continuous).  Note that discrete really means alphabetic; if your data consists of integers, use continuous.  For alphabetic data, the keyword discrete must be followed by the alphabet (e.g., ACGT for DNA).  The MUMMIE script get-schema.pl is useful for automatically inferring a schema file from a directory of FASTB files.

Note that in the current version of MUMMIE all continuous tracks are modeled jointly via a mixture of multivariate Gaussian distributions.  Although the Gaussian mixture can approximate many other distributions well (given enough mixture components), it is often useful to transform distinctly non-Gaussian data to look more Gaussian.

Once you've formatted your data into FASTB files, there are several programs in MUMMIE that you can use to inspect your data to check its integrity.  The script fastb-to-xgraph.pl renders a FASTB file graphically using the open-source software xgraph (which is included in the MUMMIE distribution; you must compile xgraph and ensure that it is in your shell's path before running fastb-to-xgraph.pl).  Other useful commands are fastb-stats, subset-fastb, and smooth-fastb, which are documented in the "Commands" section of this manual.