RNA-seq Variant Prediction
RSVP - RNA-seq Variant Prediction
RSVP is a software package for prediction of alternative isoforms of protein-coding genes, based on both genomic DNA evidence and aligned RNA-seq reads. The method is based on the use of ORF graphs, which are more general than the splice graphs used in traditional transcript assembly.
A full description of the method is given in the accompanying publication:
Majoros H., Lebeck N., Ohler U., Li S. (2014) Improved transcript isoform discovery using ORF graphs. Bioinformatics doi:10.1093/bioinformatics/btu160.
The source code can be downloaded as a linux tarball here. Model files for A. thaliana can be downloaded here; model files for human can be downloaded here.
You can also download: Arabidopsis training data, Arabidopsis test data, Arabidopsis lincRNA analysis, human training data, human test data.
After downloading the tarball, execute the following command in UNIX:
tar xvfz rsvp.tgz
This will unpack the archive. Linux binaries are included; if these do not match your machine's architecture you will need to compile the source code. This can be accomplished by changing into the source directory (cd package/c++) and using the UNIX command: make all. This will compile the programs in the c++ directory. You must also compile the programs in the c++/genezilla subdirectory: simply cd into that subdirectory and again type make all.
Please also add the environment variable PERLLIB to your environment; it should be set to the subdirectory package/perl. In addition, set the GENEZILLA environment variable to point to package/c++/genezilla.
Instructions for Running
RSVP should be run on individual loci; it has not been tested on entire chromosome sequences. Thus, you should first segment the genome into transcribed loci, using either existing annotations or using some tool for segmenting RNA-seq pileup data into individual loci. Once you have that segmentation, you must generate a GFF file giving the coordinates of the segments; call this file genes.gff. You can then generate a "chunks" directory containing FASTA files of each of the genes using this script:
package/perl/make-test-set-modified.pl genes.gff genome.fasta 100000 50 TAG,TGA,TAA
The file genome.fasta should be a multi-FASTA file of chromosome sequences in which the first identifier on the defline is the chromosome name and matches the first field of the records in the genes.gff file. The above command will create a directory called chunks and populate it with FASTA files.
You will also need to align the RNA-seq reads to the genome; we recommend using Bowtie and TopHat 2. You will then need to generate pileup files, using SAMtools. This will generate a BAM file. You then need to run the following script to convert the BAM file:
samtools mpileup sorted.bam | simplify-pileup.pl | gzip > pileup.txt.gz
Given the pileup.txt.gz file, you can then generate individual pileup files for each chromosome:
package/c++/compress-pileup pileup.txt.gz pileups
TopHat also emits junctions into a file called juctions.bed; you can use the following script to convert that file into the format in which we need it:
python package/python/BED2Intron.py tophat/junctions.bed > junctions.bed.intron
Given this file, you can then extract chromosome-specific junction files:
package/c++/compress-junctions junctions.bed.intron pileups
Finally, you need to extract individual pileup and junction files for each "chunk":
package/perl/get-chunk-pileups.pl chunks pileups package
Now the chunks directory has everything needed for running RSVP. You can run RSVP on the whole set of chunks using this command:
package/perl/run-genezilla.pl chunks parms.iso 0.99 0.01 10 package 1 10 . 0
This requires the parameter file parms.iso; we provide a parameter file for Arabidopsis thaliana, but for other organisms you will have to retrain GeneZilla from scratch. The other parameters in the command above are: DNA weight, RNA weight, number of isoforms per gene to predict, the package directory, whether to enforce strict RNA support (0=false, 1=true), the queue size for the N-best algorithm, the maximum number of chunks to process (or dot for no maximum), and whether to infer N automatically.
The output will consist of individual *.nbest files, one for each gene; these are GFF files. The score field of the GFF file contains the exon score, which is normalized to a per-nucleotide value and is thus comparable between exons and genes. The above command also generates ORF graphs, but to save disk space these are automatically deleted after they are used for prediction; to keep the graphs, simply comment out the lines in the script that delete the graph files. The first graph file (*.graph) contains only DNA scores, and the second graph file (*.graph2) contains scores that combine DNA and RNA evidence.
For organisms other than Arabidopsis and human, you will need to retrain the underlying gene finder, Genezilla. Extensive documentation showing how to retrain GeneZilla can be found at www.genezilla.org.