PARalyzer v1.5 **************************************************************** IMPORTANT To run PARalyzer v1.5 you must provide the memory parameter in the command line arguement, for example to use 2GB of ram: ./PARalyzer 2G default.ini Now accepts SAM/BAM files in addition to bowtie output. "SAM_FILE" or "BOWTIE_FILE" must be specified in the INI file. For sam file: SAM_FILE=#filepath/filename#=COLLAPSED if reads were collapse before alignment and you want to incorporate the read copy number SAM_FILE=#filepath/filename# if reads were not collapse before alignment or you dop not want to incorporate the read copy number #filepath/filename# = Location and name of a SAM file to be analyzed note: there can be multiple SAM alignment files used as input, just create a new line with a new 'SAM_FILE=' parameter PARalyzer v1.0 **************************************************************** Requirements: Required: 4G RAM (for smaller datasets) JAVA version '1.6.0' UCSC .2bit version of the genome(s) against which the reads were aligned <- next build will do away with this requirement (but use more memory) (http://genome.ucsc.edu/FAQ/FAQformat.html#format7) Bowtie alignment file (http://bowtie-bio.sourceforge.net/news.shtml) Recommended: 12G RAM for larger Datasets FASTX toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) <- recommended pre-processing package; created by the Hannon lab at CSHL note: for a speedier alignment and PARalyzer processing, use the fastx_collapser (see the '=COLLAPSED' option below if you use this pre-processing tool) **************************************************************** **************************************************************** USAGE: Running PARalyzer >PARalyzer sample.ini Recommended BOWTIE parameters: >bowtie GENOME_INDEX -v 2 -m 10 --best --strata -f INPUT_FASTA_FILE OUTPUT_FILE note: if you use the multi-thread option (-p), you will need to re-sort the output so that mapped locations for the same read are on consecutive lines; using the -p option may result in a mixed output **************************************************************** **************************************************************** SETTING UP THE .ini FILE *****Required Options: BANDWIDTH=#integer# #integer# = Size of bandwidth for KDE calculation (default 3) CONVERSION=#character from#>#character to# #character from# = Character representing the modified ribonucleotide (default 'T') #character to# = Character representing what the modified ribonucleotide is read as by rTranscriptase (default 'C') note: only 1 conversion is possible at this time; in the future we may implement the ability to have 2 modified ribonucleotides MINIMUM_READ_COUNT_PER_GROUP=#integer# #integer# = Minimum number of reads required to call a group (default 10) MINIMUM_READ_COUNT_PER_CLUSTER=#integer# #integer# = Minimum number of reads required to call a cluster (default 1) MINIMUM_READ_COUNT_FOR_KDE=#integer# #integer# = Minimum read depth at a location to make a KDE estimate (default 1) => (recommended: 5) MINIMUM_CLUSTER_SIZE=#integer# #integer# = Minimum length required for a cluster to be reported (default 1) MINIMUM_CONVERSION_LOCATIONS_FOR_CLUSTER=#integer# #integer# = Minimum number of separate locations to have a reported conversion for a cluster to be reported (default 1) => (recommended: 2) note: setting this to 0 will cause errors, if you are looking for sites that may have no conversions, I recommended analyzing the 'groups' output file (see below) MINIMUM_CONVERSION_COUNT_FOR_CLUSTER=#integer# #integer# = Minimum number of conversion events within a region to report a cluster (default 1) note: setting this to 0 will cause errors, if you are looking for sites that may have no conversions, I recommended analyzing the 'groups' output file (see below) MINIMUM_READ_COUNT_FOR_CLUSTER_INCLUSION=#integer# #integer# = Minimum read depth for a location to be included within a cluster (default 1) MINIMUM_READ_LENGTH=#integer# #integer# = Minimum length of mapped read to be included in the analysis (default 1) MAXIMUM_NUMBER_OF_NON_CONVERSION_MISMATCHES=#integer# #integer# = Maximum number of non-conversion mismatches of a mapped read to be included in the analysis (default 5) BOWTIE_FILE=#filepath/filename# #filepath/filename# = Location and name of a bowtie output file to be analyzed note: there can be multiple BOWTIE alignment files used as input, just create a new line with a new 'BOWTIE_FILE=' parameter GENOME_2BIT_FILE=#filepath/filename# #filepath/filename# = Location of the UCSC .2bit file of the genome against which the reads were aligned OUTPUT_CLUSTERS_FILE=#filepath/filename# #filepath/filename# = Location and name of the resulting clusters file *****1 of the following***** EXTEND_BY_READ HAFFNER_APPROACH ADDITIONAL_NUCLEOTIDES_BEYOND_SIGNAL=#integer# **************************** EXTEND_BY_READ Including this line means that the cluster will be extended beyond the signal to include a region such that it extends to the end of any read that falls within the cluster and contained a conversion, or until the minimum read depth (MINIMUM_READ_COUNT_FOR_CLUSTER_INCLUSION parameter) is no longer met HAFNER_APPROACH Identifies the location with the largest number of conversion events and extends the cluster up to ( parameter ADDITIONAL_NUCLEOTIDES_BEYOND_SIGNAL)nt in each direction from that point, or until the minimum read depth (MINIMUM_READ_COUNT_FOR_CLUSTER_INCLUSION parameter) is no longer met ADDITIONAL_NUCLEOTIDES_BEYOND_SIGNAL=#integer# #integer# = The maximum number of reads to extend beyond the positive signal in each direction (default 0) the cluster is defined as the region where the conversion KDE is above the background KDE and then extended up to #integer#, or until the minimum read depth (MINIMUM_READ_COUNT_FOR_CLUSTER_INCLUSION parameter) is no longer met *****Other Options: FILTER_FILE=#filepath/filename#=#flag# #filepath/filename# = Location of the UCSC .bed file (http://genome.ucsc.edu/FAQ/FAQformat.html#format1) containing genomic coordinates for regions you would like to filter #flag# = Text that will be added to clusters / groups if they overlap one of these regions note: multiple filter files may be used; just add additional lines; if multiple filter files are used and a cluster overlaps multiple regions, only one will be reported OUTPUT_GROUPS_FILE=#filepath/filename# #filepath/filename# = Location and name of the resulting groups file; contains the information of the groups prior to cluster generation OUTPUT_DISTRIBUTIONS_FILE=#filepath/filename# #filepath/filename# = Location and name of the resulting distributions file; contains the signal KDE, background KDE, read count & conversion % for all locations within each group BOWTIE_FILE=#filepath/filename#=COLLAPSED #filepath/filename# = Location and name of a bowtie output file to be analyzed note: adding the '=COLLAPSED' flag means that the FASTA file aligned to the genome(s) was first collapsed by the 'fastx_collapser' program from the FASTX toolkit SPECIAL_CHROMOSOME=#chromosome#=#filepath/filename# note: this is to be used if not all of the chromosomes that you aligned to are in the Genome.2bit file #chromosome# = chromosome name (i.e. chrX) #filepath/filename# = Location of the ucsc .2bit file that contains this particular chromosome FIND_MIRNA_SEEDMATCHES=#filepath/filename# #filepath/filename# = Location of the file that contains mature miRNA name and sequences MAXIMUM_SEED_MATCH_LENGTH=#integer# #integer# = maximum length of seed match; must be greater than or equal to 6 (not recommended to go above 12) note: this will search all clusters for sites that match all seeds of #integer#-1m/A through 6mer note: this still needs more work to include all different seed-match types OUTPUT_MIRNA_TARGETS_FILE=#filepath/filename# #filepath/filename# = Location and filename of a file displaying all miRNA-cluster targets **************************************************************** Understanding the output files: OUTPUT_CLUSTERS_FILE=#filepath/filename# This will generate a comma separated file containing the information about the resulting clusters Chromosome = chromosome on which the cluster resides Strand = orientation in which the cluster resides ClusterStart = beginning coordinate on the chromosome of the cluster ClusterEnd = ending coordinate on the chromosome of the cluster ClusterID = unique ID for the cluster ClusterSequence = sequence of the cluster ReadCount = number of reads that overlap the cluster by at least 1 nucleotide ModeLocation = coordinate of the location with the highest signal / (signal + background) value ModeScore = score of the highest signal / (signal + background) value ConversionLocationCount = number of unique location where at least 1 conversion occurred ConversionEventCount = total number of conversions that occurred within the cluster NonConversionEventCount = total number of possible conversion events that did not occur FilterType = if a FILTER_FILE parameter was used, it lists the #flag# text if it overlapped a region within that file; if it did not map to any FILTER_FILE locations, an 'NA' is reported OUTPUT_GROUPS_FILE=#filepath/filename# This will generate a comma separated file containing the information about the resulting groups Chromosome = chromosome on which the group resides Strand = orientation in which the group resides GroupStart = beginning coordinate on the chromosome of the group GroupEnd = ending coordinate on the chromosome of the group GroupID = unique ID for the group ReadCount = number of reads within the group FilterType = if a FILTER_FILE parameter was used, it lists the #flag# text if it overlapped a region within that file; if it did not map to any FILTER_FILE locations, an 'NA' is reported OUTPUT_DISTRIBUTIONS_FILE=#filepath/filename# This contains the signal KDE, background KDE, read count & conversion % for all locations within each group note: The data will be in blocks of four lines for each group note: groups on the reverse strand do not need to be reversed; the values always equal nucleotdies from GroupStart to GroupEnd, regardless of Strand First Column = Chromosome = chromosome on which the group resides Second Column = Strand = orientation in which the group resides Third Column = GroupStart = beginning coordinate on the chromosome of the group Fourth Column = GroupEnd = ending coordinate on the chromosome of the group Fifth Column = GroupID = unique ID for the group Sixth Column = Information = reports if the current line contains the Signal, Background, Conversion Percent, or ReadCount note: All nucleotides that do not have any possibility of having a conversion event are given a value of -1 All Subsequent Columns: the values for each nucleotide from GroupStart until GroupEnd OUTPUT_MIRNA_TARGETS_FILE=#filepath/filename# This is a comma separated file that contains all found seed-matches for miRNAs within the resulting clusters Chromosome = chromosome on which the seed-match resides Strand = orientation in which the seed-match resides SiteStart = beginning coordinate on the chromosome of the seed-match SiteEnd = ending coordinate on the chromosome of the seed-match SeedSequence = seed-match sequence SeedType = seed-match type (i.e. 7mer1A) ClusterID = unique ID for the cluster in which the seed-match resides ClusterSequence = sequence of the cluster in which the seed-match resides miRNAs = all miRNA names that matched with the given seed-match FilterType = if a FILTER_FILE parameter was used, it lists the #flag# text if it overlapped a region within that file; if it did not map to any FILTER_FILE locations, an 'NA' is reported ****************************************************************