Additional Supplementary Material
A Transcription Factor Affinity Based Code for Mammalian Transcription Initiation
All Single-Peak locations (TSSs from the set of 2399 CAGE Tag Clusters from which all training and test sets were derived):
Fasta file (gzipped) containing sequence from 5kb upstream to 5kb downstream of the TSS
Training Sets (CAGE Tag Cluster IDs):
Test Sets (CAGE Tag Cluster IDs):
Each archive directory (tar-gzip) contains three subdirectories, TSS_set, IGC_set, and CDS_set for the positive data, negative intergenic data, and negative cds data respectively. Each subdirectory contains 10 fasta files representing the 10 parts. Sequence for each example location is taken from (-250, +50) with respect to the example location. Note that because TSS and corresponding upstream intergenic examples must be extracted from mm5 (the genome build of the original CAGE Tag mappings), sequences will occasionally contain "N's" at nucleotides which were not yet identified in the build.
A descriptor file containing the Tag Cluster IDS, genomic locations for the highest TSS in a cluster, and other detailed information can be downloaded at http://fantom31p.gsc.riken.jp/cage_analysis/export/mm5/tss_summary.tsv.bz2.
UCSC Custom Tracks for all Annotation-Supported Model Test Set Scans:
UCSC Custom Tracks for all pri-miRNA Scans:
UCSC Custom Tracks for non-genic miRNA regions scans from the Marson Data Set (Marson et al., 2008):
Clicking on the UCSC Custom Track listed in each category will automatically open the track for viewing in a new UCSC Genome Browser window. Genome build and other browser settings are appropriately pre-selected when each link is clicked (mm5 for Test Set Scans, hg18/mm9 for pri-miRNA Scans, mm8/mm9 for Marson Data Set miRNA Region Scans).