Computational Regulatory Genomics

OHLER LABORATORY

Are you sure you want to delete: << Reproducible footprinting >> ?
<< YES >> << NO >>

Reproducible footprinting

Requirements

R version '2.15.' or higher
R libraries ‘mixtools’, ‘gtools’ and ‘GenomicRanges’
Perl
SAMtools version 1.1
BEDTools version 2.17.0

Download

Linux users can download the pipeline from here.

Installation

Extract the contents of the downloaded archive:
tar -xvzf footprint-pipeline-1.0.1.tar.gz

Change into the pipeline's directory and configure:
cd footprint-pipeline-1.0.1
./configure --prefix=/tmp/my-test

Note: If the necessary executables are not in the path, or their versions are incompatible with the pipeline (see the requirements section above), the full paths of the correct executables can be specified for the configuration. See below an example containing all BEDTools executables used by the pipeline, alongside SAMtools:
./configure COVERAGEBED=/path/to/coverageBed FASTAFROMBED=/path/to/fastaFromBed BAMTOBED=/path/to/bamToBed BEDTOBAM=/path/to/bedToBam BEDTOOLS=/path/to/bedtools SAMTOOLS=/path/to/samtools --prefix=/tmp/my-test

Make, run the test and install:
make
make check
make install

Note: This will install the pipeline to /tmp/my-test directory. This can be changed by supplying the desired directory to the --prefix argument in the configuration step.

Usage

The main executable find_footprints.sh can be found in /bin, under the specified installation directory.
Running "bash find_footprints.sh" will list the expected inputs.

To run the pipeline, find_footprints.sh is executed in the following way:

bash find_footprints.sh bam_file chrom_sizes motif_coords genome_fasta factor_name bias_file peak_file no_of_components background fixed_bg

The arguments are explained below:

    bam_file:
        The bam file from the ATAC-seq or DNase-seq experiment.

    chrom_sizes:
        A tab delimited file with 2 columns, where the first column is
        the chromosome name and the second column is the chromosome
        length for the appropriate organism and genome build.

    motif_coords:
        A 6-column file with the coordinates of motif matches (eg
        resulting from scanning the genome with a PWM) for the
        transcription factor of interest. The 6 columns should contain
        chromosome, start coordinate, end coordinate, name, score and
        strand information in this order. There should not be any additional
        columns.The coordinates should be closed (1-based).

    genome_fasta:
        A multi-fasta format file that contains the whole genome
        sequence for the appropriate organism and genome build. This
        file should be indexed (eg by using samtools faidx) and placed
        in the same directory.

    factor_name:
        The name of the transcription factor of interest supplied by
        the user. This is used in the naming of the output files.

    bias_file:
        A file listing the cleavage/transposition bias of the
        different protocols, for all 6-mers. Provided options: ATAC,
        DNase double hit or DNase single hit protocols.

    peak_file:
        A file with the coordinates of the ChIP-seq peaks for the
        transcription factor of interest. The format is flexible as
        long as the first 3 columns (chromosome, start coordinate, end
        coordinate) are present.

    no_of_components:
        Total number of footprint and background components that
        should be learned from the data. Options are 2 (1 fp and 1 bg)
        and 3 (2 fp and 1 bg) components.

    background:
        The mode of initialization for the background
        component. Options are "Flat" or "Seq". Choosing "Flat"
        initializes this component as a uniform distribution. Choosing
        "Seq" initializes it as the signal profile that would be
        expected solely due to the protocol bias (given by the
        bias_file parameter).

     fixed_bg:
        Whether the background component should be kept fixed. Options
        are TRUE/T or FALSE/F. Setting "TRUE" keeps this component
        fixed, whereas setting "FALSE" lets it be reestimated during
        training. In general, if the background is estimated from bias
        (option "Seq"), it is recommended to keep it fixed.


For instance to run the pipeline on the test data with 2 components, estimating the background from the protocol bias and keeping it fixed, execute the following command:

bash find_footprints.sh test_data/ATAC_HEK293_hg19_chr1.bam test_data/hg19.chr1.chrom.size test_data/CTCF_motifs_hg19_chr1.bed test_data/hg19_chr1.fa CTCF SeqBias_ATAC.txt test_data/CTCF_HEK293_chip_hg19_chr1.bed 2 Seq T

Important note: Please make sure that the "bam_file" and the "motif_coords" inputs are the only files in their respective directories. This is recommended to avoid overwriting any existing files, as the pipeline generates temporary files in these directories.

The pipeline outputs 4 files (in the same directory as the input "bam_file"):

    File with the ".RESULTS" extension:
        Gives the results of the footprinting analysis. The first 6
        columns harbor the motif information (identical to the motif_coords
        input file). The 7th column has the footprint score (log-odds of
        footprint versus background) for each motif instance. The following
        columns show the probabilities for the individual footprint and
        background components.

    File with the ".PARAM" extension:
        Gives the trained parameters for the footprint and background
        components. It includes as many lines as components (eg the first
        line has the parameters for the first component).

    Png file named "plot1":
        A plot with two panels, showing the initial components above
        and the final trained components below. The plotted values for the
        final components are given in the .PARAM output file explained above.

    Png file named "plot2":
        A plot only with the final trained components. In a model where 2
        components are used, this plot is identical to the bottom panel in
        plot1. When 3 components are used, this plot shows the weighted average
        of the 2 footprint components as the final footprint profile.