Computational Regulatory Genomics

Are you sure you want to delete: << Alignment-free orthologous enhancer detection >> ?
<< YES >>       << NO >>

Alignment-free orthologous enhancer detection

An alignment-free method to identify candidate orthologous enhancers in multiple Drosophila genomes


Evolutionarily conserved non-coding genomic sequences represent a potentially rich source for the discovery of gene regulatory region such as transcriptional enhancers. However, detecting orthologous enhancers using alignment-based methods in higher eukaryotic genomes is particularly challenging, as regulatory regions can undergo considerable sequence changes while maintaining their functionality. We have developed an alignment-free method which identifies conserved enhancers in multiple diverged species. Our method is based on similarity metrics between two sequences based on the co-occurrence of sequence patterns regardless of their order and orientation, thus tolerating sequence changes observed in non-coding evolution. We show that our method is highly successful in detecting orthologous enhancers in distantly related species without requiring additional information such as knowledge about transcription factors involved, or predicted binding sites. By estimating the significance of similarity scores, we are able to discriminate experimentally validated functional enhancers from seemingly equally conserved candidates without function. We demonstrate the effectiveness of this approach on a wide range of enhancers in Drosophila, and also present encouraging results to detect conserved functional regions across large evolutionary distances. Our work provides encouraging steps on the way to ab initio unbiased enhancer prediction to complement ongoing experimental efforts.

M. Arunachalam,  K. Jayasurya,  P. Tomancak,  U. Ohler (2010). An alignment-free method to identify candidate orthologous enhancers in multiple Drosophila genomes. Bioinformatics.


This method searches for orthologous enhancer region in a related species for a given known enhancer.

Input sequences are read from a fasta file, repeats and the low-complexity regions in the control sequences are masked. This method identifies conserved enhancers in related species in a non-alignment fashion. All possible patterns of a given window are enumerated for evaluation of their contribution to the overall similarity measure. The number of instances of each pattern in both sequences explains its contribution of the total similarity between them [1].

The given known enhancer region is scanned against the control region in a pairwise sliding window fashion in both strands. The mixed metric score is computed for each window that generates a similarity profile. The window with the global maximum mixed metric score is considered and the consecutive windows that exceeds the threshold value are merged. The region that exceeds the threshold value is reported as a potential orthologous enhancer region in the related species.

This method is evaluated by applying to different data sets demonstrated the flexibility of the alignment-free method [2] [3] [4] [5].



Please refer to the provided README for detailed instructions on how to run



Dataset This includes the program, an example input dataset and the results.

[1] van Helden, J., Metrices for comparing regulatory sequences on the basis of pattern counts. Bioinformatics 2004, 20:399-406.

[2] Berman, Computational identification of developmental enhancers: conservation and function of transcription factor binding site clusters in Drosophila melanogaster and Drosophila pseudoobscura. Genome Biol 2004, 5:R61.

[3] Papatsenko, D. and Levine, M., Quantitative analysis of binding motifs mediating diverse spatial readouts of the Dorsal gradient in the Drosophila embryo PNAS 2005, 102:4966-5971.

[4] Gallo, S. M., Li, L., Hu Z, and Halfon, M, S., REDfly: a Regulatory Element Database for Drosophila Bioinformatics 2006, 22:381-383.

[5] Hare, Sepsid even-skipped enhancers are functionally conserved in Drosophila despite lack of sequence conservation PLoS Genet 2008, 4:c1000106.

This site is covered by the MDC Data Privacy policy