Information about the dataset used throughout the "Automated annotation of gene expression image sequences via nonparametric factor analysis and conditional random fields" paper:

In evaluating the performance of the sparse Bayesian Factor Analysis (sBFA) - Conditional Random Field (CRF) framework, we used estimated sparse loadings/features only on the set of genes in common between all 5 stages of interest (stage ranges 4-6, 7-8, 9-10, 11-12, 13-16) and a repertoire of annotation terms from a controlled vocabulary, where the most popular annotation terms were independently selected for each stage range, in order to cover approximately 85% of the entire set of genes. This resulted in a set of p = 1,807 images from the BDGP in situ high-throughput hybridization dataset (Tomancak et al., 2002) and p = 48 annotation terms.

Prior to this selection, we separated informative images from the non-informative ones using Euclidean distances between estimated sparse factor analysis weights and a null vector as reference (Pruteanu-Malinici et al., 2011).

All images were scaled to 240 x 120 pixel resolution, containing a single embryo and no background. To reduce the domain (and correspondingly the number of features) of each analyzed image, we defined a grid of fixed size (e.g., 80 x 40 patches) and calculated the mean pixel value within each patch; all mean values were stacked into a single feature vector.


Pruteanu-Malinici, I. et al. (2011) Automatic annotation of spatial expression patterns via sparse Bayesian factor models. PLoS Computational Biology 2011, 7:e1002098.

Tomancak, P. et al. (2002) Systematic determination of patterns of gene expression during Drosophila embryogenesis. Genome Biology 2002, 3(12).