Information about the dataset used throughout the "Automatic annotation of spatial expression patterns via sparse Bayesian factor models" paper:

The model is demonstrated on a subset of p = 1,231 images from the BDGP in situ high-throughput hybridization dataset (Tomancak et al., 2002), covering 196 genes acquired during the time window of developmental stage 4-6, using segmented and registered images described in Mace et al. (2009). Images in this stage window have been previously annotated with the view from which images were taken (lateral, dorsal/ventral), information not as extensively provided for other stages. Genes in this set were all taken from a lateral view and annotated with 34 unique non-trivial terms describing the spatial expression patterns (i.e. excluding no or ubiquitous expression); any annotation term would thus be associated to one or multiple genes.

Tar file of the 1,231 images from the BGDP dataset (This tar file is 987MB)

All images were scaled to 240 x 120 pixel resolution, containing a single embryo and no background. To reduce the domain (and correspondingly the number of features) of each analyzed image, we defined a grid of fixed size (e.g., 80 x 40 patches) and calculated the mean pixel value within each patch; all mean values were stacked into a single feature vector.

In the work presented here, we considered three different grid sizes (which translates into three different image resolutions): 80 x 40, 60 x 30 and 48 x 24.


Mace, D. L. et al. (2009) Extraction and comparison of gene expression patterns from 2D RNA in situ hybridization images. Bioinformatics 2009.

Tomancak, P. et al. (2002) Systematic determination of patterns of gene expression during Drosophila embryogenesis. Genome Biology 2002, 3(12).