In evaluating the performance of the sparse Bayesian Factor Analysis (sBFA) - Conditional Random Field (CRF) framework, we used estimated sparse loadings/features only on the set of genes in common between all 5 stages of interest (stage ranges 4-6, 7-8, 9-10, 11-12, 13-16) and a repertoire of annotation terms from a controlled vocabulary, where the most popular annotation terms were independently selected for each stage range, in order to cover approximately 85% of the entire set of genes. This resulted in a set of p = 1,807 images from the BDGP in situ high-throughput hybridization dataset (Tomancak et al., 2002) and p = 48 annotation terms.
Prior to this selection, we separated informative images from the non-informative ones using Euclidean distances between estimated sparse factor analysis weights and a null vector as reference (Pruteanu-Malinici et al., 2011).
All images were scaled to 240 x 120 pixel resolution, containing a single embryo and no background. To reduce the domain (and correspondingly the number of features) of each analyzed image, we defined a grid of fixed size (e.g., 80 x 40 patches) and calculated the mean pixel value within each patch; all mean values were stacked into a single feature vector.