Close  
Figure 4: Learning a dictionary of cellular expression patterns. (a) Thresholds drawn as vertical lines for partitioning dataset into highsignal and lowsignal subpopulations (L1 and L2, respectively). (b) Linear approximations of the L1 (high signal) and L2 (low signal) data matrices by the overcomplete dictionaries D and the sparse coding matrices W. Data matrix D × W is a reconstruction of the dataset, X. The rows of X and D correspond to estrogen receptor, progesterone receptor, and human epidermal growth factor 2 biomarker intensities, as labeled. The columns of X and W correspond to each individual cell. The columns of D correspond to the unique dictionary elements and the rows of W correspond to their weights. (c) Each cell is phenotyped to a single pattern in dictionary D. A threedimensional representation of the L1 matrix is shown, where each cell is color coded by its phenotype. (d) Subspace selection of overcomplete dictionaries D, for L1 and L2, leads to a pattern size of 11 for each subpopulation. (e) Each pattern in the dictionary is shown as a colored stem plot and refers to (from left to right) the estrogen receptor, human epidermal growth factor 2, and progesterone receptor intensity levels. It is convenient to describe these intensities as high, medium, and low as we will do in the main text. For example, the cyancolored pattern 2 in the L1 dictionary (left), which accounts for the cyancolored cloud in panel c, may be described as estrogen receptor high, human epidermal growth factor 2 high, and progesterone receptor low. Next, using kmeans clustering, we consolidate the L1 and L2 dictionaries into a final dictionary set of size 8. To denote the outcome of kmeans clustering, we draw a colored box around each pattern in the L1 and L2 dictionaries, corresponding to the eight different consolidated clusters and show the mean patterns of the consolidated dictionary to the right 
