ORIGINAL ARTICLE Year : 2013  Volume : 4  Issue : 2  Page : 11 Histological stain evaluation for machine learning applications Jimmy C Azar^{1}, Christer Busch^{2}, Ingrid B Carlbom^{1}, ^{1} Department of Information Technology, Centre for Image Analysis, Uppsala University, Uppsala, Sweden ^{2} Department of Pathology and Cytology, University Hospital, Uppsala, Sweden Correspondence Address: Aims: A methodology for quantitative comparison of histological stains based on their classification and clustering performance, which may facilitate the choice of histological stains for automatic pattern and image analysis. Background: Machine learning and image analysis are becoming increasingly important in pathology applications for automatic analysis of histological tissue samples. Pathologists rely on multiple, contrasting stains to analyze tissue samples, but histological stains are developed for visual analysis and are not always ideal for automatic analysis. Materials and Methods: Thirteen different histological stains were used to stain adjacent prostate tissue sections from radical prostatectomies. We evaluate the stains for both supervised and unsupervised classification of stain/tissue combinations. For supervised classification we measure the error rate of nonlinear support vector machines, and for unsupervised classification we use the Rand index and the Fmeasure to assess the clustering results of a Gaussian mixture model based on expectationmaximization. Finally, we investigate class separability measures based on scatter criteria. Results: A methodology for quantitative evaluation of histological stains in terms of their classification and clustering efficacy that aims at improving segmentation and color decomposition. We demonstrate that for a specific tissue type, certain stains perform consistently better than others according to objective error criteria. Conclusions: The choice of histological stain for automatic analysis must be based on its classification and clustering performance, which are indicators of the performance of automatic segmentation of tissue into morphological components, which in turn may be the basis for diagnosis.
Introduction Pathologists rely on histological stains to identify morphological changes in tissue that are linked to cancer and other diseases. Different stain combinations target different types of tissue. With the most common stain, hematoxylin/eosin, the hematoxylin stains the cell nuclei blue, and the counterstain, eosin, stains the cytoplasm pink and stromal components in various grades of red/pink. This and most other stains are developed for visual examination of the tissue under a microscope, and thus they are often not optimal for automated pattern and image analysis. However automated and computerassisted analysis of histopathology specimens is gaining importance since the introduction of high throughput imaging systems. As a first step in the automated analysis of tissue samples, morphological components, such as nuclei, stroma, and cytoplasm, often need to be identified and segmented from the surrounding tissue for further analysis. This identification relies on the absorption of histological stains by the tissue components, and therefore the choice of stain has a significant effect on the outcome of the segmentation. Color image analysis for histologically stained tissue is further complicated by both inter and intraspecimen intensity variations due to variations in the tissue preparation process and overlap in the stain absorption spectra. [1],[2] Quantitative evaluation of the efficacy of a particular stain for automatic color image analysis relies upon the evaluation of the stained tissue in a color space where the distance between two points represents the chromaticity difference between the corresponding colors, reducing the dependence on the intensity variations that may arise from tissue preparation and image acquisition. In such a color space, the tissue image data forms distinct clusters, one cluster for each stain/tissue combination in the image. With an ideal stain, these clusters are compact and distant from each other, enabling a machine learning algorithm to separate the clusters and yield an accurate color segmentation. We explore a methodology for selecting optimal histological stains for automation in terms of both supervised and unsupervised classification performances. We use three types of quantitative efficacy measures: The classification error for supervised classification, the Fmeasure and the Rand index for clustering performance, and the Fisher criterion and Mahalanobis distance for cluster separability. We demonstrate these measures on 13 different stains applied to prostate tissue. Although these stains are not all typically used for prostate cancer malignancy grading, they are included for illustration purposes. We conclude that regardless of the classification method, the same stains perform the best for this type of tissue; for other types of tissues a different set of stains would undoubtedly perform better. There are numerous methods in the literature dealing with color decomposition, or deconvolution. [1],[2],[3],[4],[5] However they do not attempt to compare the different stains from a classification perspective. Our aim is to demonstrate a methodology for selecting an optimal stain for color decomposition. To our knowledge, there has not been a similar attempt at comparing histological stains from a machine learning point of view. Materials and Methods Our data comprises adjacent sections of prostate tissue from radical prostatectomies prepared by the Department of Pathology, University Hospital, Uppsala, Sweden. The sections were stained with 13 different histological stains and scanned using the Aperio ScanScope XT Slide Scanner (Aperio Technologies, Vista, CA) with a ×40 objective. The histological stains were hematoxylin/eosin (H&E), Mallory's trichrome (MALLORY), Miller's elastic (MILLERSE), Van Gieson (VG), Van Gieson elastic (VGE), Giemsaeosin stain (GIEMSA), Grocott's methenamine silver stain (GROCOTT), hematoxylinHerovici polychrome (HERO), Weigert's elastic (WEIGERT), Cytokeratin immunohistochemistry (CYTK), Periodic acidSchiff (PAS), Ladewig (LADEWIG), and Siriushematoxylin (SIR+), an experimental stain combination we have tested. For information on histological stains, the reader is referred to Bancroft et al.[6] Images of tissue samples stained with hematoxylineosin, Mallory's trichrome, and Siriushematoxylin are shown in [Figure 1].{Figure 1} Our methodology for comparing different stains is based on assessing the performance of both supervised and unsupervised classifications on the given dataset. In our case, we are interested in three stain/tissue combinations, or classes, namely nuclei, stroma, and cytoplasm. For the supervised case, we require a training set constituting the ground truth as seen by a pathologist, and this was obtained by the manual selection of 30 different regions in each of the stained 640 × 816 pixel images and by the labeling of each region according to its class. The regions were selected completely within the target, with 1000 pixels selected per class. We train and optimize a support vector classifier over the data from each stained image and report the classification error using tenfold crossvalidation. We chose the support vector classifier with a radial basis function, as it provides a significant range of complexity that may be controlled by optimizing the kernel and regularization parameters based on the crossvalidated classification error. This allows the classifier to generalize well and avoid overfitting. For the unsupervised case, we assess the clustering performance of the Gaussian mixture model by comparing the cluster labels with those of the ground truth by using precision and recall. We use the Gaussian mixture model, as it is not constrained by the assumption of spherical clusters as are the kmeans and fuzzy cmeans. For cluster separability we use two measures, the Fisher Criterion and the Mahalanobis distance. Results Feature Space To eliminate dependency on inter and intrastain variations, which is essential for robust classification, we adopt the Maxwellian plane as our feature space. We first transform the image RGB data to a linear model using the BeerLambert Law. [7] Next we transform the resulting color data to the Maxwellian plane [8] at a distance of 1/√3 from the origin, using the perspective transformation centered at the origin: [9],[2] [INLINE:1] This decouples the intensity from the color information, thereby eliminating the dependence on intensity. As a result, the distance between two points in the Maxwellian plane corresponds to the chromaticity difference between the corresponding colors. [Figure 2]a shows the scatter plots for the color data in a section stained with MILLERSE in the RGB space, the BeerLambert space, and the Maxwellian chromaticity plane. It is evident that there is considerable overlap between the color regions corresponding to the cytoplasm and the nuclei in the Maxwellian plane. This intrinsic class overlap limits the performance of the classification. In [Figure 2]b, the corresponding scatter plots for CYTK show that the classes are well separated in the Maxwellian plane and consequently the classification should yield a much better result.{Figure 2} Supervised Classification To assess classification performance, a support vector machine [10],[11] was utilized for each stain. Given an image with a particular stain, let denote the set of vectors representing the pixel locations in the image (i.e., patterns) where n is the total number of pixels. In the Maxwellian plane, each such vector x i is in ℝ2 . For every pair of classes, ω1 , ω2 , a general support vector classifier is defined as f(x) = w Tɸ(x)+ w0 , where x is a pattern to be classified, w, w 0 are the weights defining the classifier, and ɸ(.) is a mapping of the patterns into kernel space. We have used a radial basis function (RBF) defined by: [INLINE:2] for the support vector classifier, where γ is a parameter that controls the width of the Gaussian kernel. The classifier may be optimized with regard to γ using tenfold crossvalidation; however, the classifier parameters in this case consist not only of γ, but also of a regularization term C, which may be chosen freely (please refer to the Appendix for details). In order to optimize the classifier with regard to both these parameters simultaneously, an exhaustive grid search algorithm [12] ensures arriving at the optimal combination of the (C, γ) values. An example of such an optimization is shown in [Figure 3] in the form of a contour plot in which the minimum corresponds to the optimal combination.{Figure 3} For each stain, we construct a support vector classifier using an RBF kernel that is then optimized as described above, with γ ranging from 2−15 to 2 5 and the regularization parameter C ranging from 2−5 till 2 9 , to ensure coverage of a wide range of simpletocomplex boundaries. [12] We then perform tenfold crossvalidation on each stain using a support vector classifier optimized for that stain. The final result is shown in [Figure 4], where the abscissa is ordered according to increasing classification error. We conclude that with respect to supervised classification, the stains MALLORY, CYTK, HERO, GIEMSA, and SIR+ show a considerably better performance for prostate tissue than the other stains. This is attributed to a significant class overlap for the other stains. A stain combination that stains several different structures in the tissue with the same color component is at a disadvantage, as this may lead to a high class overlap and low classification rates.{Figure 4} Unsupervised Classification Blind classification is sometimes preferred over supervised classification because it avoids manual labeling. Assuming that the class labels are unknown, we assess the clustering performance by the Gaussian mixture model. [13],[14] For each class, ω j, the probability density function is modeled by a Gaussian component completely described by its mean μ j and covariance matrix ∑j: [INLINE:3] where θj is the set of parameters for the distribution of class ω j, θ j {μ j, ∑ j} that is, and d is the number of features, that is, the α and β coordinates in the Maxwellian plane. Thus, the complete dataset may be described as a mixture of these components given by: [INSIDE:1] where K is the number of Gaussian components, or clusters, Ψ is the total set of parameters for the mixture distribution, and α j is the prior probability associated with the jth Gaussian component, and [INSIDE:2]. The expectationmaximization algorithm is used to maximize the loglikelihood, Q, by observing the data given by the Gaussian model. During the expectation step (or Estep) of the algorithm, the cluster membership of each object, w ji, is computed using the Bayes' theorem. This is followed by an update of the mixture prior probabilities, α j, and the cluster means and covariance matrices, ω j, θ j {μj ∑j}j= 1,…,K during the maximization step (or Mstep). The EM steps are repeated in succession until convergence, that is, until the change in the loglikelihood function Q is below a tolerance of 10−10 (see Appendix for details). We initialize the expectationmaximization algorithm with repeated random initializations to obtain the first estimates for the mixture density parameters and select the result with the highest loglikelihood. The results of the Gaussian mixture model clustering for the MALLORY, GIEMSA, SIR+, and CYTK stains, on the prostate tissue samples in [Figure 1], are shown in [Figure 5]. The highlighted contours in the figures correspond to a level 25% below the maximum value of the Gaussian.{Figure 5} To assess clustering performance objectively, we use the class labels from the previous supervised classifications and compute two cluster validation measures, the Rand index [15] and the Fmeasure [16],[17] , which are computed pairwise over all the objects in the dataset: [INLINE:4], where TP = True Positives, TP = True Negatives, FP = False Positives, and FN = False Negatives, and [INLINE:5], where P and R are precision and recall, defined as: [INLINE:6] We chose β = 1 to give comparable weights to precision and recall, and the measure is then denoted as an F 1 measure. The result of this analysis is shown in [Figure 6]. Both performance measures indicate that the five best stains in the supervised case are also more favorable here.{Figure 6} Class Separability Measures In order to assess the separability of the classes in terms of how compact and distant they are from each other, we have utilized a Fisherbased criterion [18] and a sum of the squared Mahalanobis distance. [19] We define the Fisher criterion as trace[INSIDE:3] where SW and SB are the within and betweenscatter matrices: [INLINE:7] with μ the mean of the entire dataset, and ∑i and μi, the covariance matrix and mean of class ωi , respectively, ni the number of patterns in class ωi , and n the total number of patterns in the dataset. On the other hand, the Mahalanobis distance defined between two classes with means μ1, μ2 and covariance matrices ∑1, ∑2 is given by: [INLINE:8] If the classes are compact and wellseparated, then they should have a relatively high Mahalanobis distance. We compute the squared Mahalanobis distance for every pair of classes and sum up these values. The result is shown in [Table 1], where we see that the optimal stains according to these measures are Mallory's trichrome and Cytokeratin. However, we also see that some of the stains, such as the Miller's elastic, rank well despite their poor classification rate, and this is attributed to the fact that the scatterbased criteria may not always predict the performance of a classifier correctly. A simple example that demonstrates this occurs when two classes are compact, but completely overlapping, and the third class is compact and extremely distant from the other two classes. This third class would influence the result positively, although two of the classes are completely overlapping, and will lead to a low classification rate. One can notice the effect a distant class may have on the Mahalanobis distance by observing the scatter plot of the Cytokeratin stain in the Maxwellian plane [Figure 5], which has led to the high value of the Mahalanobis distance for CYTK in [Table 1].{Table 1} Discussion We have presented a methodology for quantitative comparison of histological stains based on their classification and clustering performance. Among the stains we have investigated, using prostate tissue samples, the classification results suggest that MALLORY, CYTK, HERO, GIEMSA, and SIR+ have an advantage over the other stains, with supervised classification error rates below 11%, and below 5% for MALLORY and CYTK. Furthermore, blind classification was assessed using the Gaussian mixture model and the Rand and F1 measures. Results show that MALLORY and SIR+ outperform other stains in terms of clustering accuracy, with Rand indices above 0.85. We also saw that simple criteria, such as the Fisher criterion and the Mahalanobis distance, give a good ranking of stains, but may in some cases be misleading. The ideal stain is one where the reference colors are compact and separate, with a good classification performance. Furthermore, with an ideal stain, all tissue types absorb a similar amount of stain, without oversaturation. For the purpose of highthroughput imaging, the choice of a histological stain for a given application must be motivated by the objective criteria that optimize the automation process by facilitating machine learning and image analysis techniques. Acknowledgment The authors wish to thank Ulrika A. Larsson at the Uppsala University Hospital for preparing the tissue samples. This research was supported by Vetenskapsrådet under grant 20095418. Appendix Support Vector Machines The correct classification of all objects is assumed if the following conditions hold for i = 1, ..., n: [INLINE:9] where yi is the class label for object xi, ξi and is the slack variable corresponding to object xi. These conditions may be rewritten in compact form as [INLINE:10] The support vector classifier aims at maximizing the margin between the classes, which is equivalent to minimizing the norm of the weights. In addition, the slack variables are added to the quantity to be minimized, while regulated by a cost (or regularization) parameter C to allow for class overlap. This minimization is subject to constraints, namely, the patterns are classified correctly, as described above. Thus the classifier minimizes the following criterion: [INLINE:11] This may be formulated using Lagrange multipliers, whereby the following Lagrangian form is obtained: [INLINE:12] where αi is the Lagrange multiplier associated with the constraint equation corresponding to pattern x i, and r i is the Lagrange multiplier associated with slack variable ξi and ensures that ξi remains positive. By differentiating L with respect to the weights and the slack variables, setting the result to zero, and then solving, we obtain the following: [INLINE:13] Note that the last result C  αi r i = 0 may be combined with r i ≥ 0 to give 0 ≤ αi ≤ C. By substituting these results back into the Lagrangian, we obtain the dual form of the Lagrangian: [INLINE:14] Quadratic programming is then employed to solve for the Lagrange multipliers, αi , i = 1,..., n. Once these are obtained, w is computed from [INSIDE:4], where SV is the set of support vectors (objects associated with αi ≠ 0). Thereafter, the offset w 0 is obtained from: [INSIDE:5] by averaging over objects with 0 < αi < C, that is, ξi = 0. The support vector classifier is then derived as: [INLINE:15] The scalar product ɸT (xl)ɸ (x) is substituted by a kernel function K(xl, x) = (ɸT (xl)ɸ(x), without the need to explicitly define the transformation function or mapping ɸ(.) Gaussian Mixture Model The likelihood of observing the data given by the model for a single component is given by: [INLINE:16] If we assume that the class labels are known and are given by: [INLINE:17] then the likelihood function may be rewritten as: [INLINE:18] The Gaussian mixture model employs the expectationmaximization algorithm to maximize this likelihood or rather loglikelihood function [INLINE:19] However, since zji is not known, the following function is maximized instead: [INLINE:20] During the Estep, the cluster membership of each object, w ji at any given iteration m becomes: [INLINE:21] During the Mstep of the algorithm the distribution parameters are updated: [INLINE:22] These two steps are repeated iteratively until convergence. References


