|J Pathol Inform 2015,
HPASubC: A suite of tools for user subclassification of human protein atlas tissue images
Toby C Cornish1, Aravinda Chakravarti2, Ashish Kapoor2, Marc K Halushka1
1 Department of Pathology, Johns Hopkins University, Baltimore MD, USA
2 Department of Medicine, Institute for Genetic Medicine, Johns Hopkins University, Baltimore MD, USA
|Date of Submission||13-Jan-2015|
|Date of Acceptance||04-May-2015|
|Date of Web Publication||23-Jun-2015|
Toby C Cornish
Department of Pathology, Johns Hopkins University, Baltimore MD
Source of Support: None, Conflict of Interest: None
| Abstract|| |
Background: The human protein atlas (HPA) is a powerful proteomic tool for visualizing the distribution of protein expression across most human tissues and many common malignancies. The HPA includes immunohistochemically-stained images from tissue microarrays (TMAs) that cover 48 tissue types and 20 common malignancies. The TMA data are used to provide expression information at the tissue, cellular, and occasionally, subcellular level. The HPA also provides subcellular data from confocal immunofluorescence data on three cell lines. Despite the availability of localization data, many unique patterns of cellular and subcellular expression are not documented. Materials and Methods: To get at this more granular data, we have developed a suite of Python scripts, HPASubC, to aid in subcellular, and cell-type specific classification of HPA images. This method allows the user to download and optimize specific HPA TMA images for review. Then, using a playstation-style video game controller, a trained observer can rapidly step through 10's of 1000's of images to identify patterns of interest. Results: We have successfully used this method to identify 703 endothelial cell (EC) and/or smooth muscle cell (SMCs) specific proteins discovered within 49,200 heart TMA images. This list will assist us in subdividing cardiac gene or protein array data into expression by one of the predominant cell types of the myocardium: Myocytes, SMCs or ECs. Conclusions: The opportunity to further characterize unique staining patterns across a range of human tissues and malignancies will accelerate our understanding of disease processes and point to novel markers for tissue evaluation in surgical pathology.
Keywords: Biomarker, heart, human protein atlas, subcellular localization, tissue microarray
|How to cite this article:|
Cornish TC, Chakravarti A, Kapoor A, Halushka MK. HPASubC: A suite of tools for user subclassification of human protein atlas tissue images. J Pathol Inform 2015;6:36
|How to cite this URL:|
Cornish TC, Chakravarti A, Kapoor A, Halushka MK. HPASubC: A suite of tools for user subclassification of human protein atlas tissue images. J Pathol Inform [serial online] 2015 [cited 2020 Jun 5];6:36. Available from: http://www.jpathinformatics.org/text.asp?2015/6/1/36/159213
| Introduction|| |
We are in the golden era of classifying the expression of human genes, miRNAs and proteins across all human tissues in a high-throughput fashion. However, methods that homogenize tissues to obtain these results , fail to prove that a gene, miRNA or protein will be found within an expected cell type  let alone in a particular subcellular organelle. Thus, a key strength of a tissue immunohistochemical (IHC) or immunofluorescence (IF) approach is to visualize the location of a protein based on its staining pattern. Until recently, these were low-throughput methods with most studies limited to evaluating a protein's location in at most a small number of tissues. This has changed with the development of the human protein atlas (HPA).
The HPA is a comprehensive proteomic database for visualizing the distribution of protein expression across most human tissues and many common malignancies.  As of version 12 of the HPA, >21,900 antibodies against >16,600 human genes have been evaluated and tested across 48 normal and 20 malignant human tissues. The data were generated across numerous tissue microarrays (TMAs) using antibodies validated by Western blot analysis. Each TMA core image was subsequently digitized and made available to the community through the HPA website. Thus, today, the exact location of each protein across a variety of tissues can be identified with some data on specificity of the localization through replicate experimental data.
Human protein atlas provides annotations of the staining patterns across their TMA images. For many tissue types, the staining in specific cells is characterized separately (ex. pneumocytes and macrophages in lungs or glomeruli and tubules in kidney) and with general intensity (low, medium or strong) and general subcellular localizations (cytoplasmic, membranous or nuclear). In addition, confocal IF has been performed fully on three cell lines (A-431, U-2 OS, and U-251 MG) and since 2012 on an additional 15 cell lines. The HPA provides IF staining for the protein of interest, microtubules, endoplasmic reticulum, and the nucleus to allow additional subcellular localization. While the information provided by the HPA can be very useful, the granularity of the localization information is uneven across tissues and cell types. Many researchers will require a finer characterization of protein localization than has been provided by existing HPA annotations.
There are many novel questions one can answer about the nature of protein expression using the rich HPA image repository. For example, at the cellular level, one can characterize all of the proteins that variably stain across different tubules of the kidney nephron (ex. LGALS3, SMAD4) or demonstrate gradient expression along a maturing colonic crypt (ex. MKI67, GULP1). One can also evaluate nondominant cell type staining in organs to tease out the expression in minor cell types (ex. ACTA2 in the liver). For many cell types, more than three patterns of staining, cytoplasmic, membranous or nuclear, exist. For example, in the cardiac myocyte, just by randomly visualizing multiple proteins, we identified a total of seven unique subcellular staining patterns: Intercalated disc, cytoplasm, cytoplasmic membrane, nuclear membrane, nucleus, organelle, and t-tubule. Although this more granular information is obtainable from the HPA image collection, new tools are needed to rapidly review these images and subclassify the staining patterns in the HPA repository.
To further enhance the utility of the HPA, we developed a tool (HPASubC) to subclassify HPA TMA cores into distinct staining patterns. This software tool is unlike other described HPA-based systems that use machine learning and pattern recognition on the confocal cultured cell IF staining to determine subcellular localization. ,, Rather, it allows a trained pathologist or other interested user to quickly scan through tens of thousands of tissue images to characterize any arbitrary staining pattern they chose to investigate. We demonstrate this utility by the example of identifying all proteins that stain smooth muscle cells (SMCs) and endothelial cells (ECs) but not cardiac myocytes in heart TMA images. We also demonstrate how these data can be used with publically available heart tissue proteomic and genomic expression data to subset these non-cell specific lists.
| Materials and Methods|| |
HPASubC consists of a suite of tools implemented as Python scripts. The scripts target Python 2.7 and depend on the standard library and several third-party modules. Mechanize-0.2.5 (http://pypi.python.org/pypi/mechanize/) is used for stateful interaction with the HPA website. Pygame (http://www.pygame.org/) is used for the graphical user interface (GUI) to display images and to manage user input, including game controller input. BeautifulSoup 4.3.2 ( http://www.crummy.com/software/BeautifulSoup/ ) is used to parse HTML pages retrieved from the HPA website. Pyexiv2 (http://tilloy.net/dev/pyexiv2/) is used to read and write the Exif metadata in the downloaded images. HPASubC uses the standard Exif UserComment tag to store JSON-encoded metadata (such as Ensembl gene [ENSG ID], tissue type, and antibody) directly in the downloaded HPA image files. The HPASubC tools are freely available and can be downloaded from http://github.com/cornish/HPASubC.
General HPASubC Usage
Our procedure begins by downloading the normal_tissue.csv.zip file list of all proteins evaluated at the HPA [Figure 1]. This file is available at http://www.proteinatlas.org/about/download. From here, a list of unique ENSGs IDs is selected and used with the download_images_from_gene_list.py script to download tissue type-specific images from HPA and generate a table of image identifiers. The download script writes ENSG, tissue, and antibody data to the JPEG's Exif UserComment tag to maintain the context of the downloaded images. A Sony PlayStation style USB controller (Logitech Precision) is used with the image_viewer.py script to scroll through the images, zoom images as needed, and select images with the appropriate staining patterns [Figure 2]. The selected images are then optionally scrolled through again and scored with the image_scorer.py script using score values of 0-6 that can be based on any user parameters like stain intensity. Additional HPA metadata such as protein names, expression across different tissues, and Entrez data about a protein of interest can be obtained using the download_protein_data_from_gene_list.py script. Detailed instructions for running the scripts are included in the package's README.TXT file.
|Figure 1: Schematic and use of HPASubC. All human protein atlas (HPA) records are obtained and used to identify Ensembl gene IDs of immunostained proteins. These IDs are parsed out to capture and download tissue microarrays images of interest from a particular tissue source. Images are resized for optimal viewing on an available computer monitor. Images are scrolled through using a game controller, where the blue buttons toggle between images and the red "fire" button selects an image with appropriate staining. Collected images can then be scored for intensity or other classifications, and HPA metadata can be obtained. In our analysis, we started with over 1.1 million records that resulted in >49,200 heart tissue microarrays images downloaded. We evaluated smooth muscle cell or endothelial cell staining, and if either was present, it was selected. This resulted in 1,355 images representing 703 unique proteins|
Click here to view
|Figure 2: Selection and scoring of images. Each tissue microarrays image is presented to the user either for selection (using image_ viewer.py) or for scoring (using image_scorer.py). Either a gamepad or the keyboard can be used to scroll through the images, zoom in and out, and either select or score the displayed image|
Click here to view
Calculating the Error Rate of Using a Gamepad Controller
A total of 500 core images each of cardiac muscle and lung tissue were downloaded from HPA. Lung and cardiac muscle were chosen because they are easily distinguished from each other at a glance. The 1000 total images were mixed in a single folder, randomized using the python list shuffle method, and renamed sequentially. Two pathologists (MH, TC) each ran the image_viewer.py script on the mixed images, selecting lung images, but not heart images. Each run was timed, and the user results were compared to the actual tissue type.
HPASubC Usage to Localize Specific Smooth Muscle Cells and Endothelial Cells Staining in the Heart
We chose to investigate all staining for proteins that were present in SMCs and ECs but not in the cardiac myocytes in heart TMA cores. We included any weak, moderate or strong staining of SMCs or ECs in the absence of myocyte staining. We also included strong SMC or EC staining with a weak to moderate non-specific myocyte cytoplasmic blush or perinuclear granular staining interpreted to be lipofuscin. We obtained >1.1 million records, parsed out the ENSG IDs, and used these to obtain ~49,200 heart images [Figure 1]. These images were resized at 1200 × 1200 pixels for a Dell U2410 monitor. Images were scrolled through in batches of 3,000. The 1,580 positive images were then scored using the following values - 0: Non SMC or EC staining, 1: Clean EC, 2: Predominant EC, 3: Both EC and SMC, 4: Predominant SMC, 5: Clean SMC. At this time, non-SMC or EC staining proteins, such as those that mark extracellular matrix, serum or inflammatory cells were removed. The remaining images were collapsed into unique ENSG IDs, and HPA metadata were added. A confirmation of the final list was performed by reviewing all images for a given identified ENSG ID to determine the percent that had positive staining. All work was performed on a Dell Optiplex 9010 with an Intel Core i7 3.40 GHz core (8 processors, Dell, Inc., Plano, Texas) and 16 GB RAM running Windows 7.
| Results|| |
HPASubC Speed and Functionality
Due to the large numbers of samples involved, each step took an appreciable about of time. In our environment, the download_gene_data_from_gene_list.py script retrieved the image URL for each ENSG ID at an average speed of 1.45 s per ID or around 6.5 h to download data for all 16,600 proteins currently listed in HPA. To download each full-sized JPEG took an additional 1.8 s (on average) or just over 24 h to download all 49,200 images evaluated. A trained cardiovascular pathologist (MKH) reviewed 49,200 heart TMA images in batches of ~3,000 images looking for SMC and/or EC staining. An average batch of 3000 images took around 22 min to review, representing approximately 0.5 s per image or 6 h for the entire 49,200 image dataset. The metadata download script obtained 100% of gene names, 99.8% of expression/staining data, and Entrez data for 472 of 703 (67%) ENSG IDs. We additionally measured the error rate of using a gamepad for input in the scripts. For 1000 randomly mixed cardiac and lung tissue images, the average error rate in classifying tissue type (only) was 0.65% with an average speed of 0.92 s/image.
Unique Staining of Endothelial Cells and Smooth Muscle Cells
We identified 703 proteins present exclusively in ECs and/or SMCs but not in cardiac myocytes [Figure 3], [Supplemental Table 1] [Additional file 1]. Among EC staining patterns, we mostly observed staining in all ECs [Figure 3]a, but occasional proteins were absent in the capillaries and exclusive to small arterioles [Figure 3]b. The overall distribution of staining suggested that most of these unique proteins were found in ECs (403) while 145 were shared, and 155 were exclusive to SMCs [Figure 4]. We also determined the percentage of all images demonstrating appropriate staining for any given protein and found 261 (37%) had staining in all evaluated images and 510 (72.5%) had staining in at least 50% of evaluated images.
|Figure 3: Representative images of different detected staining patterns. (a) CD93 demonstrates pure endothelial cell (EC) staining of arteries and capillaries. (b) NOV has exclusively arteriole EC staining. (c) ATP2B4 stains strongly for smooth muscle cells (SMCs) and more weakly for capillary ECs. (d) ZSWIM5 stains exclusively for SMCs. A faint granular stain is noted in some myocytes, which was considered common and nonspecific (All images are obtained from human protein atlas; bar = 100 μm)|
Click here to view
|Figure 4: Venn diagram of the relationship of protein expression in endothelial cells (ECs) (green) and smooth muscle cell (SMCs) (peach) for all 703 identified proteins. Most unique staining was observed in ECs, but 145 proteins (21%) had staining both in ECs and SMCs|
Click here to view
Using Endothelial Cell and Smooth Muscle Cell Staining to Subset Cardiac Expression Data
We investigated how our list of cardiac non-myocyte proteins could be used to better understand cardiac genetic and proteomic data [Figure 5]. We used a list of cardiac genes whose expression was identified using an Affymetrix Exon array as described.  The exon array identified 14,808 expressed transcripts in the heart with ENSG IDs. We determined that 567 of these transcripts overlapped our list of 703 EC and SMC expressed proteins [Supplemental Table 1]. The top half of this gene list (7,404) represented 58% (328 genes/proteins) of the overlap between lists. We obtained adult heart proteomic data from the recent report of Kim et al.  This list contains 6,701 proteins with an ENSG ID. Of these, 330 (4.9%) overlapped with our 703 EC and SMC proteins. Further, 5,433 proteins/genes were shared between the two published sources , representing 81% of all reported proteins [Figure 5]. Of the 703 EC and SMC staining proteins from the HPA data, 282 (40%) were found on both the proteomic and transcriptomic lists and 86% (605) were present on at least one list. Our HPA data also suggest that 4-5% of all transcripts/proteins present in cardiac-omic datasets are present exclusively in nonmyocytes.
|Figure 5: Venn diagrams of the relationship of the proteomic and genomic cardiac tissue datasets. The proteome (purple) contained 6,701 proteins of which 81% were also reported in the mRNA dataset (blue). Similarly, of 703 smooth muscle cell (SMC) and endothelial cell (EC) specific proteins found in this study, 88% (282) of those reported in the proteomic data (green) were also noted in the miRNA dataset (peach). Ninety-eight EC or SMC proteins (white) found in human protein atlas images are not found in either larger dataset|
Click here to view
| Discussion|| |
We describe a new suite of software tools to subclassify staining patterns on IHC HPA images. HPASubC can be used in a variety of creative fashions to evaluate any unique staining patterns across the range of HPA images. Delineating EC and SMC staining in heart tissues was a simple proof-of-concept that shows the power of the technique.
Overall, we report on 703 proteins found in ECs and SMCs that are not present in cardiac myocytes. This SMC and EC curated dataset is now a useful reference for groups evaluating cardiac signals in their disease of interest and to subset proteins that are not expressed in myocytes. For example, we determined that beta 2 microglobulin (B2M) is expressed only in ECs in the heart. B2M is widely expressed across a variety of other cell types in other organs. This suggests that although B2M has been used as a housekeeping gene to normalize qPCR data in a variety of tissues, ,, it would be a poor choice for normalizing heart qPCR data. 
Utilizing HPASubC, over a period of roughly two weeks, we were able to characterize all EC and SMC staining in the heart using publicly-available data and minimal resources. Although these methods could be performed more quickly, we do not recommend reviewing all the images in a single sitting due to the potential for a repetitive strain injury to the thumbs and the need to minimize visual and mental fatigue. 
The general limitations to this approach are the expertise of the reviewer to consistently identify a pattern of interest and the potential for mimickers of the pattern to confound review. Moreover, the quality/appropriateness of the antibody staining is another source of limitation. We note that in many situations, when images from two separate antibodies for the same protein are compared, often only one antibody had staining data. It is, thus, unknown if the discrepancy is the result of false positive or false negative staining. With each new version of the HPA, data from numerous old antibodies are being replaced with newer, presumably more specific antibodies and the old images are removed. Thus, any classification of subcellular or cell-specific features will remain in flux for the foreseeable future. In addition, certain cell types may not be present on every image for a given organ system. We found that to be true where our identification of SMC staining is likely artifactually lower than ECs because small arterioles were not seen on every heart TMA image [Figure 4]. We are confident that as the HPA nears its goal of IHC for every protein that the stability and utility of the database will increase.
Financial Support and Sponsorship
Conflict of Interest
There are no conflict of interest.
| References|| |
Kim MS, Pinto SM, Getnet D, Nirujogi RS, Manda SS, Chaerkady R, et al.
A draft map of the human proteome. Nature 2014;509:575-81.
Consortium GT. The genotype-tissue expression (GTEx) project. Nat Genet 2013;45:580-5.
Kent OA, McCall MN, Cornish TC, Halushka MK. Lessons from miR-143/145: The importance of cell-type localization of miRNAs. Nucleic Acids Res 2014;42:7528-38.
Uhlen M, Oksvold P, Fagerberg L, Lundberg E, Jonasson K, Forsberg M, et al.
Towards a knowledge-based human protein atlas. Nat Biotechnol 2010;28:1248-50.
Glory E, Newberg J, Murphy RF. Automated comparison of protein subcellular location patterns between images of normal and cancerous tissues. Proc IEEE Int Symp Biomed Imaging 2008;4540993:304-07.
Li J, Newberg JY, Uhlén M, Lundberg E, Murphy RF. Automated analysis and reannotation of subcellular locations in confocal images from the human protein atlas. PLoS One 2012;7:e50514.
Newberg JY, Li J, Rao A, Pontén F, Uhlén M, Lundberg E, et al.
Automated analysis of human protein atlas immunofluorescence images. Proc IEEE Int Symp Biomed Imaging 2009;5193229:1023-6.
Gupta S, Halushka MK, Hilton GM, Arking DE. Postmortem cardiac tissue maintains gene expression profile even after late harvesting. BMC Genomics 2012;13:26.
Waha A, Watzka M, Koch A, Pietsch T, Przkora R, Peters N, et al.
A rapid and sensitive protocol for competitive reverse transcriptase (cRT) PCR analysis of cellular genes. Brain Pathol 1998;8:13-8.
Bussolati B, Bruno S, Grange C, Buttiglieri S, Deregibus MC, Cantino D, et al.
Isolation of renal progenitor cells from adult human kidney. Am J Pathol 2005;166:545-55.
Meller M, Vadachkoria S, Luthy DA, Williams MA. Evaluation of housekeeping genes in placental comparative expression studies. Placenta 2005;26:601-7.
Bahar R, Hartmann CH, Rodriguez KA, Denny AD, Busuttil RA, Dollé ME, et al.
Increased cell-to-cell variation in gene expression in ageing mouse heart. Nature 2006;441:1011-4.
Vaidya HJ. Playstation thumb. Lancet 2004;363:1080.
[Figure 1], [Figure 2], [Figure 3], [Figure 4], [Figure 5]