|J Pathol Inform 2020,
Value of public challenges for the development of pathology deep learning algorithms
Douglas Joseph Hartman1, Jeroen A. W. M. Van Der Laak2, Metin N Gurcan3, Liron Pantanowitz1
1 Department of Pathology, University of Pittsburgh Medical Center, Pittsburgh, PA, USA
2 Department of Pathology, Radboud University Medical Center, Nijmegen, The Netherlands; Centre for Medical Image Science and Visualisation, Linköping, Sweden
3 Center for Biomedical Informatics, Wake Forest School of Medicine, Winston Salem, NC, USA
|Date of Submission||04-Nov-2019|
|Date of Decision||29-Nov-2019|
|Date of Acceptance||12-Dec-2019|
|Date of Web Publication||26-Feb-2020|
Dr. Douglas Joseph Hartman
Department of Pathology, University of Pittsburgh Medical Center, 200 Lothrop Street, A-610, Pittsburgh 15213, PA
Source of Support: None, Conflict of Interest: None
| Abstract|| |
The introduction of digital pathology is changing the practice of diagnostic anatomic pathology. Digital pathology offers numerous advantages over using a physical slide on a physical microscope, including more discriminative tools to render a more precise diagnostic report. The development of these tools is being facilitated by public challenges related to specific diagnostic tasks within anatomic pathology. To date, 24 public challenges related to pathology tasks have been published. This article discusses these public challenges and briefly reviews the underlying characteristics of public challenges and why they are helpful to the development of digital tools.
Keywords: Algorithm development, artificial intelligence, digital pathology algorithms, public challenges
|How to cite this article:|
Hartman DJ, Van Der Laak JA, Gurcan MN, Pantanowitz L. Value of public challenges for the development of pathology deep learning algorithms. J Pathol Inform 2020;11:7
|How to cite this URL:|
Hartman DJ, Van Der Laak JA, Gurcan MN, Pantanowitz L. Value of public challenges for the development of pathology deep learning algorithms. J Pathol Inform [serial online] 2020 [cited 2021 May 10];11:7. Available from: https://www.jpathinformatics.org/text.asp?2020/11/1/7/279524
| Introduction|| |
There is great excitement about the potential for artificial intelligence (AI) to favorably alter the clinical practice for diagnostic pathologists. One mechanism that has facilitated the development of AI algorithms has been through public challenges for specific tasks. A public challenge is a free/open to the public image task, for example, mitotic figure counting, gland segmentation, and detection of metastatic tumor foci in lymph nodes. Many of the available public challenges in medical imaging are hosted by one website – grand-challenge.org. A dataset is generally provided as part of the challenge – sometimes as a single set and sometimes divided into training and testing sets. The training set is used to build the algorithm, whereas the testing set is used to evaluate the performance of the algorithm in an independent set of cases. Annotated datasets are mostly provided depending on the specific image task being examined (e.g., segmentation, drawing the boundaries of structures of interest in an image, or slide level, i.e., cancer/no cancer). As part of the data download process, users must register. The users are given a timeframe to build the algorithm using the training set and all the other available data and evaluate the developed algorithm on a test set using some evaluation criteria. For some challenges, the results and the methods to achieve those results are presented in conjunction with a conference. The rules for the challenges are published and these typically list information about how the results will be evaluated as well as the size of the dataset. The results from different algorithms are usually posted on a public leaderboard. Some competitions are associated with awards for the top-performing algorithms. These public challenges provide a large dataset with annotations that is necessary to develop an AI algorithm. The public datasets allow for a common mechanism to compare the algorithms from different developers (including both academia and industry). Public challenges are also a method to advance computational pathology by encouraging competition and enabling a direct comparison between different algorithms. They also foster the AI startup field by reducing the burden of obtaining a large dataset and enabling different groups to work on the same problem and learn from each other. Since these challenges provide a basis for AI development, it is critical to understand the underlying infrastructure used to build AI. Awareness of these public challenges will increase the knowledge of the field of AI development and could also be helpful in the regulatory field. Wider appreciation of how AI is developed and how it performs for given tasks can increase the acceptance of AI within the broader medical community. Although it is still evolving, the Food and Drug Administration has expressed interest in the regulation of AI as a medical device while also approving the first AI algorithm – IDx-DR, a screening algorithm for the evaluation of diabetic retinopathy.
| Technical Background|| |
These public challenges essentially offer raw data for many groups worldwide to attempt to solve some of the challenging pathology problems. The algorithms produced by the participants using the challenge data set are evaluated based on the results of their submitted outputs. In many cases, the data license prohibits the use of the dataset for reasons other than challenge participation. An exception is the Cancer Metastases in Lymph Nodes (CAMELYON) data sets, which are shared under a CCO license allowing unlimited use of the data. Although the Cancer Genome Atlas More Details also contains whole-slide images (WSIs), the images are not annotated, limiting their usefulness for task-related challenges. The challenge datasets within each challenge are variable, as are the requested tasks to be solved by AI. The challenges generally have a period where training data are provided, and then, a testing dataset is provided with the submission of the results to the host organization so that the results can be shared with the public.
| Procedure and Datasets|| |
Using the website grand-challenge.org, the number of challenges related to anatomic pathology was collated. This website is authored by the Diagnostic Image Analysis Group (led by Bram van Ginneken of the Radboud University Medical Center in Nijmegen, the Netherlands). It includes a listing of various public challenges that have been published since 2007. As of this writing, the grand challenge consisted of 191 challenges. The challenges have been hosted by various organizations or groups, but Medical Imaging Computing and Computer-Assisted Intervention, International Society for Biomedical Imaging (ISBI), and International Society for Optics and Photonics (SPIE) Medical Imaging have been some of the more prolific supporters of these challenges. For example, we will briefly review one of the more well-known challenges – Camelyon-16 Challenge., This challenge (cosponsored by the International Society for Biomedical Imaging [ISBI]) consisted of 399 hematoxylin and eosin (H and E)-stained WSIs of sentinel lymph nodes from two hospitals in the Netherlands. This challenge was the first to provide WSIs, and those images were acquired using two different scanners – Pannoramic 250 Flash II (3DHISTECH, Budapest, Hungary) and NanoZoomer-XR (Hamamatsu Photonics, Hamamatsu City, Japan). As the ground truth, the presence of metastases was annotated under the supervision by expert pathologists. An example of an annotated WSI is presented in [Figure 1]. A total of 270 images were used for a training set, and 129 digital slides were available as a test set. Two tasks were requested as part of this challenge: (a) identify individual metastases in WSIs and (b) classify each WSI as containing a metastasis or not. The developed solutions had to be able to detect micrometastases and macrometastases but were not required to detect isolated tumor cells. For the first task, the free-response receiver operator characteristic curve (FROC) was used to evaluate [Figure 2] the participants, whereas the second task was evaluated by area under the receiver operating characteristic curve [Table 1]. The organizers of the challenge also tested the pathologist results for the same tasks in two settings – unlimited and limited time. The training dataset with accompanying annotations was released for download on December 30, 2015; on March 1, 2016, the test WSI was released with a deadline for submissions on April 1, 2016. The winners were announced during the ISBI workshop on April 13, 2016. A follow-up challenge from this group – CAMELYON17 moved from slide-level evaluation to patient-level evaluation. These tasks were selected for several reasons: (a) this is a tedious, clinically relevant task that occurs in high numbers, (b) the solution would likely generalize to lymph node metastases of other cancers, and (c) the detection of metastatic clusters of tumor cells on H and E slides would require the recognition of subtle textural patterns (solutions would likely advance algorithms in histopathology in general).
|Figure 1: Example of metastatic regions in an H and E-stained sentinel lymph node tissue section, with annotations of metastases by a pathologist (blue lines)|
Click here to view
|Figure 2: Results of task 1 of Cancer Metastases in Lymph Nodes 16: Detection of individual metastatic regions in SLN whole-slide image. The analysis is performed using the free-response receiver operator characteristic curve, displaying sensitivity versus the number of false positives per whole-slide image. The green diamond indicates the performance of a single pathologist who scored the slides in an experimental setting without any time constraint|
Click here to view
|Table 1: Results of task 2 of Cancer Metastases in Lymph Nodes 16: Prediction of sentinel lymph node status on the slide level|
Click here to view
| Results|| |
The challenges were evaluated for their subject matter (medical discipline) ranging from radiology, pathology, cell biology, cardiology, ophthalmology, dermatology, dental, gastroenterology, and others. Out of the 191 challenges, 24 (12.5%) were related to pathology or combined radiology/pathology. [Figure 3] demonstrates the relative concentration of challenges according to the medical discipline. Since the first challenges in 2007, the number of challenges per year has steadily been increasing [Figure 4]. The medical disciplines with challenges have diverged from initial radiology predominant studies to a much wider range of medical disciplines [Figure 4]. The first challenge involving pathology-related images was in 2010. This first pathology challenge was detecting lymphocytes within H and E-stained slides and counting the number of centroblasts from the cases of follicular lymphoma and was sponsored in conjunction with the International Conference on Pattern Recognition 2010. Initially, web technology and data storage were not sufficiently developed to allow the use of WSIs, and therefore, small, mostly manually selected “field of views” were used. The results in strongly limited applicability of algorithms developed within the challenging context, as the algorithms are not robust to any image content not sufficiently covered in the data sets (e.g., artifacts). A description of the pathology-related challenges is presented in [Table 2]. Investigation of the challenges related to pathology images demonstrate that the most frequent organ site (45.8%) used for these studies was the breast. Other organ sites included the cervix, central nervous system, thyroid, lung, and multi-organ datasets [Figure 5]. The images within these datasets consist of both “field of views” and WSIs. The number of images was variable per dataset; however, sometimes, it was not provided in the study description. The range of images available was between 15 and 1000 images. The studies inconsistently reported how many patients were represented by the number of images within the dataset. The WSIs within the datasets were sometimes from a single platform, while others provided multiple image file formats (up to 4). The listed image file formats include. svs., ndpi., mrzx., tif., TIFF., bmp., czi, and extended depth field cytology images.
|Figure 3: The breakdown of 191 challenges according to the medical discipline of the challenge. Of note, with the exception of one challenge, most of the challenges involve tasks within a single medical discipline|
Click here to view
|Figure 4: The number of challenges according to the medical discipline over time since the year 2007. The volume of challenges has been steadily increasing and diversifying since 2007. Radiology still account for the majority of challenges, but pathology and ophthalmology are increasing|
Click here to view
|Figure 5: Breakdown of the pathology challenges according to the predominant organ site to study|
Click here to view
Evaluating the performance of the algorithms requires a benchmark or ground truth against which the output of the algorithm is compared. The majority of challenges cite “expert” or “experienced” pathologists as the source of ground truth for the dataset (n = 16). However, eight challenges did not describe how the ground truth was determined, which presents a major problem. Three challenges cite a single pathologist interpretation (one of which was augmented by molecular profiles). One challenge cited an “expert oncologist,” another cited “two medical experts” as the ground truth, and one challenge used Her2 results without specifying the method of Her2 evaluation (immunohistochemistry or fluorescence in situ hybridization). Surprisingly, one study used the annotations of engineering students checked by a single pathologist.
Various statistics were used within the challenges to evaluate the outputs from the algorithms. These included F1 score, quadratic weighted kappa, instance-level recall, FROC, DICE coefficient, area under the curve, relative target registration, execution time, prediction probability, weighted precision, weighted recall, aggregated Jaccard index, overall prediction accuracy, accuracy metric, the number of correctly classified cases divided by the total number of cases, Spearman's correlation coefficient, point system for correct score, true positive, true negative, false positive, and false negative. While there is no single method to evaluate various problems, it is very important to follow some commonly accepted evaluation methodologies. These methodologies would vary depending on the nature of the problem as well as how the ground truth is generated. Each challenge needs to pay attention to (1) how the ground truth is generated, (2) what evaluation metrics would be used (e.g., DICE coefficient vs. Hausdorff distance), and (3) asking the participants to submit their evaluation results in a standard format (e.g., extended markup language) so that all the submissions could be evaluated using the same set of evaluation techniques and software. The grand challenge platform contains tools for fully automated assessment of submitted results, further increasing the reproducibility and efficiency.
| Conclusion|| |
As mentioned previously, AI is rapidly being developed and pathology has not been exempt from these advances. The number of public challenges, including pathology datasets, has been increasing reflecting the increased availability of digital data from pathology. One interesting observation regarding pathology challenges is that there is a disconnect between the types of organs studied and the large volume specimens typically encountered in routine clinical practice. Dermatopathology and gastrointestinal specimens represent the large majority of specimens received in pathology laboratories; yet, there are no dermatopathology public challenges and only a few for gastrointestinal pathology. Many companies are working within these routine areas internally, but the mismatch in supply from public challenges and demand of clinical practice limit the wider adoption of AI by the pathology community.
Being aware of public challenges for AI research is important for the pathology community. As AI algorithms will likely be marketed to pathologists in the near future, it is important that the pathology community become aware of the conditions under which algorithms are being developed and the performance differences between the algorithms. Through public challenges, a common evaluation method and dataset allow for a better comparison of the performance of the algorithms. Of note, there are still some deficiencies with public challenges. For example, the image file type is sometimes only one file type or a proprietary file type, which may hinder widespread deployment. Even more importantly, in many challenges, data sets consist of WSI from a single or a small number of sources. Even though data augmentation and WSI normalization may be of help, generalizability of algorithms most ideally comes from a diverse data set containing images from a larger number of centers. Furthermore, diverse evaluation metrics is used in these public challenges, leading to difficulty in comparing the algorithms and methods from different challenges. Public challenges offer some degree of transparency into the development process for AI as well as help expand the understanding of this relatively new (to pathology) field (i.e., opening the “black box”). An additional element of the challenges that pathologists should be aware of is the determination of the “ground truth”– this can be highly variable and although it usually involves a pathologist, it does not always. Well annotated and heavily annotated datasets are critical to the success of these challenges, but annotation can be time-consuming and costly. Several groups have explored using “crowd-sourcing” to achieve the annotations., Whether using nonpathologists for annotation will be effective for algorithm development is yet to be determined. “Crowd-sourcing” does represent the opportunity to overcome the specific individual bias in morphologic assessment.
Aside from the common ground upon which to evaluate an algorithm, public challenges also foster the development of AI by reducing the start-up costs to commence with AI development. Historically, large curated datasets have been owned by academic medical centers and by companies who were working on developing the technology which reduces competition within the market. Public challenges also add transparency to the process by clearly describing the datasets and establishing routine practices/workflows.
We wish to commend the authors and hosts for these public challenges and encourage further such public challenges in this field. In addition to the authors and the hosts for these public challenges, numerous grants supported these challenges, and we commend those groups for their support (grants cited within the published work associated with challenges are listed in the acknowledgments). We also acknowledge the difficulties associated with generating a public challenge and administering them. The value these public challenges bring to the broader medical community needs to be emphasized.
Many of these public challenges were supported through grants. The project described was supported in part by U24CA199374 (PIs: Gurcan, Madabhushi, Martel), and U01 CA220401 (PIs: Gurcan, Cooper, Flowers), R01 CA235673 (PI: Puduvalli) from the National Cancer Institute, R01 HL145411 (PI: Beamer) from National Heart Lung and Blood Institute, UL1 TR001420 (PI: McClain) from National Center for Advancing Translational Sciences, and OSU CCC Intramural Research (Pelotonia) Award. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Cancer Institute, National Institute on Deafness and Other Communication Disorders, National Heart Lung and Blood Institute, National Center for Advancing Translational Sciences, or the National Institutes of Health. CAMELYON16 – Data collection and annotation were funded by Stichting IT Projecten (Nijmegen, Netherlands) and by the Fonds Economische Structuurversterking (tEPIS/TRAIT project; LSH-FES Program 2009; DFES10 29161 and FES1103JJT8U). Fonds Economische Structuurversterking also supported (in kind) web access to WSIs. This work was supported by grant 601040 from the Seventh Framework Programme for Research-funded VPH-PRISM project of the European Union (Mr. Ehteshami Bejnordi). The Knut and Alice Wallenberg foundation is acknowledged for the generous support of Dr. van der Laak.
Financial support and sponsorship
Conflicts of interest
Douglas Hartman has received an educational honorarium from Philips. Liron Pantanowitz is on the medical advisory board for Leica and Ibex and is a consultant for Hamamatsu. Jeroen van der Laak is a member of the scientific advisory boards of Philips, The Netherlands, and ContextVision, Sweden. Jeroen van der Laak receives research funding from Sectra, Sweden and receives project remuneration from Philips, the Netherlands.
| References|| |
Niazi MK, Parwani AV, Gurcan MN. Digital pathology and artificial intelligence. Lancet Oncol 2019;20:e253-61.
Rotemberg V, Halpern A, Dusza S, Codella NC. The role of public challenges and data sets towards algorithm development, trust, and use in clinical practice. Semin Cutan Med Surg 2019;38:E38-42.
Hipp JD, Sica J, McKenna B, Monaco J, Madabhushi A, Cheng J, et al
. The need for the pathology community to sponsor a whole slide imaging repository with technical guidance from the pathology informatics community. J Pathol Inform 2011;2:31.
] [Full text]
Litjens G, Bandi P, Ehteshami Bejnordi B, Geessink O, Balkenhol M, Bult P, et al
. 1399 HE-stained sentinel lymph node sections of breast cancer patients: the CAMELYON dataset. Gigascience 2018;7:1-8.
Ehteshami Bejnordi B, Veta M, Johannes van Diest P, van Ginneken B, Karssemeijer N, Litjens G, et al
. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA 2017;318:2199-210.
Golden JA. Deep learning algorithms for detection of lymph node metastases from breast Cancer: Helping artificial intelligence be seen. JAMA 2017;318:2184-6.
Gurcan M, Madabhushi A, Rajpoot N. Pattern recogntion in histopathological images: An ICPR 2010 context. Int Conf Pattern Recognit 2010;226-34.
Roux L, Racoceanu D, Loménie N, Kulikova M, Irshad H, Klossa J, et al
. Mitosis detection in breast cancer histological images An ICPR 2012 contest. J Pathol Inform 2013;4:8.
] [Full text]
Tizhoosh HR, Pantanowitz L. Artificial intelligence and digital pathology: Challenges and opportunities. J Pathol Inform 2018;9:38.
] [Full text]
Tellez D, Litjens G, Bándi P, Bulten W, Bokhorst JM, Ciompi F, et al
. Quantifying the effects of data augmentation and stain color normalization in convolutional neural networks for computational pathology. Med Image Anal 2019;58:101544.
Amgad M, Elfandy H, Hussein H, Atteya LA, Elsebaie MAT, Abo Elnasr LS, et al
. Structured crowdsourcing enables convolutional segmentation of histology images. Bioinformatics 2019;35:3461-7.
Grote A, Schaadt NS, Forestier G, Wemmert C, Feuerhake F. Crowdsourcing of histological image labeling and object delineation by medical students. IEEE Trans Med Imaging 2019;38:1284-94.
[Figure 1], [Figure 2], [Figure 3], [Figure 4], [Figure 5]
[Table 1], [Table 2]