|J Pathol Inform 2011,
Jules J Berman
|Date of Submission||09-Dec-2010|
|Date of Acceptance||23-Feb-2011|
|Date of Web Publication||31-Mar-2011|
Jules J Berman
Source of Support: None, Conflict of Interest: None
|How to cite this article:|
Berman JJ. Post-Informatics pathology. J Pathol Inform 2011;2:18
During 1970s and 1980s, pathology departments interfaced laboratory instruments and pathologists with computers, permitting the acquisition of large amounts of clinical pathology and anatomic pathology data in digital form. Over the next several decades, pathology data were collected and organized, while contributors from many ancillary fields (e.g., computer science, image analysis, statistics, cryptography, library science, electronic communication, ethics, and law) developed tools for exchanging and analyzing large data sets derived from many diverse sources.
In 2010, we have:
All the listed tools are available at no cost, as either royalty-free, open source, or public domain products. Moreover, there is a rich literature, in journals, in books, and on the web, that explains how these resources can be obtained and used.
- Nomenclatures to express medical data (e.g., MeSH, NCI Thesaurus, and LOINC)
- A method to describe data with metadata (XML, eXtensible Markup Language), a specification for binding XML-described data to unique objects (RDF, Resource Description Framework), and an ontology language for relating classes of information (e.g., OWL, Web Ontology Language).
- Electronic Medical Records (EMRs), wherein hospital information systems attach all the information collected on a patient to the patient's unique identifier, creating a well-specified database for every patient.
- Laws, regulations, and guidelines detailing how clinical data can be shared in a manner that protects patients from harm.
- Algorithms and software implementations for deidentifying and encrypting confidential medical data.
- Methods for finding and clustering data, by feature similarities, and for building a hierarchical grouping that interrelates the clusters. Other methods find trends in data, or find the best cut-off points that distinguish one class of data from another.
- General cross-platform scripting languages, such as Perl, Python, and Ruby, that provide nonprogrammers with the tools to write their own implementations of fundamental data analysis algorithms or to call, from their own scripts, any of thousands of publicly available method modules.
- Specialized scripting languages that support specific types of tasks (e.g., R for statistics, ImageMagick for imaging, POV-Ray for 3-D visualizations, and Tcl/Tk for graphic user interfaces).
- Standard protocols for sharing data and data services across networks (e.g., web services, cloud computing).
The acquisition of immense data resources and of the tools to analyze the data marks the arrival of the post-informatics age. Pathologists can now focus their efforts on post-informatics questions. Here are just a few.
| How Should We Verify Clinical Trials and Other Evidence-Based Projects?|| |
Clinical trials are experiments, nothing more. Like any experiment, a clinical trial can be poorly designed, misinterpreted, badly implemented, or subject to the deleterious effects of random errors. Clinical trials need to be validated by clinical experience. This often necessitates analyzing pathology and clinical data obtained prior to, and after, the introduction of trial-inspired interventions.
In the past few decades, there have been numerous advances in the design and analysis of clinical trials. Today, clinical trials are usually prospective, randomized, and double-blinded. Patients are accrued under a strict selection protocol, until a predetermined number of participants are obtained, sufficient to produce statistically significant trial results. In the case of cancer trials, teams of pathologists review the original diagnostic tissues, ensuring that every accrued patient has a diagnosis, histologic subtype, and stage that are appropriate for the clinical study. Clinical trials are registered, and the data resulting from the trial are collected according to a standard protocol. Under ideal conditions, the data are analyzed by statisticians who are not directly involved in the study and who have no stake in the outcome. Inventing the modern clinical trial was, in no small part, an informatics activity. Determining whether the results of clinical trials yield useful treatment options is a post-informatics task.
| How Can We Find New Predictors of Treatment Response?|| |
In the informatics age, diagnostic predictors related to the presence or absence of disease. When a disease has a long preclinical stage (as in the case of Alzheimer's disease), or when a disease has many distinct genetic variants (as in cystic fibrosis variants), or clinical variants associated with distinctive biological features (as in estrogen-positive/estrogen-negative breast cancer), or morphologic subtypes of possible clinical significance (as in the subtypes of breast cancer), clinicians expect pathologists to develop ways of determining how these subtypes of disease will respond to varying treatment options (i.e., response predictors). Because there are many possible combinations of treatments and disease subtypes, it is not feasible to develop response predictors through long-term, expensive clinical trials. There is a chance that new response predictors can be developed by analyzing preexisting data sets composed of integrated clinical and pathologic data obtained from multiple institutions. This is a job for post-informatics pathologists.
| Are There General Methods that will Produce Testable Hypotheses from Raw Data Sets?|| |
In the informatics age, large data sets were created, either by experiment or clinical trial, or following clinical intervention (i.e., outcomes data), for the purpose of answering a specific set of questions. The questions existed before the data existed, and the questions were the impetus for the collection of the data. In the post-informatics age, data sets will commonly precede research questions. Some of the post-informatics data sets will be data leftovers, from prior hypothesis-driven studies. Some of data sets, such as the Human Genome Project's completed a sequence of human DNA, will be created specifically as data resources, available for a wide range of studies. One of the most important tasks of the post-informatics scientist is the creation of testable hypotheses, starting with preexisting data.
| How Can We Improve the Classification of Diseases?|| |
Perhaps the least obvious (and most likely to be ignored) of the post-informatics activities is the development of novel or improved disease classifications. Classifications are tools that simplify complex systems and help us to understand the biologic principles that extend to all the members of a class, and that distinguish the members of one biological class from members of other classes. Classifications allow us to make some sense from a complex reality.
The most successful biological classification is the so-called tree of life, the hierarchical grouping of earthly life forms, built from the lifetime intellectual contributions of thousands of scientists, over the past two millennia. High school students can peruse a one-page chart depicting the ordered classes of life, and quickly grasp the totality of relationships among all the classes of organisms that live, or have ever lived, on this planet.
Disease classifications, developed with the same principles employed by naturalists, were not the focus of much activity during the informatics age. It was sufficient to have comprehensive thesauri uniting synonyms under a canonical term and code (unique identifying string). Thesauri and nomenclatures aid with quotidian tasks, such as indexing and coding; but, unlike classifications, they cannot be used to understand, diagnose, treat, and prevent diseases.
The development of comprehensive disease classifications requires us to integrate information held in biological, clinical, and pathological resources, to find the key features that define and distinguish classes of disease, and to create a hierarchical placement for each class. This task is particularly essential to the field of cancer. The number of tumor types and subtypes is far too large to accommodate the evaluation of treatment protocols catered to every kind of tumor. The development of cancer treatments for classes of tumors is a far more practical approach that should emerge in the post-informatics age.
| How Can We Protect Complex Information Systems From Our Users, and How Can We Protect Our Users from Complex Information Systems?|| |
Post-informatics information systems are complex. Complex systems can render unintended adverse consequences on users, and users can unpredictably disrupt complex systems. In the informatics age, as systems became increasingly complex, it was common to erect a protective barrier between users and information systems. In most cases, data could enter the system, but access to the data was strictly limited. Hospital staff were routinely denied programmer access to information system software.
In the post-informatics age, users will demand greater access to data. Systems will need to provide new functionalities for users, and the community of users will include nonbiological entities (e.g., software agents). Thus, in the post-informatics age, we will need to open information systems in a manner that protects users and protects systems.
| How Can We Preserve Patient Data for All Posterity?|| |
Collecting data are an informatics task. Saving the collected data in a form that can be easily parsed and understood, now and into the distant future, is a post-informatics task.
Methods for the storage and retrieval of electronic data are often ephemeral. Zip disks, 3.5″ disks, 5.25″ disks, and most analog or digital magnetic tape formats (e.g., beta or VHS video, 8-tracks or cassette audio) are obsolete. Software applications that worked well on 32-bit operating systems may not work on current 64-bit operating systems. Early file formats may not be recognized by new word processing applications (XyWrite files, popular among publishers throughout the 1980s and early 1990s are undecipherable in most modern word processors).
Lest anyone believes long-term data storage is an impossible task, consider the archiving activities of the ancient Sumerians. The Gilgamesh epic was written about 2500 B.C., in Sumerian, on clay tablets. It was acquisitioned into the library of King Ashurbanipal (668-627 B.C.). In about 612 B.C. the library was burned by marauders. The fire baked the clay tablets, hardening them and extending their shelf-life. These Sumerian tablets written more than 4000 years in the past can be read today.
The value of post-informatics science is dependent upon our ability to store and accrue patient data, from birth until death, over generations. Whereas the field of informatics has been effectively cremated, post-informatics requires perpetual care.
| How Can We Reduce the Number of Diagnostic Errors Rendered by Pathologists?|| |
One of the primary endeavors of the pathology informatics age was the creation of computer-based tools (e.g., sophisticated image analysis applications, rule-based inference engines, neural networks, etc.) all intended to replace pathologists, or to reduce their numbers. These well-funded efforts have all failed. With the possible exception of automated cytologic screening, which may have reduced a few cytologist positions, no pathologist has ever been replaced by an algorithm. If 40 years of unrelenting failure has taught us anything, it is that the age of post-informatics pathology will not yield robot pathologists, anytime soon.
Much more feasible than computerized diagnostics is computerized error reduction. When a pathologist renders a diagnosis that stretches demographic credibility (e.g., uterine cancer in a male patient), or defies likelihood (e.g., four bronchioalveolar carcinomas diagnosed in four different patients, accessioned on the same day), or compels biologic skepticism (e.g., basal cell carcinoma of skin metastatic to gallbladder), a computer algorithm might serve to halt the release of the report.
Some of the same approaches that tried, and failed, to yield automated diagnoses, might succeed in reducing errors caused by human fallibility. In the post-informatics age, clinical and pathologic data obtained on millions of diagnoses can be used to determine the features most consistent with correctly rendered diagnoses. Diagnosed cases lacking these features can be automatically flagged and reviewed.
| Can Pathologists Perform Functions Tomorrow That They Cannot Perform Today?|| |
During the informatics age, the purpose of information systems was to computerize the traditional pathology services (i.e., producing reports, sending the reports to clinicians, storing reports, and retrieving reports as needed). Towards the end of the information age, simple display, retrieval, and communications methods were employed to enhance the value of reports (i.e., synoptic reporting, the inclusion of images, links to information related to the diagnoses, graphing and other visualization enhancements to better represent the cumulative data for a patient, automated triggers for specified laboratory results, documentation of laboratory actions, etc.). In the post-informatics age, pathologists must provide services that were unavailable, in the past. Here are a few examples:
The key point here is that pathologists must ensure that their data archives do not degenerate into data cemeteries. The post-informatics pathologist must search for new ways to use data to benefit patients, and must not settle for merely automating traditional services.
- Monitoring adverse responses to treatments.
- Finding outlier pathologists who render specific diagnoses at frequencies significantly different from their peers (e.g., diagnosing 50% of lung adenocarcinomas as the bronchoalveolar type, when other pathologists assign the same diagnosis to 4% of adenocarcinomas).
- Using data to improve diagnostic accuracy.
| How Do We Migrate Journals From a Hypothesis-Centric Universe to a Data-Centric Universe?|| |
In the informatics age, the scientific universe was centered on manuscripts, with each manuscript encapsulating a hypothesis, a method to test the hypothesis, the data produced by the method, and a generalizable conclusion. This is an inefficient way to conduct science. For each hypothesis, an entirely new set of data needs to be created. The data, if it is to be believed, must be reproduced by one or more additional laboratories. In recent times, as experimental data sets have become large and complex, the feasibility of repeating an experiment, or even conducting an independent analysis of an author's primary data, has declined sharply. Most manuscripts never receive the minimal scrutiny required to establish credibility.
The post-informatics age has a new object occupying the center of its universe: the set of all publicly available scientific data. Manuscripts in the post-informatics age are tiny satellites that orbit the massive data set. No longer must manuscripts include primary data; they will simply point to the pre-existing data resources upon which their conclusions are based. In many cases, manuscripts will point to trusted data sources, such as the Human Genome Project. A single data source can serve hundreds or thousands of new manuscripts. Methods are also a form of data, because they can be fully specified. A prototypical manuscript will consist of hypotheses, link to the method(s) and data resource(s) used to test the hypothesis, and a generalizable conclusion. The data and the methods can be evaluated separately from the paper, and authors may choose to use only those methods and data that have been validated by their peers. Criticisms of manuscripts can focus on the assumptions and arguments upon which the data and methods were selected.
In the data-centric universe, who will receive credit for the work accomplished: the data set creator, or the manuscript author? Who will pay for data? Who will pay for manuscripts? How will data sets be accredited? How will the public gain access to data derived from confidential medical data sets? How will the public gain access to intellectual property (i.e., methods or data) used in the preparation of manuscripts? Who will publish manuscripts that have neither methods nor data? These are questions for the post-informatics age.
| What Are The Limits of Complexity in Computer-Assisted Medical Practice?|| |
In the software realm, it is very simple to create applications with a level of complexity that far exceeds our ability to understand their behavior. The worst programmers are apt to produce the most complex applications. Although software engineers have tools to manage software complexity (e.g., Unified Modeling Language) at some point, applications often become chaotic and unpredictable. Complexity is perhaps the most serious limitation to post-informatics progress. It is possible to undertake large, expensive projects that are doomed from the start, simply because they are too complex to succeed. In the post-informatics age, we must acquire some way of knowing when our observations cease to have practical meaning in the physical realm.
| Who Are The Post-Informatics Pathologists?|| |
In the informatics age, pathologists worked in a pathology department. Their chief responsibilities involved producing pathology reports and training the next generation of pathologists. In the post-informatics age, some pathologists will work as free agents (i.e., sans department), and they will use available information and software tools to render consultative reports. They will train the next generation of post-informatics pathologists by creating tutorials and books, and by offering apprenticing opportunities. They will maintain their professional identities by using their diagnostic expertise and their skill in informatics to reduce the suffering and death caused by disease.