Year : 2011 | Volume
: 2 | Issue : 1 | Page : 5-
Informatics research using publicly available pathology data
Jules J Berman
7104 Brandywine Way, Colombia
Jules J Berman
7104 Brandywine Way
The day has not arrived when pathology departments freely distribute their collected anatomic and clinical data for research purposes. Nonetheless, several valuable public domain data sets are currently available, from the U.S. Government. Two public data sets of special interest to pathologists are the SEER (the U.S. National Cancer Institute«SQ»s Surveillance, Epidemiology and End Results program) public use data files, and the CDC (Center for Disease Control and Prevention) mortality files. The SEER files contain about 4 million de-identified cancer records, dating from 1973. The CDC mortality files contain approximately 85 million de-identified death records, dating from 1968. This editorial briefly describes both data sources, how they can be obtained, and how they may be used for pathology research.
|How to cite this article:|
Berman JJ. Informatics research using publicly available pathology data.J Pathol Inform 2011;2:5-5
|How to cite this URL:|
Berman JJ. Informatics research using publicly available pathology data. J Pathol Inform [serial online] 2011 [cited 2019 Oct 17 ];2:5-5
Available from: http://www.jpathinformatics.org/text.asp?2011/2/1/5/76154
Pathology Data Sources
Disease research in pathology informatics requires archiving, retrieving, organizing, sharing, and analyzing diverse pathology-related data sources. The most important data source for pathologists is the data collected by anatomic and clinical pathologists (e.g., blood tests, surgical pathology reports, autopsy reports, annotated images, and specialized studies).
It is a sad irony that the data collected by pathologists is seldom available for serious scientific inquiry. Many pathologists cannot freely access the full set of anatomic and clinical data collected within their own departments, though de-identified pathology data is exempted from regulation by HIPAA and the Common Rule. , Any pathology department could, without violating federal laws, de-identify and distribute their archived anatomic and clinical pathology datasets to the scientific community. For reasons economic, legalistic, and psychologic, no pathology departments have, as yet, distributed de-identified collections of their collected datasets to the public. , As a result, there is virtually no scientific research currently being conducted on large, multi-institutional collections of pathology data. Moreover, the primary data for research done on collections of single-institution data is seldom released to the public. When research is conducted on pathology data, and the data is withheld from the public, there is no way to validate the conclusions. Hence, the U.S. National Academy of Sciences, along with the editors of many scientific journals, have established a policy requiring authors to release the primary data supporting their conclusions.  Because pathology departments have not released their de-identified research records to the public, record-based pathology informatics research cannot be published in journals that conform to the National Academy of Sciences recommendations. In the rare instance where an institution has published scientific results based on global analyses of their own datasets, the raw data, upon which those results are based, have not been made available for critical review or for secondary analyses. Scientific results have no value unless the results are backed by data that can be openly examined.
Where pathology departments have failed, the U.S. government has, to some extent, succeeded. Enormous sources of individualized but de-identified death records and cancer records are available at no cost to medical researchers.
Cdc Mortality Files
Eighty-five million de-identified death records are available from the CDC (Center for Disease Control and Prevention). Each record contains basic demographics on the decedent (age, race, gender, place of death), the cause of death, and (if provided in the death record) the underlying causes of death and significant additional medical conditions. Annual collections are available for the years 1968 to 2007. Each annual data file provides about 2 million byte-indexed sequential-line records. This means that there is one death record per line, and each line contains coded data indexed for ranges of bytes that are designated by a data dictionary. For example, the ICD-10 code of the underlying cause of death, for deaths occurring in 1999, are found in bytes 142-145 of each record. This may seem like an awkward way of organizing data, when you consider the ease with which modern specifications (such as RDF) can encapsulate data with metadata. Nonetheless, the byte-indexed sequential files can be parsed very efficiently with just a few lines of code.
Though pathologists prefer autopsy data over death certificate data, autopsies are performed on a very small fraction of decedents. Despite early efforts to standardize and collect autopsy data, pathologists have not succeeded in sharing their autopsy data in a national database. , Consequently, death certificates are the most important source of mortality data available to medical researchers.
Public mortality files can be downloaded by anonymous ftp from the following URL: ftp.cdc.gov/pub/health_statistics/nchs/datasets/dvs/mortality/
Seer Data Files
SEER is the U.S. National Cancer Institute's Surveillance, Epidemiology and End Results program. SEER offers a Public Use dataset, containing de-identified records on about 4 million cancers that have occurred since 1973. Each SEER record is a single cancer case. With about 4 million carefully curated cases, scientists can draw certain types of inferences that could not possibly be made with the data accumulated at a single medical institution.
To get the SEER public use data files, you must first complete a data access request available at:seer.cancer.gov/data/request.html
SEER sends you a username and password that you will need to access the data files. The data is available on a DVD, or by direct Internet download. Each SEER record is a line on a data file, and consists of 264 alphanumeric characters. Byte data includes the patient's race, gender, age at diagnosis, primary tumor site, diagnosis, and information related to tumor size and occurrence of metastases. A data dictionary provides the byte location of the various field values contained in each record. The data dictionary comes bundled with the data files.
In the past several years, I have written hundreds of blog entries explaining how open source materials (i.e., data, algorithms, nomenclature, metadata specifications, and software) can be used to collect, organize, integrate, and analyze pathology-related information. The blogs are available at: julesberman.blogspot.com
A blog tutorial for SEER data files appeared on November 14, 2008. A blog tutorial on the CDC mortality files appeared on December 2, 2008. All of the 300+ blogs entries for the blog site can be accessed through a linked archive web page at: http://www.julesberman.info/blog_in.htm
Blog entries related to the CDC and SEER data include topics such as data mashups (e.g., mapping the geographic locations of disease occurrences), age distributions (e.g., following the average age of occurrence of diseases that progress through diagnostic categories over time), and trends in disease incidence (e.g., measuring the annual incidence of genetically screened diseases). There are blog entries of general interest to those working with pathology datasets, including entries on de-identification methods, image manipulation and annotation, data specification methods, and various computational algorithms.
|1||Standards for privacy of individually identifiable health information. Office of the Assistant Secretary for Planning and Evaluation, DHHS. Final rule. Fed Regist 2000;65:82462-829.|
|2||Department of Health and Human Services. 45 CFR (Code of Federal Regulations), 46. Protection of Human Subjects (Common Rule). Fed Regist 1991;56:28003-32.|
|3||Berman JJ, Bhatia K. Biomedical Data Integration: Using XML to Link Clinical and Research Datasets. Expert Rev Mol Diag 2005;5:329-36.|
|4||Berman JJ. Confidentiality for Medical Data Miners. Artif Intell Med 2002;26:25-6.|
|5||National Academy of Sciences Report. Sharing Publication-Related Data and Materials: Responsibilities of Authorship in the Life Sciences. Washington DC.: The National Academies Press; 2003. |
|6||Hutchins GM, Berman JJ, Moore GW, Hanzlick R, and the Autopsy Committee of the College of American Pathologists. Practice guidelines for autopsy pathology. Arch Pathol Lab Med 1999;123:1085-92.|
|7||Berman JJ, Moore GW, Hutchins GM. Internet Autopsy Database. Hum Pathol 1997;28:393-4.|