Journal of Pathology Informatics Journal of Pathology Informatics
Contact us | Home | Login   |  Users Online: 581  Print this pageEmail this pageSmall font sizeDefault font sizeIncrease font size 


ORIGINAL ARTICLE
Year : 2013  |  Volume : 4  |  Issue : 1  |  Page : 23

Extracting laboratory test information from biomedical text


1 Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
2 The Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA

Correspondence Address:
Yanna Shen Kang
Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, Maryland
USA
Login to access the Email id

Source of Support: None, Conflict of Interest: None


DOI: 10.4103/2153-3539.117450

Rights and Permissions

Background: No previous study reported the efficacy of current natural language processing (NLP) methods for extracting laboratory test information from narrative documents. This study investigates the pathology informatics question of how accurately such information can be extracted from text with the current tools and techniques, especially machine learning and symbolic NLP methods. The study data came from a text corpus maintained by the U.S. Food and Drug Administration, containing a rich set of information on laboratory tests and test devices. Methods: The authors developed a symbolic information extraction (SIE) system to extract device and test specific information about four types of laboratory test entities: Specimens, analytes, units of measures and detection limits. They compared the performance of SIE and three prominent machine learning based NLP systems, LingPipe, GATE and BANNER, each implementing a distinct supervised machine learning method, hidden Markov models, support vector machines and conditional random fields, respectively. Results: Machine learning systems recognized laboratory test entities with moderately high recall, but low precision rates. Their recall rates were relatively higher when the number of distinct entity values (e.g., the spectrum of specimens) was very limited or when lexical morphology of the entity was distinctive (as in units of measures), yet SIE outperformed them with statistically significant margins on extracting specimen, analyte and detection limit information in both precision and F-measure. Its high recall performance was statistically significant on analyte information extraction. Conclusions: Despite its shortcomings against machine learning methods, a well-tailored symbolic system may better discern relevancy among a pile of information of the same type and may outperform a machine learning system by tapping into lexically non-local contextual information such as the document structure.


[FULL TEXT] [PDF]*
Print this article     Email this article
 Next article
 Previous article
 Table of Contents

 Similar in PUBMED
   Search Pubmed for
   Search in Google Scholar for
 Related articles
 Citation Manager
 Access Statistics
 Reader Comments
 Email Alert *
 Add to My List *
 * Requires registration (Free)
 

 Article Access Statistics
    Viewed4533    
    Printed57    
    Emailed0    
    PDF Downloaded502    
    Comments [Add]    
    Cited by others 1    

Recommend this journal