Abstract
We explore methods for effectively extracting information from clinical narratives that are captured in a public health consulting phone service called HealthLink. Our research investigates the application of state-of-the-art natural language processing and machine learning to clinical narratives to extract information of interest. The currently available data consist of dialogues constructed by nurses while consulting patients by phone. Since the data are interviews transcribed by nurses during phone conversations, they include a significant volume and variety of noise. When we extract the patient-related information from the noisy data, we have to remove or correct at least two kinds of noise: explicit noise, which includes spelling errors, unfinished sentences, omission of sentence delimiters, and variants of terms, and implicit noise, which includes non-patient information and patient's untrustworthy information. To filter explicit noise, we propose our own biomedical term detection/normalization method: it resolves misspelling, term variations, and arbitrary abbreviation of terms by nurses. In detecting temporal terms, temperature, and other types of named entities (which show patients’ personal information such as age and sex), we propose a bootstrapping-based pattern learning process to detect a variety of arbitrary variations of named entities. To address implicit noise, we propose a dependency path-based filtering method. The result of our denoising is the extraction of normalized patient information, and we visualize the named entities by constructing a graph that shows the relations between named entities. The objective of this knowledge discovery task is to identify associations between biomedical terms and to clearly expose the trends of patients’ symptoms and concern; the experimental results show that we achieve reasonable performance with our noise reduction methods.
- ACE. 2008. Automatic Content Extraction. English annotation guidelines for relations. Linguistic Data Consortium, version 6.0--2008.01.07 edition. Retrieved from http: //www.ldc.upenn.edu/Projects/ACE/.Google Scholar
- A. R. Aronson. 2001. Effective mapping of biomedical text to the UMLS Metathesaurus: The MetaMap program. In Proceedings of AMIA Symposium. 17--21.Google Scholar
- M. Bundschus, M. Dejori, M. Stetter, V. Tresp, and H. P. Kriegel. 2008. Extraction of semantic biomedical relations from text using conditional random fields. BMC Bioinformatics 23, 9, 207.Google Scholar
- A. J. Butte and R. Chen. 2006. Finding disease-related genomic experiments within an international repository: First steps in translational bioinformatics. In Proceedings of the AMIA Annual Symposium. 106--110.Google Scholar
- A. Carlson, J. Betteridge, R. C. Wang, E. R. Hruschka Jr., and T. M. Mitchell. 2010. Coupled semi-supervised learning for information extraction. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. New York, NY, 101--110. Google ScholarDigital Library
- A. X. Chang and C. D. Manning. 2012. SUTIME: A library for recognizing and normalizing time expressions. In Proceedings of the Eight International Conference on Language Resources and Evaluation. Istanbul, Turkey, 3735--3740.Google Scholar
- H. W. Chun, Y. Tsuruoka, J. D. Kim, R. Shiba, N. Nagata, T. Hishiki, and J. Tsujii. 2006. Extraction of gene-disease relations from Medline using domain dictionaries and machine learning. In Proceedings of the Pacific Symposium on Biocomputing. 4--15.Google Scholar
- M. Dai, N. H. Shah, W. Xuan, M. A. Musen, S. J. Watson, B. D. Athey, and F. Meng. 2008. An efficient solution for mapping free text to ontology terms. In Proceedings of the AMIA Summit on Translational Bioinformatics. 21.Google Scholar
- F. J. Damerau. 1964. A technique for computer detection and correction of spelling errors. Communications of the ACM 7, 3, 171--176. Google ScholarDigital Library
- J. C. Denny, J. F. Peterson, N. N. Choma, H. Xu, R. A. Miller, L. Bastarache, and N. B. Peterson. 2010. Extracting timing and status descriptors for colonoscopy testing from electronic medical records. Journal of the American Medical Information Association 17, 4, 383--8.Google ScholarCross Ref
- R. Farkas, V. Vincze, G. Móra, J. Csirik, and G. Szarvas. 2010. The CoNLL-2010 shared task: Learning to detect hedges and their scope in natural language text. In Proceedings of the 14th CoNLL Conference -- Shared Task. 1--12. Google ScholarDigital Library
- M. Fiszman, W. Chapman, D. Aronsky, R. Evans, and P. Haug. 2000. Automatic detection of acute bacterial pneumonia from chest X-ray reports. Journal of the American Medical Information Association 7, 6, 593--604.Google ScholarCross Ref
- S. Gaudan, A. Jimeno Yepes, V. Lee, and D. Rebholz-Schuhmann. 2008. Combining evidence, specificity, and proximity towards the normalization of gene ontology terms in text. EURASIP Journal on Bioinformatics and Systems Biology 8, 1, 1--9. Google ScholarDigital Library
- T. Hao. 2012. Bootstrap-based equivalent pattern learning for collaborative question answering. LNCS, 318--329. Google ScholarDigital Library
- A. Holzinger, R. Geierhofer, F. Modritscher, and R. Tatzl. 2008. Semantic information in medical information systems: Utilization of text mining techniques to analyze medical diagnoses. Journal of Universal Computer Science 14, 22, 3781--3795.Google Scholar
- A. Holzinger, K. M. Simonic, and P. Yildirim. 2012. Disease-disease relationships for rheumatic diseases: Web-based biomedical textmining an knowledge discovery to assist medical decision making. In Proceedings of the IEEE 36th Annual Computer Software and Applications Conference (COMPSAC). 573--580. Google ScholarDigital Library
- A. Holzinger, P. Yildirim, M. Geier, and K.-M. Simonic. 2013. Quality-based knowledge discovery from medical text on the web. In Quality Issues in the Management of Web Information, Intelligent Systems Reference Library, ISRL 50. Springer, Berlin, 145--158.Google Scholar
- Jay M. Ponte and W. Bruce Croft. 1998. A language modeling approach to information retrieval. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval. 206--214. Google ScholarDigital Library
- A. Jimeno, E. Jimenez-Ruiz, V. Lee, S. Gaudan, R. Berlanga, and D. Rebholz-Schuhmann. 2008. Assessment of disease named entity recognition on a corpus of annotated sentences. BMC Bioinformatics 9, Suppl 3, S3.Google ScholarCross Ref
- L. Karttunen and A. Zaenen. 2005. Veridicity. In Proceedings of the Dagstuhl Seminar. Retrieved from http://drops.dagstuhl.de/opus/volltexte/2005/314/pdf/05151.KarttunenLauri.Paper.314.pdf.Google Scholar
- J. Kim, T. Ohta, S. Pyysalo, Y. Kano, and J. Tsujii. 2009. Overview of BioNLP’09 shared task on event extraction. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task. 1--9. Google ScholarDigital Library
- Z. Kozareva and E. Hovy. 2010. Learning arguments and supertypes of semantic relations using recursive patterns. In Proceedings of the ACL. 1482--1491. Google ScholarDigital Library
- M. Li and J. Patrick. 2012. Extracting temporal information from electronic patient records. In Proceedings of the AMIA Annual Symposium. 542--551.Google Scholar
- X. Ling and D. S. Weld. 2010. Temporal information extraction. In Proceedings of the 24th Conference on Artificial Intelligence (AAAI). 1385--1390.Google Scholar
- T. McIntosh. 2010. Unsupervised discovery of negative categories in lexicon bootstrapping. EMNLP 356--365. Google ScholarDigital Library
- A. Mottaz, Y. L. Yip, P. Ruch, and A. Veuthey. 2007. Mapping protein information to disease terminologies. Journal of Integrative Bioinformatics 4, 3, 79.Google ScholarCross Ref
- F. Mougin, A. Burgun, and O. Bodenreider. 2006. Mapping data elements to terminological resources for integrating biomedical data sources. BMC Bioinformatics 7, S3.Google ScholarCross Ref
- N. Nakashole, M. Theobald, and G. Weikum. 2010. Find your advisor: Robust knowledge gathering from the web. In Proceedings of the 13th International Workshop on the Web and Databases. 6. Google ScholarDigital Library
- A. Névéol, W. Kim, John W. Wilbur, and Z. Lu. 2009. Exploring two biomedical text genres for disease recognition. In Proceedings of the Workshop on BioNLP. 144--152. Google ScholarDigital Library
- J. Pustejovsky, M. Verhagen, R. Saurí, J. Littman, R. Gaizauskas, G. Katz, I. Mani, R. Knippen, and A. Setzer. 2006. TimeBank 1.2. Linguistic Data Consortium, LDC2006T08.Google Scholar
- E. Riloff and J. Shepherd. 1997. A corpus-based approach for building semantic lexicons. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing. Providence, RI, 117--124.Google Scholar
- E. Riloff and R. Jones. 1999. Learning dictionaries for information extraction by multilevel bootstrapping. In Proceedings of the 16th National Conference on Artificial Intelligence and the 11th Innovative Applications of Artificial Intelligence Conference. 474--479. Google ScholarDigital Library
- S. Robertson and S. Walker. 1994. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the 17th ACM Conference on Research and Development in Information Retrieval (SIGIR'94). ACM Press, 232--241. Google ScholarDigital Library
- P. Ruch, R. Baud, and A. Geissbuhler. 2003. Using lexical disambiguation and named entity recognition to improve spelling correction in the electronic patient record. Artificial Intelligence in Medicine 29, 12, 169--184. Google ScholarDigital Library
- R. Saurí and J. Pustejovsky. 2012. Are you sure that this happened? Assessing the factuality degree of events in text. Computational Linguistics 38, 2, 261--299. Google ScholarDigital Library
- M. Skeppstedt, M. Kvist, and H. Dalianis. 2012. Rule-based entity recognition and coverage of SNOMED-CT in Swedish clinical text. LREC 1250--1257.Google Scholar
- J. Strötgen and M. Gertz. 2010. HeidelTime: High quality rule-based extraction and normalization of temporal expressions. In Proceedings of the 5th International Workshop on Semantic Evaluation. 321--324. Google ScholarDigital Library
- L. K. Tanabe and W. J. Wilbur. 2006. A priority model for named entities. In Proceedings of HLT-NAACL BioNLP Workshop. 33--40. Google ScholarDigital Library
- Ö. Uzuner, B. South, S. Shen, and S. DuVall. 2010. i2b2/va challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Information Association 18, 5, 552--556.Google Scholar
- Y. Wang, M. Zhu, L. Qu, M. Spaniol, and G. Weikum. 2010. Timely Yago: Harvesting, querying, and visualizing temporal knowledge from Wikipedia. In EDBT. 697--700. Google ScholarDigital Library
- P. Willet. 1988. Recent trends in hierarchical document clustering: A critical review. Information Processing and Management 24, 577--597. Google ScholarDigital Library
- H. Yu and E. Agichtein. 2003. Extracting synonymous gene and protein terms from biological literature. Bioinformatics 19, 1, i340--i349.Google ScholarCross Ref
- A. Yeh, A. Morgan, M. Colosimo, and L. Hirschman. 2005. Biocreative task 1a: Gene mention finding evaluation. BMC Bioinformatics 6, Suppl.1, S2.Google ScholarCross Ref
Index Terms
- Recognition of Patient-Related Named Entities in Noisy Tele-Health Texts
Recommendations
Learning multilingual named entity recognition from Wikipedia
We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...
Named Entity Recognition Experiments on Turkish Texts
FQAS '09: Proceedings of the 8th International Conference on Flexible Query Answering SystemsNamed entity recognition (NER) is one of the main information extraction tasks and research on NER from Turkish texts is known to be rare. In this study, we present a rule-based NER system for Turkish which employs a set of lexical resources and pattern ...
Two-stage approach to named entity recognition using Wikipedia and DBpedia
IMCOM '17: Proceedings of the 11th International Conference on Ubiquitous Information Management and CommunicationIn natural language understanding, extraction of named entity (NE) mentions in given text and classification of the mentions into pre-defined NE types are important processes. Most NE recognition (NER) relies on resources such as a training corpus or NE ...
Comments