医学电子病历的分析和研究
The Analysis and Research in Medical Records
(本项目受微软亚洲研究院资助)
(This project was funded by Microsoft Research Asia)
INTRODUCTION
Digitalization of Medical Records is an inevitable trend along with the popularization of information and computer technology in modern hospitals around the world. The electronic medical record carries a considerable amount of various and substantial primitive information from patients. However, the electronic medical records are unstructured texts. How to excavatepertinent information related to patients’ disease from unstructured texts is essential and a hot spot in recent years. The i2b2 NLP Shared Tasks (Challenges in Natural Language Processing for Clinical Data) have already been successfully hold for four times by an NIH-funded National Center for Biomedical Computing based at Partners HealthCare System called Informatics for Integrating Biology and the Bedside in America. The series of i2b2 challenges provide some labeled medical records and promote the ability of natural language processing (NLP) tools to discover increasingly beneficial information from unstructured clinical data. The increasingly fine grained information can help physicians bring more opportunities for accurate clinical decisions, provide a sizable amount of primitive information for medical researchers, and calculate the medical insurance bill. However, there are few works and approaches to extract Chinese electronic medical records using NLP. The value of mining progressivelybeneficial information from Chinese unstructured texts is indispensable. On the one hand, the Chinese government has promulgated a series of measures to promote the digitalization of Chinese medical records. Electronic medical records are one of fundamental aspects for becoming a first-class and comprehensive hospital of more than 500 inpatients in China. On the other hand, obviously, the information covered in electronic medical records is an enormouscolliery and the number of electronic medical records is a significant increase in China, compared with other countries, as the population of China is much greater than that of any other country in the world, which accounting for 20% of the world population.
ACHIEVEMENTS
I.Corpus Creation
A corpus of named entities for Chinese discharge summary was built of comprehensive high quality. The gold standard corpus is created from 336 Chinese discharge summaries, which contain 8811 instances of Medical Problem, 1188 instances of Treatment, 782 instances of Medication, 1299 instances of Test and 4234 instances of Anatomy. A summary contains, in average, 26.22 medical problems, 3.33 treatments, 2.33 medications, 3.87 tests and 12.60 anatomies. The average number of sentences is 12.19 in a summary. The IAAs are71.94%, 71.87% between annotations of the doctors and the gold standard, and 91.59%, 90.98% between annotations of the authors and the gold standard.
II.Ontology building
Two collections of terms, one for English and the other for Chinese, are created. The terms in these collections are classified to either of Medical Problem, Medication or Medical Test in the I2B2 challenge tasks. The English collection contains 49,249 (Problem), 89,591 (Medication) and 25,107 (Test), while the Chinese one contains 66,780 (Problem), 101,025 (Medication) and 15,032 (Test). The proposed method of constructing a large collection of medical terms is both efficient and effective, independently of the languages. We will make the collections publicly available, which will contribute to the research community of medical records processing in English and Chinese.
III.Challenge
2010, the 10th rank in i2b2 challenge on concepts, assertions, and relations in clinical text.
2011, the 1st rank in i2b2 challenge on co-reference resolution.
2011, the 2nd rank in i2b2 challenge on finding emotions in suicide notes.
2012, with the F-measure used for evaluation, we were ranked first out of14 competing teams (event extraction),first out of 14teams (timex extraction),third out of 12 teams (temporalrelation),and second out of seven teams(end-to-endtemporal relation)on temporal relations.
PUBLICATION
· Yan Xu, Junichi Tsujii, and Eric Chang, Named entity recognition of follow-up and time information in 20,000 radiology reports, in Journal of the American Medical Informatics Association, 28 May 2012
· Yan Xu, Kai Hong, Junichi Tsujii, and Eric Chang, Feature engineering combined with machine learning and rule-based methods for structured information, in Journal of the American Medical Informatics Association, 14 May 2012
· Yan Xu, Jiahua Liu, Jiajun Wu, Yue Wang, Zhuowen Tu, Jian-Tao Sun, Junichi Tsujii, and Eric Chang, A classification approach to coreference in discharge summaries: 2011 i2b2 challenge, in Journal of the American Medical Informatics Association, 2012
Yan Xu, Yue Wang, Jiahua Liu, Zhuowen Tu, Jian-Tao Sun, Junichi Tsujii, and Eric Chang, Suicide Note Sentiment Classification: A Supervised Approach Augmented by Web Data, in Biomedical Informatics Insights, LibertasAcademica, January 2012
· Yan Xu, Yue Wang, Jiahua Liu, and Eric Chang, Sentiment Analysis of Suicide Notes Based OnExternal Resources (Rank 2), in Proceedings of the 2011 i2b2/VA/Cincinnati Workshop on Challenges in Natural Language Processing for Clinical Data, 2011
· Yan Xu, Jiahua Liu, Jiajun Wu, Yue Wang, and Eric Chang, EHUATUO: A Mention-pair Coreference System by Exploiting Document Intrinsic Latent Structures and World Knowledge in Discharge Summaries (Rank 1), in Proceedings of the 2011 i2b2/VA/Cincinnati Workshop on Challenges in Natural Language Processing for Clinical Data, 2011
· Eric Chang, Yan Xu, Kai Hong, Jianqiang Dong, and Zhaoquan Gu, A Hybrid Approach to Extract Structured Information from Narrative Clinical Discharge Summaries, in Proceedings of the 2010 i2b2/VA/Cincinnati Workshop on Challenges in Natural Language Processing for Clinical Data, October 2010