Aldina AVDIĆ, Ulfeta MAROVAC, Dragan JANKOVIĆ

Automated labeling of terms in medical reports in Serbian

Nowadays, many electronic health reports (EHRs) are stored daily. They consist of the structured part and of an unstructured section written in natural language. Due to the limited time for medical examination, EHRs are short reports which often contain errors and abbreviations. Therefore it is a challenge to process an EHR and extract knowledge from this part of the text for different purposes. This paper compares the results of three proposed methods for automatic labeling of medical terms in unstructured parts of EHRs. All words are categorized as words within the medical domain (symptoms, diagnoses, therapies, anatomy, specialties etc.) and those beyond the medical domain (numbers, places, stop words etc.). The first method is based on dictionaries of medical terms, the second on the training set, and the third on the training set and rules. The results of application of different methodologies to reduce a word to its basic form (pure, prefix, stem) are given for each of the methods. The paper shows that in labeling medical terms, the methods based on medical dictionaries (diagnosis, symptoms, medications etc.) do not produce best results, therefore it is better to use manually annotated part of the data set as a model. A significant number of words (17.36%) in medical reports are abbreviations and errors, so for better results, we should focus on creating rules to solve this problem. Better results are obtained for supervised methods compared to the dictionary-based method (with relative improvement of 42.82%). The inclusion of the algorithm for processing errors and abbreviations increased the results (with a relative improvement of 4.21%) and gave the largest F1 measure (0.9082). The advantage of the proposed method is that the use of rules for processing errors and abbreviations provides good results regardless of how the word is reduced to its basic form.

PDF

___

[1] Rosales R. Method for automatic labeling of unstructured data fragments from electronic medical records. U.S. Patent App. 12/469,745, 2009.
[2] Buckley M, Coopey B, Sharko J, Polubriaginof F, Drohan B et al. The feasibility of using natural language processing to extract clinical information from breast pathology reports. Journal of Pathology Informatics 2012; 3: 23.
[3] Chapman W, Nadkarni M, Hirschman L, D’Avolio W, Savova K et al. Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions. Journal American Medical Information Association 2011; 18 (5): 540-543. doi: 10.1136/amiajnl-2011-000465
[4] Sun W, Rumshisky A, Uzuner O. Evaluating temporal relations in clinical text: 2012 i2b2 challenge. Journal of the American Medical Informatics Association 2013; 20 (5): 806-813. doi: 10.1136/amiajnl-2013-001628
[5] Saeed M, Villarroel M, Reisner A, Cli ord G, Lehman LW et al. Multiparameter intelligent monitoring in Intensive Care II (MIMICII): a public-access intensive care unit database. Critical Care Medicine 2011; 39 (5): 952-960. doi: 10.1097/CCM.0b013e31820a92c6
[6] Vincze V, Szarvas G, Farkas R, Móra G, Csirik J. The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes. BMC Bioinformatics 2008; 9 (11): S9. doi: 10.1186/1471-2105-9-S11-S9
[7] Dalianis H, Henriksson A, Kvist M, Velupillai S, Weegar R. HEALTH BANK-a workbench for data science applications in healthcare. CAiSE Industry Track 2015; 1: 1-18.
[8] Boytcheva S, Angelova G, Angelov Z, Tcharaktchiev D. Text mining and big data analytics for retrospective analysis of clinical texts from outpatient care. Cybernetics and Information Technologies 2015; 15(4): 58-77.
[9] Meystre M, Savova K, Kipper-Schuler C, Hurdle F. Extracting information from textual documents in the electronic health record: a review of recent research. Yearbook of Medical Informatics 2008; 17 (1): 128-144.
[10] Dalianis H. Characteristics of patient records and Clinical Corpora. In: Dalianis H (editor). Clinical Text Mining. Cham, Switzerland: Springer, 2018, pp. 1-20.
[11] Sun W, Cai Z, Li Y, Liu F, Fang S et al. Data processing and text mining technologies on electronic medical records: a review. Journal of Healthcare Engineering 2018; 2018: 1-10.
[12] Alimova I, Tutubalina E. Multiple features for clinical relation extraction: a machine learning approach. Journal of Biomedical Informatics 2020; 103: 103382.
[13] Gorinski PJ, Wu H, Grover C, Tobin R, Talbot C et al. Named entity recognition for electronic health records: a comparison of rule-based and machine learning approaches. arXiv 2019; arXiv:1903.03985 [cs.CL].
[14] Savova K, Masanz J, Ogren V, Zheng J, Sohn S et al. Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications. Journal of the American Medical Informatics Association 2010; 17 (5): 507-513.
[15] Garla V, Re III VL, Dorey-Stein Z, Kidwai F, Scotch M et al. The Yale cTAKES extensions for document classification: architecture and application. Journal of the American Medical Informatics Association 2011; 18 (5): 614-620.
[16] Soysal E, Wang J, Jiang M, Wu Y, Pakhomov S et al. CLAMP–a toolkit for efficiently building customized clinical natural language processing pipelines. Journal of the American Medical Informatics Association 2017; 25 (3): 331- 336.
[17] MacLean DL, Jeffrey H. Identifying medical terms in patient-authored text: a crowdsourcing-based approach. Journal of the American Medical Informatics Association 2013; 20 (6): 1120-1127.
[18] Lai H, Topaz M, Goss R, Zhou L. Automated misspelling detection and correction in clinical free-text records. Journal of Biomedical Informatics 2015; 55: 188-195.
[19] Boytcheva S. Automatic matching of ICD-10 codes to diagnoses in discharge letters. In: Proceedings of the Second Workshop on Biomedical Natural Language Processing; Hissar, Bulgaria; 2011. pp. 11–18.
[20] Antonic S, Krstev C. Serbian Wordnet for biomedical sciences. In: INFORUM; Prague, Czech Republic; 2008. pp. 28-30.
[21] Gantar P, Štrkalj D, Krek S, Ljubešić N. Towards semantic role labeling in Slovene and Croatian. In: Proceedings of the Conference on Language Technologies Digital Humanities; Ljubljana, Slovenia; 2018. pp. 92-98.
[22] Avdić A, Marovac U, Janković D, Avdić D. Normalization of medical records written in Serbian. In: ICIST 2019 Conference; Kopaonik, Serbia; 2019; 1: 72-75.
[23] Krstev C, Obradović I, Utvić M, Vitas D. A system for named entity recognition based on local grammars. Journal of Logic and Computation 2014; 24 (2): 473-489.
[24] Popović Z. Taggers applied on texts in Serbian. INFOtheca - Journal of Informatics & Librarianship 2010; 11 (2): 1-20.
[25] Šandrih B, Krstev C, Stankovic R. Development and evaluation of three named entity recognition systems for Serbian-the case of personal names. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing RANLP; Varna, Bulgaria; 2019. pp. 1060-1068.
[26] Toutanova K, Klein D, Manning C, Singer Y. Feature-rich Part-of-Speech tagging with a cyclic dependency network. In: Proceedings of HLT-NAACL; Edmonton, Canada; 2003. pp. 252-259.
[27] Schmid H. Improvements in part-of-speech tagging with an application to German. In: Proceedings of the ACL SIGDAT-Workshop; Dublin, Ireland; 1995. pp. 1-20.
[28] Krstev C, Vitas D, Gucul S. Recognition of personal names in Serbian texts. In: International Conference Recent Advances in Natural Language Processing; Borovets, Bulgaria; 2005. pp. 288-292.
[29] Rajković P, Janković D, Vucković D. Using string comparison algorithms for Serbian names. In: Proceedings XLI International Scientific Conference on Information, Communication and Energy Systems and Technologies – ICEST; Sofia, Bulgaria; 2006. pp. 221-224.
[30] Marovac U, Pljaskovic A, Crnisanin A, Kajan E. N-gram analysis of text documents in Serbian language. In: Telecommunications Forum (TELFOR); Belgrade, Serbia; 2012. pp. 1385-1388.
[31] Ljajić A, Marovac U. Improving sentiment analysis for Twitter data by handling negation rules in the Serbian language. Computer Science and Information Systems 2019; 16 (1): 289-311. doi: 10.2298/CSIS180122013L
[32] World Health Organization. Međunarodna statistička klasifikacija bolesti i srodnih zdravstvenih problema – Deseta revizija Knjiga 1 Tabelarna lista. Belgrade, Serbia: Institute of Public Health of Serbia ”Dr Milan Jovanović Batut”, 2013 (in Serbian).
[33] Đurić-Petković D, Ristanović E, Kuljić-Kapulica N. Virus malih boginja. MD-Medical Data 2017; 9(3): 181-184.
[34] Krstić SS, Miljković MN, Janković IA. Kliničke karakteristike malih boginja kod dece lečene u službi za pedijatriju opšte bolnice leskovac. Apollinem Medicum et Aesculapium 2012; 10 (4): 9-12 (in Serbian with an abstract in English).
[35] Milenković A, Rajković P, Stanković T, Janković D. Application of medical information system MEDIS.NET in professional learning. In: 19th Telecommunications Forum (TELFOR) Proceedings of Papers IEEE; Belgrade, Serbia; 2011. pp. 1474-1477.
[36] Fleiss JL. Measuring nominal scale agreement among many raters. Psychological Bulletin 1971; 76 (5): 378.
[37] Milošević N. Stemmer for the Serbian language. arXiv 2012; arXiv:1209.4471 [cs.CL].
[38] Krstev C, Stanković R, Vitas D. Knowledge and rule-based diacritic restoration in Serbian. In: Proceedings of the Third International Conference Computational Linguistics; Sofia, Bulgaria; 2018. pp. 41-51.
[39] Rajkovic P, Jankovic D, Vuckovic D. Adaptation and application of Daitch – Mokotoff SoundEx algorithm on Serbian names. In: Conference PRIM (book of abstracts); Kragujevac, Serbia; 2006. pp. 21.
[40] Rice JA. Mathematical Statistics and Data Analysis. 3rd ed. Pacific Grove, CA, USA: Duxbury Press, 2006.