Medline Veritabanı Üzerinde Bulunan Tıbbi Dokümanların Kanser Türlerine Göre Otomatik Sınıflandırılması

Tıp araştırmacıları tarafından sık kullanılan bir arama motoru olan Pubmed, MEDLINE veri tabanında üzerinde sorgulama yapmaktadır. MEDLINE medikal, biyoloji ve genetik alanındaki çalışmaları içeren ve sürekli güncel tutulan bibliyografik bir veri tabanıdır. İçerdiği yüksek hacimdeki yapısal olmayan metinler sebebiyle, MEDLINE veri tabanı veya belli bölümleri üzerinde pek çok metin sınıflandırma çalışmaları mevcuttur. Bu çalışmada kanser türleri hakkında yazılmış makale özetlerini inceleyerek makalenin hangi kanser türüyle ilgili olduğunu otomatik bulan bir metot geliştirilmiştir. Metodu eğitmek ve test etmek için MEDLINE veri tabanı üzerinde 25962 makale özeti, Pubmed arama motoru üzerinden ayrıca geliştirilen bir program (crawler) üzerinden toplanmıştır. Elde edilen veri seti üzerinde iki ayrı çalışma yürütülmüştür. Birinci çalışmada, geliştirilen metot özellik seçim yöntemi uygulamadan ve Ki-Kare ve Bilgi Kazancı özellik seçim yöntemlerini uygulayarak, Naif Bayes ve Destek Vektör Makinelerinin sınıflandırma performans ve işlem süreleri analiz edilmiştir. Makalelerin hangi kanser türüne ait olduğunu bulmaya çalışılmış ve oldukça yüksek bir başarım elde edilmiştir. İkinci çalışmada ise, elde edilen metinlerdeki kilit anahtar kelimeler çıkartılarak, veri seti, analiz edilmesi daha zor bir hale dönüştürülmüştür. Bu ikinci veri seti üzerinde aynı metot tekrar test edilmiştir. Çalışma sonunda, çıkartılan anahtar kelimelerin sınıflandırma başarımında kilit rol oynadığı gözlemlenmiştir. Her iki durumda da, önerilen metot makul bir sınıflandırma başarımı göstermiştir.

Automatic Classification of the Medical Documents on the Medline Database into Relevant Cancer Types

Pubmed, which is a search engine that is frequently used by medical researchers, is a tool to perform queries over the MEDLINE database. MEDLINE is a bibliographic database updated regularly to cover recent studies in the fields of medical, biology and genetics. Since, it includes large volume of unstructured data, i.e., texts, several text classification studies have been conducted over the MEDLINE database or some of its parts. In this study, a method has been developed that examines abstracts of articles written on several types of cancers and automatically detects the type of cancer mentioned in the text. In order to train and test the proposed method, 25962 article abstracts have been collected over the MEDLINE database by the help of a software (crawler) that is specifically developed in the scope of this study to query Pubmed search engine. Two different studies have been applied to the obtained data set. In the first study, classification performance and processing time of Naïve Bayes and Support Vector Machines are analyzed on the data without any preprocessing and with Chi-Square and Information Gain feature selection. It is tried to find out what type of cancer types are explained in the articles, and obtained quite high success rates. In the second study, some of the key words are removed from the text so that classifying them became harder than the first case. Same methods are trained and tested over this second version of the dataset. As a result, it is observed that the removed key words play an important role in classifying the texts. In both cases, the proposed methodology has shown reasonable performance classification.

PDF

___

M. Karabulut, "Fuzzy unordered rule induction algorithm in text categorization on top of geometric particle swarm optimization term selection", Knowl.-Based Syst., c. 54, ss. 288-297, Ara. 2013.
H. Uğuz, "A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm", Knowl.-Based Syst., c. 24, sayı 7, ss. 1024-1032, 2011.
M. Yetisgen-Yildiz ve W. Pratt, "The effect of feature representation on MEDLINE document classification", içinde AMIA annual symposium proceedings,Washington, s. 849, 2005.
A. K. Uysal ve S. Gunal, "Text classification using genetic algorithm oriented latent semantic features", Expert Syst. Appl., c. 41, s. 13, ss. 5938-5947, 2014.
B. Parlak ve A. K. Uysal, "Classification of medical documents according to diseases", içinde Signal Processing and Communications Applications Conference (SIU), 2015 23th, ss. 1635-1638, 2015.
R. B. Dollah ve M. Aono, "Ontology based approach for classifying biomedical text abstracts", Int. J. Data Eng. IJDE, c. 2, sayı 1, ss. 1-15, 2011.
O. Frunza, D. Inkpen, S. Matwin, W. Klement, ve P. O'blenis, "Exploiting the systematic review protocol for classification of medical abstracts", Artif. Intell. Med., c. 51, sayı 1, ss. 17-25, 2011.
K. Yi ve J. Beheshti, "A hidden Markov model-based text classification of medical documents", J. Inf. Sci., c. 35, sayı 1, ss. 67-81, Oca. 2009.
G. L. Poulter, D. L. Rubin, R. B. Altman, ve C. Seoighe, "MScanner: a classifier for retrieving Medline citations", BMC Bioinformatics, c. 9, sayı 1, s. 108, 2008.
F. Camous, S. Blott, ve A. F. Smeaton, "Ontology-based MEDLINE document classification", Bioinformatics Research and Development, Springer, ss. 439-452, 2007.
S. Spat, B. Cadonna, I. Rakovac, C. Gutl, H. Leitner, G. Stark, ve P. Beck, "Multi-label text classification of German language medical documents",Proceedings of the 12th World Congress on Health (Medical) Informatics, 2007.
R. Rak, L. A. Kurgan, ve M. Reformat, "Multilabel associative classification categorization of MEDLINE articles into MeSH keywords", IEEE Eng. Med. Biol. Mag., c. 26, s. 2, s. 47, 2007. [22] "weka - Stemmers". http://weka.wikispaces.com/Stemmers. 2015]. [Çevrimiçi]. Available at: [Erişim: 22-Kas
M. Güngör ve Y. Bulut, "Ki-Kare Testi Üzerine", Doğu Anadolu Bölgesi Araştırmaları, c. 7, s. 1, ss. 84-89, 2008.
V. V. KÖK ve N. KULOĞLU, "Sollama Esnasında Taşıt Ve Yol İle İlgili Faktörlerin Karar Ağacı Yöntemiyle İrdelenmesi", Erciyes Üniversitesi Fen Bilim. Enstitüsü Derg., 21(1-2), ss. 180-188, 2005.
Cover, Thomas M., and Joy A. Thomas, "Elements of information theory", John Wiley & Sons, Canada, 2012.
F. Sebastiani, "Machine Learning in Automated Text Categorization", ACM Comput Surv, c. 34, s. 1, ss. 1-47, Mar. 2002.
E. Alpaydin, Introduction to machine learning, 2nd ed. Cambridge, Mass: MIT Press, 2010.
C. C. Aggarwal ve C. Zhai, "A Survey of Text Classification Algorithms", içinde Mining Text Data, C. C. Aggarwal ve C. Zhai, Ed. Springer US, ss. 163-222, 2012.
Vapnik, V.N., "The Nature of Statistical Learning Theory", Springer-Verlag, New York, 1995.