Haber Metinlerinden Sosyo-ekonomik ve Epidemiyolojik Konuların Metin Madenciliğine Dayalı Belirlenmesi

Bilgi teknolojilerindeki ilerlemeler ile, Web’te aralarında sosyo-ekonomik ve epidemiyolojik konuların da yer aldığı birçok konuda önemli boyutta metin belgeleri paylaşılmaktadır. Internetteki çeşitli platformlarda paylaşılan haber makaleleri, hastalık raporları ve haber bültenleri gibi metin-tabanlı paylaşımlar, ortaya çıkan bulaşıcı hastalık salgınlarının erken tespiti için de önemli bir bilgi kaynağı niteliğine sahiptir. Bu bilgi, web tabanlı biyo-gözetim sistemleri geliştirilmesi için de son derece kritik önem taşımaktadır. Webte yayınlanan haber makalelerinin sayısının sürekli olarak artması, bu kaynaklarının hastalık, salgın ve sosyo-ekonomik faktörleri önceden belirlemede kullanılmasını zorlaştırmaktadır. Bu nedenle, etkin bir web tabanlı biyogözetim sistemi geliştirilmesi için, haber metinlerini uygun konulara hızlı ve yüksek başarım ile atayan metin madenciliği ve makine öğrenmesi tabanlı sistemlere gereksinim duyulmaktadır. Bu çalışmada, hayvanlar üzerinde viral bir hastalık olan ASF ve sosyo-ekonomik konularda haber metinleri içeren bir derlem üzerinde temel makine öğrenmesi sınıflandırma algoritmalarının, sınıflandırıcı topluluğu mimarilerinin ve temel metin temsil yöntemlerinin başarımları karşılaştırmalı olarak değerlendirilmiştir. Haber metinlerinin temsil edilmesinde üç temel n-gram modeli olan (1-gram, 2-gram ve 3-gram) temsilleri, terim sıklığı, terim varlığı ve TF-IDF terim ağırlıklandırma yaklaşımları ile birarada kullanılarak toplam dokuz farklı metin temsili elde edilmiştir. Elde edilen metin temsilleri, dört temel sınıflandırma algoritması olan Naive Bayes algoritması, destek vektör makineleri, k-en yakın komşu algoritması ve lojistik regresyon algoritmaları ile değerlendirilmiştir. Bunun yanı sıra, torbalama yöntemi, yükseltme yöntemi, rastgele alt-uzay yöntemi ve çoğunluk oylaması algoritması kullanılarak, haber metinlerinden sosyo-ekonomik ve epidemiyolojik konuların saptanmasında, topluluk öğrenme yöntemlerinin etkinlikleri de analiz edilmiştir. Deneysel analizlerde kullanılan temel sınıflandırıcılar arasında en yüksek başarım Naive Bayes algoritması ile topluluk öğrenmesi mimarileri arasında en yüksek başarım ise rastgele alt-orman algoritmasının Naive Bayes ile kullanılmasıyla elde edilmiştir. Deneysel sonuçlar, metin madenciliği ve makine öğrenmesi yöntemlerinin salgın hastalıkların erken belirlenmesi için kullanılmasının uygun olduğunu göstermektedir.

Anahtar Kelimeler:

metin madenciliği, makine öğrenmesi, topluluk öğrenmesi

Identification of Socio-economic and Epidemiological Issues from News Texts Based on Text Mining

With the advances in information technologies, important text documents are shared on the Web on many topics, including socio-economic and epidemiological issues. Text-based posts, such as, news articles, disease reports and news bulletins shared on various platforms on the Internet are also important sources of information for early detection of emerging infectious disease outbreaks. This information is also critical for the development of web-based bio-surveillance systems. The continuous increase in the number of news articles published on the web makes it difficult to use these sources to predict disease, epidemic and socio-economic factors. Therefore, in order to develop an effective web-based bio-surveillance system, text mining and machine learning-based systems are required that assign news texts to appropriate topics with high predictive performance and speed. In this study, the performance of conventional machine learning classifiers, ensemble learning architectures and conventional text representation methods were evaluated comparatively on a collection of ASF, a viral disease on animals, and news texts on socio-economic issues. A total of nine different text representations were obtained by using three basic n-gram model (1-gram, 2-gram and 3-gram) representations, term frequency, term existence and TF-IDF term weighting approaches to represent news texts. The text representations obtained were evaluated using five basic classification algorithms, namely, Naive Bayes algorithm, support vector machines, k-nearest neighbor algorithm, and logistic regression algorithms. In addition, the predictive performances of ensemble learning methods (namely, Bagging method, Boosting method, random subspace method and majority voting algorithm) have been evaluated on the identification of socio-economic and epidemiological issues from news texts. Among the basic classifiers used in experimental analysis, the highest performance was obtained with Naive Bayes algorithm and community learning architectures, while the highest performance was obtained by using the random sub-forest algorithm with Naive Bayes. Experimental results show that it is appropriate to use text mining and machine learning methods for early detection of epidemics.

Keywords:

text mining, machine learning, ensemble learning,

PDF

___

Gajewski, K. N., Peterson, A. E., Chitale, R. A., Pavlin, J. A., Russell, K. L., & Chretien, J. P. (2014). A review of evaluations of electronic event-based biosurveillance systems. PloS one, 9(10), e111222.
Walters, R. A., Harlan, P. A., Nelson, N. P., & Hartley, D. M. (2008). Data sources for biosurveillance. Wiley handbook of science and technology for Homeland Security, 1-17.
Hartley, D. M., Nelson, N. P., Arthur, R. R., Barboza, P., Collier, N., Lightfoot, N., ... & Brownstein, J. S. (2013). An overview of internet biosurveillance. Clinical Microbiology and Infection, 19(11), 1006-1013.
Tsai, F. J., Tseng, E., Chan, C. C., Tamashiro, H., Motamed, S., & Rougemont, A. C. (2013). Is the reporting timeliness gap for avian flu and H1N1 outbreaks in global health surveillance systems associated with country transparency?. Globalization and health, 9(1), 1-7.
Hartley, D., Nelson, N., Walters, R., Arthur, R., Yangarber, R., Madoff, L., ... & Lightfoot, N. (2013). Landscape of international event-based biosurveillance. Emerg Health Threats J. 2010; 3: e3.
Keller, M., Blench, M., Tolentino, H., Freifeld, C. C., Mandl, K. D., Mawudeku, A., ... & Brownstein, J. S. (2009). Use of unstructured event-based reports for global infectious disease surveillance. Emerging infectious diseases, 15(5), 689.
Mykhalovskiy, E., & Weir, L. (2006). The global public health intelligence network and early warning outbreak detection. Canadian journal of public health, 97(1), 42-44.
Mantero, J., Belyaeva, J., & Linge, J. P. (2011). How to maximise event-based surveillance web-systems the example of ECDC/JRC collaboration to improve the performance of MedISys. Luxembourg: Publications Office of the European Union.
Steinberger, R., Fuart, F., van der Goot, E., Best, C., von Etter, P., & Yangarber, R. (2008). Text mining from the web for medical intelligence. In Mining massive data sets for security (pp. 295-310). IOS Press.
Nelson, N. P., Brownstein, J. S., & Hartley, D. M. (2010). Event-based biosurveillance of respiratory disease in Mexico, 2007–2009: connection to the 2009 influenza A (H1N1) pandemic?. Eurosurveillance, 15(30), 19626.
Freifeld, C. C., Mandl, K. D., Reis, B. Y., & Brownstein, J. S. (2008). HealthMap: global infectious disease monitoring through automated classification and visualization of Internet media reports. Journal of the American Medical Informatics Association, 15(2), 150-157.
Lyon, A., Grossel, G., Burgman, M., & Nunn, M. (2013). Using internet intelligence to manage biosecurity risks: a case study for aquatic animal health. Diversity and Distributions, 19(5-6), 640-650.
Onan, A., Korukoğlu, S., & Bulut, H. (2016). Ensemble of keyword extraction methods and classifiers in text classification. Expert Systems with Applications, 57, 232-247.
Onan, A. (2016). Classifier and feature set ensembles for web page classification. Journal of Information Science, 42(2), 150-165.
Onan, A., Korukoğlu, S., & Bulut, H. (2016). A multiobjective weighted voting ensemble classifier based on differential evolution algorithm for text sentiment classification. Expert Systems with Applications, 62, 1-16.
Onan, A., & Korukoğlu, S. (2017). A feature selection model based on genetic rank aggregation for text sentiment classification. Journal of Information Science, 43(1), 25-38.
Onan, A. (2017). Hybrid supervised clustering based ensemble scheme for text classification. Kybernetes.
Onan, A., & Toçoğlu, M. A. (2021). A term weighted neural language model and stacked bidirectional LSTM based framework for sarcasm identification. IEEE Access, 9, 7701-7722.
Onan, A., Korukoğlu, S., & Bulut, H. (2017). A hybrid ensemble pruning approach based on consensus clustering and multi-objective evolutionary algorithm for sentiment classification. Information Processing & Management, 53(4), 814-833.
Toçoğlu, M. A., & Onan, A. (2019, August). Satire detection in Turkish news articles: a machine learning approach. In International Conference on Big Data Innovations and Applications (pp. 107-117). Springer, Cham.
Onan, A. (2018, May). Review spam detection based on psychological and linguistic features. In 2018 26th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
Onan, A. (2018). An ensemble scheme based on language function analysis and feature engineering for text genre classification. Journal of Information Science, 44(1), 28-47.
Onan, A. (2017). Twitter mesajları üzerinde makine öğrenmesi yöntemlerine dayalı duygu analizi. Yönetim Bilişim Sistemleri, 3(2), 1-14.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.
Dietterich, T. G. (2000, June). Ensemble methods in machine learning. In International workshop on multiple classifier systems (pp. 1-15). Springer, Berlin, Heidelberg.
Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123-140.
Schapire, R. E. (2013). Explaining adaboost. In Empirical inference (pp. 37-52). Springer, Berlin, Heidelberg.
Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE transactions on pattern analysis and machine intelligence, 20(8), 832-844.
Arsevska, E., Roche, M., Hendrikx, P., Chavernac, D., Falala, S., Lancelot, R., & Dufour, B. (2016). Identification of terms for detecting early signals of emerging infectious disease outbreaks on the web. Computers and Electronics in Agriculture, 123, 104-115.