Twitter duygu analizinde terim ağırlıklandırma yönteminin etkisi

Terim ağırlıklandırma, metin sınıflandırmada sonuçlar üzerinde doğrudan etkili olan önemli bir adımdır. Ancak, bir metin sınıflandırma problemi olarak ele alınan duygu analizinde farklı önişleme tekniklerine bağlı olarak ağırlıklandırma yönteminin davranışı değişebilmektedir. Bu çalışmada bilgi getirimi, metin sınıflandırma, doküman filtreleme gibi farklı çalışma alanları için yakın zamanda önerilen yöntemler Twitter duygu analizinde uygulanmış ve sonuçlar üzerindeki etkisi incelenmiştir. Öznitelikler çıkarılırken kelime torbası (BoW) ve karakter seviye N-gram olmak üzere iki farklı model kullanılmıştır. Deneyler Türkçe ve İngilizce Twitter mesajlarından oluşan veri kümeleri üzerinde uygulanmıştır. Twitter mesajlarının duygu sınıflandırması, Gizli Dirichlet Ataması (LDA) tabanlı konu modeli ile gerçekleştirilmiştir. Sınıflandırma aşamasında ise Destek Vektör Makinesi (SVM) algoritması kullanılmıştır. Deneysel sonuçlara göre, Twitter duygu analizi çalışmalarında kullanılabilecek en etkili terim ağırlıklandırma yöntemi önerilmiştir.

Anahtar Kelimeler:

Twitter, Duygu analizi, Terim ağırlıklandırma

The impact of term weighting method on Twitter sentiment analysis

Term weighting is an important step which has direct impact on the result in classical text classification. However, the behavior of the term weighting method may vary depending on different preprocessing techniques in sentiment analysis which considered as a text classification task. In this study, term weighted methods which are newly proposed for various research areas such as information retrieval, text classification and document filtering, performed to investigate effect on results for Twitter sentiment analysis. In feature extraction phase, two different models are used including Bag of Words (BoW) and character level N-gram. The experiments conducted on data sets consist of Turkish and English Twitter feeds. Sentiment classification of Twitter feeds performed using topic model generated with Latent Dirichlet Allocation (LDA) method. The Support Vector Machine (SVM) algorithm is employed in the classification stage. According to the experimental results, the most effective term weighting method that can be used in Twitter sentiment analysis studies is suggested.

Keywords:

Twitter, Sentiment analysis, Term weighting,

PDF

___

Patra A, Singh D. “A survey report on text classification with different term weighing methods and comparison between classification algorithms”. International Journal of Computer Applications, 75(7), 2013.
Prabowo R, Thelwall M. “Sentiment analysis: A combined approach”. Journal of Informetrics, 3(2), 143-157, 2009.
Paltoglou G, Thelwall M. “A study of information retrieval weighting schemes for sentiment analysis”. 48th Annual Meeting of the Association for Computational Linguistics, Stroudsburg, USA, 11-16 July 2010.
Çetin M, Amasyalı MF. “Supervised and traditional term weighting methods for sentiment analysis”. In Signal Processing and Communications Applications Conference (SIU), Girne, KKTC, 24-26 April 2013.
Aizawa A. “An information-theoretic perspective of tf–idf measures”. Information Processing & Management, 39(1), 45-65, 2003.
Salton G, Buckley C. “Term-weighting approaches in automatic text retrieval”. Information processing & management, 24(5), 513-523, 1988.
Robertson S, Zaragoza H, Taylor M. “Simple BM25 extension to multiple weighted fields”. 13th ACM International Conference on Information and Knowledge Management, New York, USA, 08-13 November 2004.
Lan M, Tan CL, Low HB. “Proposing a new term weighting scheme for text categorization”. Association for the Advancement of Artificial Intelligence, Boston, USA, 16-20 June 2006.
Reed JW, Jiao Y, Potok TE, Klump BA, Elmore MT, Hurson A R. “TF-ICF: A new term weighting scheme for clustering dynamic data streams”. In ICMLA'06. 5th International Conference on Machine Learning and Applications, Florida, USA, 14-16 December 2006.
Polettini N. “The vector space model in information retrieval-term weighting problem”. Entropy, 1-9, 2004.
Chen LS, Chang CW. “A new term weighting method by introducing class information for sentiment classification of textual data”. International Multi Conference of Engineers and Computer Scientists, Hong Kong, China, 16-18 March 2011.
Deng ZH, Luo KH, Yu HL. “A study of supervised term weighting scheme for sentiment analysis”. Expert Systems with Applications, 41(7), 3506-3513, 2014.
Gasanova T, Sergienko R, Akhmedova S, Semenkin E, Minker W. “Opinion mining and topic categorization with novel term Weighting”. 5th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Baltimore, Maryland, USA, 27 June 2014.
Jung Y, Park H, Du D. “A Balanced term-weighting scheme for improved document comparison and classification”. Preprint, 2001.
Kansheng SHI, Jie HE, Liu HT, Zhang NT, Song WT. “Efficient text classification method based on improved term reduction and term weighting”. The Journal of China Universities of Posts and Telecommunications, 18(1), 131-135, 2011.
Liu Y, Loh H. T, Sun A. “Imbalanced text classification: A term weighting approach”. Expert Systems With Applications, 36(1), 690-701, 2009.
Deng ZH, Tang SW, Yang DQ, Li MZLY, Xie KQ. “A comparative study on feature weight in text categorization”. In Advanced Web Technologies and Applications, Hangzhou, China, 14-17 April 2004.
Mladenić D, Brank J, Grobelnik M, Milic-Frayling N. “Feature selection using linear classifier weights: interaction with classification models”. 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, United Kingdom, 25-29 July 2004.
Debole, F, Sebastiani F. Supervised Term Weighting for Automated Text Categorization. Editor(s): Spiros S. Text Mining and its Applications, 81-97, Germany, Berlin Heidelberg, Springer, 2004.
Quan X, Wenyin L, Qiu B. “Term weighting schemes for question categorization”. Pattern Analysis and Machine Intelligence, 33(5), 1009-1021, 2011.
Go A, Bhayani R, Huang L. “Twitter Sentiment Classification Using Distant Supervision”. Stanford University, California, USA, Project Report, CS224N, 2009.
Srividhya V, Anitha R. “Evaluating preprocessing techniques in text categorization”. International Journal of Computer Science and Application, 47(11), 2010.
Brücher H, Knolmayer G, Mittermayer MA. “Document classification methods for organizing explicit knowledge”. University of Bern, Switzerland, Technical Report, 140, 2002.
Coban O, Ozyer B, Ozyer G. T. “A comparison of similarity metrics for sentiment analysis on Turkish twitter feeds”. International Conference on SocialCom, Chengdu, China, 19-21 December, 2015.
Agarwal A, Xie B, Vovsha I, Rambow O, Passonneau R. “Sentiment analysis of twitter data”. In Proceedings of the Workshop on Languages in Social Media, Portland, Oregon, USA, 23 June 2011.
Kouloumpis E, Wilson T, Moore JD. “Twitter sentiment analysis: The good the bad and the omg!”. International Conference on Web and Social Media, Barcelona, Catalonia, Spain, 17-21 July 2011.
Kaya M, Fidan G, Toroslu I. H. “Sentiment analysis of turkish political news”. International Joint Conferences on Web Intelligence and Intelligent Agent Technology, Macau, China, 4-7 December 2012.
Walsh, B. “Markov chain monte carlo and gibbs sampling”. University of Sao Paulo, Brazil, Lecture Notes for EBB 581, 2004.
Blei DM, Ng AY, Jordan MI. “Latent dirichlet allocation”. The Journal of machine Learning research, 3, 993-1022, 2003.
Çoban Ö, Özyer G. T. “Sentiment classification for Turkish twitter feeds using LDA”. 24th IEEE Signal Processing and Communications Applications Conference (SIU), Zonguldak, Turkey, 16-19 May 2016.
Salton G, Wong A, Yang CS. “A vector space model for automatic indexing”. Communications of the ACM, 18(11), 613-620, 1975.
Lewis DD. “An evaluation of phrasal and clustered representations on a text categorization task”. 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, Denmark, 21-24 June 1992.
Akın AA, Akın MD. “Zemberek, an open source NLP framework for Turkic Languages”. Structure, 10, 1-5, 2007.
Porter MF. “An algorithm for suffix stripping”. Program, 14(3), 130-137, 1980.
Kanaris I, Kanaris K, Houvardas I, Stamatatos E. “Words versus character n-grams for anti-spam filtering”. International Journal on Artificial Intelligence Tools, 16(06), 1047-1067, 2007.
Lodhi H, Saunders C, Shawe-Taylor J, Cristianini N, Watkins C. “Text classification using string kernels”. The Journal of Machine Learning Research, 2, 419-444, 2002.
Manning C. D, Raghavan P, Schütze H. Introduction to Information Retrieval. Online Edition, Cambridge, United Kingdom, Cambridge University Press, 2008.
Xu H, Li C. “A Novel term weighting scheme for automated text categorization”. 7th International Conference on Intelligent Systems Design Applications, Rio de Janeiro, Brazil, 22-24 October 2007.
Nanas N, Uren V, De Roeck A. “A comparative evaluation of term weighting methods for information filtering”. 15th International Workshop on Database and Expert Systems Applications, Zaragoza, Spain, 3-3 September 2004.
Bun KK, Ishizuka M. “Topic extraction from news archive using TF*PDF algorithm”. In Proceedings of the Third International Conference on Web Information Systems Engineering, Singapore, 14 December, 2002.
De Silva J, Haddela P. S. December. “A term weighting method for identifying emotions from text content”. 2013 International Industrial and Information Systems (ICIIS) Conference, Peradeniya, Sri Lanka, 17-20 December 2013.
Liu M, Yang J. “An improvement of TFIDF weighting in text categorization”. International Proceedings of Computer Science and Information Technology, IACSIT Press, Singapore, 2012.
Soucy P, Mineau G. W. “Beyond TFIDF weighting for text categorization in the vector space model”. International Joint Conference on Artificial Intelligence, Edinburgh, Scotland, July 30-August 5, 2005.
Ren F, Sohrab MG. “Class-indexing-based term weighting for automatic text classification”. Information Sciences, 236, 109-125, 2013.
Srividhya V, Anitha R. “Evaluating preprocessing techniques in text categorization”. International Journal of Computer Science and Application, 2010, 49-51, 2010.
Cortes C, Vapnik V. “Support-vector networks”. Machine learning, 20(3), 273-297, 1995.
Burges C. J. “A tutorial on support vector machines for pattern recognition”. Data mining and knowledge discovery, 2(2), 121-167, 1998.
Gunn S. R. “Support Vector Machines for Classification and Regression”. Department of Science and Mathematics Engineering, University of Southampton, Southampton, UK, ISIS Technical Report, 14, 1998.
Fradkin D, Muchnik I. “Support vector machines for classification”. Discrete Methods in Epidemiology, 70, 13-20, 2006.
Chang CC, Lin CJ. “LIBSVM: A library for support vector machines”. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 2011.
Kohavi R. “A study of cross-validation and bootstrap for accuracy estimation and model selection”. International Joint Conference on Artificial Intelligence, Quebec, Canada, 20-25 August 1995.
Jones KS, Walker S, Robertson SE. “A probabilistic model of information retrieval: development and comparative experiments”. Information Processing & Management, 36(6), 809-840, 2000.
Sheela LJ. “A Review of Sentiment Analysis in Twitter Data Using Hadoop”. International Journal of Database Theory and Application, 9(1), 77-86, 2016.