Gender Prediction from Social Media Comments with Machine Learning

In the 21st century, which can be termed as age of artificial intelligence, machine learning (ML) techniques that can become widespread and improve themselves can be given more quality services to humanity in many fields. As a result of these ML developments, nowadays many companies use predictive models to estimate customer behavior. Also, with increasing use of social media, the companies have started to deliver their products and services to their customers via social media accounts. But every customer is not interested in all product or service. Each customer's area of interest is different. Gender is one of the main reasons for this difference. If the gender of a social media user is determined correctly, the amount of sales may be increased by offering the appropriate products or services. The main aim of our study is an estimation of genders of the commenters thanks to machine learning techniques by analyzing the comments of companies posting on Facebook. In context of the study, the genders of the commenters labelled based on commenters' name. The data set is divided into training and test data as 70-30%. As a result of the study, it was seen that machine learning methods predicted with similar accuracy rates, while the highest accuracy rate (74.13%) was obtained by logistic regression method.

___

[1] D. Lazer, D. Brewer, N. Christakis, J. Fowler, and G. King, “Life in the network: the coming age of computational social science.” Science (New York, NY), 323(5915), 721, 2009.

[2] H. A. Schwartz, J. C. Eichstaedt, M. L. Kern, L. Dziurzynski, R. E. Lucas, M. Agrawal, and L. H. Ungar. “Characterizing Geographic Variation in Well-Being Using Tweets.” In ICWSM (pp. 583-591), 2013.

[3] P. S. Dodds, K. D. Harris, I. M. Kloumann, C. A. Bliss, and C. M. Danforth. “Temporal patterns of happiness and information in a global social network: Hedonometrics and Twitter.” PloS one, 6(12), e26752, 2011.

[4] M. De Choudhury, M. Gamon, S. Counts, and E. Horvitz. “Predicting depression via social media.” ICWSM, 13, 1-10, 2013.

[5] H. A. Schwartz, , J. C. Eichstaedt, M. L. Kern, L. Dziurzynski, S. M. Ramones, M. Agrawal, and L. H. Ungar. “Personality, gender, and age in the language of social media: The open-vocabulary approach.” PloS one, 8(9), e73791, 2013.

[6] M. Kosinski, D. Stillwell, and T. Graepel. “Private traits and attributes are predictable from digital records of human behavior.” Proceedings of the National Academy of Sciences, 201218772, 2013.

[7] M. J. Paul, and M. Dredze. “You are what you Tweet: Analyzing Twitter for public health.” Icwsm, 20, 265-272, 2011.

[8] A. Marengoni, S. Angleman, R. Melis, F. Mangialasche, A. Karp, A. Garmen, and L. Fratiglioni. “Aging with multimorbidity: a systematic review of the literature.” Ageing research reviews, 10(4), 430-439, 2011.

[9] R. R. McCrae, and P. T. Costa Jr. “A fivefactor theory of personality.” Handbook of personality: Theory and research, 2(1999), 139-153. 1999.

[10] M. L. Kern, J. C. Eichstaedt, H. A. Schwartz, G. Park, L. H. Ungar, D. J. Stillwell, and M. E. Seligman. “From “Sooo excited!!!” to “So proud”: Using language to study development.” Developmental psychology, 50(1), 178, 2014.

[11] J. W. Pennebaker, and L. D. Stone. “Words of wisdom: Language use over the life span.” Journal of personality and social psychology, 85(2), 291, 2003.

[12] D. A. Huffaker, and S. L. Calvert. “Gender, identity, and language use in teenage blogs.” Journal of computer-mediated communication, 10(2), JCMC10211, 2005.

[13] A. Mislove, S. Lehmann, Y. Y. Ahn, J. P. Onnela, and J. N. Rosenquist. “Understanding the Demographics of Twitter Users.” ICWSM, 11(5th), 25, 2011.

[14] M. Pennacchiotti, and A. M. Popescu. “A Machine Learning Approach to Twitter User Classification.” Icwsm, 11(1), 281-288, 2011.

[15] Rao, D., Yarowsky, D., Shreevats, A., and Gupta, M. (2010, October). Classifying latent user attributes in twitter. In Proceedings of the 2nd international workshop on Search and mining user-generated contents (pp. 37- 44). ACM.

[16] F. Al Zamal, W. Liu, and D. Ruths. “Homophily and Latent Attribute Inference: Inferring Latent Attributes of Twitter Users from Neighbors.” ICWSM, 270, 2012.

[17] A. Shlomo K. Moshe, W. P. James, and S. Jonathan. “Automatically profiling the author of an anonymous text.” Communications of the ACM, 52(2):119–123, 2009.

[18] D. Nguyen, R. Gravel, D. Trieschnigg, and T. Meder. “" How Old Do You Think I Am?" A Study of Language and Age in Twitter.” In ICWSM, 2013.

[19] F. Rangel, and P. Rosso. “Use of language and author profiling: Identification of gender and age.” Natural Language Processing and Cognitive Science, 177, 2013.

[20] J. D. Burger, and J. C. Henderson. “An Exploration of Observable Features Related to Blogger Age.” In AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs (pp. 15-20), 2006.

[21] S. Goswami, S. Sarkar, and M. Rustagi. “Stylometric analysis of bloggers’ age and gender.” In Third International AAAI Conference on Weblogs and Social Media, 2009.

[22] R. Jones, R. Kumar, B. Pang, and A. Tomkins. “I know what you did last summer: query logs and user privacy.” In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (pp. 909-914). ACM, 2007.

[23] W. Liu, and D. Ruths. “What's in a Name? Using First Names as Features for Gender Inference in Twitter.” In AAAI spring symposium: Analyzing microtext (Vol. 13, No. 1, pp. 10-16), 2013.

[24] M. A. Keane. “Automated Design of Both the Topology and Sizing of Analog Electrical Circuits Using Genetic Programming.” Artificial Intelligence in Design '96. Springer, Dordrecht. pp. 151–170, 1996.

[25] J. H. Friedman. "Data Mining and Statistics: What's the connection?” Computing Science and Statistics. 29 (1): 3–9, 1998.

[26] M. Gerven, and S. Bohte. “Artificial neural networks as models of neural information processing.” Frontiers Media SA, 2018.

[27] A. S. Albayrak, and O. G. S. K. Yilmaz. “Veri madenciliği: Karar ağacı algoritmaları ve İMKB verileri üzerine bir uygulama.” Süleyman Demirel Üniversitesi İktisadi ve İdari Bilimler Fakültesi Dergisi, 14(1), 2009.

[28] O. Celik, and S. S. Altunaydin. “A Research on Machine Learning Methods and Its Applications.” Online Learning, 1(3), 2018.

[29] H. Guneren. “Destek vektör makineleri kullanarak gömülü sistem üzerinde yüz tanıma uygulaması”, 2015.

[30] H. Ozkan. “K-Means Kümeleme ve K-NN Sınıflandırma Algoritmalarının Öğrenci Notları ve Hastalık Verilerine Uygulanması Bitirme Tezi”, İstanbul Teknik Üniversitesi, İstanbul, 2013.

[31] J. Brownlee. “A Gentle Introduction to XGBoost for Applied Machine Learning. Machine Learning Mastery.” Available online: http://machinelearningmastery.com/gentleintroduction-xgboost-appliedmachinelearning/ (accessed on 2 March 2018), 2016.

[32] https://www.cs.waikato.ac.nz/ml/weka/, (Access Date: 01.02.2018).

[33] http://scikit-learn.org/, (Access Date: 01.02.2018).

[34] P. Stone, D. Dunphy, M. Smith. “The General Inquirer: A Computer Approach to Content Analysis.” MIT press, 1966.

[35] M. Coltheart. “The mrc psycholinguistic database.” The Quarterly Journal of Experimental Psychology 33: 497–505, 1981.

[36] J. W. Pennebaker, M. R. Mehl, K. G. Niederhoffer. “Psychological aspects of natural language use: our words, our selves.” Annual Review of Psychology 54: 547–77, 2003.

[37] Y. Tausczik, J. Pennebaker. “The psychological meaning of words: Liwc and computerized text analysis methods.” Journal of Language and Social Psychology 29: 24– 54, 2010.

[38] B. Pang, L. Lee, and S. Vaithyanathan. “Thumbs up?: sentiment classification using machine learning techniques.” In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10 (pp. 79-86). Association for Computational Linguistics, 2002.

[39] M. Cetin, and M. F. Amasyali. “Supervised and traditional term weighting methods for sentiment analysis.” In Signal Processing and Communications Applications Conference (SIU), 2013 21st (pp. 1-4). IEEE, 2013.

[40] B. I. Sevindi. “Comparison of supervised and dictionary based sentiment analysis approaches on Turkish text” (Doctoral dissertation, Master thesis, Gazi University, Turkey), 2013.

[41] H. Nizam, and S. S. Akin. “Machine Learning in Social Media and the Comparison of the Balanced and Nonbalanced Data Sets in Emotion Analysis.” XIX. Internet Conference in Turkey, 2014.

[42] M. Sap, G. Park, J. Eichstaedt, M. Kern, D. Stillwell, M. Kosinski, and H. A. Schwartz. “Developing age and gender predictive lexica over social media.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1146-1151), 2014.