A comparative study of author gender identification

A comparative study of author gender identification

In recent years, author gender identification has gained considerable attention in the fields of informationretrieval and computational linguistics. In this paper, we employ and evaluate different learning approaches based onmachine learning (ML) and neural network language models to address the problem of author gender identification.First, several ML classifiers are applied to the features obtained by bag-of-words. Secondly, datasets are represented bya low-dimensional real-valued vector using Word2vec, GloVe, and Doc2vec, which are on par with ML classifiers in termsof accuracy. Lastly, neural networks architectures, the convolution neural network and recurrent neural network, aretrained and their associated performances are assessed. A variety of experiments are successfully conducted. Differentissues, such as the effects of the number of dimensions, training architecture type, and corpus size, are considered. Themain contribution of the study is to identify author gender by applying word embeddings and deep learning architecturesto the Turkish language.

___

  • [1] Bengio Y, Ducharme R, Vincent P, Janvin C. A neural probabilistic language model. J Mach Learn Res 2003; 3: 1137-1155.
  • [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural language processing (almost) from scratch. J Machine Learning Research 2011; 12: 2493-2537.
  • [3] Turian J, Ratinov L, Bengio Y. Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics; 11–16 July 2010; Uppsala, Sweden. pp. 384-394.
  • [4] Mikolov T, Yih W, Zweig G. Linguistic regularities in continuous space word representations. In: Human Language Technologies Conference of the North American Chapter of the Association of Computational Linguistics; 9–14 June 2013; Atlanta, GA, USA. pp 746-751.
  • [5] Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems; 5–10 December 2013; Lake Tahoe, NV, USA. pp. 3111-3119.
  • [6] Pennington J, Socher R, Manning CD. GloVe: Global Vectors for Word Representation. In: Empirical Methods in Natural Language Processing (EMNLP’14); 25–29 October 2014; Doha, Qatar. pp. 1532-1543.
  • [7] Le QV, Mikolov T. Distributed representations of sentences and documents. In: ICML’14 Proceedings of the 31st International Conference on International Conference on Machine Learning; 21–26 June 2014; Beijing, China. pp. 1188-1196.
  • [8] Wenpeng Y, Kann K, Yu M, Schütze H. Comparative study of CNN and RNN for natural language processing. CoRR, abs/1702.01923, 2017.
  • [9] Koppel M, Argamon S, Shimoni AR. Automatically categorizing written texts by author gender. Journal of Literary and Linguistic Computing 2002; 17: 401-412.
  • [10] Deitrick W, Miller Z, Valyou B, Dickinson B, Munson T, Hu W. Author gender prediction in an email stream using neural networks. Journal of Intelligent Learning Systems and Applications 2012; 4: 169-175.
  • [11] Deitrick W, Miller Z, Valyou B, Dickinson B, Munson T, Hu W. Gender identification on Twitter using the modified balanced winnow. Journal of Communication and Network 2012; 4: 169-175.
  • [12] Burger JD, Henderson J, Kim G, Zarrella G. Discriminating gender on Twitter. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’11); 27–31 July 2011; Edinburgh, UK. pp. 1301-1309.
  • [13] Mukherjee A, Liu B. Improving gender classification of blog authors. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP ’10); 9–11 October 2010; Massachusetts, USA. pp. 207-217.
  • [14] Argamon S, Koppel M, Pennebaker J, Schler J. Mining the blogosphere: age, gender and the varieties of selfexpression. First Monday 2007; 12: 3.
  • [15] Schler J, Koppel M, Argamon S, Pennebaker J. Effects of age and gender on blogging, computational approaches to analyzing weblogs. In: 2006 AAAI Spring Symposium; 27–29 March 2006; Stanford, CA, USA. pp. 191-197.
  • [16] Cheng N, Chen X, Chandramouli R, Subbalakshmi KP. Gender identification from e-mails. In: IEEE Symposium on Computational Intelligence and Data Mining; 30 March–2 April 2009; Nashville, TN, USA. pp. 154-158.
  • [17] Cheng N, Chandramouli R, Subbalakshmi KP. Author gender identification from text. J Digital Investigation 2011; 8: 78-88.
  • [18] Alsmearat K, Al-Ayyoub M, Al-Shalabi R. An extensive study of the bag-of-words approach for gender identification of Arabic articles. In: Proceedings of IEEE/ACS International Conference on Computer Systems and Applications; 10–13 November 2014; Doha, Qatar. pp. 601-608.
  • [19] Alsmearat K, Shehab M, Al-Ayyoub M, Al-Shalabi R, Kanaan G. Emotion analysis of Arabic articles and its impact on identifying the author’s gender. In: IEEE/ACS 12th International Conference of Computer Systems and Applications; 17–20 November 2015; Marrakech, Morocco. pp. 1-6.
  • [20] Alsmearat K, Shehab M, Al-Ayyoub M, Al-Shalabi R, Kanaan G. Author gender identification from Arabic text. Journal of Information Security and Applications 2017; 35: 85-95.
  • [21] Bayot RK, Gonçalves T. Author profiling using SVMs and word embedding averages. In: Conference and Labs of the Evaluation Forum; 5–8 September 2016; Évora, Portugal. pp. 815-823.
  • [22] Markov I, Gómez-Adorno H, Posadas-Durán JP, Sidorov G, Gelbukh A. Author profiling with Doc2vec neural network-based document embeddings. In: Pichardo-Lagunas O, Miranda-Jiménez S, editors. Advances in Soft Computing. Cancún, Mexico: Springer, 2017. pp. 117-131.
  • [23] Sboev A, Litvinova T, Gudovskikh D, Rybka R, Moloshnikov I. Machine learning models of text categorization by author gender using topic-independent features. J Procedia Computer Science 2016; 101: 135-142.
  • [24] Sboev A, Litvinova T, Gudovskikh D, Rybka R. Deep learning network models to categorize texts according to author’s gender and to identify text sentiment. In: International Conference on Computational Science and Computational Intelligence (CSCI); 15–17 December 2016; Las Vegas, NV, USA. pp. 1101-1106.
  • [25] Kucukyilmaz T, Cambazoglu BB, Aykanat C, Can F. Chat mining: predicting user and message attributes in computer-mediated communication. Journal of Information Processing & Management 2008; 44: 1448-1466.
  • [26] Can F, Patton JM. Change of word characteristics in 20th-century Turkish literature: a statistical analysis. Journal of Quantitative Linguistics 2010; 17: 167-190.
  • [27] Amasyalı MF, Diri B. Automatic Turkish text categorization in terms of author, genre and gender. In: Kop C, Fliedl G, Mayr HC, Métais E, editors. Natural Language Processing and Information Systems. Liège, Belgium: Springer, 2006. pp. 221-226.
  • [28] Leeuwenberg A, Vela M, Dehdari J, Genabith J. A minimally supervised approach for synonym extraction with word embeddings. Prague Bulletin of Mathematical Linguistics 2016; 105: 111-142.
  • [29] Yoon K. Convolutional neural networks for sentence classification. CoRR, abs/1408.5882, 2014.
  • [30] Socher R, Perelygin A, Wu JY, Chuang J, Manning CD, Ng AY, Potts C. Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’13); 18–21 October 2013; Seattle, WA, USA. pp. 1631-1642.
  • [31] Severyn A, Moschitti A. Twitter sentiment analysis with deep convolutional neural networks. In: Proceedings of SIGIR’15; 9–13 August 2015; Santiago, Chile. pp. 959-962.
  • [32] Wang X, Liu Y, Sun C, Wang B, Wang X. Predicting polarities of tweets by composing word embeddings with long short-term memory. In: Proceedings of ACL/IJCNLP; 26–31 July 2015; Beijing, China. pp. 1343-1353.
  • [33] Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. P IEEE 1998; 86: 2278-2324.
  • [34] Graves A, Mohamed A, Hinton G. Speech recognition with deep recurrent neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing; 26–31 May 2013; Vancouver, Canada. pp. 6645-6649.
  • [35] Lai S, Xu L, Liu K, Zhao J. Recurrent convolutional neural networks for text classification. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI’15); 25–30 January 2015; Austin, TX, USA. pp. 2267-2273.