A comparative study of author gender identification
A comparative study of author gender identification
In recent years, author gender identification has gained considerable attention in the fields of informationretrieval and computational linguistics. In this paper, we employ and evaluate different learning approaches based onmachine learning (ML) and neural network language models to address the problem of author gender identification.First, several ML classifiers are applied to the features obtained by bag-of-words. Secondly, datasets are represented bya low-dimensional real-valued vector using Word2vec, GloVe, and Doc2vec, which are on par with ML classifiers in termsof accuracy. Lastly, neural networks architectures, the convolution neural network and recurrent neural network, aretrained and their associated performances are assessed. A variety of experiments are successfully conducted. Differentissues, such as the effects of the number of dimensions, training architecture type, and corpus size, are considered. Themain contribution of the study is to identify author gender by applying word embeddings and deep learning architecturesto the Turkish language.
___
- [1] Bengio Y, Ducharme R, Vincent P, Janvin C. A neural probabilistic language model. J Mach Learn Res 2003; 3:
1137-1155.
- [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural language processing (almost) from
scratch. J Machine Learning Research 2011; 12: 2493-2537.
- [3] Turian J, Ratinov L, Bengio Y. Word representations: a simple and general method for semi-supervised learning.
In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics; 11–16 July 2010;
Uppsala, Sweden. pp. 384-394.
- [4] Mikolov T, Yih W, Zweig G. Linguistic regularities in continuous space word representations. In: Human Language
Technologies Conference of the North American Chapter of the Association of Computational Linguistics; 9–14
June 2013; Atlanta, GA, USA. pp 746-751.
- [5] Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their
compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems;
5–10 December 2013; Lake Tahoe, NV, USA. pp. 3111-3119.
- [6] Pennington J, Socher R, Manning CD. GloVe: Global Vectors for Word Representation. In: Empirical Methods in
Natural Language Processing (EMNLP’14); 25–29 October 2014; Doha, Qatar. pp. 1532-1543.
- [7] Le QV, Mikolov T. Distributed representations of sentences and documents. In: ICML’14 Proceedings of the 31st
International Conference on International Conference on Machine Learning; 21–26 June 2014; Beijing, China. pp.
1188-1196.
- [8] Wenpeng Y, Kann K, Yu M, Schütze H. Comparative study of CNN and RNN for natural language processing.
CoRR, abs/1702.01923, 2017.
- [9] Koppel M, Argamon S, Shimoni AR. Automatically categorizing written texts by author gender. Journal of Literary
and Linguistic Computing 2002; 17: 401-412.
- [10] Deitrick W, Miller Z, Valyou B, Dickinson B, Munson T, Hu W. Author gender prediction in an email stream using
neural networks. Journal of Intelligent Learning Systems and Applications 2012; 4: 169-175.
- [11] Deitrick W, Miller Z, Valyou B, Dickinson B, Munson T, Hu W. Gender identification on Twitter using the modified
balanced winnow. Journal of Communication and Network 2012; 4: 169-175.
- [12] Burger JD, Henderson J, Kim G, Zarrella G. Discriminating gender on Twitter. In: Proceedings of the Conference on
Empirical Methods in Natural Language Processing (EMNLP ’11); 27–31 July 2011; Edinburgh, UK. pp. 1301-1309.
- [13] Mukherjee A, Liu B. Improving gender classification of blog authors. In: Proceedings of the 2010 Conference on
Empirical Methods in Natural Language Processing (EMNLP ’10); 9–11 October 2010; Massachusetts, USA. pp.
207-217.
- [14] Argamon S, Koppel M, Pennebaker J, Schler J. Mining the blogosphere: age, gender and the varieties of selfexpression. First Monday 2007; 12: 3.
- [15] Schler J, Koppel M, Argamon S, Pennebaker J. Effects of age and gender on blogging, computational approaches
to analyzing weblogs. In: 2006 AAAI Spring Symposium; 27–29 March 2006; Stanford, CA, USA. pp. 191-197.
- [16] Cheng N, Chen X, Chandramouli R, Subbalakshmi KP. Gender identification from e-mails. In: IEEE Symposium
on Computational Intelligence and Data Mining; 30 March–2 April 2009; Nashville, TN, USA. pp. 154-158.
- [17] Cheng N, Chandramouli R, Subbalakshmi KP. Author gender identification from text. J Digital Investigation 2011;
8: 78-88.
- [18] Alsmearat K, Al-Ayyoub M, Al-Shalabi R. An extensive study of the bag-of-words approach for gender identification
of Arabic articles. In: Proceedings of IEEE/ACS International Conference on Computer Systems and Applications;
10–13 November 2014; Doha, Qatar. pp. 601-608.
- [19] Alsmearat K, Shehab M, Al-Ayyoub M, Al-Shalabi R, Kanaan G. Emotion analysis of Arabic articles and its
impact on identifying the author’s gender. In: IEEE/ACS 12th International Conference of Computer Systems and
Applications; 17–20 November 2015; Marrakech, Morocco. pp. 1-6.
- [20] Alsmearat K, Shehab M, Al-Ayyoub M, Al-Shalabi R, Kanaan G. Author gender identification from Arabic text.
Journal of Information Security and Applications 2017; 35: 85-95.
- [21] Bayot RK, Gonçalves T. Author profiling using SVMs and word embedding averages. In: Conference and Labs of
the Evaluation Forum; 5–8 September 2016; Évora, Portugal. pp. 815-823.
- [22] Markov I, Gómez-Adorno H, Posadas-Durán JP, Sidorov G, Gelbukh A. Author profiling with Doc2vec neural
network-based document embeddings. In: Pichardo-Lagunas O, Miranda-Jiménez S, editors. Advances in Soft
Computing. Cancún, Mexico: Springer, 2017. pp. 117-131.
- [23] Sboev A, Litvinova T, Gudovskikh D, Rybka R, Moloshnikov I. Machine learning models of text categorization by
author gender using topic-independent features. J Procedia Computer Science 2016; 101: 135-142.
- [24] Sboev A, Litvinova T, Gudovskikh D, Rybka R. Deep learning network models to categorize texts according
to author’s gender and to identify text sentiment. In: International Conference on Computational Science and
Computational Intelligence (CSCI); 15–17 December 2016; Las Vegas, NV, USA. pp. 1101-1106.
- [25] Kucukyilmaz T, Cambazoglu BB, Aykanat C, Can F. Chat mining: predicting user and message attributes in
computer-mediated communication. Journal of Information Processing & Management 2008; 44: 1448-1466.
- [26] Can F, Patton JM. Change of word characteristics in 20th-century Turkish literature: a statistical analysis. Journal
of Quantitative Linguistics 2010; 17: 167-190.
- [27] Amasyalı MF, Diri B. Automatic Turkish text categorization in terms of author, genre and gender. In: Kop C,
Fliedl G, Mayr HC, Métais E, editors. Natural Language Processing and Information Systems. Liège, Belgium:
Springer, 2006. pp. 221-226.
- [28] Leeuwenberg A, Vela M, Dehdari J, Genabith J. A minimally supervised approach for synonym extraction with
word embeddings. Prague Bulletin of Mathematical Linguistics 2016; 105: 111-142.
- [29] Yoon K. Convolutional neural networks for sentence classification. CoRR, abs/1408.5882, 2014.
- [30] Socher R, Perelygin A, Wu JY, Chuang J, Manning CD, Ng AY, Potts C. Recursive deep models for semantic
compositionality over a sentiment treebank. In: Proceedings of the Conference on Empirical Methods in Natural
Language Processing (EMNLP ’13); 18–21 October 2013; Seattle, WA, USA. pp. 1631-1642.
- [31] Severyn A, Moschitti A. Twitter sentiment analysis with deep convolutional neural networks. In: Proceedings of
SIGIR’15; 9–13 August 2015; Santiago, Chile. pp. 959-962.
- [32] Wang X, Liu Y, Sun C, Wang B, Wang X. Predicting polarities of tweets by composing word embeddings with long
short-term memory. In: Proceedings of ACL/IJCNLP; 26–31 July 2015; Beijing, China. pp. 1343-1353.
- [33] Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. P IEEE 1998;
86: 2278-2324.
- [34] Graves A, Mohamed A, Hinton G. Speech recognition with deep recurrent neural networks. In: IEEE International
Conference on Acoustics, Speech and Signal Processing; 26–31 May 2013; Vancouver, Canada. pp. 6645-6649.
- [35] Lai S, Xu L, Liu K, Zhao J. Recurrent convolutional neural networks for text classification. In: Proceedings of the
Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI’15); 25–30 January 2015; Austin, TX, USA. pp.
2267-2273.