Key word extraction for short text via word2vec, doc2vec, and textrank

Key word extraction for short text via word2vec, doc2vec, and textrank

The rapid development of social media encourages people to share their opinions and feelings on the Internet.Every day, a large number of short text comments are generated through Twitter, microblogging, WeChat, etc., andthere is high commercial and social value in extracting useful information from these short texts. At present, moststudies have focused on extracting text key words. For example, the LDA topic model has good performance with longtexts, but it loses effectiveness with short texts because of the noise and sparsity problems. In this paper, we attempt touse Word2Vec and Doc2Vec to improve short-text key word extraction. We first added the method of the collaborativetraining of word vectors and paragraph vectors and then used the TextRank model’s clustering nodes. We adjusted theweights of the key words that were generated by computing the jump probability between nodes and then obtained thenode-weighted score, and eventually sorted the generated key words. The experimental results show that the improvedmethod has good performance on the datasets.

___

  • [1] Rada M, Paul T. TextRank: bringing order into texts. In: 2004 Conference on Empirical Methods in Natural Language Processing; 25-26 July 2004; Barcelona, Spain. pp. 404-411.
  • [2] Page L, Brin S, Motwani R, Winograd T. The Pagerank Citation Ranking: Bringing Order to the Web. San Francisco, USA: Stanford InfoLab Press, 1999.
  • [3] Mikolov T, Sutskever I, Chen K, Corrago G, Dean J. Distributed representations of words and phrases and their compositionality. In: 27th Conference on Neural Information Processing Systems; 5-10 December 2013; Lake Tahoe, Nevada, USA. pp. 1-9.
  • [4] Mikolov T, Chen K, Corrago G, Dean J. Efficient estimation of word representations in vector space. In: International Conference on Learning Representations; 2-4 May 2013; Scottsdale, Arizona, USA. pp. 1-12.
  • [5] Le Q, Mikolov T. Distributed representations of sentences and documents. In: 31th International Conference on Machine Learning; 21-26 June 2014; Beijing, China.
  • [6] Matsuo Y, Ishizuka M. Key word extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools 2004; 13(1): 157-169.
  • [7] Ercan G, Cicekli I. Using lexical chain for key word extraction. Information Processing Management 2007; 46(6): 1705-1714.
  • [8] Willyan D, Leandro N. A key word extraction method from twitter messages represented as graphs. Applied Mathematics and Computation 2014; 240(1): 308-325.
  • [9] Yang K, Chen Z, Cai Y, Huang D, Leung H. Improved automatic key word extraction given more semantic knowledge. In: International Conference on Database Systems for Advanced Applications; 16-19 April 2016; Dallas, USA. pp.112-125.
  • [10] Xu S, Guo J, Chen X. Extracting topic key words from sina weibo text sets. In: International Conference on Audio, Language and Image Processing; 11-12 July 2016; Shanghai, China. pp. 668-673.
  • [11] Manna S, Oras P. Exploring topic models on short texts: a case study with crisis data. In: Second IEEE International Conference on Robotic Computing; 31 January- 2 February 2018; CA, USA. pp. 377-382. doi:10.1109/ICALIP.2016.7846663
  • [12] Önal S. Using latent semantic analysis for automated key word extraction from large document corpora. Turkish Journal of Electrical Engineering Computer Sciences 2017; 25: 1784-1794. doi:10.3906/elk-1511-203
  • [13] Bhavneet K, Sushma J. Key word extraction using machine learning approaches. In: 3rd International Conference on Advances in Computing, Communication Automation; 15-16 September 2017; Dehradun, India. pp. 1-6. doi:10.1109/ICACCAF.2017.8344699
  • [14] Wang Y, Zhang J. Key word extraction from online product reviews based on bi-directional LSTM recurrent neural network. In: IEEE International Conference on Industrial Engineering and Engineering Management; 10-13 December 2017; Singapore, Singapore. pp.2241-2245. doi:10.1109/IEEM.2017.8290290
  • [15] Ding S, Zhang X, Li O, Li S. Key word sequence extraction based on byte entropy iterative segmentation. In: 3rd IEEE International Conference on Computer and Communications; 13-16 December 2017; Chengdu, China. pp.1530-1535. doi:10.1109/CompComm.2017.8322796
  • [16] Xu C, Liu D. Chinese text summarization algorithm based on word2vec. Journal of Physics: Conference Series 2018; 976: 1-6.
  • [17] Ying K, Pan J, Wu M. Research on sentiment analysis of micro-blog’s topic based on textrank’s abstract. In: 2017 International Conference on Information Technology; 27-29 December 2017; Singapore, Singapore. pp. 86-90. doi:10.1145/3176653.3176698
  • [18] Saroj K, Monali B, Jacob S. A graph based key word extraction model using collective node weight. Expert Systems with Applications 2018; 97(1): 51-59.
  • [19] Georgios P, Vangelis K. Identifying argument components through textrank. In: 3rd Workshop on Argument Mining; 7-12 August 2016; Berlin, Germany. pp. 94-102.
  • [20] Zhao D, Du N, Chang Z, Li Y. Key word extraction for social media short text. In: 14th Web Information Systems and Applications Conference; 11-12 November 2017; Liuzhou, China. pp. 251-256. doi:10.1109/WISA.2017.12
  • [21] Wen Y, Yuan H. Research on key word extraction based on word2vec weighted textrank. In: 2nd IEEE International Conference on Computer and Communications; 14-17 October 2016; Chengdu, China. pp. 2109-2113. doi:10.1109/CompComm.2016.7925072
  • [22] Xia T. Study on key word extraction using word position weighted textrank. New Technology of Library and Information Service 2013; 29(9): 30-34.
  • [23] Wang Q, Sheng V, Wu X. Keyphrase extraction with sequential pattern mining. In: 31st AAAI Conference on Artificial Intelligence; 4-9 February 2017; San Francisco, USA. pp. 5003-5004.
  • [24] Liu X, Song Y, Liu S, Wang H. Automatic taxonomy construction from key words. In: 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 12-16 August 2012; Beijing, China. pp.1433- 1441.
  • [25] Xia T. Extracting key words with modified textrank model. Data Analysis and Knowledge Discovery 2017; 2: 28-34.