Using latent semantic analysis for automated keyword extraction from large document corpora

Using latent semantic analysis for automated keyword extraction from large document corpora

:In this study, we describe a keyword extraction technique that uses latent semantic analysis (LSA) to identify semantically important single topic words or keywords. We compare our method against two other automated keyword extractors, Tf-idf (term frequency-inverse document frequency) and Metamap, using human-annotated keywords as a reference. Our results suggest that the LSA-based keyword extraction method performs comparably to the other techniques. Therefore, in an incremental update setting, the LSA-based keyword extraction method can be preferably used to extract keywords from text descriptions from big data when compared to existing keyword extraction methods.

___

  • [1] Hulth A. Improved automatic keyword extraction given more linguistic knowledge. In: Conference on Empirical Methods in Natural Language Processing; 11–12 July 2003; Sapporo, Japan. pp. 216-23.
  • [2] Litvak M, Last M. Graph-based keyword extraction for single-document summarization. In: 2nd Workshop on Multi-Source Multilingual Information Extraction and Summarization; 2008; Morristown, NJ, USA.
  • [3] Turney PD. Learning algorithms for keyphrase extraction. Inform Retrieval 2000; 2: 303-336.
  • [4] Aronson AR, Mork JG, Gay CW, Humphrey SM, Rogers WJ. The NLM indexing initiative’s medical text indexer. St Heal T 2004; 107: 268-272.
  • [5] Chiang JH, Liu HH, Huang YT. Condensing biomedical journal texts through paragraph ranking. Bioinformatics 2011; 27: 1143-1149.
  • [6] D’Souza JL, Smalheiser NR. Three journal similarity metrics and their application to biomedical journals. PLoS One 2014; 9: e115681.
  • [7] Huang M, Neveol A, Lu Z. Recommending Mesh terms for annotating biomedical articles. J Am Med Inform Assoc 2011; 18: 660-667.
  • [8] Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Inf Process Manag 1988; 24: 513-523.
  • [9] Yang Y, Pedersen JO. A comparative study on feature selection in text categorization. In: Fourteenth International Conference on Machine Learning; 1997. pp. 412-210.
  • [10] Matsuo Y, Ishizuka M. Keyword extraction from a single document using word co-occurrence statistical information. Int J Artif Intell T 2004; 13: 157-169.
  • [11] Liu F, Pennell D, Liu F, Liu Y. Unsupervised approaches for automatic keyword extraction using meeting transcripts. In: Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics; 31 May–5 June 2009; Boulder, CO, USA. pp. 620-628.
  • [12] Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R. Indexing by latent semantic analysis. J Am Soc Inform Sci 1990; 41: 391-407.
  • [13] Gao J, Zhang J. Clustered SVD strategies in latent semantic indexing. Inf Process Manag 2005; 41: 1051-1063.
  • [14] Landauer TK, Dumais ST. A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol Rev 1997; 104: 211-240.
  • [15] Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature 1999; 401: 788-791.
  • [16] Kolda TG, O’Leary DP. A semidiscrete matrix decomposition for latent semantic indexing information retrieval. ACM T Inform Syst 1998; 16: 322-346.
  • [17] Hoffman T. Probabilistic latent semantic indexing. In: 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 15–19 August 1999; Berkeley, CA, USA.
  • [18] Blei DM, Ng AY, Jordan MI. Latent Dirichlet allocation. J Mach Learn Res 2003; 3: 993-1022.
  • [19] Berry MW, Dumais ST, O’Brien GW. Using linear algebra for intelligent information retrieval. SIAM Rev 1995; 37: 573-595.
  • [20] Landauer TK, Laham D, Foltz, PW. Learning human-like knowledge by singular value decomposition: a progress report. Adv Neur In 1998; 10: 45-51.
  • [21] Berry MW, Browne, M. Understanding Search Engines: Mathematical Modeling and Text Retrieval. 2nd ed. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2005.
  • [22] Ozsoy MG, Alpaslan FN, Cicekli I. Text summarization using latent semantic analysis. J Inf Sci 2011; 37: 405-417.
  • [23] Gong Y, Liu X. Generic text summarization using relevance measure and latent semantic analysis. In: 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 9–12 September 2001; New York, NY, USA.
  • [24] Steinberger J, Jezek K. Text summarization and singular value decomposition. In: Third International Conference on Advances in Information Systems; 20–22 October 2004; Czech Republic.
  • [25] Murray G, Renals S, Carletta J. Extractive summarization of meeting recordings. In: Interspeech; 4–8 September 2005; Portugal.
  • [26] Gupta V, Lehal GS. A survey of text summarization extractive techniques. J Emerg Techol Web Intell 2010; 2: 258-268.
  • [27] Hahn U, Mani I. The challenges of automatic summarization. Computer 2000; 33: 29-36.
  • [28] Han L, Suzek TO, Wang Y, Bryant SH. The text-mining based PubChem Bioassay neighboring analysis. BMC Bioinformatics 2010; 11: 549.
  • [29] Salton G, Buckley C. Improving retrieval performance by relevance feedback. J Am Soc Inform Sci 1990; 41: 288-297.
  • [30] Wang Y, Suzek T, Zhang J, Wang J, He S, Cheng T, Shoemaker BA, Gindulyte A, Bryant SH. PubChem BioAssay: 2014 update. Nucleic Acids Res 2014; 42: D1075-1082.
  • [31] Kim S, Thiessen PA, Bolton EE, Chen J, Fu G, Gindulyte A, Han L, He J, He S, Shoemaker BA et al. PubChem Substance and Compound databases. Nucleic Acids Res 2016; 44: D1202-1213.
  • [32] Neveol A, Dogan RI, Lu Z. Author keywords in biomedical journal articles. In: AMIA Annual Symposium; 13–17 November 2010; Washington, DC, USA.
  • [33] Bento AP, Gaulton A, Hersey A, Bellis LJ, Chambers J, Davies M, Kr¨uger FA, Light Y, Mak L, McGlinchey S et al. The ChEMBL bioactivity database: an update. Nucleic Acids Res 2014; 42: D1083-1090.