Sentence similarity using weighted path and similarity matrices

Sentence similarity using weighted path and similarity matrices

Sentence similarity is the task of assessing how similar the two snippets of text are. Similarity techniques areused extensively in clustering, summarization, classification, plagiarism detection etc. Due to a small set of vocabularies,sentence similarity is considered to be a difficult problem in natural language processing. There are two issues in solvingthis problem: (1) Which similarity techniques to be used for word pair similarity and (2) How to generalize that tosentence pairs. We have used the weighted path, a WordNet-based similarity assessment, and the paraphrase databaseto obtain word pair similarity values. Thereafter, we extracted maximum values from the pairwise similarity matrixand computed a similarity value for a sentence pair. We have also incorporated a vector space model technique toform a robust similarity measure. Our method outperformed state-of-the-art methods on the STSS65 test dataset inPearson’s correlation of 87% compared to human similarity scores. Moreover, our approach performed on par with othermethods on the STSS131 test data using the same test. Our approach outperforms all the other WordNet-based methodscompared on both datasets.

___

  • [1] Mirhosseini M. A clustering approach using a combination of gravitational search algorithm and k-harmonic means and its application in text document clustering. Turkish Journal of Electrical Engineering & Computer Sciences 2017; 25 (2): 1251–1262. doi: 10.3906/elk-1508-31
  • [2] Güran A, Bayazit NG, Gürbüz MZ. Efficient feature integration with Wikipedia-based semantic feature extraction for Turkish text summarization. Turkish Journal of Electrical Engineering & Computer Sciences 2013; 21 (5): 1411–1425. doi: 10.3906/elk-1201-15
  • [3] Ur Rehman Khan S, Arshad Islam M, Aleem M, Azhar Iqbal M. Temporal specificity based text classification for information retrieval. Turkish Journal of Electrical Engineering & Computer Sciences 2018; 26 (6): 2916-2927. doi: 10.3906/elk-1711-136
  • [4] Ilgen B, Adali E, Tantuğ AC. Exploring feature sets for Turkish word sense disambiguation. Turkish Journal of Electrical Engineering & Computer Sciences 2016; 24 (5): 4391–4405. doi:10.3906/elk-1408-77
  • [5] Papineni K, Roukos S, Ward T, Zhu W-j. BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics; Philadelphia, PA, USA; 2002. pp. 311-318.
  • [6] Huang H, Wu H, Wei X, Gao Y, Shi S. Mapping sentences to concept transferred space for semantic textual similarity. Knowledge and Information Systems 2018; 1-24. doi:10.1007/s10115-018-1261-3
  • [7] Deerwester S, Furnas GW, Landauer TK, Harshman R. Indexing by latent semantic analysis. Journal of the Association for Information Science and Technology 1990; 41 (6): 391–407. doi: 10.1002/(SICI)1097- 4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
  • [8] Islam MA, Inkpen D. Second order co-occurrence PMI for determining the semantic similarity of words. In: Proceedings of LREC 2006; Genoa, Italy; 2006. pp. 1033-1038.
  • [9] Islam A, Inkpen D. Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions on Knowledge Discovery from Data 2008; 2 (2): 1–25. doi:10.1145/1376815.1376819
  • [10] Cho JM, Seo J, Kim GC. Verb sense disambiguation based on dual distributional similarity. Natural Language Engineering 1999; 5 (2): 157-170. doi:10.1017/S1351324999002193
  • [11] Hirst G, St-Onge D. Wordnet: A Lexical database for English. In: Human Language Technology, Proceedings of a Workshop Held at Plainsboro, NJ, USA; 1994. pp. 468.
  • [12] Ghazizadeh Ahsaee M, Naghibzadeh M, Yasrebi Naeini SE. Semantic similarity assessment of words using weighted WordNet. International Journal of Machine Learning and Cybernetics 2014; 5 (3): 479-490. doi: 10.1007/s13042- 012-0135-3
  • [13] Tsatsaronis G, Varlamis I, Vazirgiannis M. Text relatedness based on a word thesaurus. Journal of Artificial Intelligence Research 2010; 37: 1-39. doi: 10.1613/jair.2880
  • [14] Resnik P. Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language Journal of Artificial Intelligence Research 1999; 11 (3398): 95-130. doi: 10.1613/jair.514
  • [15] Manning C, Surdeanu M, Bauer J, Finkel J, Bethard S et al. The Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations; Baltimore, MD, USA; 2014. pp. 55–60.
  • [16] Wu H, Huang H. Sentence similarity computational model based on information content. IEICE Transactions on Information and Systems 2016; E99.D (6): 1645-1652. doi:10.1587/transinf.2015EDP7474
  • [17] Wu H, Huang H. Efficient algorithm for sentence information content computing in semantic hierarchical network. IEICE Transactions on Information and Systems 2017; E100.D (1): 238-241. doi: 10.1587/transinf.2016EDL8177.
  • [18] Oliva J, Serrano JI, Del Castillo MD, Iglesias Á. SyMSS: A syntax-based measure for short-text semantic similarity. Data and Knowledge Engineering 2011; 70 (4): 390-405. doi: 10.1016/j.datak.2011.01.002
  • [19] Minkov E, Cohen WW. Adaptive graph walk-based similarity measures for parsed text. Natural Language Engineering 2014; 20 (3): 361-397. doi: 10.1017/S1351324912000393
  • [20] Iosif E, Potamianos A. Similarity computation using semantic networks created from web-harvested data. Natural Language Engineering 2015; 21 (1): 49-79. doi: 10.1017/S1351324913000144
  • [21] Sultan MA, Bethard S, Sumner T. DLSCU: sentence similarity from word alignment. In: Proceedings of the 8th International Workshop on Semantic Evaluation; Dublin, Ireland; 2014. pp. 241-246.
  • [22] Berndt J, Clifford D. Using dynamic time warping to find patterns in time series. In: AAAIWS’94 Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining; Seattle, WA, USA; 1994. pp. 359-370.
  • [23] Liu X, Zhou Y, Zheng R. Sentence similarity based on dynamic time warping. In: International Conference on Semantic Computing; Irvine, CA, USA; 2007. pp. 250-256.
  • [24] Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 1970; 48 (3): 443-453. doi: 10.1016/0022-2836(70)90057-4
  • [25] Feng J, Zhou Y, Martin T. Sentence similarity based on relevance. In: Proceedings of the 12th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU 2008); Málaga, Spain; 2008. pp. 833-839.
  • [26] Croft D, Coupland S, Shell J, Brown S. A fast and efficient semantic short text similarity metric. In: 13th UK Workshop on Computational Intelligence UKCI; Guildford, United Kingdom; 2013. pp. 221-227.
  • [27] Li Y, McLean D, Bandar ZA, O’Shea JD, Crockett K. Sentence similarity based on semantic nets and corpus statistics. IEEE Transactions on Knowledge and Data Engineering 2006; 18 (8): 1138-1150. doi: 10.1109/TKDE.2006.130
  • [28] Zhu G, Iglesias CA. Sematch: Semantic similarity framework for Knowledge Graphs. Knowledge-Based Systems 2017; 130: 30-32. doi:10.1016/j.knosys.2017.05.021
  • [29] Zhu G, Iglesias CA. Computing Semantic Similarity of Concepts in Knowledge Graphs. IEEE Transactions on Knowledge and Data Engineering 2017; 29 (1): 72–85. doi: 10.1109/TKDE.2016.2610428
  • [30] Kuhn HW. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 1955; 2: 83-97. doi: 10.1007/978-3-540-68279-0_2
  • [31] Ganitkevitch J, Callison-Burch C. The multilingual paraphrase database. In: the 9th Edition of the Language Resources and Evaluation Conference. European Language Resources Association; Reykjavik, Iceland; 2014. pp. 4276-4283.
  • [32] Pavlick E, Rastogi P, Ganitkevitch J, Van Durme B, Callison-Burch C. PPDB 2.0: better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing; Beijing, China; 2015. pp. 425–430.
  • [33] O’Shea J, Bandar Z, Crockett K, McLean D. A comparative study of two short text semantic similarity measures. In: Nguyen NT, Jo GS, Howlett RJ, Jain LC (editors). Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Berlin, Heidelberg: Springer, 2008, pp. 172-181.
  • [34] O’shea J, Bandar Z, Crockett K. A new benchmark dataset with production methodology for short text semantic similarity algorithms. ACM Transactions on Speech and Language Processing (TSLP) 2013; 10 (4): 1-63. doi: 10.1145/2537046
  • [35] Hauke J, Kossowski T. Comparison of values of Pearson’s and spearman’s correlation coefficients on the same sets of data. Quaestiones Geographicae 2011; 30 (2): 87-93. doi: 10.2478/v10117-011-0021-1
  • [36] James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning with Applications in R. New York, NY, USA: Springer-Verlag, 2013.