Sentence similarity using weighted path and similarity matrices
Sentence similarity using weighted path and similarity matrices
Sentence similarity is the task of assessing how similar the two snippets of text are. Similarity techniques areused extensively in clustering, summarization, classification, plagiarism detection etc. Due to a small set of vocabularies,sentence similarity is considered to be a difficult problem in natural language processing. There are two issues in solvingthis problem: (1) Which similarity techniques to be used for word pair similarity and (2) How to generalize that tosentence pairs. We have used the weighted path, a WordNet-based similarity assessment, and the paraphrase databaseto obtain word pair similarity values. Thereafter, we extracted maximum values from the pairwise similarity matrixand computed a similarity value for a sentence pair. We have also incorporated a vector space model technique toform a robust similarity measure. Our method outperformed state-of-the-art methods on the STSS65 test dataset inPearson’s correlation of 87% compared to human similarity scores. Moreover, our approach performed on par with othermethods on the STSS131 test data using the same test. Our approach outperforms all the other WordNet-based methodscompared on both datasets.
___
- [1] Mirhosseini M. A clustering approach using a combination of gravitational search algorithm and k-harmonic means
and its application in text document clustering. Turkish Journal of Electrical Engineering & Computer Sciences
2017; 25 (2): 1251–1262. doi: 10.3906/elk-1508-31
- [2] Güran A, Bayazit NG, Gürbüz MZ. Efficient feature integration with Wikipedia-based semantic feature extraction
for Turkish text summarization. Turkish Journal of Electrical Engineering & Computer Sciences 2013; 21 (5):
1411–1425. doi: 10.3906/elk-1201-15
- [3] Ur Rehman Khan S, Arshad Islam M, Aleem M, Azhar Iqbal M. Temporal specificity based text classification
for information retrieval. Turkish Journal of Electrical Engineering & Computer Sciences 2018; 26 (6): 2916-2927.
doi: 10.3906/elk-1711-136
- [4] Ilgen B, Adali E, Tantuğ AC. Exploring feature sets for Turkish word sense disambiguation. Turkish Journal of
Electrical Engineering & Computer Sciences 2016; 24 (5): 4391–4405. doi:10.3906/elk-1408-77
- [5] Papineni K, Roukos S, Ward T, Zhu W-j. BLEU: a method for automatic evaluation of machine translation.
In:
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics; Philadelphia, PA, USA;
2002. pp. 311-318.
- [6] Huang H, Wu H, Wei X, Gao Y, Shi S. Mapping sentences to concept transferred space for semantic textual
similarity. Knowledge and Information Systems 2018; 1-24. doi:10.1007/s10115-018-1261-3
- [7] Deerwester S, Furnas GW, Landauer TK, Harshman R. Indexing by latent semantic analysis. Journal of
the Association for Information Science and Technology 1990; 41 (6): 391–407. doi: 10.1002/(SICI)1097-
4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
- [8] Islam MA, Inkpen D. Second order co-occurrence PMI for determining the semantic similarity of words. In:
Proceedings of LREC 2006; Genoa, Italy; 2006. pp. 1033-1038.
- [9] Islam A, Inkpen D. Semantic text similarity using corpus-based word similarity and string similarity. ACM
Transactions on Knowledge Discovery from Data 2008; 2 (2): 1–25. doi:10.1145/1376815.1376819
- [10] Cho JM, Seo J, Kim GC. Verb sense disambiguation based on dual distributional similarity. Natural Language
Engineering 1999; 5 (2): 157-170. doi:10.1017/S1351324999002193
- [11] Hirst G, St-Onge D. Wordnet: A Lexical database for English. In: Human Language Technology,
Proceedings of
a Workshop Held at Plainsboro, NJ, USA; 1994. pp. 468.
- [12] Ghazizadeh Ahsaee M, Naghibzadeh M, Yasrebi Naeini SE. Semantic similarity assessment of words using weighted
WordNet. International Journal of Machine Learning and Cybernetics 2014; 5 (3): 479-490. doi: 10.1007/s13042-
012-0135-3
- [13] Tsatsaronis G, Varlamis I, Vazirgiannis M. Text relatedness based on a word thesaurus. Journal of Artificial
Intelligence Research 2010; 37: 1-39. doi: 10.1613/jair.2880
- [14] Resnik P. Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language Journal of Artificial Intelligence Research 1999; 11 (3398): 95-130. doi: 10.1613/jair.514
- [15] Manning C, Surdeanu M, Bauer J, Finkel J, Bethard S et al. The Stanford CoreNLP natural language processing
toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System
Demonstrations; Baltimore, MD, USA; 2014. pp. 55–60.
- [16] Wu H, Huang H. Sentence similarity computational model based on information content. IEICE Transactions on
Information and Systems 2016; E99.D (6): 1645-1652. doi:10.1587/transinf.2015EDP7474
- [17] Wu H, Huang H. Efficient algorithm for sentence information content computing in semantic hierarchical network.
IEICE Transactions on Information and Systems 2017; E100.D (1): 238-241. doi: 10.1587/transinf.2016EDL8177.
- [18] Oliva J, Serrano JI, Del Castillo MD, Iglesias Á. SyMSS: A syntax-based measure for short-text semantic similarity.
Data and Knowledge Engineering 2011; 70 (4): 390-405. doi: 10.1016/j.datak.2011.01.002
- [19] Minkov E, Cohen WW. Adaptive graph walk-based similarity measures for parsed text. Natural Language Engineering 2014; 20 (3): 361-397. doi: 10.1017/S1351324912000393
- [20] Iosif E, Potamianos A. Similarity computation using semantic networks created from web-harvested data. Natural
Language Engineering 2015; 21 (1): 49-79. doi: 10.1017/S1351324913000144
- [21] Sultan MA, Bethard S, Sumner T. DLSCU: sentence similarity from word alignment. In: Proceedings of the 8th
International Workshop on Semantic Evaluation; Dublin, Ireland; 2014. pp. 241-246.
- [22] Berndt J, Clifford D. Using dynamic time warping to find patterns in time series. In: AAAIWS’94 Proceedings of
the 3rd International Conference on Knowledge Discovery and Data Mining; Seattle, WA, USA; 1994. pp. 359-370.
- [23] Liu X, Zhou Y, Zheng R. Sentence similarity based on dynamic time warping. In: International Conference on
Semantic Computing; Irvine, CA, USA; 2007. pp. 250-256.
- [24] Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence
of two proteins. Journal of Molecular Biology 1970; 48 (3): 443-453. doi: 10.1016/0022-2836(70)90057-4
- [25] Feng J, Zhou Y, Martin T. Sentence similarity based on relevance. In: Proceedings of the 12th International
Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU 2008);
Málaga, Spain; 2008. pp. 833-839.
- [26] Croft D, Coupland S, Shell J, Brown S. A fast and efficient semantic short text similarity metric. In: 13th UK
Workshop on Computational Intelligence UKCI; Guildford, United Kingdom; 2013. pp. 221-227.
- [27] Li Y, McLean D, Bandar ZA, O’Shea JD, Crockett K. Sentence similarity based on semantic nets and
corpus statistics. IEEE Transactions on Knowledge and Data Engineering 2006; 18 (8): 1138-1150. doi:
10.1109/TKDE.2006.130
- [28] Zhu G, Iglesias CA. Sematch: Semantic similarity framework for Knowledge Graphs. Knowledge-Based Systems
2017; 130: 30-32. doi:10.1016/j.knosys.2017.05.021
- [29] Zhu G, Iglesias CA. Computing Semantic Similarity of Concepts in Knowledge Graphs. IEEE Transactions on
Knowledge and Data Engineering 2017; 29 (1): 72–85. doi: 10.1109/TKDE.2016.2610428
- [30] Kuhn HW. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 1955; 2: 83-97.
doi: 10.1007/978-3-540-68279-0_2
- [31] Ganitkevitch J, Callison-Burch C. The multilingual paraphrase database. In: the 9th Edition of the Language
Resources and Evaluation Conference. European Language Resources Association; Reykjavik, Iceland; 2014. pp.
4276-4283.
- [32] Pavlick E, Rastogi P, Ganitkevitch J, Van Durme B, Callison-Burch C. PPDB 2.0: better paraphrase ranking,
fine-grained entailment relations, word embeddings, and style classification. In: Proceedings of the 53rd Annual
Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural
Language Processing; Beijing, China; 2015. pp. 425–430.
- [33] O’Shea J, Bandar Z, Crockett K, McLean D. A comparative study of two short text semantic similarity measures.
In: Nguyen NT, Jo GS, Howlett RJ, Jain LC (editors). Lecture Notes in Computer Science (including subseries
Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Berlin, Heidelberg: Springer, 2008,
pp. 172-181.
- [34] O’shea J, Bandar Z, Crockett K. A new benchmark dataset with production methodology for short text semantic
similarity algorithms. ACM Transactions on Speech and Language Processing (TSLP) 2013; 10 (4): 1-63. doi:
10.1145/2537046
- [35] Hauke J, Kossowski T. Comparison of values of Pearson’s and spearman’s correlation coefficients on the same sets
of data. Quaestiones Geographicae 2011; 30 (2): 87-93. doi: 10.2478/v10117-011-0021-1
- [36] James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning with Applications in R. New
York, NY, USA: Springer-Verlag, 2013.