Reza JAVADZADEH, Morteza ZAHEDI, Marziea RAHIMI

Sentence similarity using weighted path and similarity matrices

Sentence similarity is the task of assessing how similar the two snippets of text are. Similarity techniques are used extensively in clustering, summarization, classification, plagiarism detection etc. Due to a small set of vocabularies, sentence similarity is considered to be a difficult problem in natural language processing. There are two issues in solving this problem: (1) Which similarity techniques to be used for word pair similarity and (2) How to generalize that to sentence pairs. We have used the weighted path, a WordNet-based similarity assessment, and the paraphrase database to obtain word pair similarity values. Thereafter, we extracted maximum values from the pairwise similarity matrix and computed a similarity value for a sentence pair. We have also incorporated a vector space model technique to form a robust similarity measure. Our method outperformed state-of-the-art methods on the STSS65 test dataset in Pearson's correlation of 87 % compared to human similarity scores. Moreover, our approach performed on par with other methods on the STSS131 test data using the same test. Our approach outperforms all the other WordNet-based methods compared on both datasets.

Keywords:

Sentence similarity, plagiarism detection text mining, vector space model, paraphrase database,

PDF

Turkish Journal of Electrical Engineering and Computer Science-Cover

ISSN: 1300-0632
Yayın Aralığı: Yılda 6 Sayı
Yayıncı: TÜBİTAK

Arşiv