Eşanlatım tespitinde eminlik faktörü modeli

Bu makalede, eşanlatımlı cümle çiftlerinin belirlenmesindeki belirsizlik problemi üzerinde durulmuştur. Eşanlatım cümleleri basitçe aynı olay ve/veya fikri farklı sözcük veya sözcüklerin farklı dizilişleri ile ifade eden cümle çiftleri/kümeleridir. Çalışmada eşanlatım tespitinde eminlik faktörü (EF) modelinin kullanılması önerilmiştir. EF modelinde kullanılmak üzere filtreleme yöntemi ile eşanlatım tespitinde başarılı olan öznitelikler (jenerik ve uzaklık tabanlı öznitelikler) belirlenmiş ve bu öznitelikler kümesi EF modelinde deliller olarak kullanılmıştır. EF modeli Microsoft Eşanlatım derlemi üzerinde F1 ve doğruluk ölçekleri ile sınanmıştır. Yöntemin başarımı Bayes karar verme yaklaşımı ile kıyaslanmıştır. Deney sonuçları EF modelinin eşanlatım tespitinde Bayes modeline bir alternatif yöntem olduğunu göstermiştir.

Anahtar Kelimeler:

Eşanlatım, Eşanlatım tespiti, Eminlik faktörü, Delil, Delil seçimi

Certainty factor model in paraphrase detection

In this paper, we address the problem of uncertainty management in identification of paraphrase sentence pairs. Paraphrase sentences are simply sets/pairs of sentences that express the same facts and/or opinions using different words or order of words. We propose the use of certainty factor (CF) model in paraphrase detection. A set of succeeding paraphrase detection features (generic and distance based features) is built by filtering and this set is used as evidences in CF model. The CF model is evaluated by F1 and accuracy measures on Microsoft Research Paraphrase corpus. The results are compared to the well-known Bayesian reasoning. The experimental results showed that CF model is an alternating paraphrase detection method to Bayes model.

Keywords:

Paraphrase, Paraphrase detection, Certainty factor, Evidence, Evidence selection,

PDF

___

[1] Shortliffe EH, Buchanan BG. “A model of inexact reasoning in medicine” Mathematical Biosciences. 23(3-4), 351-379, 1975.
[2] Dolan B, Quirk C, Brockett C. “Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources”. 20th International Conference Computational Linguistic (COLING ’04), Geneva, Switzerland, 23-27 August 2004.
[3] Fernando S, Stevenson M. “A Semantic Similarity Approach to Paraphrase Detection”. Proceedings of the 11th annual research colloquium of the UK special interest group for computational linguistics (CLUK 2008), Oxford, United Kingdom, 1-3 March 2008.
[4] Salton G, Lesk ME. “Computer Evaluation of Indexing and Text Processing”. Journal of the ACM (JACM), 15(1), 8-36, 1968.
[5] Schütze H. “Automatic word sense discrimination”. Computational Linguistic. 24(1), 97-123, 1998.
[6] Lin CY Hovy E. “The potential and limitations of automatic sentence extraction for summarization”. Proceedings of the HLT-NAACL 03 Text Summarization Workshop, Edmonton, AB, Canada, 31 May- 3 June 2003.
[7] Mihalcea R, Corley C, Strapparava C. “Corpus-based and knowledge-based measures of text semantic similarity”. Proceeding 21st Conference Artifical Intelligence, Boston, Massachusetts, USA, 16-20 July 2006.
[8] Zhang Y, Patrick J. “Paraphrase Identification by Text Canonicalization”. Proceedings of the Australasian Language Technology Workshop, Sydney, Australia, 9-11 December 2005.
[9] Finch A, Hwang YS, Sumita E. “Using Machine Translation Evaluation Techniques to Determine Sentence-level Semantic Equivalence”. The Third International Workshop on Paraphrasing (IWP 2005), Jeju Island, Korea, 14 October 2005.
[10] Papineni K, Roukos S, Ward T, Zhu W. “BLEU: a method for automatic evaluation of machine translation”. Proceedings of the 40th annual meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 7-12 July 2002.
[11] Doddington G. “Automatic evaluation of machine translation quality using n-gram co-occurrence statistics”. Proceedings of the Second İnternational Conference on Human Language Technology Research, San Diego, California, USA, 24-27 March 2002.
[12] Madnani N, Tetreault J, Chodorow M. “Re-Examining Machine Translation Metrics for Paraphrase Identification”. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT ’12), Montreal, Canada, 3-8 June 2012.
[13] Kozareva Z, Montoyo A. “Paraphrase identification on the basis of supervised machine learning techniques”. International Conference on Natural Language Processing (FinTAL 2006), Turku, Finland, 23-25 August 2006.
[14] Miller GA. “WordNet: a lexical database for English”. Communications of the ACM, 38(11), 39-41, 1995.
[15] Fellbaum C. WordNet: An Electronic Lexical Database. 1st ed. Cambridge, Massachusetts, USA, MIT Press, 1998.
[16] Ul-Qayyum Z, Altaf W. “Paraphrase identification using semantic heuristic features”. Research Journal of Applied Sciences, Engineering and Technology, 4(22), 4894-4904, 2012.
[17] Rus V, McCarthy PMM, Lintean MC, McNamara DS, Graesser AC. “Paraphrase ıdentification with lexico-syntactic graph subsumption”. Twenty-First International Florida Artifical Intelligence Reseach Society Conference (FLAIRS ’08), Florida, USA, 15-17 May 2008.
[18] Qiu L, Kan MY, Chua TS. “Paraphrase recognition via dissimilarity significance classification”. Conference on Empirical Methods in Natural Language Processing (EMNLP ’06), Sydney, Australia, 22-23 July 2006.
[19] Banerjee S, Pedersen T. “Extended gloss overlaps as a measure of semantic relatedness”. IJCAI International Joint Conference on Artificial Intelligence, Acapulco, Mexico, 9-15 August 2003.
[20] Islam A, Inkpen D. “Semantic text similarity using corpus-based word similarity and string similarity”. ACM Transactions on Knowledge Discovery from Data, 2(2), 1-25, 2008.
[21] Socher R, Huang E, Pennington J. “Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection”. Advances in Neural Information Processing Systems, Granada, Spain, 12-14 December 2011.
[22] Wan S, Dras M, Dale R, Paris C. “Using Dependency-Based Features to Take the ‘Para-farce’ out of Paraphrase”. Proceedings of the Australasian Language Technology Workshop, Sydney, Australia, 30 November- 1 December 2006.
[23] Wang Z, Mi H, Ittycheriah A. “Sentence similarity learning by lexical decomposition and composition”. COLING 2016- 26th International Conference on Computational Linguistics, Osaka, Japan, 11-16 December 2016.
[24] He H, Gimpel K, Lin J. “Multi-perspective sentence similarity modeling with convolutional neural networks”. EMNLP 2015: Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17-21 September 2015.
[25] Cheng J, Kartsaklis D. “Syntax-aware multi-sense word embeddings for deep compositional models of meaning”. EMNLP 2015: Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17-21 September 2015.
[26] Filice S, Da San Martino G, Moschitti A. “Structural representations for learning relations between pairs of texts”. ACL-IJCNLP 2015-53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Beijing, China, 26-31 July 2013.
[27] Dwivedi A, Mishra D, Kalra PK. “Handling uncertainties-using probability theory to possibility theory” Mag. IIT Kanpur, 7(3), 1-12, 2006.
[28] Negnevitsky M. Artificial Intelligence: A guide to Intelligent Systems. 2nd ed. Essex, England, Pearson Education, 2005.
[29] Mitchell TM. Machine learning. Boston, USA, McGraw-Hill, 1997.
[30] Kışla T, Karaoğlan B, Metin SK. “Extracting the features of similarity in short texts”. IEEE 23th Signal Processing and Communications Applications Conference, Malatya, Turkey, 16-19 May 2015.
[31] Cordeiro J, Dias G, Brazdil P. “A Metric for Paraphrase Detection”. International Multi-Conference on Computing in the Global Information Technology (ICCGI'07), Guadeloupe, French Caribbean, 4-9 March 2007.
[32] Guyon I, Elisseeff A. “An ıntroduction to variable and feature selection”. Journal of Machine Learning Research, 3(3), 1157-1182, 2003.
[33] Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. “The WEKA data mining software- an update”. SIGKDD Explorations Newsletter, 11(1), 10-18, 2009.
[34] Kendall MG, Smith BB. “The problem of m rankings”. Annals Mathematical. Statistics, 10(3), 275-287, 1939.
[35] Kumova Metin S, Karaoglan B, Kısla T. “Attribute value-range detection in identification of paraphrase sentence pairs”. 24th Signal Processing and Communication Application Conference (SIU), Zonguldak, Turkey, 16-19 May 2016