Türkçe dokümanlardaki benzerliklerin tespiti için mevcut yazılımların karşılaştırılması ve Türkçe karakter kullanımı ile kök almanın etkisinin incelenmesi

Web ortamındaki bilginin çoğalıp, Internet ve bilgi teknolojilerinin yaygın kullanılması hemen her alanda intihal vakalarının artmasına neden olmuştur. Örneğin, akademik ortamda bazı öğrenciler kendilerine eğitmenleri tarafından verilen ödevler üzerinde çeşitli intihal yöntemlerini uygulamaktadırlar. Bazı öğrenciler başkalarının çalışmasını herhangi bir değişiklik yapmadan ve sahibine atıfta bulunmadan kendi çalışması gibi gösterirken, bazı öğrenciler de diğerlerinin çalışmasını sadece bazı küçük değişiklikler yaparak sunmaktadır. Bu çalışmada amacımız intihal tespit yazılımlarından CopyCatchGold, Sherlock, SIM, WCopyFind, JPlag, YTÜ Kemik Grubu tarafından hazırlanan Metin Eşleştirme Sistemi ve Doküman Benzerliği programları ile kendi kodladığımız Kosinüs, Dice ve Jaccard metin benzerlik ölçütlerinin Türkçe örnek veri kümeleri üzerinde performanslarını karşılaştırmaktır. Buna ek olarak Türkçe karakter ve kelime köklerinin kullanımının intihal tespiti üzerindeki etkisi incelenmiştir. Sonuç olarak, Türkçe karakter kullanımının benzerlik tespitini azalttığı, ancak kelime köklerinin kullanımının ise intihal tespit araçlarının performansını arttırdığı gözlenmiştir.

A comparison of text similarity detection software for Turkish documents and investigating the effects of stemming and Turkish character usage

The increase in the amount of available information on the Web and widespread usage of the Internet and information technologies have caused to rise in occurrence of plagiarism in almost everywhere. As an example, in academia some students have performed a variety of plagiarism methods on their assignments given by the instructors. While some students show others‟ work by their own without making any changes and giving any reference to owner, some other students submit others‟ studies by making some small changes. In this study, our aim is to compare the performance of plagiarism detection software that are CopyCatchGold, Sherlock, SIM, WCopyFind, JPlag, two other software that are Text Matching System and Document Similarity developed by YTÜ Kemik Group, as well as our implemented Cosine, Dice, and Jaccard text similarity measures on Turkish sample datasets. In addition, we have investigated the effects of using Turkish character set and Turkish stemmer on plagiarism detection. Consequently, it was observed that using Turkish characters decreases similarity detection, using stemmed words on the other hands, increases the performance of plagiarism detection tools.

PDF

___

1. Honor Council, 2014. http://orgs.odu.edu/hc/pages/What_is_the_Hon or_Council.shtml
2. Hacker, D., (2007), A Writer‟s Reference, 6th ed., pp. 344-347, 418-421.
3. Huang, A., 2008. Similarity Measures for Text Document Clustering, in New Zealand Computer Science Research Student Conference - Proceedings of NZCSRSC, pp. 49-56.
4. Dursun, B., Sönmez, A. C., 2008. Türkçe Metin Benzerlik Hesaplamasi için Yeni Bir Yöntem, Signal Processing, Communication and Applications Conference, SIU 2008, IEEE 16th, DOI:10.1109/SIU.2008.4632581, pp. 1 - 4.
5. Amasyalı, F., Beken, A., 2009. Türkçe Kelimelerin Anlamsal Benzerliklerinin Ölçülmesi ve Metin Sınıflandırmada Kullanılması, IEEE Signal Processing and Communications Applications Conference, SİU-2009.
6. Işık, M., Çamurcu, A. Y., 2008. Web Belgeleri Kümelemede Benzerlik ve Uzaklık Ölçütleri Başarılarının Karşılaştırılması, Marmara Üniversitesi Fen Bilimleri Dergisi, 20, 35-49.
7. Schleimer, S., Wilkerson, D. S., Aiken, A., 2003. Winnowing: Local Algorithms for Document Fingerprinting, in Proceedings ACM SIGMOD International Conference on Management of Data, pp. 76-85.
8. Karp R.M., Rabin M.O., 1987. Efficient Randomized Pattern-Matching Algorithms, IBM Journal of Research and Development - IBM J. Res. Dev. 31(2):249-260.
9. Wise, M. J., 1993. String Similarity via Greedy String Tiling and Running Karp-Rabin Matching, ftp://ftp.cs.su.oz.au/michaelw/doc/RKR_GST.p s, Dept. of CS, University of Sydney, December 1993.
10. Wise, M. J., 1993. Running KarpRabin Matching and Greedy String Tiling, Technical Report Number 463, Dept. of CS, University of Sydney, March 1993.
11. Zeidman, R. M., 2010. Detecting Plagiarism in Computer Source Code, United States Patent Application 20100325614 A1.
12. Cosine Similarity, (2014), http://en.wikipedia.org/wiki/Cosine_similarity
13. Euclidean Distance, (2014), http://en.wikipedia.org/wiki/Euclidean_distance
14. Levenshtein, V. I., 1966. Binary Codes Capable of Correcting Deletions, İnsertions, and Reversals", Soviet Physics Doklady 10 (8): 707710.
15. Dice, L. R., 1945. Measures of the Amount of Ecologic Association Between Species, Ecology 26 (3): 297302.
16. Jaccard Index, (2014), http://en.wikipedia.org/wiki/Jaccard_index
17. Winkler, W. E., 1990. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. Proceedings of the Section on Survey Research Methods (American Statistical Association): 354359.
18. Tanimoto, T., 1957. An Elementary Mathematical theory of Classification and Prediction, Internal IBM Technical Report 1957.
19. Hamming, R. W., 1950. Error detecting and error correcting codes, Bell System Technical Journal, 29 (2): 147160.
20. Black, P. E., 2006. Manhattan Distance, The National Institute of Standards and Technology - NIST, Dictionary of Algorithms and Data Structures, Vreda Pieterse and Paul E. Black, eds.
21. Minkowski Distance, 2014. http://en.wikipedia.org/wiki/Minkowski_distan ce
22. SIM, 2014. http://dickgrune.com/Programs/similarity_teste r/
23. Hage, J., Radameker, P., Vught, N. V., 2010. A Comparison of Plagiarism Detection Tools, Technical Report UU-CS-2010-015, ISSN: 0924-3275.
24. Özen, Z., Gülseçen, S., 2012. Kaynak Kod 38Benzerliği ve Klon Kod Tespit Araçları, Akademik Bilişim‟12 - XIV. Akademik Bilişim Konferansı Bildirileri Kitapçığı, Uşak.
25. Sherlock, 2014. http://sydney.edu.au/engineering/it/~scilect/she rlock/.
26. Cosma, G., Joy, M., 2012. An Approach to Source-Code Plagiarism Detection and Investigation Using Latent Semantic Analysis, Computers, IEEE Transactions on , vol.61, no.3, pp.379,394, DOI: 10.1109/TC.2011.223.
27. WCopyfind, 2014. http://plagiarism.bloomfieldmedia.com/z- wordpress/software/wcopyfind/
28. CopyCatch Gold, 2014. http://www.cflsoftware.com/GoldFull.html
29. Prechelt, L., Malpohl G. , Phlippsen, M., 2002. JPlag: Finding Plagiarisms Among a Set of Programs, Technical Report 2000-1, University of Karlsruhe, J.UCS - The Journal of Universal Computer Science, Vol. 8, Issue 11, , 1016-1038, DOI: 10.3217/jucs-008-11- 1016.
30. Metin Eşleştirme Sistemi, 2014. http://www.kemik.yildiz.edu.tr/data/File/ogr_pr ojeler/Text%20Matching%20System.PDF
31. Doküman Benzerliği, 2014. http://www.kemik.yildiz.edu.tr/data/Document Similarity.rar
32. Yüksel, M. E., Turna, Ö. C., Ertürk, M. A., 2010. Bilgiye Erişim Sistemlerinde Veri Arama ve Eşleştirme, Akademik Bilişim‟10 - XII. Akademik Bilişim Konferansı Bildirileri Kitapçığı, Muğla.
33. Başak, S., 2009. Türkçe Dokümanların Benzerliği, Bilgisayar Mühendisliği Bölümü, Yıldız Teknik Üniversitesi, İstanbul.
34. Flajolet, P., Fusy, É, Gandouet, O., Meunier, F., 2007. HyperLogLog: the Analysis of a Near-Optimal Cardinality Estimation Algorithm, AOFA‟07: Proceedings of the 2007 International Conference on Analysis of Algorithms DMTCS Proc. AH, pp. 127-146.
35. Kafka, F., 1915. Dönüşüm (Die Verwandlung).
36. Zemberek, 2014. http://code.google.com/p/zemberek/
37. Beyazperde, 2014.http://www.beyazperde.com/
38. HD Film Vadisi, 2014.http://www.hdfilmvadisi.com/