BÜYÜK VERİDE METİN BENZERLİK ALGORİTMALARININ VERİ EŞLEME PERFORMANSLARININ KARŞILAŞTIRILMASI

Son yıllarda dünya turizmindeki büyük hareketlilik, bu sektörün büyük verinin çalışma alanları arasına girmesini sağlamıştır. Bu çalışmada farklı sağlayıcılardan gelen otel bilgilerinin, veritabanlarına farklı isim ve adreslerle girilmesi sonucu oluşan problemler için, büyük veri ve string similarity algoritmaları (SSA) kullanarak bir çözüm önerisi ortaya konulmuştur. Bunun için geniş bir otel ağına sahip bir turizm acentasının Londra’da bulunan 2599 oteli örneklem olarak seçilmiş ve bu oteller ile yetmiş farklı sağlayıcıdan gelen yaklaşık üç milyon otel bilgisinin eşleştirilmesi için, soundex algoritmasından faydalanılarak Map-Reduce işlemi gerçekleştirilmiştir. Map-Reduce ile eşleme işlem sayısı ve işlem süresinde önemli ölçüde azalma sağlanmıştır. Çalışmanın diğer aşamasında ise Dice coefficient, Levenshtein ve Longest common subsequence (LCS) algoritmaları, doğru eşleyebildikleri veri ve işlem süresi açısından kıyaslanmıştır. Bu aşamada algoritmalar uygulanmadan önce veri tabanında algoritmaların skorunu düşüren kelimeler tespit edilerek çıkartılmıştır. Doğru eşleme bakımından Dice coefficient algoritması, işlem süresi açısından ise Levenshtein algoritması daha iyi sonuçlar üretmiştir.

Anahtar Kelimeler:

Algoritmalar, Metin analizi, Doğal dil işleme, Veri analizi, Veri tabanları

COMPARISON OF THE DATA MATCHING PERFORMANCES OF STRING SIMILARITY ALGORITHMS IN BIG DATA

The great mobility in the world tourism in recent years has also enabled this sector to be included among the study areas of big data. In this study, a solution proposal was put forward by using the big data and string similarity algorithms (SSA) for the problems arising from the entry of the hotel data coming from different providers into databases with different names and addresses. Therefore, 2599 hotels of a tourism agency with a wide hotel network located in London were selected as the sample, and the Map-Reduce process was performed by using the Soundex algorithm to match these hotels with approximately three million hotel data coming from seventy different providers. Matching with Map-Reduce ensured a significant reduction in process count and process time. Furthermore, the Dice coefficient, Levenshtein and Longest common subsequence (LCS) algorithms were compared in terms of the data that they correctly matched, and process time. In this stage, the words decreasing the score of the algorithms in the database were detected and removed before the algorithms were implemented. The Dice coefficient algorithm yielded better results in terms of correct matching, and the Levenshtein algorithm yielded better results in terms of process time.

Keywords:

Algorithms, Text Analysis, Natural Language processing, Data Analysis, Databases,

PDF

___

Bakar, Z. A., Sembok, T. M. T., and Yusoff, M., 2000. An evaluation of retrieval effectiveness using spelling-correction and string-similarity matching methods on Malay texts, Journal of the Association for Information Science and Technology, vol. 51, no. 8, pp. 691-706, doi: 10.1002/(SICI)1097-4571(2000)51:8<691: :AID-ASI20>3.0.CO;2-U
Baruah, D., and Mahanta, A. K., 2013. A new similarity measure with length factor for plagiarism detection, International Journal of Computer Applications, vol. 72, no. 14, pp. 14-17.
Baruah, D., and Mahanta, A. K., 2015. Design and development of soundex for assamese language, International Journal of Computer Applications, vol. 117, no. 9, pp. 9-12, doi: 10.5120/20581-3000
Bhatti, Z., Waqas, A., Ismaili, I. A., Hakro, D. N., and Soomro, W. J., 2014. Phonetic based soundex and shapeex algorithm for Sindhi spell checker system, Advances in Environmental Biology, vol. 8, no. 4, pp. 1147-1155.
Bird, S., Klein, E., and Loper, E., 2009. Natural Language Processing with Python. O’Reilly Press, pp. 463.
Cavoukian, A., and Jonas, J., 2012. Privacy by design in the age of big data. Information and Privacy Commissioner of Ontario, Canada, pp. 3.
Chaudhary, A., Wakchoure, N., Gotarne, N., Nath, P., and B., Dhakulkar, 2016. A comparative study on name matching algorithms, International Journal of Research in Advent Technology, vol. 4, no. 5, pp. 127-129.
Chen, X., and Zhou, L., 2015. Design and implementation of an intelligent system for tourist routes recommendation based on Hadoop, 6th IEEE International Conference on Software Engineering and Service Science (ICSESS), Beijing, pp. 774–778. doi: 10.1109/ICSESS.2015.7339171
Chowdhury, S. R., Hasan, M. M., Iqbal, S., and Rahman, M. S., 2014. Computing a longest common palindromic subsequence, Fundamenta Informaticae, vol. 129, no. 4, pp. 329-340, doi: 10.3233/FI-2014-974
Dice, L. R., 1945. Measures of the amount of ecologic association between species, Ecology, vol. 26, no. 3, pp. 297-302.
Dursun, B., and Sonmez, A. C., 2008. A new method for computing the similarity of Turkish texts, IEEE 16th Signal Processing, Communication and Applications Conference, Aydın, pp. 76. doi: 10.1109/SIU.2008.4632581
Freeman, A. T., Condon, S. L., and Ackerman, C. M., 2006. Cross linguistic name matching in English and Arabic: a one to many mapping extension of the Levenshtein edit distance algorithm, in proc. Main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics. Association for Computational Linguistics, pp. 471-478, doi:10.3115/1220835.1220895
Fuentes, A. A. G., Parra, I. P., Quevedo-Torrero, J. U., and Perez, R. D., 2016. Comparative analysis of phonetic algorithms applied to Spanish,” International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, pp. 1180-1185, doi: 10.1109/CSCI.2016.0223
Gupta P., and Upadhyay, A., 2015. Sentiment and predictive analysis of big data for hotel reviews, International Journal of Software & Hardware Research in Engineering, vol. 3, no. 5, pp. 78–86.
Heeringa, W. J. 2004. Measuring dialect pronunciation differences using Levenshtein distance, Groningen: s.n, pp.323.
Ilhan, S., Duru, N., Karagoz, S., and Sagir, M., 2008. Metin madenciligi ile soru cevaplama sistemi, Electrical – Electronics - Computer Engineering Symposium, Bursa, pp. 356-359.
Jaisunder, G. C, Ahmed, I., and Mishra, R. K., 2017. Need for customized soundex based algorithm on indian names for phonetic matching, Global Journal of Enterprise Information System, vol. 8, no. 2, pp. 30-35, doi: 10.18311/gjeis/2016/7658
Jiang, Y., Deng, D., Wang, J., and Li, G., 2013. Efficient parallel partition based algorithms for similarity search and join with edit distance constraints, in Proc. Joint EDBT/ICDT 2013 Workshops, Genoa. doi: 10.1145/2457317.2457382
Kisla, T., Karaoglan, B., and Metin, S. K., 2015. Extracting the Features of Similarity in Short Texts. IEEE 23th Signal Processing And Communications Applications Conference, Malatya, pp. 180-183, doi: 10.1109/SIU.2015.7130443
Kruskal, J. B., and Sankoff, D., 1999. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Stanford, CA: CSLI Publications.
Kurdziel, L. B. F., and Spencer, R. M. C., 2016. Consolidation of novel word learning in native English-speaking adults, Memory, vol. 24, no. 4, pp. 471-481, doi: 10.1080/09658211.2015.1019889
Levenshtein, V. I., 1966. Binary codes capable of correcting deletions, insertions, and reversals, Soviet Physics Doklady, vol. 10, no. 8. pp. 707-710.
Li, G., Deng, D., and Feng, J., 2013. A partition-based method for string similarity joins with edit-distance constraints, ACM Transactions on Database Systems (TODS), vol. 38, no. 2, pp. 1–33, doi: 10.1145/2487259.2487261
Li, X., Pan, B., Law, R., and Huang, X., 2017. Forecasting tourism demand with composite search index, Tourism Management, vol. 59, pp. 57-66, 2017. doi: 10.1016/j.tourman.2016.07.005
Liu, Y., Teichert, T., Rossi, M., Li, H., and Hu, F., 2017. Big data for big insights: Investigating language-specific drivers of hotel satisfaction with 412,784 user-generated reviews, Tourism Management, vol. 59, pp. 554–563.
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., and Watkins, C., 2002. Text classification using string kernels, Journal of Machine Learning Research, vol. 2, pp. 419-444.
Miah, S. J., Vu, H. Q., Gammack, J.,. and McGrath, M., 2017. A big data analytics method for tourist behaviour analysis, Information & Management, vol. 54, no. 6, pp. 771-785, doi: 10.1016/j.im.2016.11.011
Mutalib N. S. A., and Noah, S. A., 2011. Phonetic coding methods for Malay names retrieval,” International Conference on Semantic Technology and Information Retrieval, Putrajaya, pp. 125-129. doi: 10.1109/STAIR.2011.5995776
Naumann, F., and Herschel, M., 2010. An introduction to duplicate detection,” Synthesis Lectures on Data Management, vol. 2, no.1, pp. 1-87, doi: 10.2200/ S00262ED1V01Y201003DTM003
Nyirarugira, C., and Kim, T., 2015. Stratified gesture recognition using the normalized longest common subsequence with rough sets, Signal Processing: Image Communication, vol. 30, pp. 178-189, doi: 10.1016/j.image.2014.10.00844.
Odell, M., and Russell, R., 1918. The soundex coding system, US Patents 1261167.
Onder, I., 2017. Classifying multi-destination trips in Austria with big data, Tourism Management Perspectives, vol. 21, pp. 54-58, doi: 10.1016/j.tmp.2016.11.002
Parmar, V. P., and Kumbharana, C. K., 2014. Study existing various phonetic algorithms and designing and development of a working model for the new developed algorithm and comparison by implementing it with existing algorithm (s), International Journal of Computer Applications, vol. 98, no. 19, pp. 45-49.
Peng, X., and Huang, Z., 2012. Enabling semantic queries against the spatial database, Advances in Electrical and Computer Engineering, vol. 12, no.1, pp. 45-50, doi: 10.4316/AECE.2012.01008
Sagiroglu, S., and Sinanc, D., 2013. Big data: A review, International Conference on Collaboration Technologies and Systems (CTS), San Diego, pp 42-47. doi: 10.1109/CTS.2013.6567202
Shedeed, H. A., and Abdel, H., 2011. A new intelligent methodology for computer based assessment of short answer question based on a new enhanced soundex phonetic algorithm for Arabic language, International Journal of Computer Applications, vol. 34, no. 10, pp. 40-47.
Shrote, K. R., and Deorankar, A. V., 2016 Hotel recommendation system using hadoop and mapreduce for big data, International Journal of Computer Science, Information Technology, and Security, vol. 6, no. 2, pp. 137–141.
Stein-Smith, K., 2016. The US Foreign Language Deficit: Strategies for Maintaining a Competitive Edge in a Globalized World. Palgrave Macmillan, pp. 21, doi: 10.1007/978-3-319-34159-0
Su, Z., Ahn, B. R., Eom, K. Y., Kang, M. K., Kim, J. P., and Kim, M. K., 2008. Plagiarism detection using the Levenshtein distance and Smith-Waterman algorithm, 3rd International Conference on Innovative Computing Information and Control, Dalian, Liaoning, pp. 0-3. doi: 10.1109/ICICIC.2008.422
Tabataba F. S., and Mousavi, S. R., 2012. A hyper-heuristic for the longest common subsequence problem, Computational Biology and Chemistry, vol. 36, pp. 42–54, doi: 10.1016/j.compbiolchem.2011.12.004
Toole, J. L., Colak, S., Sturt, B., Alexander, L. P., Evsukoff, A., and González, M. C., The path most traveled: Travel demand estimation using big data resources, Transportation Research Part C: Emerging Technologies, vol. 58, pp. 162-177, 2015. doi: 10.1016/j.trc.2015.04.022
Ugon, A., T. 2015. Nicolas, M. Richard, P. Guerin, P. Chansard, C. Demoor, and L. Toubiana, “A new approach for cleansing geographical dataset using Levenshtein distance, prior knowledge and contextual information, Medical Informatics Europe, Madrid, pp. 227-229. doi: 10.3233/978-1-61499-512-8-227
Xiang, L. , Jiang, N., Ya-ting, Y., Xi, Z., and Cheng-gang, M., 2014. Application of generalization language model in Chinese-Uyghur machine translation, Application Research of Computers, vol. 31, no. 10, pp. 2994-2997, doi: 10.3969/j.issn.1001-3695.2014.10.026.
Xiang, Z., Schwartz, Z., Gerdes, J. H., and Uysal, M., 2015. What can big data and text analytics tell us about hotel guest experience and satisfaction? International Journal of Hospitality Management, vol. 44, pp. 120-130, doi: 10.1016/j.ijhm.2014.10.013
Yahia, M. E., Saeed, M. E., and Salih, A. M., 2006. An intelligent algorithm for Arabic soundex function using intuitionistic fuzzy logic, 3rd International IEEE Conference Intelligent Systems, London, pp. 711-715. doi: 10.1109/IS.2006.348506
Zikopoulos, P., and Eaton, C., 2011. Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. Mcgraw-Hill Osborne Media Press, pp. 176.