MongoDB'nin tam metin arama performansını iyileştirme yöntemi

MongoDB'de kullanılan B-Tree tabanlı metin dizinleri, ters çevrilmiş dizinler gibi farklı yapılara kıyasla yavaştır. Bu çalışmada, metindeki her farklı kelimenin yalnızca bir kez yer aldığı bir yapı indekslenerek tam metin arama hızının önemli ölçüde artırılabileceği gösterilmiştir. Daha önceki çalışmalarımızda geliştirilen Çok Akışlı Kelime Tabanlı Sıkıştırma Algoritması (MWCA), kelime sözlüklerini ve verileri farklı akışlarda saklar. Belgeler bir MongoDB koleksiyonuna eklenirken MWCA ile kodlanmış ve altı farklı akışa ayrılmıştır. Her akış farklı bir alan ismi ile saklanmış ve bunlardan benzersiz kelimeler içeren üçü metin dizini oluşturulurken kullanılmıştır. Bu sayede indeks daha kısa sürede oluşturulabilmiş ve daha az yer kaplamıştır. MongoDB‘de kullanılan Snappy ve Zlib blok sıkıştırma yöntemlerinin MWCA ile kodlanan veriler üzerinde daha yüksek sıkıştırma oranlarına ulaştığı da görülmüştür. Farklı yöntemler ile sıkıştırılan koleksiyonlar üzerinde oluşturulan metin dizinlerinde yapılan arama testleri, yöntemimizin 19 ila 146 kat hız artışı ve %34 ila %40 daha az bellek kullanımı sağladığını göstermiştir. Metin dizinini kullanmayan regex aramaları ile ilgili testler de MWCA modelinin 7 ila 13 kat hız artışı ve %29 ila %34 daha az bellek kullanımı sağladığını göstermiştir.

Anahtar Kelimeler:

NoSQL, MongoDB, Metin dizini, Tam metin arama, MWCA

A method to improve full-text search performance of MongoDB

B-Tree based text indexes used in MongoDB are slow compared to different structures such as inverted indexes. In this study, it has been shown that the full-text search speed can be increased significantly by indexing a structure in which each different word in the text is included only once. The Multi-Stream Word-Based Compression Algorithm (MWCA), developed in our previous work, stores word dictionaries and data in different streams. While adding the documents to a MongoDB collection, they were encoded with MWCA and separated into six different streams. Each stream was stored in a different field, and three of them containing unique words were used when creating a text index. In this way, the index could be created in a shorter time and took up less space. It was also seen that Snappy and Zlib block compression methods used by MongoDB reached higher compression ratios on data encoded with MWCA. Search tests on text indexes created on collections using different compression options shows that our method provides 19 to 146 times speed increase and 34% to 40% less memory usage. Tests on regex searches that do not use the text index also shows that the MWCA model provides 7 to 13 times speed increase and 29% to 34% less memory usage.

Keywords:

NoSQL, MongoDB, Text index, Full-Text search, MWCA,

PDF

___

[1] Qiao Y. An FPGA-Based Snappy Decompressor-Filter. MSc Thesis, Delft University of Technology, Delft, Netherlands, 2018.
[2] Deutsch P, Gailly JL. “Zlib compressed data format specification version 3.3”. RFC 1950, USA, 1996.
[3] Collet Y, Kucherawy M. “Zstandard compression and the application/zstd Media Type”. RFC 8478, USA, 2018.
[4] Öztürk E, Mesut A, Diri B. “Multi-Stream word-based compression algorithm for compressed text search”. Arabian Journal of Science and Engineering, 43(12), 8209–8221, 2018.
[5] Habib A, Islam MJ, Rahman MS. “A dictionary-based text compression technique using quaternary code”. Iran Journal of Computer Science, 3(3), 127–136, 2020.
[6] Rahman MA, Hamada M. “Burrows–Wheeler transform based lossless text compression using keys and huffman coding”. Symmetry, 12(10), 1654-1667, 2020.
[7] Mahmood MA, Hasan KMA. “Efficient compression scheme for large natural text using zipf distribution”. International Conference on Advances in Science, Engineering and Robotics Technology, Dhaka, Bangladesh, 3 May 2019.
[8] Bharathi K, Kumar H, Fairouz A, Al Kawam A, Khatri SP. “A plain-text incremental compression (pic) technique with fast lookup ability”. IEEE 36th International Conference on Computer Design, Orlando, FL, USA, 7-10 October 2018.
[9] Buluş HN, Carus A, Mesut A. “A new word-based compression model allowing compressed pattern matching”. Turkish Journal of Electrical Engineering & Computer Sciences, 25(5), 3607–3622, 2017.
[10] Morishima S, Matsutani H. “Performance evaluations of document-oriented databases using GPU and cache structure”. IEEE Trustcom/BigDataSE/ISPA, Helsinki, Finland, 20-22 August 2015.
[11] Kelec A, Dujlovic I, Obradovic N. “One approach for fulltext search of files in MongoDB based systems”. IEEE 18th International Symposium INFOTEH-JAHORINA, East Sarajevo, Bosnia & Herzegovina, 20-22 March 2019.
[12] Truica CO, Boicea A, Radulescu F. “Building an inverted index at the dbms layer for fast full text search”. Journal of Control Engineering and Applied Informatics, 19(1), 94-101, 2017.
[13] Zobel J, Moffat A. “Inverted files for text search engines”. ACM computing surveys (CSUR), 38(2), 1-56, 2006.
[14] Greca S, Kosta A, Maxhelaku S. “optimizing data retrieval by using mongodb with elasticsearch”. International Conference on Recent Trends and Applications in Computer Science and Information Technology, Tirana, Albania, 23-24 November 2018.
[15] Han L, Zhu L. “Design and implementation of elasticsearch for media data”. IEEE International Conference on Computer Engineering and Application, Guangzhou, China, 18-20 March 2020.
[16] Lu W, Zhu L, Duan S. “Research and implementation of big data system of social media”. IEEE/ACIS 17th International Conference on Computer and Information Science, Singapore, 6-8 June 2018.
[17] Mongodb, Inc. "Getting started with MongoDB Atlas FullText Search". https://www.mongodb.com/blog/post/ getting-started-with-mongodb-atlas-fulltext-search (10.07.2021).
[18] Eyada MM, Saber W, El Genidy MM, Amer F. “Performance evaluation of iot data management using mongodb versus MySQL databases in different cloud environments”. IEEE Access, 8, 110656-110668, 2020.
[19] Mehmood E, Anees T. “Performance analysis of not only SQL semi-stream join using MongoDB for real-time data warehousing”. IEEE Access, 7, 134215-134225, 2019.
[20] Nurseitov N, Paulson M, Reynolds R, Izurieta C. “Comparison of JSON and XML data interchange formats: a case study”. 22nd International Conference on Computer Applications in Industry and Engineering; San Francisco, CA, USA, 4-6 November 2009.
[21] Deutsch P. “DEFLATE Compressed Data Format Specification Version 1.3”. RFC 1951, USA, 1996.
[22] Deutsch P. “GZIP file Format Specification Version 4.3”. RFC 1952, USA, 1996.
[23] Duda J, Tahboub K, Gadgil NJ, Delp EJ. “The use of asymmetric numeral systems as an accurate replacement for Huffman coding”. IEEE Picture Coding Symposium, Cairns, QLD, Australia, 31 May-3 June 2015.
[24] Moffat A. “Word-based text compression”. Software: Practice and Experience, 19(2), 185-198, 1989.
[25] Heaps HS. Information Retrieval: Computational and Theoretical Aspects. 1st ed. Sea Harbor Drive Orlando, FL, USA, Academic Press, 1978.
[26] Alakuijala J, Farruggia A, Ferragina P, Kliuchnikov E, Obryk R, Szabadka Z, Vandevenne L. “Brotli: A general-purpose data compressor”. ACM Transactions on Information Systems, 37(1), 1-30, 2018.