A new compression algorithm for fast text search

A new compression algorithm for fast text search

We propose a new compression algorithm that compresses plain texts by using a dictionary-based model and a compressed string-matching approach that can be used with the compressed texts produced by this algorithm. The compression algorithm (CAFTS) can reduce the size of the texts to approximately 41% of their original sizes. The presented compressed string matching approach (SoCAFTS), which can be used with any of the known pattern matching algorithms, is compared with a powerful compressed string matching algorithm (ETDC) and a compressed string-matching tool (Lzgrep). Although the search speed of ETDC is very good in short patterns, it can only search for exact words and its compression performance differs from one natural language to another because of its word-based structure. Our experimental results show that SoCAFTS is a good solution when it is necessary to search for long patterns in a compressed document.

___

  • [1] Amir A, Benson G. Efficient two-dimensional compressed matching. In: Data Compression Conference; 24–27 March 1992; Snowbird, Utah, USA. Los Alamitos, CA, USA: IEEE. pp. 279-288.
  • [2] Manber U. A text compression scheme that allows fast searching directly in the compressed file. ACM T Inform Syst 1997; 15: 124-136.
  • [3] Moura ES, Navarro G, Ziviani N, Baeza-Yates R. Fast and flexible word searching on compressed text. ACM T Inform Syst 2000; 18: 113-139.
  • [4] Brisaboa NR, Iglesias ER, Navarro G, Param´a JR. An efficient compression code for text databases. In: European Conference on IR Research; 14–16 April 2003; Pisa, Italy. LNCS 2633, Berlin, Germany: Springer-Verlag. pp. 468-481.
  • [5] Brisaboa NR, Fari˜na A, Navarro G, Esteller MF. (S,C)-Dense Coding: an optimized compression code for natural language text databases. In: Symposium on String Processing and Information Retrieval; 8–10 October 2003; Manaus, Brazil. LNCS 2857, Berlin, Germany: Springer-Verlag. pp. 122-136.
  • [6] Brisaboa NR, Fari˜na A, Navarro G, Param´a JR. New adaptive compressors for natural language text. Software Pract Exper 2008; 38: 1429-1450.
  • [7] Brisaboa NR, Fari˜na A, Navarro G, Param´a JR. Dynamic lightweight text compression. ACM T Inform Syst 2010; 28: 1-32.
  • [8] Culpepper JS, Moffat A. Phrase-based pattern matching in compressed text. In: Symposium on String Processing and Information Retrieval; 11–13 October 2006; Glasgow, Scotland, UK. LNCS 4209, Berlin, Germany: SpringerVerlag. pp. 337-345.
  • [9] Ziv J, Lempel A. A universal algorithm for sequential data compression. IEEE T Inform Theory 1977; 23: 337-343.
  • [10] Ziv J, Lempel A. Compression of individual sequences via variable-rate coding. IEEE Inform Theory 1978; 24: 530-536.
  • [11] Storer JA, Szymanski TG. Data compression via textual substitution. J ACM 1982; 29: 928-951.
  • [12] Welch TA. Technique for high-performance data compression. IEEE Computer 1984; 17: 8-19.
  • [13] Farach M, Thorup M. String matching in LempelZiv compressed strings. In: ACM Symposium on Theory of Computing; 29 May–1 June 1995; Las Vegas, NV, USA. New York, NY, USA: ACM. pp. 703-712.
  • [14] Amir A, Benson G, Farach M. Let sleeping files lie: pattern matching in Z-compressed files. J Comput Syst Sci 1996; 52: 299-307.
  • [15] Kida T, Takeda M, Shinohara A, Miyazaki M, Arikawa S. Multiple pattern matching in LZW compressed text. In: Data Compression Conference; 30 March–1 April 1998; Snowbird, Utah, USA. Los Alamitos, CA, USA: IEEE. pp. 103-112.
  • [16] Tao T, Mukherjee A. Pattern matching in LZW compressed files. IEEE T Comput 2005; 54: 929-938.
  • [17] Aho AV, Corasick MJ. Efficient string matching: an aid to bibliographic search. Commun ACM 1975; 18: 333-340.
  • [18] Navarro G, Raffinot M. A general practical approach to pattern matching over Ziv-Lempel compressed text. In: Combinatorial Pattern Matching; 22–24 July 1999; Warwick University, UK. LNCS 1645, Berlin, Germany: Springer-Verlag. pp. 14-36.
  • [19] Boyer RS, Moore JS. A fast string searching algorithm. Commun ACM 1977; 20: 762-772.
  • [20] Navarro G, Tarhio J. LZgrep: A Boyer–Moore string matching tool for Ziv–Lempel compressed text. Software Pract Exper 2005; 35: 1107-1130.
  • [21] Klein ST, Shapira D. A new compression method for compressed matching. In: Data Compression Conference; 28–30 March 2000; Snowbird, Utah, USA. Los Alamitos, CA, USA: IEEE. pp. 400-409.
  • [22] Gage P. A new algorithm for data compression. C Users Journal 1994; 12: 23-28.
  • [23] Larsson NJ, Moffat A. Offline dictionary-based compression. P IEEE 2000; 88: 1722-1732.
  • [24] Shibata Y, Kida T, Fukamachi S, Takeda M, Shinohara A, Shinohara T, Arikawa S. Byte pair encoding: a text compression scheme that accelerates pattern matching. Technical Report DOI-TR-161; 1999; Dept. of Informatics, Kyushu University.
  • [25] Moffat A, Wan R. Re-Store: A system for compressing, browsing, and searching large documents. In: 8th International Symposium on String Processing and Information Retrieval; 13–15 November 2001; Laguna de San Rafael, Chile. Los Alamitos, CA, USA: IEEE. pp. 162-174.
  • [26] Mesut A, Carus A. ISSDC: Digram coding based lossless data compression algorithm. Comput Inform 2010; 29: 741-756.
  • [27] Carus A, Mesut A. Fast text compression using multiple static dictionaries. Information Technology Journal 2010; 9: 1013-1021.
  • [28] Allauzen C, Crochemore M, Raffinot M. Factor oracle: a new structure for pattern matching. In: 26th Conference on Current Trends in Theory and Practice of Informatics; 27 November–4 December 1999; Milovy, Czech Republic. LNCS 1725, Berlin, Germany: Springer-Verlag. pp. 295-310.
  • [29] Faro S, Lecroq T. Efficient variants of the Backward-Oracle-Matching algorithm. Int J Found Comput S 2009; 20: 967-984.
  • [30] Charras C, Lecroq T. Handbook of Exact String-Matching Algorithms. London, UK: King’s College Publications. 2004.
  • [31] Say B, Zeyrek D, Oflazer K, Ozge U. Development of a corpus and a treebank for present-day written Turkish. In: Eleventh International Conference of Turkish Linguistics; 7–9 August 2002; Gazima˘gusa, KKTC. Ankara, Turkey: METU. pp. 183-192.