A new word-based compression model allowing compressed pattern matching

A new word-based compression model allowing compressed pattern matching

In this study a new semistatic data compression model that has a fast coding process and that allows compressed pattern matching is introduced. The name of the proposed model is chosen as tagged word-based compression algorithm (TWBCA) since it has a word-based coding and word-based compressed matching algorithm. The model has two phases. In the rst phase a dictionary is constructed by adding a phrase, paying attention to word boundaries, and in the second phase compression is done by using codewords of phrases in this dictionary. The rst byte of the codeword determines whether the word is compressed or not. By paying attention to this rule, the CPM process can be conducted as word based. In addition, the proposed method makes it possible to also search for the group of consecutively compressed words. Any of the previous pattern matching algorithms can be chosen to use in compressed pattern matching as a black box. The duration of the CPM process is always less than the duration of the same process on the texts coded by Gzip tool. While matching longer patterns, compressed pattern matching takes more time on the texts coded by compress and end-tagged dense code (ETDC). However, searching shorter patterns takes less time on texts coded by our approach than the texts compressed with compress. Besides this, the compression ratio of our algorithm has a better performance against ETDC only on a le that has been written in Turkish. The compression performance of TWBCA is stable and does not vary over 6% on different textfiles.

___

  • [1] Shannon CE. A mathematical theory of communication. Bell System Technical Journal 1948; 27: 379-423.
  • [2] Crochemore M, Rytter W. Jewels of Stringology. Singapore: World Scienti c Pub. Co. Inc., 2002.
  • [3] Mesut A. New methods in data compression. PhD, Trakya University, Edirne, Turkey, 2006.
  • [4] Amir A, Benson G. Efficient two-dimensional compressed matching. In: DCC'92 Data Compression Conference; 24{27 March 1992; Snow Bird, UT, USA: IEEE. pp. 279.
  • [5] Amir A, Benson G, Farach M. Let sleeping les lie: pattern matching in z-compressed les. Journal of Computer and System Sciences 1996; 52: 299-307.
  • [6] Bulus HN. Studying of usage of pattern matching algorithms in compressed text data and developing a new approach. PhD, Trakya University, Edirne, Turkey, 2010.
  • [7] Navarro G, Raffinot M. A general practical approach to pattern matching over Ziv { Lempel compressed text. In: CPM 99 Combinatorial pattern matching: 10th annual symposium; 22{24 July 1999; Warwick University, UK: pp. 14-36.
  • [8] Navarro G, Tarhio J. LZgrep: a Boyer{Moore string matching tool for Ziv{Lempel compressed text. Software| Practice & Experience Archive 2005; 35: 1107-1130.
  • [9] Shibata Y, Kida T, Fukamachi S, Takeda M, Shinohara A, Shinohara T, Arikawa S. Byte pair encoding: a text compression scheme that accelerates pattern matching. Technical Report DOI-TR-161; 1999; Dept. of Informatics, Kyushu University.
  • [10] Brisaboa NR, Iglesias EL, Navarro G, Parama JR. An efficient compression code for text databases. In: ECIR 2003 Advances in Information Retrieval, 25th European Conference on IR Research; 14{16 April 2003; Pisa, Italy. pp. 468-481.
  • [11] Brisaboa NR, Farin~a A, Navarro G, Esteller MF. (S,C)-Dense Coding: an optimized compression code for natural language text databases. In: Symposium on String Processing and Information Retrieval; 8{10 October 2003; Manaus, Brazil. LNCS 2857, Berlin, Germany: Springer-Verlag. pp. 122-136.
  • [12] Brisaboa NR, Farin~a A, Navarro G, Parama JR. New adaptive compressors for natural language text. Software Pract Exper 2008; 38: 1429-1450.
  • [13] Brisaboa NR, Farin~a A, Navarro G, Parama JR. Dynamic lightweight text compression. ACM T Inform Syst 2010; 28: 1-32.
  • [14] Carus A, Mesut A. A new compression algorithm for fast text search. Turk J Elec Eng & Comp Sci 2016; 24: 4355-4367.
  • [15] Horspool RN, Cormack GV. Constructing word-based text compression algorithms. In: DCC'92 Data Compression Conference; 24{27 March 1992; Snow Bird, UT, USA: IEEE. pp. 62-71.
  • [16] Zhang N, Mukherjee A, Tao T, Bell T, Satya RV, Adjeroh D. A exible compressed text retrieval system using a modi ed LZW algorithm. In: DCC 2005 Data Compression Conference; 29{31 March 2005; Snow Bird, UT, USA: IEEE. pp. 493.
  • [17] Moura ES, Navarro G, Ziviani N, Baeza-Yates R. Fast and exible word searching on compressed text. ACM Transactions on Information Systems 2000; 18: 113-139.
  • [18] Ziv J, Lempel A. A universal algorithm for sequential data compression. IEEE T Inform Theory 1977; 23: 337-343.
  • [19] Ziv J, Lempel A. Compression of individual sequences via variable-rate coding. IEEE Inform Theory 1978; 24: 530-536.
  • [20] Welch TA. A technique for high-performance data compression. Journal Computer Archive 1984; 17: 8-19.
  • [21] Gage P. A new algorithm for data compression. C Users Journal 1994; 12: 23-28.
  • [22] Bannai H, Inenaga S, Takeda M. Efficient LZ78 factorization of grammar compressed text. In: 19th International Symposium, SPIRE 2012; 21{25 October, 2012; Cartagena de Indias, Colombia. pp. 86-98.
  • [23] Karp RM, Rabin MO. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 1987; 31: 249-260.
  • [24] Say B, Zeyrek D, O azer K, Ozge U. Development of a corpus and a treebank for present-day written Turkish. In: Eleventh International Conference of Turkish Linguistics; August 2002; Near East University, Gazimagusa, Cyprus. pp. 183-192.
Turkish Journal of Electrical Engineering and Computer Sciences-Cover
  • ISSN: 1300-0632
  • Yayın Aralığı: Yılda 6 Sayı
  • Yayıncı: TÜBİTAK
Sayıdaki Diğer Makaleler

A model of optimal burst assembly for delay reduction at ingress OBS nodes

Viet Minh Nhat VO, Van Hoa LE, Hoang Son NGUYEN

Study and analysis of new pulsed electric eld treatment chamber con gurations for food extraction

Abdelkader CHAKER, Yassine BELLEBNA, Hamza BERMMAKI, Abdelkader SEMMAK, Amar TILMATINE

An ant colony optimization algorithm-based classi cation for the diagnosis of primary headaches using a website questionnaire expert system

Nilüfer YURTAY, Ufuk ÇELİK

A meander coupled line wideband power divider with open stubs and DGS for mobile application

Sivaprakash SOMALINGA CHANDRASEKARAN, Sivanantha Raja AVANINATHAN, Pavithra MURUGESAN

A reduced-order observer based on stator ux estimation with straightforward parameter identi cation for sensorless control of DFIGs

Mohammad Reza AZIZIAN, Rahim AJABI-FARSHBAF, Vahid ESLAMPANAH

Robust control for line-of-sight stabilization of a two-axis gimbal system

Kemal LEBLEBİCİOĞLU, Mehmet BASKIN

The reduction of semiconductor devices in a ying capacitor-based multilevel converter for use as an SSSC

Mana ROKHAFROOZ, Ali MOSALLANEJAD

Power oscillation damping control by PSS and DFIG wind turbine under multiple operating conditions

Korakot THANPISIT, Issarachai NGAMROO

Force and torque parameter estimation for a 4-pole hybrid electromagnet by ANFIS hybrid learning algorithm

Kadir ERKAN, Murat ATLIHAN, Barış Can YALÇIN

Calculation of creepage discharge safety factors against the tangential component of electric elds in the insulation structure of power transformers

Arsalan HEKMATI