A new word-based compression model allowing compressed pattern matching

A new word-based compression model allowing compressed pattern matching

In this study a new semistatic data compression model that has a fast coding process and that allows compressed pattern matching is introduced. The name of the proposed model is chosen as tagged word-based compression algorithm (TWBCA) since it has a word-based coding and word-based compressed matching algorithm. The model has two phases. In the rst phase a dictionary is constructed by adding a phrase, paying attention to word boundaries, and in the second phase compression is done by using codewords of phrases in this dictionary. The rst byte of the codeword determines whether the word is compressed or not. By paying attention to this rule, the CPM process can be conducted as word based. In addition, the proposed method makes it possible to also search for the group of consecutively compressed words. Any of the previous pattern matching algorithms can be chosen to use in compressed pattern matching as a black box. The duration of the CPM process is always less than the duration of the same process on the texts coded by Gzip tool. While matching longer patterns, compressed pattern matching takes more time on the texts coded by compress and end-tagged dense code (ETDC). However, searching shorter patterns takes less time on texts coded by our approach than the texts compressed with compress. Besides this, the compression ratio of our algorithm has a better performance against ETDC only on a le that has been written in Turkish. The compression performance of TWBCA is stable and does not vary over 6% on different textfiles.

___

  • [1] Shannon CE. A mathematical theory of communication. Bell System Technical Journal 1948; 27: 379-423.
  • [2] Crochemore M, Rytter W. Jewels of Stringology. Singapore: World Scienti c Pub. Co. Inc., 2002.
  • [3] Mesut A. New methods in data compression. PhD, Trakya University, Edirne, Turkey, 2006.
  • [4] Amir A, Benson G. Efficient two-dimensional compressed matching. In: DCC'92 Data Compression Conference; 24{27 March 1992; Snow Bird, UT, USA: IEEE. pp. 279.
  • [5] Amir A, Benson G, Farach M. Let sleeping les lie: pattern matching in z-compressed les. Journal of Computer and System Sciences 1996; 52: 299-307.
  • [6] Bulus HN. Studying of usage of pattern matching algorithms in compressed text data and developing a new approach. PhD, Trakya University, Edirne, Turkey, 2010.
  • [7] Navarro G, Raffinot M. A general practical approach to pattern matching over Ziv { Lempel compressed text. In: CPM 99 Combinatorial pattern matching: 10th annual symposium; 22{24 July 1999; Warwick University, UK: pp. 14-36.
  • [8] Navarro G, Tarhio J. LZgrep: a Boyer{Moore string matching tool for Ziv{Lempel compressed text. Software| Practice & Experience Archive 2005; 35: 1107-1130.
  • [9] Shibata Y, Kida T, Fukamachi S, Takeda M, Shinohara A, Shinohara T, Arikawa S. Byte pair encoding: a text compression scheme that accelerates pattern matching. Technical Report DOI-TR-161; 1999; Dept. of Informatics, Kyushu University.
  • [10] Brisaboa NR, Iglesias EL, Navarro G, Parama JR. An efficient compression code for text databases. In: ECIR 2003 Advances in Information Retrieval, 25th European Conference on IR Research; 14{16 April 2003; Pisa, Italy. pp. 468-481.
  • [11] Brisaboa NR, Farin~a A, Navarro G, Esteller MF. (S,C)-Dense Coding: an optimized compression code for natural language text databases. In: Symposium on String Processing and Information Retrieval; 8{10 October 2003; Manaus, Brazil. LNCS 2857, Berlin, Germany: Springer-Verlag. pp. 122-136.
  • [12] Brisaboa NR, Farin~a A, Navarro G, Parama JR. New adaptive compressors for natural language text. Software Pract Exper 2008; 38: 1429-1450.
  • [13] Brisaboa NR, Farin~a A, Navarro G, Parama JR. Dynamic lightweight text compression. ACM T Inform Syst 2010; 28: 1-32.
  • [14] Carus A, Mesut A. A new compression algorithm for fast text search. Turk J Elec Eng & Comp Sci 2016; 24: 4355-4367.
  • [15] Horspool RN, Cormack GV. Constructing word-based text compression algorithms. In: DCC'92 Data Compression Conference; 24{27 March 1992; Snow Bird, UT, USA: IEEE. pp. 62-71.
  • [16] Zhang N, Mukherjee A, Tao T, Bell T, Satya RV, Adjeroh D. A exible compressed text retrieval system using a modi ed LZW algorithm. In: DCC 2005 Data Compression Conference; 29{31 March 2005; Snow Bird, UT, USA: IEEE. pp. 493.
  • [17] Moura ES, Navarro G, Ziviani N, Baeza-Yates R. Fast and exible word searching on compressed text. ACM Transactions on Information Systems 2000; 18: 113-139.
  • [18] Ziv J, Lempel A. A universal algorithm for sequential data compression. IEEE T Inform Theory 1977; 23: 337-343.
  • [19] Ziv J, Lempel A. Compression of individual sequences via variable-rate coding. IEEE Inform Theory 1978; 24: 530-536.
  • [20] Welch TA. A technique for high-performance data compression. Journal Computer Archive 1984; 17: 8-19.
  • [21] Gage P. A new algorithm for data compression. C Users Journal 1994; 12: 23-28.
  • [22] Bannai H, Inenaga S, Takeda M. Efficient LZ78 factorization of grammar compressed text. In: 19th International Symposium, SPIRE 2012; 21{25 October, 2012; Cartagena de Indias, Colombia. pp. 86-98.
  • [23] Karp RM, Rabin MO. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 1987; 31: 249-260.
  • [24] Say B, Zeyrek D, O azer K, Ozge U. Development of a corpus and a treebank for present-day written Turkish. In: Eleventh International Conference of Turkish Linguistics; August 2002; Near East University, Gazimagusa, Cyprus. pp. 183-192.
Turkish Journal of Electrical Engineering and Computer Sciences-Cover
  • ISSN: 1300-0632
  • Yayın Aralığı: 6
  • Yayıncı: TÜBİTAK
Sayıdaki Diğer Makaleler

Fault location determination for transmission lines with different series-compensation levels using transient frequencies

Mehmet Emin TAĞLUK, Müslüm ARKAN, Mehmet Salih MAMİŞ, Düzgün AKMAZ

Numerical study of AC loss of two-layer HTS power transmission cables composed of coated conductors with a ferromagnetic substrate

Sükrü YILDIZ, Ahmet CİCEK, Fedor GÖMÖRY, Fedai İNANIR

A model of optimal burst assembly for delay reduction at ingress OBS nodes

Viet Minh Nhat VO, Van Hoa LE, Hoang Son NGUYEN

Study and analysis of new pulsed electric eld treatment chamber con gurations for food extraction

Abdelkader CHAKER, Yassine BELLEBNA, Hamza BERMMAKI, Abdelkader SEMMAK, Amar TILMATINE

Generalized referenceless image quality assessment framework using texture energy measures and pattern strength features

Jayashri BAGADE, Kulbir SINGH, Yogesh DANDAWATE

Power oscillation damping control by PSS and DFIG wind turbine under multiple operating conditions

Korakot THANPISIT, Issarachai NGAMROO

Median ltering detection based on variations and residuals in image forensics

Kang Hyeon RHEE

A new proposal for the design of hybrid AC/DC microgrids toward high power quality

Pouria GOHARSHENASAN KHORASANI, Mahmood JOORABIAN, Seyed Ghodratolah SEIFOSSADAT

Planar inverted-f antenna for universal serial bus dongle applications

Yi HUANG, Hassan Tariq CHATTHA, Qammer Hussain ABBASI, Saqer Saleh ALJAAFREH, Muhammad NASIR

Lossless predictive coding of electric signal waveforms

Krzysztof DUDA