A new dictionary-based preprocessor that uses radix-190 numbering

A new dictionary-based preprocessor that uses radix-190 numbering

Various scholarly works in the literature have pointed out that placing a preprocessor in front of a standard postcompressor would help achieve higher gains while compressing natural-language text files. Ever since, there has been much research on preprocessors to improve the gain attained by concatenated systems. With the same goal in mind our paper proposes a new word-based preprocessor named METEHAN190 (M190) and contrasts its performance with four other state-of-the-art preprocessors. Throughout the experiments source files from the Wall Street Journal (WSJ) archive, and the Calgary, Canterbury, Gutenberg, and Pizza and Chili corpora were used. Postcompressors adapted were Prediction by Partial Matching compressor using method-D (PPMD) and Monstrous PPM II compressor (PPMonstr). It was observed that in all three experiments WRT and M190 would achieve the two highest compression gains. For small text and transcription files from the Calgary corpus, M190 would outperform all preprocessors including WRT. On the other hand, a look at average encoding and decoding times shows that the semistatic byte-oriented methods are much faster in comparison to the static dictionary-based methods that encode words with characters.

___

  • [1] Huffman DA. A method for the construction of minimum-redundancy codes. P IRE 1952; 40: 1098-1101.
  • [2] Gallager RG. Variations on a theme by Huffman. IEEE T Inf Theory 1978; 24: 668-674.
  • [3] Rissanen J, Langdon GG. Arithmetic coding. IBM J Res Devel 1979; 23: 149-162.
  • [4] Cleary JG, Witten IH. Data compression using adaptive coding and partial string matching. IEEE T Commun 1984; 32: 396-402.
  • [5] Mahoney MV. (2004-) The PAQ6 data compression program, 2004-present. Available online at http://mattmahoney.net/dc/paq.html.
  • [6] Moura E, Navarro G, Ziviani N, Baeza-Yates R. Fast and flexible word searching on compressed text. ACM T Inf Syst 2000; 18: 113-139.
  • [7] Brisaboa N, Iglesias E, Navarro G, Parama J. An efficient compression code for text databases. In: IEEE 2003 Advances in Information Retrieval Conference; 14–16 April 2003; Pisa, Italy: IEEE. pp. 468-481.
  • [8] Brisaboa N, Farina A, Navarro G, Parama J. Leight-weight natural language text compression. Inf Ret 2007; 10: 1-33.
  • [9] Culpepper JS, Moffat A. Enhanced byte codes with restricted prefix properties. In: Springer-Verlag 2005 String Processing and Information Retrieval Conference; LNCS 3772: Springer-Verlag. pp. 1-12.
  • [10] Brisaboa N, Faria A, Ladra S, Navarro G. Implicit indexing of natural language text by reorganizing bytecodes. J Inf Ret 2012; 15: 527-557.
  • [11] Ziv J, Lempel A. A universal algorithm for sequential data compression. IEEE T Inf Theory 1977; 23: 337-343.
  • [12] Welch TA. A technique for high-performance data compression. Comput J 1984; 17: 8-19.
  • [13] Deutsch P. (1996-) Deflate compressed data format specification, version 1.3, 1951-present. Available online at https://www.ietf.org/rfc/rfc1951.txt.
  • [14] Nelson MR. (2002-) Star encoding, Dr. Dobb’s J., 2002-present. Available online at http://marknelson.us/page/11/.
  • [15] Awan F, Mukherjee A. LIPT: A lossless text transform to improve compression. In: IEEE 2001 Information Technology: Coding and Computing Conference; April 2001: IEEE. pp. 452-460.
  • [16] Sun W, Mukherjee A, Zhang N. A dictionary-based multi corpora text compression system. In: IEEE 2003 Data Compression Conference; March 2003: IEEE. pp. 1-11.
  • [17] Skibinski P, Grabowski S, Deorowicz S. Revisiting dictionary based compression. Softw J : Pract and Exper 2005; 35: 1455-1476.
  • [18] Golomb SW. Run-length encoding. IEEE T Inf Theory 1966; 12: 337-343.
  • [19] Burrows M, Wheeler DJ. A Block-sorting lossless data compression algorithm. Digital Systems Research Center, Research Report 124, 1994.
  • [20] Effros M, Visweswariah K, Kulkarni SR, Verdu S. Universal lossless source coding with the Burrows Wheeler transform. IEEE n Inf Theory 2002; 48: 1061-1081.
  • [21] Seward J. On the performance of BWT sorting algorithms. In: IEEE 2000 Data Compression Conference; March 2000: IEEE. pp. 173-182.
  • [22] Moffat A. Implementing the PPM data compression scheme. IEEE T Commun 1990; 38: 1917-1921.
  • [23] Teahan W. Probability estimation for PPM. In: 1995 New Zealand Computer Science Research Students Conference; University of Waikato, New Zealand, 1995.
  • [24] Shkarin D. (2001-) PPMd: fast PPM compressor for textual data, 2001-present. Available online at ftp://ftp.elf.stuba.sk/pub/pc/pack/ppmdh.rar.
  • [25] Teahan EJ, Witten IH. Unbounded length contexts for PPM.In: IEEE 1995 Data Compression Conference; March 1995: IEEE. pp. 52-61.
  • [26] Shkarin D. (2002-) Monstrous PPM II compressor based on PPMd var.I, 2002-present. Available online at http:/www.compression.ru/ds/ppmdi1.rar.
  • [27] Shkarin D. (2004-) The Durilca and Durilca Light 0.4a programs, 2006-present, Available online at http://www.compression.ru/ds/durilca.rar.
Turkish Journal of Electrical Engineering and Computer Sciences-Cover
  • ISSN: 1300-0632
  • Yayın Aralığı: Yılda 6 Sayı
  • Yayıncı: TÜBİTAK
Sayıdaki Diğer Makaleler

Process time and MPPT performance analysis of CF, LUT, and ANN control methods for a PMSG-based wind energy generation system

Ercüment KARAKAS, Abdulhakim KARAKAYA

A new CMOS logarithmic current generator

Munir ABSI AL, Karama TAMIMI AL

Energy-efficient mobile cluster-head data collection model for wireless sensor networks

Sanjeev SOFAT, Renu VIG, Nidhi GAUTAM

Determination of the striking distance of a lightning rod using finite element analysis

Hazlee Azil ILLIAS, Hazlie MOKHLIS, Ab Halim BAKAR ABU, Syahirah HALIM ABD, Nor Hidayah HASSAN NOR, Chia Kwang TAN, Alyaa ABIDIN ZAINAL

Design of low power system on programmable chip for video zoom-in processing

Lamjed TOUIL, Lilia KECHICHE, Bouraoui OUNI

A multiobjective tuning approach of power system stabilizers using particle swarm optimization

Djamel BOUKHETALA, Hilal LABDELAOUI, Far`es BOUDJEMA

A new step-size searching algorithm based on fuzzy logic and neural networks for LMS adaptive beamforming systems

Mariko MIYATAKE NAKANO, Hector MEANA PEREZ, Walter TUPACYUPANQUI OROZCO

A new market-based approach for daily Volt/Var control of distribution systems in the presence of distributed energy resources using Benders decomposition algorithm

Ahad KAZEMI, Abouzar SAMIMI

IEC 61850-based parallel bus transfer scheme for industrial substations

Cagil OZANSOY

Novel dynamic partial reconfiguration implementations of the support vector machine classifier on FPGA

Hanaa HUSSAIN, Khaled BENKRID2, Hüseyin ŞEKER