Terim-Doküman Matrisleri için Sıralamaya Dayalı Bir Kayıpsız Sıkıştırma Şeması

Kayıpsız veri sıkıştırma, özellikle bellek içi veri tabanları ve önbellek kullanımlı bilgi geri kazanım sistemlerinde, harcanan disk alanını azaltmasının yanı sıra, etkin kod çözme algoritmaları aracılığıyla bilgiye erişimi hızlandırması sebebiyle önem arz etmektedir. Bu çalışmada, ters dizinlerin sıkıştırılması kapsamında yeni bir değişken sekiz ikili kodlama yöntemi geliştirilmiş ve terim-doküman matrisinin bant genişliğinin indirgenmesi amacıyla tepe tırmanmaya dayalı çift kutuplu dizilim şeması önerilmiştir. Bu şema, internet üzerinden toplanan haber metinlerine uygulanarak doküman dizmenin dizin sıkıştırma oranına olan etkisi incelenmiştir.

___

  • [1] G.Graefe and L.Shapiro. Data compression and database performance. In ACM/IEEE-CS Symp. On Applied Computing pages 22 -27, April 1991.
  • [2] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O’Neil, P. E. O’Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-Store: A column-oriented DBMS. In VLDB, pages 553–564, 2005.
  • [3] D. J. Abadi, S. Madden, and M. Ferreira. Integrating Compression and Execution in Column-Oriented Database Systems. In SIGMOD, 2006.
  • [4] C. Lin, J. Wang, and Y. Papakonstantinou. Data Compression for Analytics over Large-scale In-memory Column Databases, ACM 2016.
  • [5] H. Schütze, C. D. Manning, and P. Raghavan. Introduction to Information Retrieval. Vol. 39. Cambridge University Press, 2008.
  • [6] I. H. Witten, A. Moffat, T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. San Francisco, CA, USA, 1999.
  • [7] R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. In 32nd ACM Symposium on Theory of Computing, pages 397–406, 2000.
  • [8] M. A. Martinez-Prieto, N. Brisaboa, R. Canovas, F. Claude, and G. Navarro. Practical compressed string dictionaries. Information Systems, 56:73–108, 2016.
  • [9] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999.
  • [10] H. E. Williams,J. Zobel. Compressing Integersfor Fast File Access. Comput.J.42, pp. 193-201, 1999.
  • [11] P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, IT-21(2):194–203, 1975.
  • [12] V. Anh and A. Moffat. Index compression using fixed binary codewords. In Proc. of the 15th Int. Australasian Database Conference, pages 61–67, 2004.
  • [13] D. A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40.9, pages 1098-1101, 1952.
  • [14] I. H. Witten, M. N. Radford, and G. C. John. Arithmetic coding for data compression. Communications of the ACM 30.6, pages 520-540, 1987. [15] A. S. Fraenkel and S. T. Klein. Novel compression of sparse bit-strings. Preliminary report. Combinatorial Algorithms on Words, Volume 12. NATO ASI Series F, pages 169-183, 1985.
  • [16] S. W. Golomb. Run-length encoding. IEEE Transactions on Information Theory, vol. IT-12, pp. 399-401, 1966.
  • [17] R. Gallager, and D. Van Voorhis. Optimal source codes for geometrically distributed integer alphabets. IEEE Transactions on Information theory 21.2,pages: 228-230, 1975.
  • [18] A. Trotman. Compressing inverted files. Information Retrieval, 6.1: 5-19, 2003.
  • [19] F. Scholer, H.E. Williams, J. Yiannis, J. Zobel. Compressionof Inverted Indexes for Fast Query Evaluation. In SIGIR 2002, pp. 222-229, 2002.
  • [20] J. Plaisance, N. Kurz, and D. Lemire. Vectorized vbyte decoding. CoRR, 2015.
  • [21] M. C. Sorkunand C. Özbey. Compression experiments on term-document index. Computer Science and Engineering (UBMK), 2017 International Conference on. IEEE, 2017.
  • [22] H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th international conference on World wide web. ACM, 401–410, 2009.
  • [23] C. Papadimitriou. The NP-completeness of the bandwidth minimization problem.Computing, vol. 16, pp. 263–270, 1976.
  • [24] A. Lim, B. Rodrigues, F. Xiao. Integrated genetic algorithm with hill climbing forbandwidth minimization problem. GECCO 2003. LNCS, vol. 2724, pp. 1594–1595, 2003.
  • [25] E. Rodriguez-Tello, J. K. Hao, J. Torres-Jimenez. An improved simulated annealing algorithm for bandwidth minimization, European Journal of Operational Research 185, pp. 1319–1335, 2008.
  • [26] R. Martí, M. Laguna, F. Glover and V. Campos.Reducing the bandwidth ofa sparse matrix with tabu search. European Journal of Operational Research 135, pp. 450–459, 2001.
  • [27] E. Cuthill, J. McKee. Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 1969 24th National Conference (New York, NY, USA, 1969), ACM ’69, ACM, pp. 157–172, 1969.
  • [28] N. E. Gibbs, W. G. Poole and P. K. Stockmeyer. An algorithm for reducing the bandwidth and profile of a sparse matrix. SIAM Journal on Numerical Analysis 13, pp. 236-250, 1976.
  • [29] S. W. Sloan. An algorithm for profile and wavefront reduction of sparse matrices. International Journal for Numerical Methods in Engineering 23, 2. 239–251, 1986.
  • [30] D. Harel and Y. Koren. Graph drawing by high-dimensionalembedding. In Revised Papers from the 10th International Symposium on Graph Drawing. GD ’02, Springer-Verlag, pp. 207–219, 2002.
  • [31] I. Spence andJ. Graef. The determination of the underlying dimensionalityof an empirically obtained matrix of proximities. Multivariate Behavioral Research 9, pp. 331–341, 1974.
  • [32] Y. Cheng, G. M. Church. Biclustering of expression data. In Ismb, vol. 8, pp. 93–103, 2000.
  • [33] F. Silvestri. Sorting out the document identifier assignment problem. In Proc. of 29th European Conf. on Information Retrieval, pages 101–112, 2007.
  • [34] V. Ramaswamy, R. Konow, A. Trotman, J. Degenhardt, and N. Whyte. Document Reordering is Good, Especially for e-Commerce. In Proceedings of the SIGIR 2017 Workshop on eCommerce (ECOM 17), 2017.
  • [35] S. Büttcher, C. Clarke, and G. Cormack. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press, p. 214, 2010.
  • [36] Y. Nekritch. Decoding of canonical Huffman codes with look-up tables. In Proc. Conf. Data Compression, p. 566, 2000.