Estimating the selectivity of LIKE queries using pattern-based histograms

Estimating the selectivity of LIKE queries using pattern-based histograms

Accurate cost and time estimation of a query is one of the major success indicators for database managementsystems. SQL allows the expression of flexible queries on text-formatted data. The LIKE operator is used to search for aspecified pattern (e.g., LIKE “luck%”) in a string database. It is vital to estimate the selectivity of such flexible predicatesfor the query optimizer to choose an efficient execution plan. In this paper, we study the problem of estimating theselectivity of a LIKE query predicate over a bag of strings. We propose a new type of pattern-based histogram structureto summarize the data distribution in a particular column. More specifically, we first mine sequential patterns over agiven string database and then construct a special histogram out of the mined patterns. During query optimization time,pattern-based histograms are exploited to estimate the selectivity of a LIKE predicate. The experimental results on areal dataset from DBLP show that the proposed technique outperforms the state of the art for generic LIKE queries like%s1%s2%...%sn% where si represents one or more characters. What is more, the proposed histogram structure requiresmore than two orders of magnitude smaller memory space, and the estimation time is almost an order of magnitude lessin comparison to the state of the art.

___

  • ] Krishnan P, Vitter JS, Iyer B. Estimating alphanumeric selectivity in the presence of wildcards. In: ACM 1996 Conference on Management of Data; 4–6 June 1996; Montreal, Canada. New York, NY, USA: ACM. pp. 282-293.
  • Lee H, Ng RT, Shim K. Approximate substring selectivity estimation. In: ACM 2009 International Conference on Extending Database Technology; 24-26 March 2009; Saint Petersburg, Russia. New York, NY, USA: ACM. pp. 827-838.
  • Jagadish HV, Kapitskaia O, Ng RT, Srivastava D. One-dimensional and multi-dimensional substring selectivity estimation. VLDB J 2000; 9: 214-230
  • Jagadish HV, Ng RT, Srivastava D. Substring selectivity estimation. In: ACM 1999 Symposium on Principles of Database Systems; 31 May–2 June 1999; Philadelphia, Pennsylvania, USA. New York, NY, USA: ACM. pp. 249-260.
  • Chaudhuri S, Ganti V, Gravano L. Selectivity estimation for string predicates: overcoming the underestimation problem. In: IEEE 2004 Conference on Data Engineering; 30 March–2 April 2004; Boston, MA, USA. New York, NY, USA: IEEE. pp. 227-238.
  • Ferragina P, Grossi R. The string B-tree: a new data structure for string search in external memory and its applications. J ACM 1999; 46: 236-280.
  • To H, Chiang K, Shahabi C. Entropy-based histograms for selectivity estimation. In: ACM 2013 Conference on Information and Knowledge Management; 27 October–1 November 2013; San Francisco, CA, USA. New York, NY, USA: ACM. pp. 1939-1948
  • Poosala V, Haas PJ, Ioannidis YE, Shekita EJ. Improved histograms for selectivity estimation of range predicates. In: ACM 1996 Conference on Management of Data; 4–6 June 1996; Montreal, Canada. New York, NY, USA: ACM. pp. 294-305.
  • Jagadish HV, Koudas N, Muthukrishnan S, Poosala V, Sevcik KC, Suel T. Optimal histograms with quality guarantees. In: VLDB 1998 Conference on Very Large Databases; 24–27 August 1998; New York City, NY, USA. pp. 24-27.
  • Matias Y, Vitter JS, Wang M. Wavelet-based histograms for selectivity estimation. In: ACM 1998 Conference on Management of Data; 2–4 June 1998; Seattle, WA, USA. New York, NY, USA: ACM. pp. 448-459.
  • Muralikrishna M, DeWitt DJ. Equi-depth multidimensional histograms. In: ACM 1988 Conference on Management of Data; 1–3 June 1988; Chicago, IL, USA. New York, NY, USA: ACM. pp. 28-36.
  • Jin L, Li C. Selectivity estimation for fuzzy string predicates in large data sets. In: VLDB 2005 Conference on Very Large Databases; 30 August–2 September 2005; Trondheim, Norway. pp. 397-408
  • Arasu A, Ganti V, Kaushik R. Efficient exact set-similarity joins. In: VLDB 2006 Conference on Very Large Databases; 12–15 September 2006; Seoul, Korea. pp. 918-929.
  • Chaudhuri S, Ganjam K, Ganti V, Motwani R. Robust and efficient fuzzy match for online data cleaning. In: ACM 2003 Conference on Management of Data; 9–12 June 2003; San Diego, CA, USA. New York, NY, USA: ACM. pp. 313-324.
  • Chaudhuri S, Ganti V, Kaushik R. A primitive operator for similarity joins in data cleaning. In: IEEE 2006 Conference on Data Engineering; 3–7 April 2006; Atlanta, GA, USA. New York, NY, USA: IEEE. pp. 5-5.
  • Xiao C, Wang W, Lin X. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. In: VLDB 2008 Conference on Very Large Databases; 23–28 August 2008; Auckland, New Zealand. pp. 933-944.
  • Lee H, Ng RT, Shim K. Extending q-grams to estimate selectivity of string matching with low edit distance. In: VLDB 2007 Conference on Very Large Databases; 23–27 September 2007; Vienna, Austria. pp. 195-206.
  • Kim Y, Park H, Shim K, Woo KG. Efficient processing of substring match queries with inverted variable-length gram indexes. Inf Sci 2013; 244: 119-141.
  • Yao B, Li F, Hadjieleftheriou M, Hou K. Approximate string search in spatial databases. In: IEEE 2006 Conference on Data Engineering; 1–6 March 2010; Long Beach, CA, USA. New York, NY, USA: IEEE. pp. 545-556.
  • Agrawal R, Srikant R. Mining sequential patterns. In: IEEE 1995 Conference on Data Engineering; 6–10 March 1995; Taipei, Taiwan. New York, NY, USA: IEEE. pp. 3-14.
  • Yan X, Han J, Afshar R. CloSpan: Mining: Closed sequential patterns in large datasets. In: SIAM 2003 Conference on Data Mining; 1–3 May 2003; San Francisco, CA, USA. pp. 166-177.
  • Wang J, Han J, Pei J. Closet+: Searching for the best strategies for mining frequent closed itemsets. In: ACM 2003 Conference on Knowledge Discovery and Data Mining; 24–27 August 2003; Washington, DC, USA. New York, NY, USA: ACM. pp. 236-245.
  • Zaki MJ, Hsiao CJ. CHARM: An efficient algorithm for closed itemset mining. In: SIAM 2002 Conference on Data Mining; 11–13 April 2002; Arlington, VA, USA. pp. 457-473.
  • Pasquier N, Bastide Y, Taouil R, Lakhal L. Discovering frequent closed itemsets for association rules. In: ICDT 1999 Conference on Database Theory; 10–12 January 1999; Jerusalem, Israel. Berlin, Germany: Springer. pp. 398-416.
  • Wang J, Han J. BIDE: Efficient mining of frequent closed sequences. In: IEEE 2004 Conference on Data Engineering; 30 March–2 April 2004; Boston, MA, USA. New York, NY, USA: IEEE. pp. 79-90.
Turkish Journal of Electrical Engineering and Computer Sciences-Cover
  • ISSN: 1300-0632
  • Yayın Aralığı: Yılda 6 Sayı
  • Yayıncı: TÜBİTAK
Sayıdaki Diğer Makaleler

Coordination of distance and directional overcurrent relays using a new algorithm: grey wolf optimizer

Omid SOLEIMANI OOREH, Zahra MORAVEJ

Fiber-optic interferometric sensor for monitoring automobile and rail traffic

Radek MARTINEK, Radana KAHANKOVA, Marcel FAJKUS, Vladimir VASINEK, Daniel CVEJN, Jan VANUS, Jan NEDOMA, Marek DVORSKY

Feature extraction using sequential cumulative bin and overlap mean intensity for iris classification

Ahmad Nazri ALI, Mohd Zaid ABDULLAH, Shahrel Azmin SUANDI

FPGA implementation of a low-power and area-efficient state-table-based compression algorithm for DSLR cameras

Mohd Rafi LONE, Najeeb-ud-Din HAKIM

Adaptive antisingularity terminal sliding mode control for a robotic arm with model uncertainties and external disturbances

Minhtuan PHAM, Quyen BUI, Kiem NGUYEN, Tinh NGUYEN

Estimation of the depth of anesthesia by using a multioutput least-square support vector regression

Sirous MOMENZADEH, Seyed Kamaledin SETAREHDAN, Mercedeh JAHANSEIR

Classification and regression analysis using support vector machine for classifying and locating faults in a distribution system

Hazlee Azil Bin ILLIAS, Hazlie MOKHLIS, Sophi Shilpa GURURAJAPATHY

Horizontal diversity in test generation for high fault coverage

Abu Khari Bin A’AIN, Ian GROUT, Norlina PARAMAN, Usman Ullah SHEIKH, Arbab ALAMGIR

Estimating the selectivity of LIKE queries using pattern-based histograms

Mehmet AYTİMUR, Ali ÇAKMAK

Influence of thyristor-controlled series capacitor on wheeling cost incorporating the impact of real and reactive power losses

Kranthi Kiran IRINJILA, Jaya Laxmi ASKANI