Mehmet AYTİMUR, Ali ÇAKMAK

Estimating the selectivity of LIKE queries using pattern-based histograms

Accurate cost and time estimation of a query is one of the major success indicators for database managementsystems. SQL allows the expression of flexible queries on text-formatted data. The LIKE operator is used to search for aspecified pattern (e.g., LIKE “luck%”) in a string database. It is vital to estimate the selectivity of such flexible predicatesfor the query optimizer to choose an efficient execution plan. In this paper, we study the problem of estimating theselectivity of a LIKE query predicate over a bag of strings. We propose a new type of pattern-based histogram structureto summarize the data distribution in a particular column. More specifically, we first mine sequential patterns over agiven string database and then construct a special histogram out of the mined patterns. During query optimization time,pattern-based histograms are exploited to estimate the selectivity of a LIKE predicate. The experimental results on areal dataset from DBLP show that the proposed technique outperforms the state of the art for generic LIKE queries like%s1%s2%...%sn% where si represents one or more characters. What is more, the proposed histogram structure requiresmore than two orders of magnitude smaller memory space, and the estimation time is almost an order of magnitude lessin comparison to the state of the art.

PDF

___

] Krishnan P, Vitter JS, Iyer B. Estimating alphanumeric selectivity in the presence of wildcards. In: ACM 1996 Conference on Management of Data; 4–6 June 1996; Montreal, Canada. New York, NY, USA: ACM. pp. 282-293.
Lee H, Ng RT, Shim K. Approximate substring selectivity estimation. In: ACM 2009 International Conference on Extending Database Technology; 24-26 March 2009; Saint Petersburg, Russia. New York, NY, USA: ACM. pp. 827-838.
Jagadish HV, Kapitskaia O, Ng RT, Srivastava D. One-dimensional and multi-dimensional substring selectivity estimation. VLDB J 2000; 9: 214-230
Jagadish HV, Ng RT, Srivastava D. Substring selectivity estimation. In: ACM 1999 Symposium on Principles of Database Systems; 31 May–2 June 1999; Philadelphia, Pennsylvania, USA. New York, NY, USA: ACM. pp. 249-260.
Chaudhuri S, Ganti V, Gravano L. Selectivity estimation for string predicates: overcoming the underestimation problem. In: IEEE 2004 Conference on Data Engineering; 30 March–2 April 2004; Boston, MA, USA. New York, NY, USA: IEEE. pp. 227-238.
Ferragina P, Grossi R. The string B-tree: a new data structure for string search in external memory and its applications. J ACM 1999; 46: 236-280.
To H, Chiang K, Shahabi C. Entropy-based histograms for selectivity estimation. In: ACM 2013 Conference on Information and Knowledge Management; 27 October–1 November 2013; San Francisco, CA, USA. New York, NY, USA: ACM. pp. 1939-1948
Poosala V, Haas PJ, Ioannidis YE, Shekita EJ. Improved histograms for selectivity estimation of range predicates. In: ACM 1996 Conference on Management of Data; 4–6 June 1996; Montreal, Canada. New York, NY, USA: ACM. pp. 294-305.
Jagadish HV, Koudas N, Muthukrishnan S, Poosala V, Sevcik KC, Suel T. Optimal histograms with quality guarantees. In: VLDB 1998 Conference on Very Large Databases; 24–27 August 1998; New York City, NY, USA. pp. 24-27.
Matias Y, Vitter JS, Wang M. Wavelet-based histograms for selectivity estimation. In: ACM 1998 Conference on Management of Data; 2–4 June 1998; Seattle, WA, USA. New York, NY, USA: ACM. pp. 448-459.
Muralikrishna M, DeWitt DJ. Equi-depth multidimensional histograms. In: ACM 1988 Conference on Management of Data; 1–3 June 1988; Chicago, IL, USA. New York, NY, USA: ACM. pp. 28-36.
Jin L, Li C. Selectivity estimation for fuzzy string predicates in large data sets. In: VLDB 2005 Conference on Very Large Databases; 30 August–2 September 2005; Trondheim, Norway. pp. 397-408
Arasu A, Ganti V, Kaushik R. Efficient exact set-similarity joins. In: VLDB 2006 Conference on Very Large Databases; 12–15 September 2006; Seoul, Korea. pp. 918-929.
Chaudhuri S, Ganjam K, Ganti V, Motwani R. Robust and efficient fuzzy match for online data cleaning. In: ACM 2003 Conference on Management of Data; 9–12 June 2003; San Diego, CA, USA. New York, NY, USA: ACM. pp. 313-324.
Chaudhuri S, Ganti V, Kaushik R. A primitive operator for similarity joins in data cleaning. In: IEEE 2006 Conference on Data Engineering; 3–7 April 2006; Atlanta, GA, USA. New York, NY, USA: IEEE. pp. 5-5.
Xiao C, Wang W, Lin X. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. In: VLDB 2008 Conference on Very Large Databases; 23–28 August 2008; Auckland, New Zealand. pp. 933-944.
Lee H, Ng RT, Shim K. Extending q-grams to estimate selectivity of string matching with low edit distance. In: VLDB 2007 Conference on Very Large Databases; 23–27 September 2007; Vienna, Austria. pp. 195-206.
Kim Y, Park H, Shim K, Woo KG. Efficient processing of substring match queries with inverted variable-length gram indexes. Inf Sci 2013; 244: 119-141.
Yao B, Li F, Hadjieleftheriou M, Hou K. Approximate string search in spatial databases. In: IEEE 2006 Conference on Data Engineering; 1–6 March 2010; Long Beach, CA, USA. New York, NY, USA: IEEE. pp. 545-556.
Agrawal R, Srikant R. Mining sequential patterns. In: IEEE 1995 Conference on Data Engineering; 6–10 March 1995; Taipei, Taiwan. New York, NY, USA: IEEE. pp. 3-14.
Yan X, Han J, Afshar R. CloSpan: Mining: Closed sequential patterns in large datasets. In: SIAM 2003 Conference on Data Mining; 1–3 May 2003; San Francisco, CA, USA. pp. 166-177.
Wang J, Han J, Pei J. Closet+: Searching for the best strategies for mining frequent closed itemsets. In: ACM 2003 Conference on Knowledge Discovery and Data Mining; 24–27 August 2003; Washington, DC, USA. New York, NY, USA: ACM. pp. 236-245.
Zaki MJ, Hsiao CJ. CHARM: An efficient algorithm for closed itemset mining. In: SIAM 2002 Conference on Data Mining; 11–13 April 2002; Arlington, VA, USA. pp. 457-473.
Pasquier N, Bastide Y, Taouil R, Lakhal L. Discovering frequent closed itemsets for association rules. In: ICDT 1999 Conference on Database Theory; 10–12 January 1999; Jerusalem, Israel. Berlin, Germany: Springer. pp. 398-416.
Wang J, Han J. BIDE: Efficient mining of frequent closed sequences. In: IEEE 2004 Conference on Data Engineering; 30 March–2 April 2004; Boston, MA, USA. New York, NY, USA: IEEE. pp. 79-90.