Name spell-check framework for social networks

Name spell-check framework for social networks

The problem of finding similar strings is very important in most real life applications including spell-checking, data cleaning, next generation sequencing, and alignment. In order to query and manage string data online, scalable algorithms and frameworks are essential. Scalable frameworks and algorithms have been introduced in the past few years. However, these frameworks mainly deal with caching and querying structured data. They do not deal with fuzzy queries, where we need to search for an approximate string. In this paper, we propose an edit distance aware filtering algorithm for all kinds of approximate string search problems. We also propose a novel name spell-check engine mainly for social networks. Our experiments show that our edit distance aware filtering mechanism alone improves the query processing time and throughput by almost 30%. Additionally, our name spell-check engine improved the name spell-check response time and throughput almost 10 times by using our filtering scheme and some domain specific observations.

___

  • [1] White T. Hadoop: The Definitive Guide. Sebastopol, CA, USA: O’Reilly Media, 2009.
  • [2] Fitzpatrick B. Distributed caching with memcached. Linux Journal 2004; 2004: 124-129.
  • [3] George L. HBase: the definitive guide. Sebastopol, CA, USA: O’Reilly Media, 2011.
  • [4] Xin H, Lee D, Hormozdiari F, Yedkar S, Mutlu O, Alkan C. Accelerating read mapping with FastHASH. BMC Genomics 2013; 14: S13-S14.
  • [5] Levenstein V. Binary codes capable of correcting spurious insertions and deletions of ones. Probl Inf Transm 1965; 1: 8-17
  • [6] Sarawagi S, Kirpal A. Efficient set joins on similarity predicates. In: Proceedings of the 2004 ACM SIGMOD international conference on Management of data; 13–18 June 2004; Paris, France. New York, NY, USA: ACM. pp. 743-754.
  • [7] Bayardo RJ, Ma Y, Srikant R. Scaling up all pairs similarity search. In: WWW 2007 Proceedings of the 16th International Conference on World Wide Web; 8 May 2007; Banff, Alberta, Canada. New York, NY, USA: ACM. pp. 131-140.
  • [8] Sutinen E, Tarhio J. On using q-gram locations in approximate string matching. In: Spirakis P, editor. Algorithms ESA ’95. Berlin, Germany: Springer, 1995. pp. 327-340.
  • [9] Ukkonen E. Approximate string-matching with q-grams and maximal matches. Theor Comput Sci 1992; 92: 191-211.
  • [10] Chowdhury G. Introduction to Modern Information Retrieval. London, UK: Facet Publishing, 2010.
  • [11] Cutting D, Pedersen J. Optimization for dynamic inverted index maintenance. In: Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval; 1990; Brussels, Belgium. New York, NY, USA: ACM. pp. 405-411.
  • [12] Chaudhuri S, Ganjam K, Ganti V, Motwani R. Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data; 9–12 June 2003; San Diego, CA, USA. New York, NY, USA: ACM. pp. 313-324.
  • [13] Navarro G. A guided tour to approximate string matching. ACM Comput Surv 2001; 33: 31-88.
  • [14] Li C, Lu J, Lu Y. Efficient merging and filtering algorithms for approximate string searches. In: IEEE 2008 Proceedings of the 24th International Conference on Data Engineering; 7–12 April 2008; Cancun, Mexico. New York, NY, USA: IEEE. pp. 257-266.
  • [15] Kim M, Whang K, Lee J, Lee M. n-gram/2l: A space and time efficient two-level n-gram inverted index structure. In: VLDB 2005 Proceedings of the 31st International Conference on Very Large Data Bases; 30 August–2 September 2005; Trondheim, Norway. New York, NY, USA: ACM. pp. 325-336.
  • [16] Li C, Wang B, Yang X. Vgram: improving performance of approximate queries on string collections using variablelength grams. In: VLDB 2007 Proceedings of the 33rd International Conference on Very Large Data Bases; 23–27 September 2007; Vienna, Austria. New York, NY, USA: ACM. pp. 303-314.
  • [17] Li H, Homer N. A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 2010; 11: 473-483.