Name spell-check framework for social networks

Name spell-check framework for social networks

The problem of finding similar strings is very important in most real life applications including spell-checking, data cleaning, next generation sequencing, and alignment. In order to query and manage string data online, scalable algorithms and frameworks are essential. Scalable frameworks and algorithms have been introduced in the past few years. However, these frameworks mainly deal with caching and querying structured data. They do not deal with fuzzy queries, where we need to search for an approximate string. In this paper, we propose an edit distance aware filtering algorithm for all kinds of approximate string search problems. We also propose a novel name spell-check engine mainly for social networks. Our experiments show that our edit distance aware filtering mechanism alone improves the query processing time and throughput by almost 30%. Additionally, our name spell-check engine improved the name spell-check response time and throughput almost 10 times by using our filtering scheme and some domain specific observations.

___

  • [1] White T. Hadoop: The Definitive Guide. Sebastopol, CA, USA: O’Reilly Media, 2009.
  • [2] Fitzpatrick B. Distributed caching with memcached. Linux Journal 2004; 2004: 124-129.
  • [3] George L. HBase: the definitive guide. Sebastopol, CA, USA: O’Reilly Media, 2011.
  • [4] Xin H, Lee D, Hormozdiari F, Yedkar S, Mutlu O, Alkan C. Accelerating read mapping with FastHASH. BMC Genomics 2013; 14: S13-S14.
  • [5] Levenstein V. Binary codes capable of correcting spurious insertions and deletions of ones. Probl Inf Transm 1965; 1: 8-17
  • [6] Sarawagi S, Kirpal A. Efficient set joins on similarity predicates. In: Proceedings of the 2004 ACM SIGMOD international conference on Management of data; 13–18 June 2004; Paris, France. New York, NY, USA: ACM. pp. 743-754.
  • [7] Bayardo RJ, Ma Y, Srikant R. Scaling up all pairs similarity search. In: WWW 2007 Proceedings of the 16th International Conference on World Wide Web; 8 May 2007; Banff, Alberta, Canada. New York, NY, USA: ACM. pp. 131-140.
  • [8] Sutinen E, Tarhio J. On using q-gram locations in approximate string matching. In: Spirakis P, editor. Algorithms ESA ’95. Berlin, Germany: Springer, 1995. pp. 327-340.
  • [9] Ukkonen E. Approximate string-matching with q-grams and maximal matches. Theor Comput Sci 1992; 92: 191-211.
  • [10] Chowdhury G. Introduction to Modern Information Retrieval. London, UK: Facet Publishing, 2010.
  • [11] Cutting D, Pedersen J. Optimization for dynamic inverted index maintenance. In: Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval; 1990; Brussels, Belgium. New York, NY, USA: ACM. pp. 405-411.
  • [12] Chaudhuri S, Ganjam K, Ganti V, Motwani R. Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data; 9–12 June 2003; San Diego, CA, USA. New York, NY, USA: ACM. pp. 313-324.
  • [13] Navarro G. A guided tour to approximate string matching. ACM Comput Surv 2001; 33: 31-88.
  • [14] Li C, Lu J, Lu Y. Efficient merging and filtering algorithms for approximate string searches. In: IEEE 2008 Proceedings of the 24th International Conference on Data Engineering; 7–12 April 2008; Cancun, Mexico. New York, NY, USA: IEEE. pp. 257-266.
  • [15] Kim M, Whang K, Lee J, Lee M. n-gram/2l: A space and time efficient two-level n-gram inverted index structure. In: VLDB 2005 Proceedings of the 31st International Conference on Very Large Data Bases; 30 August–2 September 2005; Trondheim, Norway. New York, NY, USA: ACM. pp. 325-336.
  • [16] Li C, Wang B, Yang X. Vgram: improving performance of approximate queries on string collections using variablelength grams. In: VLDB 2007 Proceedings of the 33rd International Conference on Very Large Data Bases; 23–27 September 2007; Vienna, Austria. New York, NY, USA: ACM. pp. 303-314.
  • [17] Li H, Homer N. A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 2010; 11: 473-483.
Turkish Journal of Electrical Engineering and Computer Sciences-Cover
  • ISSN: 1300-0632
  • Yayın Aralığı: 6
  • Yayıncı: TÜBİTAK
Sayıdaki Diğer Makaleler

Theoretical analysis of doping management and its effects on power scaling

Mustafa Sadettin ÖZYAZICI, Fatih Omer İLDAY, Amira GURSEL TANDİROVİÇ, Parviz ELAHİ

An efficient LOF-based long-range correlation filter for the restoration of salt and pepper impulse corrupted digital images

Saudia SUBASH, Justin VARGHESE, Mohamed KHAN SAMIULLA, Krishnan NALLAPERUMAL, Bijoy BABU, Mohammed SAADI RAMADAN

Cascaded half-full-bridge PWM multilevel inverter configuration

Charles Ikechukwu ODEH

Power management using dynamic power state transitions and dynamic voltage frequency scaling controls in virtualized server clusters

Mohan Raj KUMAR VELAYUDHAN, Shriram RAGHUNATHAN

Threshold optimization according to the restricted Bayes criterion in decentralized detection problems

Suat BAYRAM, Hakan SOKU

A novel method of relieving congestion in hybrid deregulated market utilizing renewable energy sources

Joseph NESAMALAR DRUSILA JESLIN, Paramasivam VENKATESH, Sathiasamuel RAJA CHARLES

Enhanced control of a DFIG-based system by sliding-mode control method during network disturbances

Saeed ABAZARI, Sajad FARAJZADEH, Samad BOROUJENI TAGHIPOUR

An implantable microstrip antenna design for MICS-band biomedical applications

Adnan SONDAŞ, Mustafa Hikmet Bilgehan UÇAR

Optimal dispatchable DG allocation in a distribution network considering load growth with a mixed-PSO algorithm

Mehrdad ABEDI, Peyman KARIMYAN, Seyed Mohammad AHADI, Behrooz VAHIDI

Design of fir filters using exponential Hamming window family

Kemal AVCI