Fast bitwise pattern-matching algorithm for DNA sequences on modern hardware

We introduce a fast bitwise exact pattern-matching algorithm, which speeds up short-length pattern searches on large-sized DNA databases. Our contributions are two-fold. First, we introduce a novel exact matching algorithm designed specifically for modern processor architectures. Second, we conduct a detailed comparative performance analysis of bitwise exact matching algorithms by utilizing hardware counters. Our algorithmic technique is based on condensed bitwise operators and multifunction variables, which minimize register spills and instruction counts during searches. In addition, the technique aims to efficiently utilize CPU branch predictors and to ensure smooth instruction flow through the processor pipeline. Analyzing letter occurrence probability estimations for DNA databases, we develop a skip mechanism to reduce memory accesses. For comparison, we exploit the complete \textit{Mus musculus} sequence, a commonly used DNA sequence that is larger than 2 GB. Compared to five state-of-the-art pattern-matching algorithms, experimental results show that our technique outperforms the best algorithm even for the worst-case DNA pattern for our technique.

Fast bitwise pattern-matching algorithm for DNA sequences on modern hardware

We introduce a fast bitwise exact pattern-matching algorithm, which speeds up short-length pattern searches on large-sized DNA databases. Our contributions are two-fold. First, we introduce a novel exact matching algorithm designed specifically for modern processor architectures. Second, we conduct a detailed comparative performance analysis of bitwise exact matching algorithms by utilizing hardware counters. Our algorithmic technique is based on condensed bitwise operators and multifunction variables, which minimize register spills and instruction counts during searches. In addition, the technique aims to efficiently utilize CPU branch predictors and to ensure smooth instruction flow through the processor pipeline. Analyzing letter occurrence probability estimations for DNA databases, we develop a skip mechanism to reduce memory accesses. For comparison, we exploit the complete \textit{Mus musculus} sequence, a commonly used DNA sequence that is larger than 2 GB. Compared to five state-of-the-art pattern-matching algorithms, experimental results show that our technique outperforms the best algorithm even for the worst-case DNA pattern for our technique.

___

  • Bucak ˙I ¨O, Uslan V. Sequence alignment from the perspective of stochastic optimization: a survey. Turk J Electr Eng Co 2011; 19: 59–71.
  • Pehlivan ˙I, Orhan Z. Automatic knowledge extraction for filling in biography forms from Turkish texts. Turk J Electr Eng Co 2011; 19: 157–173.
  • Liu Y,Wu X, Hu X, Gao J, Wang C. Pattern matching with wildcards based on multiple suffix trees. In: IEEE International Conference on Granular Computing; 11–13 August 2012; Hangzhou, China. New York, NY, USA: IEEE. pp. 320–325.
  • Moraru I, Andersen DG. Exact pattern matching with feed-forward Bloom filters. ACM Journal of Experimental Algorithmics 2011; 16: 2.1:1–2.1:19.
  • Sheik SS, Aggarwal SK, Poddar A, Balakrishnan N, Sekar K. A fast pattern matching algorithm. J Chem Inf Comput Sci 2004; 44: 1251–1256.
  • Franek F, Jennings CG, Smyth WF. A simple fast hybrid pattern-matching algorithm. Journal of Discrete Algo- rithms 2007; 5: 682–695.
  • Baeza-Yates R, Gonnet GH. A new approach to text searching. Commun ACM 1992; 35: 74–82.
  • Navarro G, Raffinot M. Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM Journal of Experimental Algorithmics 2000; 5: 4.
  • K¨ulekci O. A method to overcome computer word size limitation in bit-parallel pattern matching. Lect Notes Comput Sc 2008; 5369: 496–506.
  • Horspool RN. Practical fast searching in strings. Software Pract Exper 1980; 10: 501–506.
  • Karp RM, Rabin MO. Efficient randomized pattern-matching algorithms. IBM J Res Dev 1987; 31: 249–260.
  • Sunday DM. A very fast substring search algorithm. Commun ACM 1990; 33: 132–142.
  • Cantone D, Faro S, Giaquinta E. Bit-(parallelism): getting to the next level of parallelism. Lect Notes Comput Sc 2010; 6099: 166–177.
  • Durian B, Holub J, Peltola H, Tarhio J. Tuning BNDM with q-grams. In: Proceedings of the Workshop on Algorithm Engineering and Experiments; 2009; New York, NY, USA. New York, NY, USA: SIAM, pp. 29–37.
  • Fredriksson K, Grabowski S. Practical and optimal string matching. Lect Notes Comput Sc 2005; 3772: 376–387.
  • Zhang G, Zhu E, Mao L, Yin M. A bit-parallel exact string matching algorithm for small alphabet. Lect Notes Comput Sc 2009; 5598: 336–345.
  • K¨ulekci MO, Vitter JS, Xu B. Boosting pattern matching performance via k-bit filtering. Lect Notes Electr En 2010; 62: 27–33.
  • K¨ulekci O. On enumeration of DNA sequences. In: Proceedings of the ACM Conference on Bioinformatics, Com- putational Biology, and Biomedicine; 2012; Orlando, FL, USA. New York, NY, USA: ACM. pp. 442–449.
  • Boyer RS, Moore JS. A fast string searching algorithm. Commun ACM 1977; 20: 762–772.
  • Rosen KH. Discrete Mathematics and Its Applications. 6th ed. New York, NY, USA: McGraw-Hill, 2007.
  • Hennessy JH, Patterson DA. Computer Architecture: A Quantitative Approach. 4th ed. San Francisco, CA, USA: Morgan Kaufmann, 2006.
  • Appel AW, George L. Optimal spilling for CISC machines with few registers. ACM SIGPLAN Notices 2000; 36: 243–253.