Sezer GÖREN UĞURDAĞ, Cem ÜNSALAN, Mahmut Şamil SAĞIROĞLU, Mehmet Yağmur GÖK

Highly accurate and sensitive short read aligner

Next-generation sequencing generates large numbers of short reads from DNA. This makes it difficultto process and store. Therefore, efficient sequence alignment and mapping techniques are needed in bioinformatics.Alignment and mapping are the basic steps involved in genetic data analysis. The Smith–Waterman (SW) algorithm, awell-known dynamic programming algorithm, is often used for this purpose. In this work, we propose to utilize Phredquality scores in Gotoh’s affine gap model to increase the accuracy and sensitivity of the SW algorithm. Hardwareplatforms such as FPGAs and GPUs are commonly used to solve computationally expensive problems. In this work,a hybrid PC-FPGA system is built where the SW algorithm based on the affine gap model with Phred quality scoresis implemented on the FPGA and a read compressor is implemented on the host PC. We compare our method withstate-of-the-art systems such as Bowtie, BWA, and the Kim–Olson FPGA-based system in terms of sensitivity, accuracy,and speed. Based on extensive experiments, we observed that our proposed method is more sensitive and accurate ascompared to other solutions.

PDF

___

[1] Ewing B, Hillier H, Wendl M, Green P. Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res 1998; 8: 175-185.
[2] Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol 1981; 147: 195-197.
[3] Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol 1982; 162: 705-708.
[4] Burrows M, Wheeler DJ. A Block-Sorting Lossless Data Compression Algorithm. Palo Alto, CA, USA: Digital Equipment Corporation, 1994.
[5] Ferragina P, Manzini G. Opportunistic data structures with applications. In: Annual Symposium on Foundations of Computer Science; 12–14 November 2000; Washington, DC, USA. New York, NY, USA: IEEE. p. 309.
[6] Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 2009; 10: R25.
[7] Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009; 24: 1754-1760.
[8] Homer N, Merriman B, Nelson SF. BFAST: An alignment tool for large scale genome resequencing. PLoS One 2009; 4: e7767.
[9] Knodel O, Preusser TB, Spallek RG. Next generation massively parallel short-read mapping on FPGAs. In: International Conference on Application Specific Systems, Architectures and Processors; 11–14 September 2011; Santa Monica, CA, USA. New York, NY, USA: IEEE. pp. 195-201.
[10] Kim M, Olson CB. Hardware acceleration of short read mapping. In: International Symposium on FieldProgrammable Custom Computing Machines; 29 April–1 May 2012; Ontario, Canada. New York, NY, USA: IEEE. pp. 161-168.
[11] Yongchao L, Bertil S, Douglas LM. CUSHAW: A CUDA compatible short read aligner to large genomes based on the Burrows–Wheeler transform. Bioinformatics 2012; 28: 1830-1837.
[12] Hoang DT. A Systolic Array for the Sequence Alignment Problem. Technical Report CS-92-22. Providence, RI, USA: Brown University, 1992.
[13] Gok M, Yilmaz C. Efficient cell designs for systolic Smith Waterman implementation. In: International Conference on Field Programmable Logic and Applications; 28–30 August 2006; Madrid, Spain. New York, NY, USA: IEEE. pp. 889-892.
[14] Gok M, Unsalan C, G¨oren S, Sagiroglu MS. Programmable hardware based short read aligner using Phred quality scores. In: International Conference on Social Computing, 8–14 September 2013; Washington, DC, USA. New York, NY, USA: IEEE. pp. 864-867.
[15] Lee WP, Stromberg MP, Ward A, Stewart C, Garrison EP, Marth GT. MOSAIK: A hash-based algorithm for accurate next generation sequencing short-read mapping. PLoS One 2014; 9: e90581.
[16] Holtgrewe M. Mason – A Read Simulator for Second Generation Sequencing Data. Technical Report. Berlin, German: Freie Universit¨at Berlin, 2010.
[17] Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 2008; 18: 1851-1858.