System designs to perform bioinformatics sequence alignment

The emerging field of bioinformatics uses computing as a tool to understand biology. Biological data of organisms (nucleotide and amino acid sequences) are stored in databases that contain billions of records. In order to process the vast amount of data in a reasonable time, high-performance analysis systems are developed. The main operation shared by the analysis tools is the search for matching patterns between sequences of data (sequence alignment). In this paper, we present 2 systems that can perform pairwise and multiple sequence alignment operations. Through the optimized design methods, proposed systems achieve up to 3.6 times more performance compared to the previous designs. The proposed systems are modeled with VHDL; these models are simulated and then mapped on a Mezzanine card that contains an FPGA chip. The systolic processing core implemented on the Mezzanine card processes the data obtained from the PC and sends the results back to the PC. The practical functionality of the systems is tested and verified by comparing the system results with pure software results.

System designs to perform bioinformatics sequence alignment

The emerging field of bioinformatics uses computing as a tool to understand biology. Biological data of organisms (nucleotide and amino acid sequences) are stored in databases that contain billions of records. In order to process the vast amount of data in a reasonable time, high-performance analysis systems are developed. The main operation shared by the analysis tools is the search for matching patterns between sequences of data (sequence alignment). In this paper, we present 2 systems that can perform pairwise and multiple sequence alignment operations. Through the optimized design methods, proposed systems achieve up to 3.6 times more performance compared to the previous designs. The proposed systems are modeled with VHDL; these models are simulated and then mapped on a Mezzanine card that contains an FPGA chip. The systolic processing core implemented on the Mezzanine card processes the data obtained from the PC and sends the results back to the PC. The practical functionality of the systems is tested and verified by comparing the system results with pure software results.

___

  • Speed-up CLUSTALW Speed-up sequences (s) (s) (s) 100 0.859 1 28 828 08 200 188 6 03 672 07 400 500 1 24 1500 00 600 688 2 15 40625 88 800 969 0 16 6391 87 18 iterations. Note that Table 2 contains only hardware-related results. In general, the cited previous work does not include any information about performance results of the systems for practical usage. However, the end user is more interested in the wall-clock time required to get the alignment results than the hardware execution times. Therefore, wall-clock time required for alignments is also included in this work. Table 3 presents overall performances for linear and affine gap penalty alignment systems. Here, the first column is the database size to be scanned while the other columns present the performance of the proposed system and the software-only solution. As seen from Table 3, the proposed system performance increases as the database gets larger. For example, the proposed system scans a 100 MB database in less than 30 s, while the software-only application uses over 90 min to do the same operation.
  • Performance evaluations of the multiple sequence alignment systems Proposed multiple sequence alignment designs are modeled in VHDL and mapped on a Xilinx Virtex II XC2V6000 FPGA platform. There are 185 MSA cells placed on the FPGA. Each cell includes 91 flip flops and 354 LUTs mapped on 188 FPGA slices. According to the synthesis results, the clock frequency can be set to 57 MHz at maximum. Therefore, performance of the multiple sequence alignment system is 10.54 GCUPS. Table 4 presents how much the first step of the CLUSTALW algorithm is accelerated. To achieve this information, 100 to 800 sequences with 185 residues were aligned with FPGA-supported CLUSTALW and pure CLUSTALW. As is seen, the first step is accelerated over 80 times as the number of sequences surpasses 600. However, this acceleration is only for the first step of CLUSTALW, and the overall gain is still obscure. A realistic prediction can be made by using Amdahl’s Law [29]. According to Figure 2, about 80% of overall time is spent on the first step. Because of that, the hardware can achieve a maximum of 5 times of acceleration over the software-only solution. For this reason, the first steps are compared in Table 4, as it was made in the previous work. The proposed design is compared with a similar design that aimed to accelerate CLUSTALW, presented in [17]. Table 4 presents results directly obtained from [17], since they used exactly the same FPGA platform. Table 4 also compares the speed of the aforementioned reference design and proposed design. In [17], the authors did not specify the names or the sizes of the globin sequences used to test their system. On the other hand, there exists approximately 8000 different globin sequences on popular databases. A rough examination of these sequences show that these sequences are 140 to 150 letters long in general. Our tests are performed on protein sequences 185 letters long obtained from the PIR database [30]. Conclusion
  • In this work, system designs to perform pairwise and multiple sequence alignment operations are presented. The proposed systems include a host PC and an FPGA Mezzanine card. Through optimizations of the designs, smaller hardware usage and higher execution speed are achieved. The proposed pairwise alignment system can extract most similar sequences in a biological database to a query sequence. Performance of this system is up to 5 times higher than those of reference designs. The system scans a 100 MB database in less than 30 s, while the software-only solution performs the same operation in over 90 min. The proposed MSAS accelerates the slowest step of the CLUSTALW algorithm. The MSAS design performs the first step over 8 times faster than the reference designs for 16-bit precision. In general, if the length of sequences exceeds the number of cells, then additional iterations are needed to complete the alignment. On the other hand, the current state of the art overcomes this problem, since the capacity of the FPGAs has improved to fit sequences with several thousand residues. To justify that, systems are also synthesized for the Xilinx Virtex-6 XC6VLX760 platform. Results indicate that the maximum number of pairwise alignment cells can be 3764 with 225 MHz clock frequency, and the maximum number of multiple alignment cells can be 1928 with 218 MHz clock frequency for 16-bit precision. Therefore, additional iterations are not necessary considering the length of protein sequences in databases. The performance with this state-of-the-art FPGA is expected to be about 100 GCUPS. Acknowledgments
  • This work was supported by the Scientific and Technological Research Council of Turkey (T ¨ UB˙ITAK) under Grant No: 105E075. References National Institutes of Health, Working Definition of Bioinformatics and Computational Biology, Bethesda, NIH, http://www.bisti.nih.gov/CompuBioDef.pdf, 2000. D.W. Mount, Bioinformatics: Sequence and Genome Analysis, Cold Spring Harbor, Cold Spring Harbor Laboratory Press, 2004. National Center for Biotechnology Information, Homepage, http://www.ncbi.nlm.nih.gov/. National Center for Biotechnology Information, GenBank Overview, http://www.ncbi.nlm.nih.gov/genbank/. W.R. Pearson, D.J. Lipman, “Improved tools for biological sequence comparison”, Proceedings of the National Academy of Sciences, Vol. 85, pp. 2444–2448, 1988. S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman, “Basic local alignment search tool”, Journal of Molecular Biology, Vol. 215, pp. 403–410, 1990. T.F. Smith, M.S. Waterman, “Identification of common molecular subsequences”, Journal of Molecular Biology, Vol. 147, pp. 195–197, 1981.
  • R. Durbin, S. Eddy, A. Krogh, G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Cambridge, Cambridge University Press, 2002.
  • S.R. Eddy, “Profile hidden Markov models”, Bioinformatics, Vol. 14, pp. 755–763, 1998.
  • J.D. Thompson, D.G. Higgins, T.J. Gibson, “ClustalW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice”, Nucleic Acids Research, Vol. 22, pp. 4673–4680, 1994.
  • C. Notredame, D.G. Higgins, J. Heringa, “T-Coffee: a novel method for fast and accurate multiple sequence alignment”, Journal of Molecular Biology, Vol. 302, pp. 205–217, 2000.
  • D.T. Hoang, “Searching genetic databases on Splash 2”, Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, pp. 185–191, 1993.
  • Xilinx home page, http://www.xilinx.com. Y. Yamaguchi, T. Maruyama, A. Konagaya, “High speed homology search with FPGAs”, Pacific Symposium on Biocomputing, pp. 271–282, 2002.
  • T.F. Oliver, B. Schmidt, D.L. Maskell, “Hyper customized processors for bio-sequence database scanning on FPGAs”, Proceedings of the 2005 ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays, pp. 229–237, 2005.
  • X. Lin, Z. Peiheng, B. Dongbo, F. Shengzhong, S. Ninghui, “To accelerate multiple sequence alignment using FPGAs”, Eighth International Conference on High-Performance Computing in Asia-Pacific Region, pp. 5–180, 200 T.F. Oliver, B. Schmidt, D. Nathan, R. Clemens, D.L. Maskell, “Multiple sequence alignment on an FPGA”, Proceedings of the 11th International Conference on Parallel and Distributed Systems, Vol. 2, pp. 326–330, 2005. M. G¨ ok, C ¸ . Yılmaz, “Efficient cell designs for systolic Smith–Waterman implementations”, Proceedings of the International Conference on Field Programmable Logic and Applications, pp. 889–892, 2006.
  • M. G¨ ok, C ¸ . Yılmaz, “Hardware designs for local alignment of protein sequences”, Lecture Notes in Computer Science, Vol. 4263M pp. 277–285, 2006.
  • C ¸ . Yılmaz, M. G¨ ok, “An optimized system for multiple sequence alignment”, Proceedings of the International Conference on Reconfigurable Computing and FPGAs, pp. 178–182, 2009.
  • National Center for Biotechnology Information, GenBank Sequences from Pandemic (H1N1) 2009 Viruses, http://www.ncbi.nlm.gov/genomes/FLU/SwineFlu.html.
  • S.B. Needleman, C.D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins”, Journal of Molecular Biology, Vol. 48, pp. 443–453, 1970.
  • M.O. Dayhoff, R.M. Schwartz, B.C. Orcutt, “A model of evolutionary change in proteins”, Atlas of Protein Sequence and Structure, Vol. 5, pp. 345–352, 1978.
  • S. Henikoff, J.G. Hanikoff, “Amino acid substitution matrices from protein blocks”, Proceedings of the National Academy of Sciences, Vol. 89, pp. 10915–10919, 1992.
  • ˙I. ¨ O. Bucak, V. Uslan, “Sequence alignment from the perspective of stochastic optimization: a survey”, Turkish Journal of Electrical Engineering & Computer Sciences, Vol. 19, pp. 157–173, 2011.
  • D.G. Higgins, P.M. Sharp, “Clustal: a package for performing multiple sequence alignment on a microcomputer”, Gene, Vol. 73, pp. 237–244, 1988.
  • ClustalW: Multiple Sequence Alignment Home Page, http://www.clustal.org. Alpha Data Home Page, http://www.alpha-data.com. G.M. Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities”, Proceedings of the Spring Joint Computer Conference, pp. 483–485, 1967.
  • PIR - Protein Information Resource Homepage, http://pir.georgetown.edu.