A Framework for Query Optimization Algorithms for Biological Data

A Framework for Query Optimization Algorithms for Biological Data

Recently, the size of biological databases has significantly increased, with a growing number of users and rates of queries. As a result, some databases have reached a terabyte size. On the other hand, the need to access the databases at the fastest possible rates is increasing. At this point, the computer scientists could assist to organize the data and query in a way that allows biologists to quickly search existing information. In this paper, a query model for DNA and protein sequence datasets is proposed. This method of dealing with the query can effectively and rapidly retrieve all similar proteins/DNA from a large database. A theoretical and conceptual proposed framework is derived using query techniques form different applications. The results show that the query optimization algorithms reduce the query processing time in comparison with the normal query searching algorithm.

___

  • [1] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and E. W. Sayers (2010). Genbank, Nucleic acids research. 38(1), D46-D51, DOI: 10.1093/nar/gkx1094.
  • [2] P. Rice, I. Longden, and A. Bleasby (2000). Emboss: the european molecular biology open software suite. Trends in genetics, 16 (6), 276-277. DOI: 10.1016/S0168-9525(00)02024-2
  • [3] A. Bairoch and R. Apweiler (2000). The swiss-prot protein sequence database and its supplement trembl in 2000. Nucleic acids research, 28 (1), 45-48. DOI:10.1093/nar/28.1.45
  • [4] K. D. Pruitt, T. Tatusova, and D. R. Maglott (2007), Ncbi reference sequences (refseq): a curated non-redundant sequence database of genomes, transcripts and proteins, Nucleic acids research. 35(1), D61-D65. doi:  [10.1093/nar/gkl842]
  • [5] P. Librado and J. Rozas, Dnasp (2009). v5: a software for comprehensive analysis of dna polymorphism data, Bioinformatics. 25(11), 1451-1452. DOI: 10.1093/bioinformatics/btp187
  • [6] C. Plot (2000). The sequence manipulation suite: Javascript programs for analyzing and formatting protein and dna sequences, Biotechniques. 28(6), 1102-1104. DOI:10.2144/00286ir01
  • [7] Jaber, K. M., Abdullah, R., and Rashid, N (2014). A. Fast decision tree-based method to index large DNA-protein sequence databases using hybrid distributed-shared memory programming model. International Journal of Bioinformatics Research and Applications. 10(3), 321-340.  doi: 10.1504/IJBRA.2014.060765.
  • [8] R. J. Block, D. Bolling et al. (1945). The amino acid composition of proteins and foods. analytical methods and results. The amino acid composition of proteins and foods. Analytical methods and results. 17(4).
  • [9] R. Leinonen, R. Akhtar, E. Birney, L. Bower, A. Cerdeno-Tarraga, Y. Cheng, I. Cleland, N. Faruque, N. Goodgame, R. Gibson et al. (2011). The european nucleotide archive, Nucleic acids research. 39, D28-D31.
  • [10] Ian Korf, M.Y., Joseph Bedell (2003). BLAST.
  • [11] Rieffel, M. A., Gill, T. G. and White, W. R. (2004). Bioinformatics clusters in action., Cluster World.
  • [12] Prasan Roy(2000). Rule-Based Query Optimization using the Volcano Framework., PhD thesis, IIT Bombay.
  • [13] NCBI Website, URL: http://blast.ncbi.nlm.nih.gov,2018.
  • [14] Whitford, D., Proteins (2005). Structure and Function., 1 Edition, Wiley, 2005.
  • [15] DDBJ Database Available at: http://www.ddbj.nig.ac.jp/breakdown_stats/dbgrowth-old-e.html. [Accessed 12 April 2017].