A novel Fibonacci hash method for protein family identification by using recurrent neural networks

A novel Fibonacci hash method for protein family identification by using recurrent neural networks

Identification and classification of protein families are one of the most significant problem in bioinformatics and protein studies. It is essential to specify the family of a protein since proteins are highly used in smart drug therapies, protein functions, and, in some cases, phylogenetic trees. Some sequencing techniques provide researchers to identify the biological similarities of protein families and functions. Yet, determining these families with sequencing applications requires huge amount of time. Thus, a computer and artificial intelligence based classification system is needed to save time and avoid complexity in protein classification process. In order to designate the protein families with computer- aided systems, protein sequences need to be converted to the numerical representations. In this paper, we provide a novel protein mapping method based on Fibonacci numbers and hashing table (FIBHASH). Each amino acid code is assigned to the Fibonacci numbers based on integer representations respectively. Later, these amino acid codes are inserted a hashing table with the size of 20 to be classified with recurrent neural networks. To determine the performance of the proposed mapping method, we used accuracy, f1-score, recall, precision, and AUC evaluation criteria. In addition, the results of evaluation metrics with other protein mapping techniques including EIIP, hydrophobicity, CPNR, Atchley factors, BLOSUM62, PAM250, binary one-hot encoding, and randomly encoded representations are compared. The proposed method showed a promising result with an accuracy of 92.77%, and 0.98 AUC score

___

  • [1] Nguyen N, Nute M, Mirarab S, Warnow Tandy. HIPPI: Highly accurate protein family classification with ensembles of HHMs. BMC Genomics 2016; 17(765): 89-100. doi: 10.1186/s12864-016-3097-0
  • [2] Dawson N, Sillitoe I, Marsden RL, Orengo CA. The classification of protein domains. Methods in Molecular Biology 2017; 1525: 137-164. doi: 10.1007/978-1-4939-6622-6_7
  • [3] Enright AJ, Van Dongen S, Ouzonis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acid Research 2022; 30(7): 1575-1584. doi: 10.1093/nar/30.7.1575
  • [4] Bork P, Dandekar T, Diaz-Lazcoz Y, Eisenhaber F, Huynen M et al. Predicting function: from genes to genomes and back. Journal of Molecular Biolog 1998; 283(4): 707-725. doi: 10.1006/jmbi.1998.2144
  • [5] Remmert M, Biegert A, Hauser A, Söding J. HHblits: Lightning-fast iterative protein sequence searching by HMM- HMM alignment. Nature Methods 2012; 9(2): 173-175. doi: 10.1038/nmeth.1818
  • [6] Vazhayil A, Vinayakumar R, Soman KP. DeepProteomics: Protein Family Classification Using Shallow and Deep Networks. Cold Spring Harbor, NY, USA: Cold Spring Harbor Laboratory, 2018.
  • [7] Zamani M, Kremer SC. Amino acid encoding schemes for machine learning methods. In: 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW); Atlanta, GA, USA; 2011. pp. 327-333.
  • [8] Yin C, Yau SST. A coevolution analysis for identifying protein-protein interactions by Fourier transform. PLOS ONE 2017; 12(4): e0174862. doi: 10.1371/journal.pone.0174862
  • 9] Jing X, Dong Q, Hong D, Lu R. Amino acid encoding methods for protein sequences: A comprehensive review and assessment. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2019; Early access. doi: 10.1109/TCBB.2019.2911677
  • [10] Zacharaki EI. Prediction of protein function using a deep convolutional neural network ensemble. PeerJ Computer Science 2017; 3(e124): 1-17. doi: 10.7717/peerj-cs.124
  • [11] Zhang D, Rabuka MR. Protein family classsification from Scratch: A CNN based deep learning ap- proach. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2020; early acess. doi: 10.1109/TCBB.2020.2966633
  • [12] Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nature Biotechnology 2015; 33(8): 831-838. doi: 10.1038/nbt.3300
  • [13] Hüsken M, Stagge P. Recurrent neural networks for time series classification. Neurocomputing 2003; 50: 223-235. doi: 10.1016/S0925-2312(01)00706-8
  • [14] Jin X, Yu X, Wang X, Bai Y, Su T et al. Prediction for time series with CNN and LSTM. In: 11th International Conference on Modelling, Identification and Control (ICMIC2019); Tianjin, China; 2019. pp. 631-641.
  • [15] Schuster M, Paliwal KK. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 1997; 45(11): 2673-2681. doi: 10.1109/78.650093
  • [16] Naveenkumar KS, Harun M, Babu R, Vinayakumar R, Soman KP. Protein family classification using deep learning. bioRxiv 2018; 414128: doi: 10.1101/414128
  • [17] Lee TK. Protein family classification with neural networks. MSc., University of Stanford, California, USA, 2016.
  • [18] Zhang D, Kabuka MR. Protein family classification with .ulti-layer graph convolutional networks. In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); Madrid, Spain; 2018. pp. 2390-2393.
  • [19] Chen J, Chaudhari NS. Protein family classification using second-order recurrent neural networks. Genome Infor- matics 2003; 14: 520-521.
  • [20] Chen D, Wang J, Yan M, Bao FS. A complex numerical representation of amino acids for protein function comparison. Journal of Computational Biology 2016; 23(8): 669-677. doi: 10.1089/cmb.2015.0178
  • [21] Veljkovic N, Glisic S, Prljic J, Perovic V, Botta M et al. Discovery of new therapeutic targets by the informational spectrum method. Current Protein and Peptide Science 2008; 9(5): 493-506. doi: 10.2174/138920308785915245
  • [22] Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. Journal of Molecular Biology 1982; 157(1): 105-132. doi: 10.1016/0022-2836(82)90515-0
  • [23] Atchley WR, Zhao J, Fernandes AD, Drüke T. T. Solving the protein sequence metric problem. PNAS 2005; 102(18); 6395-6400. doi: 10.1073/pnas.0408677102
  • [24] Dayhoff MO, Schwartz RM, Orcutt BC. A model of evolutionary change in proteins. National Biomedical Research Foundation 1978; 5(3): 345-352.
  • [25] Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America 1992; 89(22): 10915-10919. doi: 10.1073/pnas.89.22.10915
  • [26] Sinha S. The Fibonacci numbers and its amazing applications. International Journal of Engineering Science Invention 2017; 6(9): 7-14.
  • [27] Persaud D, O’Leary JP. Fibonacci series, golden proportions, and the human biology. Austin Journal of Surgery 2015; 2(5): 1066.
  • [28] Perez JC. Chaos, DNA and neuro-computers: A golden link. Speculations in Science and Technology 1991; 14(4): 336-347
  • [29] Perez JC. Codon populations in single-stranded whole human genome DNA are fractal and fine-tuned by the Golden Ratio 1.618. Interdisciplinary Sciences: Computational Life Sciences 2010; 2(3): 228-240. doi: 10.1007/s12539-010- 0022-0
  • [30] Negadi T. A mathematical model for the genetic code(s) based on Fibonacci numbers and their q-analogues. NeuroQuantology: An Interdisciplinary Journal of Neuroscience and Quantum Physics 2015; 13(3): 259-272. doi: 10.14704/nq.2015.13.3.850
  • [31] Weiss MA. Data Structures & Algorithm Analysis in C++. USA: Pearson, 2013. [32] Nimbe P, Frimpong SO, Opoku M. An efficent way strategy for collision resolution in hash tables. International Journal of Computer Applications 2014; 99(10): 35-41. doi: 10.5120/17411-7990
  • [33] Sherstinsky A. Fundamentals of recurrent neural network (RNN) and long-short term memory (LSTM) network. Physica D: Nonlinear Phenomena 2020; 404. 132306. doi: 10.1016/j.physd.2019.132306
  • [34] Hochreiter S, Schmidhuber J. Long-short term memory. Neural Computation 1997; 9(8): 1735-1780. doi 10.1162/neco.1997.9.8.1735
  • [35] Cai R, Zhang X, Wang H.Bidirectional recurrent convolutional neural network for relation classification. In: 2016 54th Annual Meeting of the Association for Computational Linguistics; Berlin, Germany; 2016. pp. 756-765.
  • [36] Liu G, Guo J. Bidirectional LSTM with attention mechanism and convolutional layer for text classification. Neurocomputing 2019; 337: 325-338. doi: 10.1016/j.neucom.2019.01.078
  • [37] Basaldella M, Antolli E, Serra G, Tasso C. Bidirectional LSTM recurrent neural network for keyphrase extraction. In: 2018 14th Italian Research Conference on Digital Libraries; Udine, Italy; 2018. pp. 180-187.
  • [38] Graves A, Jaitly N, Mohamed A. Hybrid speech recognition with deep bidirectional LSTM. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding; Olomouc, Czech Republic; 2013. pp. 273-278.
  • [39] Skutkova H, Maderankova D, Sedlar K, Jugas R, Vitek M. A degeneration-reducing criterion for optimal digital mapping of genetic codes. Computational and Structural Biology 2019; 17: 406-141. doi: 10.1016/j.csbj.2019.03.007
  • [40] Durbin R. Biological Sequence Analysis: Probabilistic Models of Proteins and Nuclear Acids. Cambridge, UK: Cambridge University Press, 1998.
  • [41] Kamarudin AN, Cox T, Kolamunnage-Dona R. Time-dependent ROC curve analysis in medical research: Current methods and applications. BMC 2017; 17(1): 53. doi: 10.1186/s12874-017-0332-6
  • [42] Safari S, Baraloo A, Elfil M, Negida A. Evidance based emergency medicine; part 5 receiver operation curve and area under the curve. Emergency 2016; 4(2): 111-113.
  • [43] Zhao XG, Dai W, Li Y, Tian L. AUC-based biomarker ensemble with an application on gene scores prediction low bone mineral density. Bioinformatics 2011; 27(21): 3050-3055. doi: 10.1093/bioinformatics/btr516
  • [44] Wington RS, Connor JL, Centor RM. Transportability of a decision rule for the diagnosis of streptococcal pharyn- gitis. Archives of International Medicine 1986; 146(1): 81-83.
Turkish Journal of Electrical Engineering and Computer Sciences-Cover
  • ISSN: 1300-0632
  • Yayın Aralığı: Yılda 6 Sayı
  • Yayıncı: TÜBİTAK