A proposal of a hybrid model to predict the secondary protein structures based on amino acid sequences
A proposal of a hybrid model to predict the secondary protein structures based on amino acid sequences
Aim: Predicting the secondary structure of proteins based on amino acid sequences is one of the most significant issues inbioinformatics that requires clarification. A high accuracy in determining the secondary structure is a key to programmaticallyuncover 3D structure of proteins and for individual drug applications of programmable proteins. The success rates in predicting thesecondary structures (Q3 score) were around 0.60 when relevant research was initiated and now the rates have reached to the limitof 0.80.Material and Methods: In this study, the secondary structure was predicted through 3-state (Helix, Strand and Turn). Artificial neuralnetworks and machine learning algorithms were used as a hybrid model and a framework was developed. The probability of thepaired presence of amino acids in sequences was used in digitizing amino acid sequences. Calculations were completed separatelyfor each secondary structural element and the cascade mean filter was used as a threshold method to clarify the differences. Thegenerated matrices were used to digitize the protein sequences. Secondary structure was predicted through the Helix-Strand, HelixTurn, Strand-Turn, and subsequently, a final decision as Helix, Strand and Turn was reached via machine learning models.Results: It was determined that the success rates in the dual estimation of secondary structural elements were 0.797 for helixstrand, 0.848 for helix-turn and 0.829 for strand-turn. The average success rate for paired estimation of secondary structuralelements was calculated as 0.824. In the proposed model, accuracy was calculated as 0.742 for Helix, 0.703 for Strand and 0.880for Turn. Q3 score was obtained as 0.775.
___
- 1. Narloch PH, Parpinelli RS. The Protein Structure Prediction Problem Approached by a Cascade Differential Evolution Algorithm Using ROSETTA. Brazilian Conference on Intelligent Systems (BRACIS) 2017;294-9.
- 2. Weng JT-Y, Wu L-C, Chang W-C et al. Novel Bioinformatics Approaches for Analysis of HighThroughput Biological Data. BioMed Res Int 2014;1-3.
- 3. Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983;22:2577- 637.
- 4. Yang Y, Gao J, Wang J, et al. Sixty-five years of the long march in protein secondary structure prediction: the final stretch? Brief Bioinform 2016;19:482-94.
- 5. Chou PY, Fasman GD. Empirical Predictions of Protein Conformation. Annu Rev Biochem 1978;47:251-76.
- 6. Garnier J, Osguthorpe DJ, Robson B. Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins J Mol Biol 1978;120:97-120.
- 7. Rost B, Sander C. Prediction of protein secondary structure at better than 70% accuracy. J Mol Biol. 1993;232:584-99.
- 8. Jones DT. Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 1999;292:195-202.
- 9. Jiang Q, Jin X, Lee S-J, et al. Protein secondary structure prediction: A survey of the state of the art. J Mol Graph Model 2017;76:379-402.
- 10. Selbig J, Mevissen T, Lengauer T. Decision tree-based formation of consensus protein secondary structure prediction. Bioinformatics. 1999;15:1039-46.
- 11. He J, Hu H-J, Harrison R, et al. Rule Generation for Protein Secondary Structure Prediction With Support Vector Machines and Decision Tree. IEEE Trans Nanobioscience 2006;5:46-53.
- 12. Yendralwar AA, Waghmare SL, Biyani RM, et al. Bayesian Approach to Prediction of Protein Secondary Structure 2014;5:3375-5.
- 13. Chawla N, Moore Jr, Bowyer KW, et al. Bagging-like effects for decision trees and neural nets in protein secondary structure prediction. Proceedings of the 1st International Conference on Data Mining in Bioinformatics. Springer, Verlag, 2001;50-9.
- 14. Lou W, Wang X, Chen F, et al. Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes. PLoS ONE 2014;9:e86703.
- 15. Python, Python.org. https://www.python.org/ access date 2019.
- 16. Guzzi PH. Computing Languages for Bioinformatics: Python. Encyclopedia of Bioinformatics and Computational Biology 2019;1:195-8.
- 17. National Center for Biotechnology Information. https://www.ncbi.nlm.nih.gov/ access date 2019.
- 18. The Universal Protein Resource Knowledge base. https://www.uniprot.org/ access date 2019.
- 19. McKinney W. pandas: a foundational Python library for data analysis and statistics. Python for High Performance and Scientific Computing. 2011;14:1-9.
- 20. van der Walt S, Colbert SC, Varoquaux G. The NumPy Array: A Structure for Efficient Numerical Computation. Comput Sci Eng. 2011;13:22-30.
- 21. SciPy.org. https://www.scipy.org/ access date 2019.
- 22. StatsModels: Statistics in Python, statsmodels 0.9.0 documentation. https://www.statsmodels .org/ stable/index.html access date 2019.
- 23. Hunter JD. Matplotlib: A 2D Graphics Environment. Comput Sci Eng 2007;9:90–5.
- 24. Mwaskom/Seaborn: V0.8.1. https://zenodo.org/ record/883859 access date 2019.
- 25. Pedregosa F, Varoquaux G, Gramfort A, et al. Scikitlearn: Machine Learning in Python. J Mach Learn Res 2011;12:2825-30.
- 26. scikit-learn: machine learning in Python, scikitlearn 0.21.2 documentation. https://scikit-learn.org/ stable/index.html access date 2019.
- 27. Sehirli E, Turan MK, Demiral E. A randomized automated thresholding method to identify comet objects on comet assay images. Proceedings of the 3rd International Conference on Communication and Information Processing, 2017; 464-7.
- 28. Turan MK, Yücer E, Sehirli E, et al. Estimation of population number via light activitieson night-time satellite images. ISPRS - Int Arch Photogramm Remote Sens Spat Inf Sci. 2017;103-5.
- 29. Lin K, May ACW, Taylor WR. Amino Acid Encoding Schemes from Protein Structure Alignments: Multidimensional Vectors to Describe Residue Types. J Theor Biol. 2002;216:361-5.
- 30. Swanson R. A, Vecctor representation for amino acid sequences. Bull Math Biol. 1984;64:623-39.
- 31. Zamani M, Kremer SC. Amino acid encoding schemes for machine learning methods. IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW), Atlanta, GA, 327-33.
- 32. Jing X, Dong Q, Hong D, et al. Amino acid encoding methods for protein sequences: a comprehensive review and assessment. IEEE/ACM Trans Comput Biol Bioinform 2018;1-14.
- 33. Panchal G, Ganatra A, Kosta YP, et al. Behaviour Analysis of Multilayer Perceptronswith Multiple Hidden Neurons and Hidden Layers. Int J Comput Theory Eng. 2011;332-7.
- 34. Jurman G, Riccadonna S, Furlanello C. A Comparison of MCC and CEN Error Measures in Multi-Class Prediction. Biondi-Zoccai G, editor. PLoS ONE. 2012;7:41882.
- 35. Raschka S. An Overview of General Performance Metrics of Binary Classifier Systems. arXiv preprint arXiv:1410.5330 2014;1-5.