Determination of highly effective attributes in fold level classification of proteins

In this paper it is aimed to determine which of the protein features or attributes is the most significant for classification of proteins according to their folds. Proteins in the database used in this study are represented by six feature groups called attributes and by a 125-dimensional feature vector. The representation of proteins with very high dimensional vectors such as 125 causes increasing computational load of the classification process and extending the process time. In this study “dimension reduction” solution is offered for this negative situation. Hence, with two different approaches, the features and attributes having high classification performance are determined. In the first approach, which attribute gives higher performance is determined by testing separately each of the six attributes. In the second approach, the most significant of the 125 features are determined using Divergence Analysis method. In this study, a classic classifier KNN (K-nearest neighbor) and artificial neural network models GAL (Grow and Learn) and SOM (Self-Organizing Map) networks are used as classifier and classification performance is analyzed for reduced dimension datasets.

___

  • 1. Hashemi, H.B., Shakery, A., Naeini, M.P, Protein fold pattern recognition using Bayesian ensemble of RBF neural networks, in SOCPAR2009: Malaysia. p. 436-441.
  • 2. Cantoni, V., Ferone, A., Ozbudak, O. and Petrosino, A., Searching structural blocks by SS exhaustive matching, Lecture Notes in Bioinformatics. Leif Peterson, Giuseppe Russo, Francesco Masulli (Eds.), 2013. p. 57-69.
  • 3. Protein Data Bank, http://www.rcsb.org, last access date: 31.12.2018.
  • 4. Murzin, A.G., Brenner, S.E., Hubbard, T. and Chothia, C., SCOP: A structural classification of proteins database for the investigation of sequences and structures, Journal of Molecular Biology, 1995. 247(4), p. 536–540.
  • 5. Dubchak, I., Muchnik, I., Mayor, C., Dralyuk, I. and Kim, S.H., Recognition of a protein fold in the context of the structural classifications of proteins (SCOP) classification, Proteins: Structure, Function and Bioinformatics, 1999. 35(4), p. 401–407.
  • 6. Reczko, M. and Bohr, H., The DEF data base of sequence based protein fold class predictions, Nucleic acids research, 1994. 22(17), p. 3616-3619.
  • 7. Edler, L., Grassmann, J. and Suhai, S., Role and results of statistical methods in protein fold class prediction, Mathematical and Computer Modelling, 2001. 33(12), p. 1401–1417.
  • 8. Ding, C.H.Q. and Dubchak, I., Multi-class protein fold recognition problem using support vector machines and neural networks, Bioinformatics, 2001. 17(4), p. 349–358.
  • 9. Bologna, G. and Appel, R.D., A comparison study on protein fold recognition, Proceedings of the 9th International Conference on Neural Information Processing, 2002. volume 5, IEEE, p. 2492–2496.
  • 10. Igel, C., Gebert, J. and Wiebringhaus, T., Protein fold class prediction using neural networks with tailored early-stopping, , Proceedings of IEEE International Joint Conference on Neural Networks,2004. volume 3, p. 1693–1697.
  • 11. Huang, C.D., Liang, S.F., Lin, C.T. and Wu, R.C., Machine learning with automatic feature selection for multi-class protein fold classification, Journal of information science and engineering, 2005. 21(4), p. 711–720.
  • 12. Jazebi, S., Tohidi, A. and Rahgozar, M., Application of classifier fusion for protein fold recognition, Sixth International Conference on Fuzzy Systems and Knowledge Discovery, 2009. volume 7, p.171–175.
  • 13. Chinnasamy, A., Sung, W.K. and Mittal, A., Protein structure and fold prediction using tree-augmented naive Bayesian classifier, Journal of Bioinformatics and Computational Biology, 2005. 3(04), p. 803–819.
  • 14. Okun, O., Protein fold recognition with k-local hyperplane distance nearest neighbor algorithm, Proceedings of the Second European Workshop on Data Mining and Text Mining in Bioinformatics, 2004. Pisa, Italy, Citeseer, p. 51–57.
  • 15. Shen, H.B. and Chou, K.C., Ensemble classifier for protein fold pattern recognition, Bioinformatics, 2006. 22(14), p. 1717–1722.
  • 16. Kavousi, K., Moshiri, B., Sadeghi, M., Araabi, B.N. and Moosavi-Movahedi, A.A., A protein fold classifier formed by fusing different modes of pseudo amino acid composition via PSSM, Computational Biology and Chemistry, 2011. 35(1), p. 1–9.
  • 17. Kavousi, K., Sadeghi, M., Moshiri, B. and Araabi, B. N.and Moosavi-Movahedi, A.A., Evidence theoretic protein fold classification based on the concept of hyperfold, Mathematical Biosciences, 2012. 240(2), p. 148–160.
  • 18. Markowetz, F., Edler, L. and Vingron, M., Support vector machines for protein fold class prediction, Biometrical Journal, 2003. 45(3), p. 377–389.
  • 19. Shi, S.Y.M., Suganthan, P.N. and Deb, K., Multiclass protein fold recognition using multiobjective evolutionary algorithms, Proceedings of the IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, 2004. p. 61–66.
  • 20. Shamim, M.T.A., Anwaruddin, M. and Nagarajaram, H.A., Support Vector Machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs, Bioinformatics, 2007. 23(24), p. 3320–3327.
  • 21. Bindewald, E., Cestaro, A., Hesser, J., Heiler, M. and Tosatto, S.C.E., MANIFOLD: protein fold recognition based on secondary structure, sequence similarity and enzyme classification, Protein Engineering, 2003. 16(11), p. 785–789.
  • 22. Nanni, L., A novel ensemble of classifiers for protein fold recognition, Neurocomputing, 2006. 69(16), p. 2434–2437.
  • 23. Nanni, L., Ensemble of classifiers for protein fold recognition, Neurocomputing, 2006. 69(7), p. 850–853.
  • 24. Chen, K. and Kurgan, L., PFRES: protein fold classification by using evolutionary information and predicted secondary structure, Bioinformatics, 2007. 23(21), p. 2843–2850.
  • 25. Guo, X. and Gao, X., A novel hierarchical ensemble classifier for protein fold recognition, Protein Engineering Design and Selection, 2008. 21(11), p. 659–664.
  • 26. Chen, P., Liu, C., Burge, L., Mahmood, M., Southerland, W. and Gloster, C., Protein fold classification with genetic algorithms and feature selection, Journal of bioinformatics and computational biology, 2009. 7(05), p. 773–788.
  • 27. Yang, T., Kecman, V., Cao, L., Zhang, C. and Huang, J.Z., Margin-based ensemble classifier for protein fold recognition, Expert Systems with Applications, 2011. 38(10), p. 12348–12355.
  • 28. Lin, C., Zou, Y., Qin, J., Liu, X., Jiang, Y., Ke, C. and Zou, Q., Hierarchical classification of protein folds using a novel ensemble classifier, PloS one, 2013. 8(2), e56499.
  • 29. Aram, R.Z. and Charkari, N.M., A two-layer classification framework for protein fold recognition, Journal of Theoretical Biology, 2015. 365, p. 32–39.
  • 30. Damoulas, T. and Girolami, M.A., Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection, Bioinformatics, 2008. 24(10), p. 1264–1270.
  • 31. Huang, C.D., Lin, C.T. and Pal, N.R., Hierarchical learning architecture with automatic feature selection for multiclass protein fold classification, NanoBioscience, IEEE Transactions on, 2003. 2(4), p. 221–232.
  • 32. Krishnaraj, Y. and Reddy, C.K., Boosting methods for protein fold recognition: an empirical comparison, IEEE International Conference on Bioinformatics and Biomedicine, 2008.
  • 33. Shen, H.B. and Chou, K.C., Predicting protein fold pattern with functional domain and sequential evolution information, Journal of Theoretical Biology, 2009. 256(3), p. 441–446.
  • 34. Dehzangi, A., Amnuaisuk, S.P. and Dehzangi, O., Using random forest for protein fold prediction problem: An empirical study, J. Inf. Sci. Eng., 2010. 26(6), p. 1941–1956.
  • 35. Dehzangi, A., Amnuaisuk, S.P., Manafi, M. and Safa, S., Using rotation forest for protein fold prediction problem: An empirical study, 8th European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics, 2010. p. 217–227.
  • 36. Wang, R. and Gao, X., A Two-Layer Learning Architecture for Multi-Class Protein Folds Classification, Interdisciplinary Research and Applications in Bioinformatics, Computational Biology, and Environmental Sciences, 2010.
  • 37. Dehzangi, A.and Karamizadeh, S., Solving protein fold prediction problem using fusion of heterogeneous classifiers, INFORMATION, An International Interdisciplinary Journal, 2011. 14(11), p. 3611–3622.
  • 38. Suvarnavani, K., Rafiah, S.B. and Kamisetti, N.R., Multiclass classification for protein fold prediction using Smote, International Journal of Advanced Research in Computer Science and Software Engineering, 2011. 2(11), p. 290–296.
  • 39. Bae, S.E., Jung, S., Ahn, I. and Son, H.S., Protein fold classification with backbone torsional characters using multi-class linear discriminant analysis, J Proteomics Bioinform, 2013. 6, p. 148–152.
  • 40. Duda RO, Hart PE. Pattern Classification and Scene Analysis. John-Wiley&Sons. Inc. 1973.
  • 41. Kohonen, T., Self-Organized Formation of Topologically Correct Feature Maps, Biological Cybernetics, 1982. 43(1), p. 59-69.
  • 42. Polat O. and Dokur Z., Protein fold recognition using self-organizing map neural network, Current Bioinformatics, 2016. 11, p. 451-458.
  • 43. Alpaydın E., Neural models of incremental supervized and unsupervized learning, Ds. Thesis, Ecole Polytecnique Federale De Lausanne, Switzerland, 1990.
  • 44. Polat O. and Dokur Z., Protein fold classification with grow-and-learn network, Turk J Elec Eng & Comp Sci, 2017. 25, p. 1184-1196.
  • 45. Ölmez, T., Dokur, Z. Uzman Sistemlerde Örüntü Tanıma: Yapay Sinir Ağları, Genetik Algoritmalar, Bulanık Mantık, Makine Öğrenmesi ders notu.