Fahriye GEMCİ, Turgay İBRİKÇİ, Ulus ÇEVİK

COMPARATIVE OF SUCCESS OF KNN WITH NEW PROPOSED K-SPLIT METHOD AND STRATIFIED CROSS VALIDATION ON REMOTE HOMOLOGUE PROTEIN DETECTION

In this study, a remote homologous protein detection problem, which is a problem related to the field of bioinformatics and has made a great contribution in the field of medicine, is discussed. Protein sequences taken from the SCOP database, which is an important and widely used database for proteins, were tested for remote homologue protein detection in this study. Feature vectors were obtained from the protein sequences using the bag-of-words model. These obtained feature vectors were classified using the k-nearest Neighbor classifier algorithm. In this classification, the different distances used were Bray Curtis, Euclidean, Minkowski, Dice, Jaccard, Chebyshev, Cosine, SokalSneath, correlation, matching coefficient, RogersTanimoto, SokalMichener, Canbera, Hamming, Kulczynski, and RussellRao on the k-nearest Neighbor classifier for remote homologue protein detection. Two different new methods is proposed for preventing the imbalanced data problem. The first of these is special k-fold value and the other is novel k-split method. It is observed that the k-nearest Neighbor algorithm with the Bray Curtis distance and cross validation with special k-fold value and novel k-split method show the most successful performance, with 98.9% and 83.8% accuracy and 77% and 92% ROC score, respectively.

Anahtar Kelimeler:

Remote Homologue Protein, k-nearest Neighbor, Bag-of-words model, Distances, k-fold Stratified Cross Validation

COMPARATIVE OF SUCCESS OF KNN WITH NEW PROPOSED K-SPLIT METHOD AND STRATIFIED CROSS VALIDATION ON REMOTE HOMOLOGUE PROTEIN DETECTION

In this study, a remote homologous protein detection problem, which is a problem belonging to the field of bioinformatics, which has a great contribution in the field of medicine, is discussed. Protein sequences taken from the SCOP database, which is an important and widely used database for proteins, were tested for remote homolog protein detection in this study. Feature vectors were obtained from the protein sequences using the bag of word model. These obtained feature vectors were classified using the kNN classifier algorithm. In this classification, the different distances were used as Bray Curtis, Chebyshev, Cosine, Dice, Euclidean, Hamming, Jaccard, Kulczynski, Matching coefficient, Minkowski, RogersTanimoto, RussellRao and SokalMichener on kNN classifier for remote homolog protein detection. There is proposed special k fold value formula for prevent imbalanced data problem. It has observed that the kNN algorithm with the Bray Curtis distance with cross validation with special k fold value shows the most successful performance with 99% accuracy.

Keywords:

Remote Homolog Protein, k-nearest Neighbor (kNN), Bag of words model, Distances, k-fold Cross Validation,

PDF

___

[1] Li J, Wong L, Yang Q. Guest editors' introduction: Data Mining in Bioinformatics. IEEE Intell. Systems, 2005; 20(6):16-18.
[2] Yoo I, Alafaireet P, Marinov M, Pena-Hernandez K, Gopidi R, Chang J.-F, Hua L. Data mining in healthcare and biomedicine: a survey of the literature. Journal of medical systems, 2012; 36(4):2431-2448.
[3] Chen J, Guo M, Wang X, Liu B. A comprehensive review and comparison of different computational methods for protein remote homology detection. Briefings in Bioinformatics, 2018; 19(2): 231-244.
[4] Liao L, Noble WS. Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. Journal of computational biology, 2003; 10(6), 857-868.
[5] Lovato P, Cristani M, Bicego M. Soft Ngram representation and modeling for protein remote homology detection. IEEE/ACM transactions on computational biology and bioinformatics, 2016; 14(6), 1482-1488.
[6] Dong QW. Lin L, Wang XL, Li MH. A pattern-based SVM for protein remote homology detection. In 2005 International Conference on Machine Learning and Cybernetics, 2005; Vol.6, 3363-3368. IEEE.
[7] Beaume N, Ramstein G, Jacques Y. An expert-based approach for the identification of remote homologs. WCSB. 2008; (pp. 17-20).
[8] Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. Journal of molecular biology, 1995; 247(4), 536-540.
[9] Harris A, Jones SH. Words. In Writing for Performance. 2016; (pp. 19-35). Rotterdam, Netherlands: Sense.
[10] Ofer D, Brandes N, Linial M. The language of proteins: NLP, machine learning & protein sequences. Computational and Structural Biotechnology Journal, 2021.
[11] Sushma K. Vani KS. Protein Secondary Structure Extraction using Bag of Words Model. ICETCSE 2016 Special Issue International Journal of Computer Science and Information Security, 2016; 14,106-110.
[12] Kumar NP, Rao MV, Krishna PR, Bapi RS. Using sub-sequence information with kNN for classification of sequential data. International Conference on Distributed Computing and Internet Technology. Springer, Berlin, Heidelberg, 2005. (p. 536-546).
[13] Abu Alfeilat HA, Hassanat AB, Lasassmeh O, Tarawneh A S, Alhasanat M B, Eyal Salman H S Prasath V S. Effects of distance measure choice on k-nearest neighbor classifier performance: A review. Big data, 2019; 7(4), 221-248.
[14] Fix E, Hodges JL. Discriminatory analysis. Nonparametric discrimination; consistency properties. Technical Report 4, USAF School of Aviation Medicine, Randolph Field, TX, USA. 1951.
[15] Han Jiawei, Kamber Micheline. Data Mining, Concepts and Techniques, Morgan Kaufmann Publishers. 2001.
[16] Michie MG. Use of the Bray-Curtis similarity measure in cluster analysis of foraminiferal data. Journal of the International Association for Mathematical Geology,1982; 14(6), 661-667.
[17] Mousa A, Yusof Y. Fuzzy C-Means with Improved Chebyshev Distance for Multi-Labelled Data. Journal of Engineering and Applied Sciences, 2018;13(2), 353-360.
[18] Qian G, Sural S, Gu Y, Pramanik S. Similarity between Euclidean and cosine angle distance for nearest neighbor queries. In Proceedings of the 2004 ACM symposium on Applied computing 2004, March; (pp. 1232-1237).
[19] Li B, Han L. Distance weighted cosine similarity measure for text classification. In International conference on intelligent data engineering and automated learning. 2013, October; (pp. 611-618). Springer, Berlin, Heidelberg.
[20] Prasath VB, Alfeilat HAA, Hassanat A, Lasassmeh O, Tarawneh AS, Alhasanat MB, Salman H SE. Distance and Similarity Measures Effect on the Performance of K-Nearest Neighbor Classifier--A Review, 2017; arXiv preprint arXiv:1708.04321.
[21] Stabili D, Marchetti M, Colajanni M. Detecting attacks to internal vehicle networks through Hamming distance. In 2017 AEIT International Annual Conference, 2017 September; (pp. 1-6). IEEE.
[22] Cha SH. Comprehensive survey on distance/similarity measures between probability density functions. International J. Mathematical Models and Methods in Applied Science, 2007; 1(4), 300-307.
[23] Kocher M. Savoy J. Distance measures in author profiling. Information processing & management, 2017; 53(5), 1103-1119.
[24] Boyce RL, Ellison PC. Choosing the best similarity index when performing fuzzy set ordination on binary data. Journal of Vegetation Science, 2001; 12(5), 711-720.
[25] Chay ZE, Lee CH, Lee KC. Oon JS, Ling M. Russel and Rao coefficient is a suitable substitute for Dice coefficient in studying restriction mapped genetic distances of Escherichia coli. Computational and Mathematical Biology, 2010;1(1), 1-9.
[26] Stacey B. A Standardized Treatment of Binary Similarity Measures with an Introduction to k-Vector Percentage Normalized Similarity, 2016.
[27] Putra RE. Suciati N. Wijaya AY. Implementing content based image retrieval for Batik using Rotated Wavelet Transform and Canberra distance. Image, 2011; 2(3), 4-5.
[28] Choi SS. Cha SH. Tappert CC. A survey of binary similarity and distance measures. Journal of systemics, cybernetics and informatics, 2010; 8(1), 43-48.
[29] Yan H, Zhou X, Ge Y, Neighborhood repulsed correlation metric learning for kinship verification. 2015 Visual Communications and Image Processing (VCIP), December 2015; IEEE, 1-4.
[30] Berrar D. Cross-validation. Encyclopedia of bioinformatics and computational biology, 2019; 1:542-545.
[31] Breiman L. Friedman JH. Olshen RA. Stone CJ. Classification and Regression Trees (Wadswort h International Group). 1984.
[32] Zeng X. Martinez T. R. Distribution-balanced stratified cross-validation for accuracy estimation. Journal of Experimental & Theoretical Artificial Intelligence, 2000; 12(1), 1-12.
[33] Damoulas T, Girolami MA. Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection. Bioinformatics,. 2008; 24(10), 1264-1270.
[34] Nakshathram S. Duraisamy R. Pandurangan M. Sequence-Order Frequency Matrix-Sampling and Machine learning with Smith-Waterman (SOFM-SMSW) for Protein Remote Homology Detection, 2021