Protein Verilerinin Ayrık Dalgacık Dönüşümü İle Analizi

Biyolojik veri tabanları, genomik ve proteomik çalışmalar nedeniyle büyük miktarda veri içermektedir. Verilerin analizi, organizmadaki metabolik bozuklukların anlaşılmasına ve ilaç keşif çalışmalarının artırılmasına büyük katkı sağlamaktadır. Zaman ve maliyet tasarrufu nedeniyle makine öğrenmesi ve veri analizi yöntemleri bu amaçla sıkça kullanılmaktadır. Yöntemlerin etkinliği, uygun parametre seçimine ve protein dizilerinin kodlanış tipine de bağlıdır. Bu amaçla amino asitlere ait fizikokimyasal özelliklerin dâhil edilmesi kullanılan algoritmanın performansını arttırmaktadır. Filogenetik analiz, türler arasındaki ilişkiyi görselleştirmek için kullanılan en iyi yöntemlerden biridir. Çalışmada, dijital sinyal analizinde kullanılan dalgacık dönüşümü yönteminin, protein dizilerine uyarlanması tasarlanmıştır. Dalgacık dönüşümü kullanılarak 15 türe ait SOD1 protein dizileri arasındaki genetik yakınlık Ağırlıklı Çift Grup Aritmetik Ortalamalar Yöntemi (WPGMA) yöntemiyle belirlenmiştir. Ayrıca, proteinler arası genetik uzaklıkları temel alan Jukes-Cantor (JC) uzaklığı kullanılarak elde edilen filogenetik ağaç ile elde edilen sonuçlar karşılaştırılmış, dalgacık analizi yönteminin türlere ait moleküler boyuttaki ilişkinin ortaya koyulmasında etkinliği ortaya çıkartılmıştır. Türlere ait filogenetik ağaç oluşturma süreleri Dalgacık dönüşümü ile 2.0711178 sn., Jukes-Cantor ile 2.20329 sn. olarak elde edilmiştir. Böylelikle, dalgacık dönüşümü kullanarak tanımlanan filogenetik ağaç oluşturma işlem süresinin mevcut JC yöntemine göre daha kısa olmasının büyük veri analizlerinde avantaj sağlaması beklenmektedir.

Anahtar Kelimeler:

Dalgacık Dönüşümü, Protein Dizileri, Filogenetik, Sınıflandırma

Analysis of Protein Data with Discrete Wavelet Transform

Biological databases contain large amounts of data due to genomics and proteomics studies. The analysis of the data makes a great contribution to the understanding of metabolic disorders in the organisms and to improve drug discovery studies. Machine learning and data analysis methods are frequently used for this purpose due to the time and cost savings. The effectiveness of the methods also depends on the appropriate parameter selection and the type of coding of the protein sequences. Therefore, the inclusion of physicochemical properties of amino acids increases the performance of the algorithm used. Phylogenetic analysis is one of the best methods used to visualize the relationship between species. In the study, the wavelet transform used in digital signal analysis was designed to be adapted to protein sequences. Using wavelet analysis, genetic similarity between SOD1 protein sequences of 15 species was determined by Weighted Pair Group Arithmetic Mean Method (WPGMA). In addition, the results obtained with the phylogenetic tree obtained by using the Jukes-Cantor (JC) distance based on the genetic distances between the proteins were compared, and the effectiveness of the wavelet analysis method in revealing the molecular dimension of the species was revealed. The phylogenetic tree construction times of the species were obtained as 2.0711178 sec. with the Wavelet transform and 2.20329 sec. with the Jukes-Cantor. Thus, it is expected that the phylogenetic tree construction process defined by using wavelet transform is shorter than the current JC method, which will provide an advantage in big data analysis.

Keywords:

Wavelet Transform, Protein Sequences, Phylogenetic, Classification,

PDF

___

[1] A. Lesk, “Introduction to bioinformatics”, New York, USA: Oxford University Press, 2004.
[2] S. A. Krawetz, and D. D. Womble, “Introduction to bioinformatics: a theoretical and practical approach”, New Jersey, USA: Humana Press, 2003.
[3] D. Baker, and A. Sali, “Protein structure prediction and structural genomics”, Science, vol. 294, no. 5540, pp. 93-96, 2001.
[4] M. S. Rosenberg, “Evolutionary distance estimation and fidelity of pair wise sequence alignment”, BMC Bioinformatics, vol. 6, no. 102, 2005.
[5] D. J Rigden, and D. J. Rigden, “From protein structure to function with bioinformatics”, Heidelberg- Almanya: Springer, 2017.
[6] H. Lin, “The modified Mahalanobis discriminant for predicting outer membrane proteins by using Chou's pseudo amino acid composition”, Journal of Theoretical Biology, vol. 252, no. 2, pp. 350-356, 2008.
[7] J. Jin, and J. An, “Robust discriminant analysis and its application to identify protein coding regions of rice genes”, Mathematical Biosciences, vol. 232, no. 2, pp. 96-100, 2011.
[8] A. Pavesi, “New insights into the evolutionary features of viral overlapping genes by discriminant analysis”, Virology, vol. 546, pp. 51-66, 2020.
[9] C. Rhodes, C. Lewis, J. Szekely, A. Campbell, M. R. A. Creighton, E. Boone, and S. Seashols-Williams, “Developmental validation of a microRNA panel using quadratic discriminant analysis for the classification of seven forensically relevant body fluids”, Forensic Science International: Genetics, vol. 59, no. 102692, 2022.
[10] S. T. Sara, M. M. Hasan, A. Ahmad, and S. Shatabda, “Convolutional neural networks with image representation of amino acid sequences for protein function prediction”, Computational Biology and Chemistry, vol. 92, no. 107494, 2021.
[11] G. Orlando, D. Raimondi, F. Codice, F. Tabaro, and W. Vranken, “Prediction of disordered regions in proteins with recurrent neural networks and protein Dynamics”, Journal of Molecular Biology, vol. 434(12), no. 167579, 2022.
[12] E. Nasibov, and C. Kandemir-Cavas, “Protein subcellular location prediction using optimally weighted fuzzy k-NN algorithm”, Computational Biology and Chemistry, vol. 32, no. 6, pp. 448-451, 2008.
[13] Y. Ding, J. Tang, and F. Guo, “Human protein subcellular localization identification via fuzzy model on kernelized neighborhood representation”, Applied Soft Computing, vol. 96, no. 106596, 2020.
[14] Z. B. Ozger, and P. Cihan, “A novel ensemble fuzzy classification model in SARS-CoV-2 B-cell epitope identification for development of protein-based vaccine”, Applied Soft Computing, vol. 116, no. 108280, 2022.
[15] M. L. Islam, S. Shatabda, M. A. Rashid, M. G. Khan, and M. S. Rahman, “Protein structure prediction from inaccurate and sparse NMR data using an enhanced genetic algorithm”, Computational Biology and Chemistry, vol. 79, pp. 6-15, 2019.
[16] J. Lin, H. Chen, S. Li, Y. Liu, X. Li, and B. Yu, “Accurate prediction of potential druggable proteins based on genetic algorithm and Bagging-SVM ensemble classifier”, Artificial Intelligence in Medicine, vol. 98, pp. 35-47, 2019.
[17] B. Bošković, and J. Brest, “Genetic algorithm with advanced mechanisms applied to the protein structure prediction in a hydrophobic-polar model and cubic lattice”, Applied Soft Computing, vol. 45, pp. 61-70, 2016.
[18] M. R. Kumar, and N. K. Vaegae, “A new numerical approach for DNA representation using modified Gabor wavelet transform for the identification of protein coding regions”, Biocybernetics and Biomedical Engineering, vol. 40, no. 2, pp. 836-848, 2020.
[19] Q. Zheng, T. Chen, W. Zhou, L. Xie, and H. Su, “Gene prediction by the noise-assisted MEMD and wavelet transform for identifying the protein coding regions”, Biocybernetics and Biomedical Engineering, vol. 41, no.1, pp. 196-210, 2021.
[20] B. Yu, L. Lou, S. Li, Y. Zhang, W. Qiu, X. Wu, M. Wang, and B. Tian, “Prediction of protein structural class for low-similarity sequences using Chou’s pseudo amino acid composition and wavelet denoising”, Journal of Molecular Graphics and Modelling, vol. 76, pp. 260-273, 2017.
[21] G. A. Arango-Argoty, J. A. Jaramillo-Garzón, and G. Castellanos-Domínguez, “Feature extraction by statistical contact potentials and wavelet transform for predicting subcellular localizations in gram negative bacterial proteins”, Journal of Theoretical Biology, vol. 364, pp. 121-130, 2015.
[22] B. Yu, S. Li, C. Chen, J. Xu, W. Qiu, X. Wu, and R. Chen, “Prediction subcellular localization of Gram- negative bacterial proteins by support vector machine using wavelet denoising and Chou's pseudo amino acid composition”, Chemometrics and Intelligent Laboratory Systems, vol. 167, pp. 102-112, 2017.
[23] S. Chaohong, and S. Feng, “Wavelet transform for predicting apoptosis proteins subcellular location”, Journal of Natural Sciences, vol. 15, no. 2, pp. 103-108, 2010.
[24] J. J. Shu, and K. Y. Yong, “Fourier-based classification of protein secondary structures”, Biochemical and Biophysical Research Communications, vol. 485, pp. 731-735, 2017.
[25] A. Bairoch, “The ENZYME database in 2000”, Nucleic Acids Research, vol. 28, pp. 304–305, 2000.
[26] J. Kyte, and R. F. Doolittle, “A simple method for displaying the hydropathic character of a protein”, Journal of Molecular Biology, vol. 157, no. 1, pp. 105–32, 1982.
[27] D. F. Walnut, “An introduction to wavelet analysis”, Boston, USA: Springer, 2002.
[28] N. Arı, Ş. Özen, and Ö. H. Çolak, “Dalgacık Teorisi (Wavelet), Matlab uygulamaları ile”, Ankara, Türkiye: Palme Yayıncılık, 2008.
[29] F. Pardi, and O. Gascuel, “Distance-based methods in phylogenetics”. Richard M. Kliman. Encyclopedia of Evolutionary Biology, Elsevier, pp.458-465, 2016.
[30] J. D. Thompson, D. G. Higgins, and T. J. Gibson, “CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice”, Nucleic Acids Research, vol 11, no. 22, pp. 4673-4680, 1994.
[31] C. Wu, R. Gao, Y. De Marinis, and Y. Zhang, “A novel model for protein sequence similarity analysis based on spectral Radius”, Journal of Theoretical Biology, vol. 446, pp. 61-70, 2018.
[32] J. Wu, T. Zhou, J. Tao, Y. Hai, F. Ye, X. Liu, and Q. Dai, “Similarity/dissimilarity analysis of protein structures based on Markov random fields”, Computational Biology and Chemistry, vol. 75, pp. 45-53, 2018.
[33] R. Busa-Fekete, A. Kertész-Farkas, A. Kocsor, and S. Pongor, “Balanced ROC analysis (BAROC) protocol for the evaluation of protein similarities”, Journal of Biochemical and Biophysical Methods, vol. 70, no. 6, pp. 1210-1214, 2008.
[34] J. Zhao, J. Wang, W. Hua, and P. Ouyang, “Algorithm, applications and evaluation for protein comparison by Ramanujan Fourier transform”, Molecular and Cellular Probes, vol. 29, no. 6, pp. 396-407, 2015.