Protein Verilerinin Ayrık Dalgacık Dönüşümü İle Analizi
Biyolojik veri tabanları, genomik ve proteomik çalışmalar nedeniyle büyük miktarda veri içermektedir. Verilerin analizi, organizmadaki metabolik bozuklukların anlaşılmasına ve ilaç keşif çalışmalarının artırılmasına büyük katkı sağlamaktadır. Zaman ve maliyet tasarrufu nedeniyle makine öğrenmesi ve veri analizi yöntemleri bu amaçla sıkça kullanılmaktadır. Yöntemlerin etkinliği, uygun parametre seçimine ve protein dizilerinin kodlanış tipine de bağlıdır. Bu amaçla amino asitlere ait fizikokimyasal özelliklerin dâhil edilmesi kullanılan algoritmanın performansını arttırmaktadır. Filogenetik analiz, türler arasındaki ilişkiyi görselleştirmek için kullanılan en iyi yöntemlerden biridir. Çalışmada, dijital sinyal analizinde kullanılan dalgacık dönüşümü yönteminin, protein dizilerine uyarlanması tasarlanmıştır. Dalgacık dönüşümü kullanılarak 15 türe ait SOD1 protein dizileri arasındaki genetik yakınlık Ağırlıklı Çift Grup Aritmetik Ortalamalar Yöntemi (WPGMA) yöntemiyle belirlenmiştir. Ayrıca, proteinler arası genetik uzaklıkları temel alan Jukes-Cantor (JC) uzaklığı kullanılarak elde edilen filogenetik ağaç ile elde edilen sonuçlar karşılaştırılmış, dalgacık analizi yönteminin türlere ait moleküler boyuttaki ilişkinin ortaya koyulmasında etkinliği ortaya çıkartılmıştır. Türlere ait filogenetik ağaç oluşturma süreleri Dalgacık dönüşümü ile 2.0711178 sn., Jukes-Cantor ile 2.20329 sn. olarak elde edilmiştir. Böylelikle, dalgacık dönüşümü kullanarak tanımlanan filogenetik ağaç oluşturma işlem süresinin mevcut JC yöntemine göre daha kısa olmasının büyük veri analizlerinde avantaj sağlaması beklenmektedir.
Analysis of Protein Data with Discrete Wavelet Transform
Biological databases contain large amounts of data due to genomics and proteomics studies. The analysis of the data makes a great contribution to the understanding of metabolic disorders in the organisms and to improve drug discovery studies. Machine learning and data analysis methods are frequently used for this purpose due to the time and cost savings. The effectiveness of the methods also depends on the appropriate parameter selection and the type of coding of the protein sequences. Therefore, the inclusion of physicochemical properties of amino acids increases the performance of the algorithm used. Phylogenetic analysis is one of the best methods used to visualize the relationship between species. In the study, the wavelet transform used in digital signal analysis was designed to be adapted to protein sequences. Using wavelet analysis, genetic similarity between SOD1 protein sequences of 15 species was determined by Weighted Pair Group Arithmetic Mean Method (WPGMA). In addition, the results obtained with the phylogenetic tree obtained by using the Jukes-Cantor (JC) distance based on the genetic distances between the proteins were compared, and the effectiveness of the wavelet analysis method in revealing the molecular dimension of the species was revealed. The phylogenetic tree construction times of the species were obtained as 2.0711178 sec. with the Wavelet transform and 2.20329 sec. with the Jukes-Cantor. Thus, it is expected that the phylogenetic tree construction process defined by using wavelet transform is shorter than the current JC method, which will provide an advantage in big data analysis.
___
- [1] A. Lesk, “Introduction to bioinformatics”, New York, USA: Oxford University Press, 2004.
- [2] S. A. Krawetz, and D. D. Womble, “Introduction to bioinformatics: a theoretical and practical approach”,
New Jersey, USA: Humana Press, 2003.
- [3] D. Baker, and A. Sali, “Protein structure prediction and structural genomics”, Science, vol. 294, no. 5540,
pp. 93-96, 2001.
- [4] M. S. Rosenberg, “Evolutionary distance estimation and fidelity of pair wise sequence alignment”, BMC
Bioinformatics, vol. 6, no. 102, 2005.
- [5] D. J Rigden, and D. J. Rigden, “From protein structure to function with bioinformatics”, Heidelberg-
Almanya: Springer, 2017.
- [6] H. Lin, “The modified Mahalanobis discriminant for predicting outer membrane proteins by using Chou's
pseudo amino acid composition”, Journal of Theoretical Biology, vol. 252, no. 2, pp. 350-356, 2008.
- [7] J. Jin, and J. An, “Robust discriminant analysis and its application to identify protein coding regions of rice
genes”, Mathematical Biosciences, vol. 232, no. 2, pp. 96-100, 2011.
- [8] A. Pavesi, “New insights into the evolutionary features of viral overlapping genes by discriminant analysis”,
Virology, vol. 546, pp. 51-66, 2020.
- [9] C. Rhodes, C. Lewis, J. Szekely, A. Campbell, M. R. A. Creighton, E. Boone, and S. Seashols-Williams,
“Developmental validation of a microRNA panel using quadratic discriminant analysis for the classification
of seven forensically relevant body fluids”, Forensic Science International: Genetics, vol. 59, no. 102692,
2022.
- [10] S. T. Sara, M. M. Hasan, A. Ahmad, and S. Shatabda, “Convolutional neural networks with image
representation of amino acid sequences for protein function prediction”, Computational Biology and
Chemistry, vol. 92, no. 107494, 2021.
- [11] G. Orlando, D. Raimondi, F. Codice, F. Tabaro, and W. Vranken, “Prediction of disordered regions in
proteins with recurrent neural networks and protein Dynamics”, Journal of Molecular Biology, vol. 434(12),
no. 167579, 2022.
- [12] E. Nasibov, and C. Kandemir-Cavas, “Protein subcellular location prediction using optimally weighted fuzzy
k-NN algorithm”, Computational Biology and Chemistry, vol. 32, no. 6, pp. 448-451, 2008.
- [13] Y. Ding, J. Tang, and F. Guo, “Human protein subcellular localization identification via fuzzy model on
kernelized neighborhood representation”, Applied Soft Computing, vol. 96, no. 106596, 2020.
- [14] Z. B. Ozger, and P. Cihan, “A novel ensemble fuzzy classification model in SARS-CoV-2 B-cell epitope
identification for development of protein-based vaccine”, Applied Soft Computing, vol. 116, no. 108280,
2022.
- [15] M. L. Islam, S. Shatabda, M. A. Rashid, M. G. Khan, and M. S. Rahman, “Protein structure prediction from
inaccurate and sparse NMR data using an enhanced genetic algorithm”, Computational Biology and
Chemistry, vol. 79, pp. 6-15, 2019.
- [16] J. Lin, H. Chen, S. Li, Y. Liu, X. Li, and B. Yu, “Accurate prediction of potential druggable proteins based
on genetic algorithm and Bagging-SVM ensemble classifier”, Artificial Intelligence in Medicine, vol. 98, pp.
35-47, 2019.
- [17] B. Bošković, and J. Brest, “Genetic algorithm with advanced mechanisms applied to the protein structure
prediction in a hydrophobic-polar model and cubic lattice”, Applied Soft Computing, vol. 45, pp. 61-70, 2016.
- [18] M. R. Kumar, and N. K. Vaegae, “A new numerical approach for DNA representation using modified Gabor
wavelet transform for the identification of protein coding regions”, Biocybernetics and Biomedical
Engineering, vol. 40, no. 2, pp. 836-848, 2020.
- [19] Q. Zheng, T. Chen, W. Zhou, L. Xie, and H. Su, “Gene prediction by the noise-assisted MEMD and wavelet
transform for identifying the protein coding regions”, Biocybernetics and Biomedical Engineering, vol. 41,
no.1, pp. 196-210, 2021.
- [20] B. Yu, L. Lou, S. Li, Y. Zhang, W. Qiu, X. Wu, M. Wang, and B. Tian, “Prediction of protein structural class
for low-similarity sequences using Chou’s pseudo amino acid composition and wavelet denoising”, Journal
of Molecular Graphics and Modelling, vol. 76, pp. 260-273, 2017.
- [21] G. A. Arango-Argoty, J. A. Jaramillo-Garzón, and G. Castellanos-Domínguez, “Feature extraction by
statistical contact potentials and wavelet transform for predicting subcellular localizations in gram negative
bacterial proteins”, Journal of Theoretical Biology, vol. 364, pp. 121-130, 2015.
- [22] B. Yu, S. Li, C. Chen, J. Xu, W. Qiu, X. Wu, and R. Chen, “Prediction subcellular localization of Gram-
negative bacterial proteins by support vector machine using wavelet denoising and Chou's pseudo amino acid composition”, Chemometrics and Intelligent Laboratory Systems, vol. 167, pp. 102-112, 2017.
- [23] S. Chaohong, and S. Feng, “Wavelet transform for predicting apoptosis proteins subcellular location”,
Journal of Natural Sciences, vol. 15, no. 2, pp. 103-108, 2010.
- [24] J. J. Shu, and K. Y. Yong, “Fourier-based classification of protein secondary structures”, Biochemical and
Biophysical Research Communications, vol. 485, pp. 731-735, 2017.
- [25] A. Bairoch, “The ENZYME database in 2000”, Nucleic Acids Research, vol. 28, pp. 304–305, 2000.
- [26] J. Kyte, and R. F. Doolittle, “A simple method for displaying the hydropathic character of a protein”, Journal
of Molecular Biology, vol. 157, no. 1, pp. 105–32, 1982.
- [27] D. F. Walnut, “An introduction to wavelet analysis”, Boston, USA: Springer, 2002.
- [28] N. Arı, Ş. Özen, and Ö. H. Çolak, “Dalgacık Teorisi (Wavelet), Matlab uygulamaları ile”, Ankara, Türkiye:
Palme Yayıncılık, 2008.
- [29] F. Pardi, and O. Gascuel, “Distance-based methods in phylogenetics”. Richard M. Kliman. Encyclopedia of
Evolutionary Biology, Elsevier, pp.458-465, 2016.
- [30] J. D. Thompson, D. G. Higgins, and T. J. Gibson, “CLUSTAL W: improving the sensitivity of progressive
multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix
choice”, Nucleic Acids Research, vol 11, no. 22, pp. 4673-4680, 1994.
- [31] C. Wu, R. Gao, Y. De Marinis, and Y. Zhang, “A novel model for protein sequence similarity analysis based
on spectral Radius”, Journal of Theoretical Biology, vol. 446, pp. 61-70, 2018.
- [32] J. Wu, T. Zhou, J. Tao, Y. Hai, F. Ye, X. Liu, and Q. Dai, “Similarity/dissimilarity analysis of protein
structures based on Markov random fields”, Computational Biology and Chemistry, vol. 75, pp. 45-53, 2018.
- [33] R. Busa-Fekete, A. Kertész-Farkas, A. Kocsor, and S. Pongor, “Balanced ROC analysis (BAROC) protocol
for the evaluation of protein similarities”, Journal of Biochemical and Biophysical Methods, vol. 70, no. 6,
pp. 1210-1214, 2008.
- [34] J. Zhao, J. Wang, W. Hua, and P. Ouyang, “Algorithm, applications and evaluation for protein comparison
by Ramanujan Fourier transform”, Molecular and Cellular Probes, vol. 29, no. 6, pp. 396-407, 2015.