İnsana ait protein fonksiyonlarının protein haritalama teknikleri ve derin öğrenme modeli ile tahmin edilmesi

Canlıların moleküler mekanizmasının anlaşılabilmesi için protein fonksiyonları önem arz etmektedir. Proteinlere ait fonksiyonlar belirlenirken, proteinlerin yapılarından yararlanılır. Protein fonksiyonları daha çok, karakterize edilmemiş protein dizilimlerinin anotasyonlarının belirleyebilmek, canlıların hücresel mekanizmalarını anlayabilmek, genlerde ya da proteinlerde hastalığa neden olan fonksiyonel değişiklileri belirleyebilmek ve hastalıkların önlenebilmesi, tedavi edilebilmesi ve teşhisi için yeni yaklaşımlar geliştirmek için kullanılmaktadır. Protein fonksiyonları deneysel yöntemlerle etkin bir şekilde belirlenebilmektedir. Ancak, deneysel yöntemlerin zaman alması ve çok sayıda kimyasal süreçten geçmesi, bu aşamaların yavaş ve maliyetli olmasına neden olmaktadır. Bunlara ek olarak, fonksiyonel yapısı ve dizilimi bilinen bazı proteinlerin anotasyonları deneysel süreçlerden dolayı halen belirlenememektedir. Bu gibi nedenler ve dezavantajlardan dolayı hesaplama-tabanlı uygulamalara ihtiyaç duyulmaktadır. Hesaplama-tabanlı uygulamalar için genellikle yapay zeka algoritmaları kullanılmaktadır. Yapay zeka yöntemleri ile protein fonksiyonlarının tahmin edilebilmesi için protein dizilimlerinin belirli haritalama yöntemleri ile sayısal hale getirilmesi gerekmektedir. Bu çalışmada, belirli protein haritalama teknikleri kullanılarak gen ontoloji tabanlı protein fonksiyonlarının tahmini gerçekleştirilmiştir. Çalışma, protein verilerinin elde edilmesi, protein dizilimlerinin sayısallaştırılması, protein fonksiyonlarının sınıflandırılması ve protein haritalama tekniklerinin performanslılarının belirlenmesi olmak üzere dört farklı aşamadan oluşmaktadır. Çalışmanın sonunda, biyolojik süreç kategorisinde en iyi doğruluk ve AUC skoru PAM250 protein haritalama tekniği ile elde edilmiş ve sırasıyla %69 ve %88 olarak hesaplanmıştır. Hücresel bileşen kategorisinde ise en iyi doğruluk ve AUC değer, sırasıyla %64 ve %89 oranı ile FIBHASH protein haritalama tekniği ile elde edilmiştir. Moleküler fonksiyon kategorisinde ise %64 AUC oranı ve %89 doğruluk değeri ile en iyi sonuç FIBHASH ile elde edilmiştir. Önerilen yapay zekâ yöntemi ile protein sayısal haritalama tekniklerinin birlikte kullanımının, protein fonksiyonlarının tahmin edilmesinde etken bir role sahip olduğu gözlemlenmiştir.

Prediction of human protein functions with protein mapping techniques and deep learning model

Protein functions are important for understanding the molecular mechanism of living organisms. Protein structures are used when determining the functions of proteins. Protein functions are mostly used to determine the annotations of uncharacterized protein sequences, to understand the cellular mechanisms of living things, to identify functional changes in genes or proteins that cause disease, and to develop new approaches to prevent, treat and diagnose diseases. Protein functions can be determined effectively by experimental methods. However, experimental methods take time and go through many chemical processes, causing these stages to be slow and costly. In addition to these, the annotations of some proteins whose functional structure and sequence are known cannot be specified due to experimental processes. Due to such reasons and disadvantages, computational-based approaches are needed. Artificial intelligence algorithms are generally used for computational-based applications. In order to predict protein functions with artificial intelligence methods, protein sequences must be mapped with certain mapping methods. In this study, prediction of gene ontology-based protein functions was performed using certain protein mapping techniques. The study consists of four different stages; obtaining protein data, mapping protein sequences, classifying protein functions, and determining the performance of protein mapping techniques. At the end of the study, the best accuracy and AUC score in the biological process category was obtained by the PAM250 protein mapping technique and was calculated as 69% and 88%, respectively. In the cellular component category, the best accuracy and AUC value were obtained by FIBHASH protein mapping technique with 64% and 89%, respectively. In the molecular function category, the best result was obtained with FIBHASH with 64% AUC score and 89% accuracy. It has been observed that the combined use of the proposed artificial intelligence method and protein numerical mapping techniques have an effective role in predicting protein functions.

___

  • [1] Rifaigolu AS, Dogan T, Martin MJ, Cetin-Atalay R, Atalay V. “DEEPred: Automated protein function prediction with multi-task feed-forward deep neural networks”. Scientific Reports, 9(1), 7344, 2019.
  • [2] Liolios K, Chen IM, Mavromatis K, Tavernarakis N, Hugenholtz P, Markowitz VM, Kyrpides NC. “The genomes on line database (GOLD) in 2009: Status if genomic and metagenomica projects and their associated metadata”. Nucleic Acids Research, 38, 346-354, 2010.
  • [3] Cao R, Cheng J. “Integrated protein function prediction by mining function associations, sequences, and protein-protein and gene-gene interaction networks”. Methods, 93, 84-91, 2016.
  • [4] The UniProt Consortium. “UniProt: the universal protein knowledgebase”. Nucleic Acids Research, 45, 158-169, 2017.
  • [5] Bonetta R, Valentino G. “Machine learning techniques for protein function prediction”. Proteins, 88(3), 397-413, 2020.
  • [6] Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. “Gene ontology: tool for the unification of biology”. Nature Genetics, 25(1), 25-29, 2000.
  • [7] Friedberg I. “Automated protein function prediction: the genomic challenge.” Briefings in Bioinformatics, 7(3), 225-242, 2006.
  • [8] Lee D, Redfern O, Orengo C. “Predicting protein function from sequence and structure”. Nature Reviews: Molecular Cell Biology, 8(12), 995-1005, 2007.
  • [9] Bernardes JS, Pedreira CE. “A review of protein function prediction under machine learning perspective”. Recent Patents on Biotechnology, 7(2), 122-141, 2013.
  • [10] Fa R, Cozzetto D, Wan C, Jones DT. “Predicting human protein function with multi-task deep neural networks”. PLOS One, 13(6), 1-6, 2018.
  • [11] Lobley AE, Nugent T, Orengo CA, Jones DT. “FFPred: an integrated feature-based function prediction server for vertebrate proteomes”. Nucleic Acids Research, 36, 297-302, 2008.
  • [12] Suthaharan S. “Big data classification: problems and challenges in network intrusion prediction with machine learning”. ACM SIGMETRICS Performance Evaluation Review, 41(4), 70-73, 2014.
  • [13] Najafabadi MM,Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E. “Deep learning applications and challenges in big data analytics”. Journal of Big Data, 2(1), 1-21, 2015.
  • [14] Cai Y, Wang J, Deng L. “SDN2GO: an integrated deep learning model for protein function prediction”. Frontiers in Bioengineering, 8, 1-11, 2020.
  • [15] Cao R, Freitas C, Chan L, Sun M, Jiang H, Chen Z. “ProLanGO: protein function prediction using neural machine translation based on a recurrent neural network”. Molecules, 22(10), 1-14, 2017.
  • [16] f M, Khan MA, Hoehndorf R, Wren J. “DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier”. Bioinformatics, 34(4), 660-668, 2018.
  • [17] You R, Zhang Z, Xiong Y, Sun F, Mamitsuka H, Zhu S. “GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank”. Bioinformatics, 34(14), 2465-2473, 2018.
  • [18] Hakala K, Kaewphan S, Bjorne J, Mehryary F, Moen H, Tolvanen M, Salakoski T, Ginter F. “Neural network and random forest medels in protein function prediction”. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 18, 1-10, 2020.
  • [19] UniProt Consortium. “UniProt: a hub for protein information”. Nucleic Acids Research, 43, 204-212, 2015.
  • [20] Barrell D, Dimmer E, Huntley RP, Binns D, O’Donovan C, Apwiler R. “The GOA database in 2009: an integrated gene ontology annotation resource”. Nucleic Acids Research, 37, 396-403, 2009.
  • [21] Atchley WR, Zhao J, Fernandes AD, Drüke T. “Solving the protein sequence metric problem”. Proceedings of the National Academy of Sciences of the United States of America, 102(18), 6395-6400, 2005.
  • [22] Henikoff S, Henikoff JG. “Amino acid substitution matrices from protein blocks”. Proceedings of the National Academy of Sciences of the United States of America, 89(22), 10915-10919, 1992.
  • [23] Styczynski MP, Jensen KL, Rigoutsos I, Stephanopoulos G. “BLOSUM62 miscalculations improve search performance”. Nature Biotechnology, 26(3), 274-275, 2008.
  • [24] Veljkovic N, Glisic S, Prljic J, Perovic V, Botta M, Veljkovic V. “Discovery of new therapeutic targets by the informational spectrum method”. Current Protein & Peptide Science, 9(5), 493-506, 2008.
  • [25] Alakus TB, Turkoglu I. “A novel fibonacci hash method for protein family identification by using recurrent neural networks”. Turkish Journal of Electrical Engineering & Computer Sciences, 29(1), 370-386, 2021.
  • [26] Dayhoff MO, Schwartz RM, Orcutt BC. “A model of evolutionary change in proteins”. National Biomedical Research Foundation, 5(3), 345-352, 1978.
  • [27] Can B. “LSTM ağları ile Türkçe kök bulma”. Bilişim Teknolojileri Dergisi, 12(3), 183-193, 2019.
  • [28] Şeker A, Diri B, Balık H. “Derin öğrenme yöntemlerin ve uygulamaları hakkında bir inceleme”. Gazi Mühendislik Bilimleri Dergisi, 3(3), 47-64, 2017.
  • [29] Metin İA, Karasulu B. “İnsan aktivitelerinin sınıflandırılmasında tekrarlayan sinir ağı kullanan derin öğrenme tabanlı yaklaşım”. Veri Bilimi Dergisi, 2(2), 1-10, 2019.
  • [30] Sherstinsky A. “Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network”. PhysicaD: Nonlinear Phenomena, 404, 1-28, 2020.
  • [31] Hochreiter S, Schmidhuber J. “Long short-term memory”. Neural Computation, 9(8), 1735-1780, 1997.
  • [32] Liu G, Guo J. “Bidirectional LSTM with attention mechanism and convolutional layer for text classification”. Neurocomputing, 337, 325-338, 2019.
  • [33] Basaldella M, Antolli E, Serra G, Tasso C. “Bidirectional LSTM recurrent neural network for keyphrase extraction”. Italian Research Conference on Digital Libraries, Udine, Italy, 25-26 January 2018.
  • [34] Graves A, Jaitly N, Mohamed A. “Hybrid speech recognition with deep bidirectional LSTM”. 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, 8-12 December 2013.
  • [35] Babüroğlu B, Tekerek A, Tekerek M. “Türkçe için derin öğrenme tabanlı doğal dil işleme modeli geliştirilmesi”. arXiv, 2019. https://arxiv.org/pdf/1905.05699.pdf
  • [36] Graves A, Schmidhuber J. “Framewise phoneme classification with bidirectional LSTM and other neural network architectures”. Neural Networks, 18(5-6), 602-610, 2005.
  • [37] Kamarudin AN, Cox T, Kolamunnage-Dona R. “Time-dependent ROC curve analysis in medical research: current methods and applications”. BMC Medical Research Methodology, 17(53), 1-19, 2017.
  • [38] Safari S, Baratloo A, Elfil M, Negida A. “Evidance based emergency medicine; part 5 receiver operating curve and area under the curve”. Emergency, 4(2), 111-113, 2016.
  • [39] Zhao XG, Dai W, Li Y, Tian L. “AUC-based biomarker ensemble with an application on gene scores predicting low bone mineral density”. Bioinformatics, 27(21), 3050-3055, 2011.
  • [40] Wigton RS, Connor JL, Centor RM. “Transportability of a decision rule for the diagnosis of streptococcal pharyngitis”. Archives of Internal Medicine, 146(1), 81-83, 1986.
  • [41] Mandrekar JN. “Receiver operating characteristic curve in diagnostic test assessment”. Journal of Thoracic Oncology, 5(9), 1315-1316, 2010.
  • [42] Chen D, Wang J, Yan M, Bao FS. “A complex prime numerical representation of amino acids for protein function comparison”. Journal of Computational Biology, 23(8), 669-677, 2016.
  • [43] Jing X, Dong Q, Hong D, Lu R. “Amino acid encoding methods for protein sequences: a comprehensive review and assessment”. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 17(6), 1918-1931, 2020.
  • [44] Doğan F, Türkoğlu İ. “Derin öğrenme modelleri ve uygulama alanlarına ilişkin bir derleme”. DÜMF Mühendislik Dergisi, 10(2), 409-445, 2019.
  • [45] Alpaydın E. Yapay Öğrenme. 4. Baskı. İstanbul, Türkiye, Boğaziçi Üniversitesi, 2018.
  • [46] Goodfellow I, Bengio Y, Courville A. Derin Öğrenme. 1. Baskı. Ankara, Türkiye, Buzdağı, 2018.
  • [47] Das B, Turkoglu I. “A novel numerical mapping method based on Entropy for digitizing DNA sequences”. Neural Computings and Applications, 29(8), 207-215, 2018.
  • [48] Alakus TB, Turkoglu I. “. A novel Entropy-based mapping method for determining the protein-protein interactions in viral genomes by using coevolution analysis”. Biomedical Signal Processing and Control, 65, 1-15, 2021.
  • [49] Dogan F, Turkoglu I. “Classification of satellite images by deep learning”. 8th International Advanced Technologies Symposium, Elazig, Turkey, 19-22 October 2017.
  • [50] Alakus TB, Turkoglu I. “Comparison of deep learning approaches to predict COVID-19 infection”. Chaos, Solutions & Fractals, 140, 1-7, 2020.
  • [51] Toraman S, Alakus TB, Turkoglu I. “Convolutional capsnet: A novel artificial neural network approach to detect COVID-19 disease from X-ray images using capsule networks”. Chaos, Solutions & Fractals, 140, 1-11, 2020.
  • [52] Gurgoze G, Turkoglu I. “Energy management techniques in mobile robots”. World Academy of Science, Engineering and Technology International Journal of Energy and Power Engineering, 10(11), 1079-1087, 2017.
  • [53] Pala MA, Çimen ME, Boyraz ÖF. “Meme kanseri teşhis edilmesinde karar ağacı ve knn algoritmalarının karşılaştırmalı başarım analizi”. Academic Perspective Procedia, 2(3), 544-552, 2019.
  • [54] Kim J, Kim J, Lee D, Chung KY. “Ontology driven interactive healthcare with wearable sensors”. Multimedia Tools and Applications, 71, 827-847, 2014.
  • [55] Iskanderov J, Güvensan MA. “Akıllı telefon ve giyilebilir cihazlarla aktivite tanıma: klasik yaklaşımlar, yeni çözümler”. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, 25(2), 223-239, 2019.
  • [56] Gürkan H, Hanilçi A. “Evrişimsel sinir ağı ve QRS imgeleri kullanarak EKG tabanlı biyometrik tanıma yöntemi”. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, 26(2), 318-327, 2020.
  • [57] Çetin M, Beyhan S, Bahtiyar B. “Yapay sinir ağı temelli uyarlamalı doğrusal model-öngörülü kontrol”. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, 22(8), 650-658, 2016.
  • [58] Vascon S, Frasca M, Tripodi R, Valentini G, Pelillo M. “Protein function prediction as a graph-transduction game”. Pattern Recognition Letters, 134, 96-105, 2020.
  • [59] Makrodimitris S, van Ham RCHJ, Reinders MJT. “Improving protein function prediction using protein sequence and GO-term similarities”. Bioinformatics, 35(7), 1116-1124, 2019.
  • [60] Gligorijevic V, Barot M, Bonneau R. “deepNF: deep network fusion for protein function prediction”. Bioinformatics, 34(22), 3873-3881, 2018.