Brassica Bitki Türlerinde Transkripsiyon Faktörü DNA'sının Derin Öğrenme ile Sınıflandırılması

DNA ve protein türlerinin belirlenmesi, benzerliklerinin incelenmesi vb. araştırma alanındaki zorlu problemler arasında yer almaktadır. Bu nedenle elde edilen veriler ve bu verilerin kullanımı da sınırlıdır. Bu çalışmada bilgisayar biliminin veri işlemedeki gücünü biyoloji ile birleştirdik. Turpgillerden Brassica bitkilerinde bulunan transkripsiyon faktörü proteinlerinin DNA'larını sınıflandırdık ve bitkideki transkripsiyon faktörü proteinlerinin sentezi ile ilgili DNA'ları belirledik. Veri setini Bitki Transkripsiyon Faktörü Veritabanından (PlantTFDB) derledik. Önişleme kısmında kod sözlüğü yapısını kullandık ve Çift Yönlü LSTM ve Çift Yönlü GRU ağlarını kullanarak hızlı ve başarılı bir model sağladık. Modelimiz %90,40 test doğruluğuna ve %86,75 5-kat çapraz doğrulama doğruluğuna sahiptir. Modelde daha az birimli katmanda LSTM ve daha fazla birimli katmanda GRU kullanılması model için daha kısa eğitim süresi sağlamıştır. Ayrıca hazırlanan model Brassica bitkilerinin transkripsiyon faktör DNA'larını sınıflandırsa da diğer bitkilerin transkripsiyon faktör DNA'larında da belli bir düzeyde başarılı olacaktır. Hazırlanan model, çalışma alanı açısından literatüre katılmış önemli bir yenilik olarak öne çıkmaktadır.

Anahtar Kelimeler:

Biyoinformatik, DNA sınıflandırma, Derin öğrenme, Çift yönlü, LSTM, GRU

Classification of Transcription Factor DNA in the Brassica Plant Species by Deep Learning

Determining the types of DNA and proteins, examining their similarities, etc., remains among the challenging problems in the research field. For this reason, the data obtained and the use of this data are also limited. In this study, we combined the power of computer science in data processing with biology. We classified the DNAs of transcription factor proteins found in cruciferous Brassica plants and identified the DNAs related to the synthesis of transcription factor proteins in the plant. We compiled the dataset from the Plant Transcription Factor Database (PlantTFDB). We used the code dictionary structure in the preprocessing part and provided a fast and successful model using Bidirectional LSTM and Bidirectional GRU networks. Our model has 90.40% test accuracy and 86.75% 5-fold cross-validation accuracy. Using LSTM in the layer with fewer units and GRU in the layer with more units in the model provided a shorter training time for the model. In addition, although the prepared model classifies the transcription factor DNAs of Brassica plants, it will also be successful at a certain level in the transcription factor DNAs of other plants. The prepared model stands out as an important innovation that has been added to the literature in terms of its field of study.

Keywords:

Bioinformatics, DNA classification, Deep learning, Bidirectional, LSTM, GRU,

PDF

___

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403–410. https://doi.org/10.1016/S0022-2836(05)80360-2
Baldi, P., & Brunak, S. (2001). Bioinformatics, Second Edition: The Machine Learning Approach. MIT Press.
Bileschi, M. L., Belanger, D., Bryant, D. H., Sanderson, T., Carter, B., Sculley, D., Bateman, A., DePristo, M. A., & Colwell, L. J. (2022). Using deep learning to annotate the protein universe. Nature Biotechnology, 40(6), 932–937. https://doi.org/10.1038/s41587-021-01179-w
Du, X., Cai, Y., Wang, S., & Zhang, L. (2016). Overview of deep learning. 2016 31st Youth Academic Annual Conference of Chinese Association of Automation (YAC), 159–164. https://doi.org/10.1109/YAC.2016.7804882
Eddy, S. R. (1996). Hidden Markov models. Current Opinion in Structural Biology, 6(3), 361–365. https://doi.org/10.1016/S0959-440X(96)80056-X
Fang, G., Zeng, F., Li, X., & Yao, L. (2021). Word2vec based deep learning network for DNA N4-methylcytosine sites identification. Procedia Computer Science, 187, 270–277. https://doi.org/10.1016/j.procs.2021.04.062
Gao, Y., & Glowacka, D. (2016). Deep Gate Recurrent Neural Network. In R. J. Durrant & K.-E. Kim (Eds.), Proceedings of The 8th Asian Conference on Machine Learning (Vol. 63, pp. 350–365). PMLR. https://proceedings.mlr.press/v63/gao30.html
Greff, K., Srivastava, R. K., Koutnik, J., Steunebrink, B. R., & Schmidhuber, J. (2017). LSTM: A Search Space Odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10), 2222–2232. https://doi.org/10.1109/TNNLS.2016.2582924
Gromiha, M. M. (2010). Protein Sequence Analysis. Protein Bioinformatics, 29–62. https://doi.org/10.1016/B978-8-1312-2297-3.50002-3
Gunasekaran, H., Ramalakshmi, K., Rex Macedo Arokiaraj, A., Deepa Kanmani, S., Venkatesan, C., & Suresh Gnana Dhas, C. (2021). Analysis of DNA Sequence Classification Using CNN and Hybrid Models. Computational and Mathematical Methods in Medicine, 2021, 1–12. https://doi.org/10.1155/2021/1835056
Huerta, M., Haseltine, F., Liu, Y., Downing, G., & Seto, B. (2000). NIH working definition of bioinformatics and computational biology.
Jin, J., Tian, F., Yang, D.-C., Meng, Y.-Q., Kong, L., Luo, J., & Gao, G. (2017). PlantTFDB 4.0: toward a central hub for transcription factors and regulatory interactions in plants. Nucleic Acids Research, 45(D1), D1040–D1045. https://doi.org/10.1093/nar/gkw982
Jin, J., Yu, Y., & Wei, L. (2022). Mouse4mC-BGRU: Deep learning for predicting DNA N4-methylcytosine sites in mouse genome. Methods, 204, 258–262. https://doi.org/10.1016/j.ymeth.2022.01.009
Karin, M. (1990). Too many transcription factors: positive and negative interactions. The New Biologist, 2(2), 126–131.
KILIC, S. (2013). ROC Analysis in Clinical Decision Making. Journal of Mood Disorders, 3(3), 135. https://doi.org/10.5455/jmood.20130830051624
Latchman, D. S. (1993). Transcription factors: an overview Function of transcription factors. Int. J. Exp. Path, 74, 417–422.
Luque, A., Carrasco, A., Martín, A., & de las Heras, A. (2019). The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognition, 91, 216–231. https://doi.org/10.1016/J.PATCOG.2019.02.023
Narayana, N., Ginell, S. L., Russu, I. M., & Berman, H. M. (1991). Crystal and molecular structure of a DNA fragment: d(CGTGAATTCACG). Biochemistry, 30(18), 4449–4455. https://doi.org/10.1021/bi00232a011
Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984. Nomenclature Committee of the International Union of Biochemistry (NC-IUB). (1986). Proceedings of the National Academy of Sciences, 83(1), 4–8. https://doi.org/10.1073/pnas.83.1.4
Piecyk, R. S., Schlegel, L., & Johannes, F. (2022). Predicting 3D chromatin interactions from DNA sequence using Deep Learning. Computational and Structural Biotechnology Journal, 20, 3439–3448. https://doi.org/10.1016/j.csbj.2022.06.047
Price, M. N., Wetmore, K. M., Waters, R. J., Callaghan, M., Ray, J., Liu, H., Kuehl, J. v, Melnyk, R. A., Lamson, J. S., Suh, Y., Carlson, H. K., Esquivel, Z., Sadeeshkumar, H., Chakraborty, R., Zane, G. M., Rubin, B. E., Wall, J. D., Visel, A., Bristow, J., … Deutschbauer, A. M. (2018). Mutant phenotypes for thousands of bacterial genes of unknown function. Nature, 557(7706), 503—509. https://doi.org/10.1038/s41586-018-0124-0
Riaño-Pachón, D. M., Ruzicic, S., Dreyer, I., & Mueller-Roeber, B. (2007). PlnTFDB: an integrative plant transcription factor database. BMC Bioinformatics, 8(1), 42. https://doi.org/10.1186/1471-2105-8-42
Sakr, A. S., Pławiak, P., Tadeusiewicz, R., & Hammad, M. (2022). Cancelable ECG biometric based on combination of deep transfer learning with DNA and amino acid approaches for human authentication. Information Sciences, 585, 127–143. https://doi.org/10.1016/j.ins.2021.11.066
Şeker, A., Diri, B., & Balık, H. H. (2017). Derin Öğrenme Yöntemleri ve Uygulamaları Hakkında Bir İnceleme. Gazi Mühendislik Bilimleri Dergisi, 3(3), 47–64.
Shu, J. J. (2017). A new integrated symmetrical table for genetic codes. Biosystems, 151, 21–26. https://doi.org/10.1016/J.BIOSYSTEMS.2016.11.004
Strodthoff, N., Wagner, P., Wenzel, M., & Samek, W. (2020). UDSMProt: universal deep sequence models for protein classification. Bioinformatics, 36(8), 2401–2409. https://doi.org/10.1093/bioinformatics/btaa003
Tang, X., Zheng, P., Li, X., Wu, H., Wei, D.-Q., Liu, Y., & Huang, G. (2022). Deep6mAPred: A CNN and Bi-LSTM-based deep learning method for predicting DNA N6-methyladenosine sites across plant species. Methods, 204, 142–150. https://doi.org/10.1016/j.ymeth.2022.04.011
WATSON, J. D., & CRICK, F. H. C. (1953). Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid. Nature, 171(4356), 737–738. https://doi.org/10.1038/171737a0
Xiong, Z., Cui, Y., Liu, Z., Zhao, Y., Hu, M., & Hu, J. (2020). Evaluating explorative prediction power of machine learning algorithms for materials discovery using k-fold forward cross-validation. Computational Materials Science, 171, 109203. https://doi.org/10.1016/j.commatsci.2019.109203
Yang, K. K., Wu, Z., Bedbrook, C. N., & Arnold, F. H. (2018). Learned protein embeddings for machine learning. Bioinformatics, 34(15), 2642–2648. https://doi.org/10.1093/bioinformatics/bty178