A novel genome analysis method with the entropy-based numerical technique using pretrained convolutional neural networks

A novel genome analysis method with the entropy-based numerical technique using pretrained convolutional neural networks

The identification of DNA sequences as exon and intron is a common problem in genome analysis. The methods used for feature extraction and mapping techniques for the digitization of sequences affect directly the solution of this problem. The existing mapping techniques are not enough to detect coding and noncoding regions in some genomes because the digital representation of each base in a DNA sequence with an integer does not fully reflect the structure of an original DNA sequence. In the entropy-based mapping technique, we could overcome this problem because the technique deepens distinction rates of exon regions, and better reflects the complexity of DNA sequences. Moreover, in the literature, features are extracted by using various statistical techniques. The statistical features to be extracted are chosen by a system designer’s experience. The other proposed approach in this study is to carry out the feature extraction using the transfer learning method. Transfer learning and feature extraction are performed automatically by convolutional neural network models as independent of the data set. In this study, we propose a new method to classify DNA sequences as exon and intron using two approaches. In the first approach, the entropy-based numerical technique was used for the numerical representation of DNA sequences. In the second approach, transfer learning was used to extract features. Then, the obtained features were classified by support vector machine and k -nearest neighbors algorithm. As a result of the classification, accurate performance with 97.8% was achieved. The performance of the current method was compared with the other numerical mapping techniques and feature extraction methods. The results showed that the developed method was much more successful than other methods.

___

  • [1] Cristea PD. Conversion of nucleotides sequences into genomic signals. Journal of Cellular and Molecular Medicine 2002; 6(2): 279-303. doi: 10.1111/j.1582-4934.
  • [2] Dougherty ER. Genomic signal processing. IEEE Signal Processing Magazine 2012; 29(3): 124-129.
  • [3] DeMaria AN. A structure for deoxyribose nucleic acid. JACC: Journal of the American College of Cardiology 2003; 373–374. doi: 10.1016/S0735-1097(03)00800-3.
  • [4] Koonin EV, Novozhilov AS. Origin and evolution of the genetic code: The universal enigma. IUBMB Life 2009; 61(2): 99-111. doi: 10.1002/iub.146.22
  • [5] Cristea PD. Genetic signal representation and analysis. In: SPIE Conference Biomedical Optics; Paris, France; 2002. pp. 77-84.
  • [6] Abo-Zahhad M, Ahmed SM, Abd-Elrahman AS. Genomic analysis and classification of exon and intron sequences using DNA numerical mapping techniques. International Journal of Information Technology and Computer Science 2012; 4(8): 22-36. doi: 10.5815/ijitcs.2012.08.03
  • [7] Abo-Zahhad M, Ahmed SM, Abd-Elrahman AS. A novel circular mapping technique for spectral classification of exons and introns in human DNA sequences. International Journal of Information Technology and Computer Science 2014; 4: 19-29. doi: 10.5815/ijitcs.2014.04.02
  • [8] Wang SY, Tian FC, Liu X, Wang J. A novel representation approach to DNA sequence and its application. IEEE Signal Processing Letters 2009; 16(4): 275:278. doi: 10.1109/LSP.2009.2014291.9
  • 9] Hota MK, Srivastava VK. Performance analysis of different DNA to numerical mapping techniques for iden- tification of protein coding regions using tapered window based short-time discrete Fourier transform. In: International Conference on Power Control and Embedded Systems; Allahabad, India; 2010. pp. 1-4. doi: 10.1109/ICPCES.2010.5698675
  • [10] Crosby K, Gabbert P. BioSPRINT: classification of intron and exon sequences using the SPRINT algorithm. In: Computational Systems Bioinformatics Conference, CSB, Proceedings; Stanford, CA, USA; 2004. pp. 637-638. doi: 10.1109/CSB.2004.1332540.15
  • [11] Gupta R, Mittal A, Singh K, Bajpai P, Prakash S. A time series approach for identification of exons and introns. In: 10th International Conference on Information Technology (ICIT) 2007; Roukela, India; 2007. pp. 91-93. doi: 10.1109/ICOIT.2007.4418274
  • [12] Sahu SS, Panda G. Identification of protein-coding regions in DNA sequences using a time-frequency filtering approach. Genomics. Proteomics Bioinformatics 2011; 9(1-2): 45-55. doi: 10.1016/S1672-0229(11)60007-7
  • [13] Zhang WF, Yan H. Exon prediction using empirical mode decomposition and Fourier transform of structural profiles of DNA sequences. Pattern Recognition 2012; 45(3): 947-955 doi: 10.1016/j.patcog.2011.08.016.
  • [14] Sree PK, Rao PSVS, Devi NSSSNU. CDLGP: A novel unsupervised classifier using deep learning for gene prediction. In: 2017 IEEE International Conference on Power, Control, Signals and Instrumentation Engineering (ICPCSI); Chennai, India; 2017. pp. 2811-2813. doi: 10.1109/ICPCSI.2017.8392232
  • [15] Sree PK, Usha Devi NSSSN, Sudheer MS. A robust deep learning mechanism augmented with cellular automata for DNA computing. In: IEEE International Conference Power, Control, Signals and Instrumentation Engineering (ICPCSI); Chennai, India; 2017. pp. 1305-1308. doi: 10.1109/ICPCSI.2017.8391921
  • [16] Das B, Turkoglu I. A novel numerical mapping method based on entropy for digitizing DNA sequences. Neural Computing and Applications 2018; 29(24): 207-215. doi: 10.1007/s00521-017-2871-5.
  • [17] Das B. Development of new approaches based on signal processing for disease diagnosis from DNA sequences. PhD, Fırat University, Elazığ, Turkey, 2018. [18] Karci A. Fractional order entropy: New perspective. Optics 2016; 127(20): 9172-9177.
  • [19] Grandhi DG, Kumar CV. 2-Simplex mapping for identifying the protein coding regions in DNA. In: IEEE Region 10 Conference; Taipei, Taiwan; 2007. pp. 1-3. doi: 10.1109/TENCON.2007.4429086.
  • [20] Akhtar M, Epps J, Ambikairajah E. On DNA numerical representations for period-3 based exon prediction. In: IEEE International Workshop On Genomic Signal Processing and Statistics 2007. pp. 1-4. doi: 10.1109/GEN- SIPS.2007.4365821.
  • [21] Holden T, Subramaniam R, Sullivan R, Cheung E, Schneider C et al. ATCG nucleotide fluctuation of Deinococcus radiodurans radiation genes. Proc. SPIE Instruments, Methods, and Missions for Astrobiology X 2007; 6644: 669417. doi: 10.1117/12.732283
  • [22] Zhuang Z, Shen X, Pan W. A simple convolutional neural network for prediction of enhancer-promoter interactions with DNA sequence data. Bioinformatics 2019; 35(17). 2899-2906. doi: 10.1093/bioinformatics/bty1050
  • [23] Hasan MJ, Islam MMM, Kim JM. Acoustic spectral imaging and transfer learning for reliable bearing fault diagnosis under variable speed conditions. Measurement 2019; 138: 620-631. doi: 10.1016/j.measurement.2019.02.075
  • [24] Alom MZ, Taha TM, Yakopcic C, Westberg S, Sidike P et al. A state-of-the-art survey on deep learning theory and architectures. Electronics 2019; 8(3): 292. doi: 10.3390/electronics8030292
  • [25] Ullah I, Hussain M, Qazi EH, Aboalsamh H. An automated system for epilepsy detection using EEG brain signals based on deep learning approach, Expert Systems with Applications 2018; 107: 61-71. doi: 10.1016/j.eswa.2018.04.021
  • [26] Gopalakrishnan K, Khaitan SK, Choudhary A, Agrawal A. Deep convolutional neural networks with transfer learning for computer vision-based data-driven pavement distress detection. Construction and Building Materials 2017; 157: 322-330. doi: 10.1016/j.conbuildmat.2017.09.110
  • [27] Zhu F. Estimating left ventricular volume with ROI-based convolutional neural network. Turkish Journal of Elec- trical Engineering and Computer Sciences 2018; 26(1): 23-34.
  • [28] Rizvi MdAI, Deb K, Khan MdI, Kowsar MMdS, Khanam, T. A comparative study on handwritten Bangla character recognition. Turkish Journal of Electrical Engineering Computer Sciences 2019; 27: 3195-3207. doi: 10.3906/elk- 1901-4813
  • [29] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA; 2016. e-ISSN: 1063-6919 doi: 10.1109/CVPR.2016.90
  • [30] Wu Z, Shen C, Hengel AV. Wider or deeper: revisiting the ResNet model for visual recognition. Pattern Recognition 2019; 90: 119-133. doi: 10.1016/j.patcog.2019.01.006
  • [31] Khazaee A, Ebrahimzadeh A. Classification of electrocardiogram signals with support vector machines and ge- netic algorithms using power spectral features. Biomedical Signal Processing and Control 2010; 5(4): 252-263. doi:10.1016/j.bspc.2010.07.006
  • [32] Toraman S, Girgin M, Ustundag B, Turkoglu I. Classification of the likelihood of colon cancer with machine learning techniques using FTIR signals obtained from plasma. Turkish Journal of Electrical Engineering and Computer Science 2019; 27 (3): 1765-1779. doi:10.3906/elk-1801-259
  • [33] Osuna E, Freund R, Girosi F. Support Vector Machines Training and Applications. Massachusetts Institude of Technology 1997; 1602: doi: 10.1.1.41.418
  • [34] Pal M, Mather PM. Support Vector classifiers for land cover classification. ArXiv 2008; doi: 10.1080/01431160802007624.
  • [35] Kavzoglu T, Colkesen I. A kernel functions analysis for support vector machines for land cover classi cation. International Journal of Application of Earth Observation Geoinformation 2009; 11(5): 352-359.
  • [36] Panda AK, Rapur JS, Tiwari R. Prediction of flow blockages and impending cavitation in centrifugal pumps using Support Vector Machine (SVM) algorithms based on vibration measurements. Measurement 2018; 130: 44-56. doi: 10.1016/j.measurement.2018.07.092.
  • [37] Das R, Sengur A. Evaluation of ensemble methods for diagnosing of valvular heart disease. Expert Systems with Applications, 2010; 37(7): 5110-5115,. doi: 10.1016/j.eswa.2009.12.085.
  • [38] Duda RO, Hart PE, Stork DG. Pattern Classification. 2nd ed. New York, NY, USA: Wiley, 2001.
  • [39] Das R. A comparison of multiple classification methods for diagnosis of Parkinson disease. Expert Systems with Applications 2010; 37(2): 1568-1572. doi: 10.1016/j.eswa.2009.06.040