Zafer AYDIN

Performance Analysis of Machine Learning and Bioinformatics Applications on High Performance Computing Systems

Günümüzde büyük verilerden anlamlı bilgiler çıkartan ve akıllı kararlar alabilen algoritmaların en verimli şekilde ve en uygun hesaplama ortamında çalıştırılması gittikçe artan bir önem arz etmektedir. Bu makalede scikit-learn, Tensorflow, WEKA, libSVM, ThunderSVM, GMTK, PSI-BLAST, and HHblits gibi büyük veri analizi uygulamaları bulunan çeşitli makine öğrenmesi ve biyoenformatik programlarının yüksek başarımlı hesaplama sistemleri ve iş istasyonlarındaki performansları incelenmiştir. Programlar tek merkezi işlemci çekirdeğine ek olarak paralel işleme ve grafik işlemci versiyonlarının mevcut olma durumuna göre, çoklu merkezi işlemci çekirdeği ve grafik işlemci çekirdeklerinde çalıştırılmıştır. Seçilen programlar için optimum CPU çekirdek sayısı tespit edilmiştir. Yapılan analizler sonucunda hız performansının birçok faktöre bağlı olduğu sonucuna varılmıştır. Bunlar arasında merkezi/grafik işlemci versiyonları, hafıza miktarı, seçilen çekirdek sayısı ve kullanılan algoritma sayılabilir. Bir programın paralel işlemeye imkan tanıyan versiyonu mevcutsa en hızlı hesaplama grafik işlemci birimleri ile, daha sonra paralel merkezi işlemci ve tek merkezi işlemci ile elde edilmiştir. İncelenen uygulamalar açısından en başarılı sistem farklılık gösterse de mevcut çalışma makine öğrenmesi ve biyoenformatik alanındaki araştırma ve geliştirme yapanların projelerinde en uygun kaynakları seçmesine olanak sağlayacaktır.

Anahtar Kelimeler:

Makine öğrenmesi, biyoenformatik, yüksek başarımlı hesaplama, hız performans analizi

Performance Analysis of Machine Learning and Bioinformatics Applications on High Performance Computing Systems

Nowadays, it is becoming increasingly important to use the most efficient and most suitable computational resources for algorithmic tools that extract meaningful information from big data and make smart decisions. In this paper, a comparative analysis is provided for performance measurements of various machine learning and bioinformatics software including scikit-learn, Tensorflow, WEKA, libSVM, ThunderSVM, GMTK, PSI-BLAST, and HHblits with big data applications on different high performance computer systems and workstations. The programs are executed in a wide range of conditions such as single-core central processing unit (CPU), multi-core CPU, and graphical processing unit (GPU) depending on the availability of implementation. The optimum number of CPU cores are obtained for selected software. It is found that the running times depend on many factors including the CPU/GPU version, available RAM, the number of CPU cores allocated, and the algorithm used. If parallel implementations are available for a given software, the best running times are typically obtained by GPU, followed by multi-core CPU, and single-core CPU. Though there is no best system that performs better than others in all applications studied, it is anticipated that the results obtained will help researchers and practitioners to select the most appropriate computational resources for their machine learning and bioinformatics projects.

Keywords:

Machine learning, bioinformatics, high performance computing, speed performance analysis,

PDF

___

[1]. R. Bekkerman, M. Bilenko, and J. Langford, Scaling Up Machine Learning: Parallel and Distributed Approaches, Cambridge University Press, 2012.
[2]. Supercomputer, https://en.wikipedia.org/wiki/Supercomputer.
[3]. Y. Kochura, S. Stirenko, O. Alienin, M. Novotarskiy, and Y. Gordienko, “Performance Analysis of Open Source Machine Learning Frameworks for Various Parameters in Single-Threaded and Multi-Threaded Modes”, In: Shakhovska N., Stepashko V. (eds) Advances in Intelligent Systems and Computing II. CSIT 2017. Advances in Intelligent Systems and Computing, vol 689. Springer, 243-256, 2018.
[4]. V. Kovalev, A. Kalinovsky, and S. Kovalev, “Deep Learning with Theano, Torch, Caffe, TensorFlow, and Deeplearning4J: Which One Is the Best in Speed and Accuracy?”, International Conference on Pattern Recognition and Information Processing, (2016).
[5]. A. Shatnawi, G. Al-Bdour, R. Al-Qurran, and M. Al-Ayyoub, “A Comparative Study of Open Source Deep Learning Frameworks”, 9th International Conference on Information and Communication Systems (ICICS), 72-77, (2018).
[6]. S. Bahrampur, N. Ramakrishnan, L. Schott, and M. Shah, “Comparative Study of Deep Learning Software Frameworks”, arXiv:1511.06435, 2016.
[7]. D.A. Bader, Y. Li, T. Li, and V. Sachdeva, “BioPerf: A Benchmark Suite to Evaluate High-Performance Computer Architecture on Bioinformatics Applications”, The IEEE International Symposium on Workload Characterization (IISWC 2005), Austin, TX, October 6-8, 2005.
[8]. M. Kurtz, F. J. Esteban, P. Hernandez, J. A. Caballero, A. Guevara, G. Dorado, and S. Galvez, “Bioinformatics Performance Comparison of Many-core Tile64 vs. Multi-core Intel Xeon”, Clei Electronic Journal, vol. 17, no. 1, 1-9, 2014.
[9]. NVIDIA DGX-1, https://www.nvidia.com/en-us/data-center/dgx-1/.
[10]. M. Abadi et al., “Tensorflow: A system for large-scale machine learning”, 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI)”, USENIX Association, 265-283, (2016). Software available at https://www.tensorflow.org.
[11]. F. Pedregosa et al., “Scikit-learn: machine learning in python”, Journal of Machine Learning Research, vol. 12, pp. 2825-2830, 2011. Software available at https://scikit-learn.org/stable/.
[12]. E. Frank, M. A. Hall, and I. Witten, “The WEKA Workbench. Online Appendix for "Data Mining: Practical Machine Learning Tools and Techniques", Morgan Kaufmann, Fourth Edition, 2016. Software available at https://www.cs.waikato.ac.nz/ml/weka/.
[13]. J. Bilmes and G. Zweig, “The graphical models toolkit: An open source software system for speech and time-series processing”, IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, IV-3916-IV-3919, (2002). Software available at https://melodi.ee.washington.edu/gmtk/.
[14]. C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector machines”, ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1--27:27, 2011. Software available at https://www.csie.ntu.edu.tw/~cjlin/libsvm/.
[15]. Z. Wen, J. Shi, Q Li, B. He, and J. Chen, “ThunderSVM: A Fast SVM Library on GPUs and CPUs”, Journal of Machine Learning Research, vol. 19, pp. 1-5, 2018. Software available at https://thundersvm.readthedocs.io/en/latest/.
[16]. S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25 (17), 3389-3402, (1997). Software available at https://blast.ncbi.nlm.nih.gov/Blast.cgi.
[17]. M. Remmert, A. Biegert, A. Hauser, and J. Söding, "HHblits: Lightning-fast iterative protein sequence searching by HMM-HMM alignment", Nat. Methods, 9 (2), 173-175, (2011). Software available at https://github.com/soedinglab/hh-suite.
[18]. NCBI, URL: https://www.ncbi.nlm.nih.gov (First published on Nov. 4, 1988).
[19]. Protein Data Bank (PDB), https://www.rcsb.org.
[20]. D. T. Jones, “Protein secondary structure prediction based on position-specific scoring matrices”, Journal of Molecular Biology, vol 292, no. 2, 195-202, 1999. Software available at http://bioinf.cs.ucl.ac.uk/psipred/.
[21]. DSSP, URL: https://swift.cmbi.umcn.nl/gv/dssp/DSSP_1.html, (first published in 1983).
[22]. Python, https://www.python.org.
[23]. Random forest, https://en.wikipedia.org/wiki/Random_forest.
[24]. Artnome, https://www.artnome.com/news/2018/11/8/inventing-the-future-of-art-analytics.
[25]. Multi-layer perceptron (MLP), https://en.wikipedia.org/wiki/Multilayer_perceptron.
[26]. Protein structure prediction, https://en.wikipedia.org/wiki/Protein_structure_prediction.
[27]. Multi-layer perceptron, https://www.oreilly.com/library/view/getting-started-with/9781786468574/ch04s04.html.
[28]. S. Fourati et al., “A crowdsourced analysis to identify ab initio molecular signatures predictive of susceptibility to viral infection”, Nature Communications, vol. 9, no. 1, pp. 1-11, 2018. Challenge web site: https://www.synapse.org/#!Synapse:syn5647810/wiki/399103.
[29]. Google, https://www.google.com.
[30]. Convolutional neural network, https://en.wikipedia.org/wiki/Convolutional_neural_network.
[31]. Optical character recognition, https://en.wikipedia.org/wiki/Optical_character_recognition.
[32]. A comprehensive guide to convolutional neural networks, https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53.
[33]. notMNIST dataset, http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html.
[34]. MNIST dataset, https://en.wikipedia.org/wiki/MNIST_database.
[35]. Using notMNIST dataset from Tensorflow, http://enakai00.hatenablog.com/entry/2016/08/02/102917.
[36]. X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks”, Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), 249-256, (2009).
[37]. Support vector machine, https://en.wikipedia.org/wiki/Support-vector_machine.
[38]. W. Yu, T. Liu, R. Valdez, M. Gwinn, and M. J. Khoury, “Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes”, BMC Medical Informatics and Decision Making, vol. 10, no. 1, 2010.
[39]. Jeffrey A. Bilmes, http://melodi.ee.washington.edu/~bilmes/pgs/index.html.
[40]. Dynamic Bayesian network, https://en.wikipedia.org/wiki/Dynamic_Bayesian_network.
[41]. Hidden Markov model, https://en.wikipedia.org/wiki/Hidden_Markov_model.
[42]. J. A. Cuff and G. J. Barton, “Evaluation and improvement of multiple sequence methods for protein secondary structure prediction”, Proteins, 34(4), 508–519, 1999. Dataset is available at http://www.compbio.dundee.ac.uk/jpred/legacy/data/.
[43]. I. Y. Y. Koh, V. A. Eyrich, M. A. Marti-Renom, D. Przybylski, M. S. Madhusudhan, N. Eswar, O. Graña, F. Pazos, A. Valencia, A., and B. Rost, “EVA: Evaluation of protein structure prediction servers”, Nucleic Acids Research, 31(13), 3311–3315, 2003.
[44]. Z. Aydin, A. Singh, J. Bilmes and W. S. Noble, “Learning sparse models for a dynamic Bayesian network classifier of protein secondary structure,” BMC Bioinformatics, 12:154, 2011.
[45]. Z. Aydin, N. Azgınoglu, H. I. Bilgin, and M. Celik, “Developing Structural Profile Matrices for Protein Secondary Structure and Solvent Accessibility Prediction”, accepted to Bioinformatics, 2019.
[46]. TRUBA, https://www.truba.gov.tr/index.php/en/main-page/.
[47]. TRUBA wiki page, http://wiki.truba.gov.tr/index.php/Ana_sayfa.
[48]. UhEM, http://www.uhem.itu.edu.tr.
[49]. İTU UhEM wiki page, http://wiki.uhem.itu.edu.tr/w/index.php/Sarıyer_sistemine_iş_vermek.
[50]. CompecTA, https://www.compecta.com.tr.
[51]. Abdullah Gul University, http://www.agu.edu.tr.