F. G. FURAT, T. İBRİKÇİ

Classification of Down Syndrome of Mice Protein Dataset on MongoDB Database

There are samples both with Down Syndrome and without in mice protein expression data set. It is important to define the reason of Down Syndrome treatment by means of mice protein for the same treatment seem human being. In the present study, mice protein expression data set from UCI repository are classified using Bayesian Network algorithm, K- Nearest Neighbor, Decision Table, Random Forest and Support Vector Machine which are some of classification methods. The classification algorithms with 10-fold cross validation and by splitting equally as test and train data are tested to classify on the mice protein data set. The classification of the data set was succeeded with 94.3519% accuracy in 0.06 seconds using Bayesian Network, with 99.2593% accuracy in 0.01 seconds using KNN, with 95.4630 % accuracy in 1.2 seconds using Decision Table, with 100% accuracy in 0.58 seconds using Random Forest and with 100% accuracy in 1.17 seconds using SVM, with 10-fold cross validation. On the other hand, the classification of the data set was succeeded with 95.3704% accuracy in 0.22 seconds using Bayesian Network, with 98.3333% accuracy in 0 seconds using KNN, with 98.3333% accuracy in 0.72 seconds using Decision Table, with 100% accuracy in 0.77 seconds using Random Forest and with 100% accuracy in 1.48 seconds using SVM, by equally train-test data partition

PDF

___

Han, Jiawei, Jian Pei, and Micheline Kamber. Data mining: concepts and techniques. Elsevier, 2011
Győrödi, C., Győrödi, R., Pecherle, G., & Olah, A. (2015). A comparative study: MongoDB vs. MySQL. In Engineering of Modern Electric Systems (EMES) 2015 13th International Conference on (pp. 1-6). IEEE
Nayak, A., Poriya, A., & Poojary, D. (2013). Type of NOSQL databases and its comparison with relational databases. International Journal of Applied Information Systems, 5(4), 16-19.
Othman, Mohd Fauzi, and Thomas Moh Shan Yau. Comparison of different classification techniques using WEKA for breast cancer. 3rd Kuala Lumpur International Conference on Biomedical Engineering. Springer, 2007.
Kumar, Ajay, and Indranath Chatterjee. Data Mining: An experimental approach with WEKA on UCI Dataset. International Journal of Computer Applications 138.13 (2016).
Kulkarni, Priti, and Haridas Acharya. Comparative analysis of classifiers for header based emails classification using supervised learning. International Research Journal of Engineering and Technology, 03 (03), 1939- 1944 (2016).
Modi, Ms Urvashi, and Anurag Jain. A survey of IDS classification using KDD CUP 99 dataset in WEKA. (2016).
Sarunyoo Boriratrit, Sirapat Chiewchanwattana, Khamron Sunat, Pakarat Musikawan and Punyaphol Horata. Harmonic extreme learning machine for data clustering. Computer Science and Software Engineering (JCSSE), 13th International Joint Conference on. IEEE, 2016.
Zhonghuan Tian, Raymond Wong, Richard Millham. Elephant search algorithm on data clustering. Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), 12th International Conference on. IEEE, 2016.
Raikwal, J. S., and Kanak Saxena. "Performance evaluation of SVM and k-nearest neighbor algorithm over medical data set." International Journal of Computer Applications 50.14 (2012).
Deekshatulu, B. L., and Priti Chandra. "Classification of heart disease using k-nearest neighbor and genetic algorithm." Procedia Technology 10 (2013): 85-94.
Khalilia, Mohammed, Sounak Chakraborty, and Mihail Popescu. "Predicting disease risks from highly imbalanced data using random forest." BMC medical informatics and decision making11.1 (2011): 51.
Blake, C. & Merz, C. (1998). UCI repository of machine learning databases. University of California, Irvine, Dept. of Inf. and Computer Science
Higuera C, Gardiner KJ, Cios KJ. (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6): e0129126.
Heckerman, David. A tutorial on learning with Bayesian networks. Innovations in Bayesian networks. Springer, 33-82, 2008.
Buntine, W. (1991). Theory refinement on Bayesian networks. In B. D. D’Ambrosio, P. Smets, & P.P. Bonissone (Eds.), Proceedings of the Seventh Annual Conference on Uncertainty Artificial Intelligent pp. 52-60. San Francisco, CA
Daniel Grossman and Pedro Domingos (2004). Learning Bayesian Network Classifiers by Maximizing Conditional Likelihood. In Press of Proceedings of the 21st International Conference on Machine Learning, Banff, Canada
Bhatia, Nitin. "Survey of nearest neighbor techniques." arXiv preprint arXiv:1007.0085 (2010).
T.M. Mitchell, Machine Learning, The McGraw-Hill Companies Press, 1997.
Mahajan, Aditi, and Anita Ganpati. "Performance evaluation of rule based classification algorithms." International Journal of Advanced Research in Computer Engineering & Technology (IJARCET) Vol 3 (2014): 3546-3550.
Vapnik, V. (1998). Statistical Learning Theory. New York: Wiley.
Kumari, V. Anuja, and R. Chitra. "Classification of diabetes disease using support vector machine." International Journal of Engineering Research and Applications 3.2 (2013): 1797-1801.
Cortes, C., Vapnik, V., “Support-vector networks”, Machine Learning, 20(2), pp. 273-297, 1995. Vapnik, V. (1998). Statistical Learning Theory. New York: Wiley
WEKA at http://www.cs.waikato.ac.nz/~ml/weka. (last accessed:15 September 2018)
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer Peter Reutemann, Ian H. Witten. The WEKA data mining software: an update. ACM SIGKDD explorations newsletter 11.1 (2009): 10-18.
Eibe Frank, Mark A. Hall, and Ian H. Witten (2016). The WEKA Workbench. Online Appendix for "Data Mining: Practical Machine Learning Tools and Techniques", Morgan Kaufmann.