Analysis and detection of Titanic survivors using generalized linear models and decision tree algorithm

In the article, it is aimed to investigate the factors affecting survival in today's legendary giant accident with different methods. The analysis aims to find the method that best determines survival. For this purpose, logit and probit models from generalized linear models and random tree algorithm from decision tree methods were used. The study was carried out in two stages. Firstly; in the analysis made with generalized linear models, variables that did not contribute significantly to the model were determined. Classification accuracy was found to be 79.89% for the logit model and 79.04% for the probit model. In the second stage; classification analysis was performed with random tree decision trees. Classification accuracy was determined to be 77.21%. In addition; according to the results obtained from the generalized linear models, the classification analysis was repeated by removing the data that made meaningless contribution to the model. The classification rate increased by 4.36% and reached 81.57%. After all; It was determined that the decision tree analysis made with the variables extracted from the model gave better results than the analysis made with the original variables. These results are thought to be useful for researchers working on classification analysis. In addition, the results can be used for purposes such as data preprocessing, data cleaning.

___

  • E. L. Rasor, “The Titanic: Historiography and Annotated Bibliography”. Greenwood Publishing Group, London, 2001.
  • A. Singh, S. Saraswat, N. Faujdar, “Analyzing Titanic Disaster using Machine Learning”. International Conference on Computing, Communication and Automation, pp. 406-411, 2017.
  • C. Dieckmann, “The Mystery of the Titanic: What Really Happened”. Undergraduate Research Journal, vol. 13(1), pp. 243-248, 2020.
  • V. Kshirsagar, N. Phalke, “Titanic Survival Analysis using Logistic Regression”. International Research Journal of Engineering and Technology, vol. 6(8), pp. 89-91, 2019.
  • Kaggle.com, ‘Titanic Data Set’, http://www.kaggle.com/, Accessed: Oct. 2020.
  • A. M. Barhoom, A. J. Khalil, B. S. Abu-Nasser, M. M. Musleh, S. S. Abu-Naser, “Predicting Titanic Survivors using Artificial Neural Network”. International Journal of Academic Engineering Research, vol. 3(9), pp. 8-12, 2019.
  • K. Singh, R. Nagpal, R. Sehgal, “Exploratory Data Analysis and Machine Learning on Titanic Disaster Datase”. 10th International Conference on Cloud Computing, Data Science & Engineerin, India, Jan. 2020.
  • Y. Kakde, Agrawal, S., “Predicting Survival on Titanic by Applying Exploratory Data Analytics and Machine Learning Techniques”, International Journal of Computer Applications, vol. 179(44), pp. 32-38, 2018.
  • J. Garrido, J. Zhou, “Full Credibility with Generalized Linear and Mixed Models”. ASTIN Bulletin, vol. 39(1), pp. 61-80, 2009.
  • T. Koc, M. A. Cengiz, “Genelleştirilmiş Lineer Karma Modellerde Tahmin Yöntemlerinin Uygulamalı Karşılaştırılması”. Karaelmas Science and Engineering Journal, vol. 2(2), pp. 47-52, 2012.
  • Y. Kida, “Generalized Linear Models: Introduction to Advanced Statistical Modeling”. Towards Data Science, Sep. 2019.
  • B. Bozkurt, “Kredi ve Yurtlar Kurumunda Kalan Öğrencilerin Memnuniyet Derecelerinin Lojistik Regresyon Yöntemi ile Araştırılması: Edirne Ili Örneği”. University of Trakya Social Sciences Institute Business Department Master Term Project, Aug. 2011.
  • G. Çırak, Ö. Çokluk, “The Usage of Artifical Neural Network and Logistic Regresssion Methods in the Classification of Student Achievement in Higher Education”. Mediterranean Journal of Humanities, vol. 3(2), pp. 71-79, 2013.
  • D. N. Gujarati, N. C. Porter, “Temel Ekonometri”. Ümit Şenesen ve Gülay Günlük Şenesen (çev.) İkinci Basım, Literatür Yayıncılık, İst. 2001.
  • Ö. İ. Güneri, B. Durmuş, “Dependent Dummy Variable Models: An Application of Logit, Probit and Tobit Models on Survey Data”. International Journal of Computational and Experimental Science and Engineering, vol. 6(1), pp. 63-74, 2020.
  • M. Bilki, Ü. Aydın, “Konut Sahibi Olma Kararlarını Etkileyen Faktörler: Lojistik Regresyon ve Destek Vektör Makinelerinin Karşılaştırılması”. Dumlupınar Üniversitesi Sosyal Bilimler Dergisi, vol. 62, pp. 184-199, 2019.
  • S. Demirci, M. Astar, “Türkiye’de Özel Sigortayı Etkileyen Faktörler: Logit Modeli”. Trakya Üniversitesi Sosyal Bilimler Dergisi, vol. 13 (2), pp. 119-130, Dec. 2011.
  • T. Amemiya, "Qualitative Response Models: A Survey". Journal of Economic Literature, vol. 19(4), pp. 481-536, 1981.
  • J. H. Aldric, F. D. Nelson, "Linear Probability, Logit and Probit Models", Sage Publications, USA, 1984.
  • D. Bertsimas, J. Dunn, “Optimal Classification Trees”. Mach Learn, vol. 106, pp. 1039–1082, 2017.
  • J. Ali, R. Khan, N. Ahmad, L. Maqsood, “Random Forests and Decision Trees”. International Journal of Computer Science Issues, vol. 9, pp. 5-3, Sep. 2012.
  • G. Nuti, L. A. J. Rugama, “A Bayesian Decision Tree Algorithm”. arXiv:1901.03214v2 [stat.ML], Jan. 2019.
  • B. Gupta, A. Rawat, A. Jain, A. Arora, R. Dhami, “Analysis of Various Decision Tree Algorithms for Classification in Data Mining”. International Journal of Computer Applications, vol. 163 (8), pp. 15-19, Apr. 2017.
  • S. D. Jadhav, H. P. Channe, “Comparative Study of K-NN, Naive Bayes and Decision Tree Classification Techniques”. International Journal of Science and Research, vol. 5 (1), pp. 1842-1845, Jan. 2016.
  • B. Durmus, Ö. İ. Güneri, “Data Mining with R: An Applied Study”. International Journal of Computing Sciences Research, vol. 3(3), pp. 201-216, 2019.
  • Ö. Akar, O. Güngör, “Rastgele Orman Algoritması Kullanılarak Çok Bantlı Görüntülerin Sınıflandırılması”. Journal of Geodesy and Geoinformation, vol. 1(2), pp. 139-146, 2012.