Effect of Imputation Methods in the Classifier Performance

Missing values in a dataset present an important problem for almost any traditional and modern statistical method since most of these methods were developed under the assumption that the dataset was complete. However, in the real world, no complete datasets are available and the issue of missing data is frequently encountered in veterinary field studies as in other fields. While the imputation of missing data is important in veterinary field studies where data mining is newly starting to be implemented, another important issue is how it should be imputed. This is because in many studies observations with any variables having missing values are being removed or they are completed by traditional methods. In recent years, while alternative approaches are widely available to prevent the removal of observations with missing values, they are being used rarely. The aim of this study is to examine mean, median, nearest neighbors, MICE and missForest methods to impute the simulated missing data which is the randomly removed with varying frequencies (5 to 25% by 5%) from the original veterinary dataset. Then highly accurate methods selected to impute the original dataset for observation of influence in classifier performance and to determine the optimal imputation method for the original dataset.

___

[1] J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques: Elsevier, 2011.

[2] P. J. García-Laencina, J.-L. SanchoGómez, and A. R. Figueiras-Vidal, "Pattern classification with missing data: a review," Neural Computing and Applications, vol. 19, pp. 263-282, 2010.

[3] J. L. Schafer, Analysis of incomplete multivariate data: Chapman and Hall/CRC, 1997.

[4] I. R. Dohoo, C. R. Nielsen, and U. Emanuelson, "Multiple imputation in veterinary epidemiological studies: a case study and simulation," Preventive veterinary medicine, vol. 129, pp. 35-47, 2016.

[5] G. Ser and S. Keskin, "Examining of Multiple Imputation Method in Two Missing Observation Mechanisms," JAPS, Journal of Animal and Plant Sciences, vol. 26, pp. 594-598, 2016.

[6] K. Hron, M. Templ, and P. Filzmoser, "Imputation of missing values for compositional data using classical and robust methods," Computational Statistics & Data Analysis, vol. 54, pp. 3095-3107, 2010.

[7] S. G. Liao, Y. Lin, D. D. Kang, D. Chandra, J. Bon, N. Kaminski, et al., "Missing value imputation in highdimensional phenomic data: imputable or not, and how?," BMC bioinformatics, vol. 15, p. 346, 2014.

[8] G. Tutz and S. Ramzan, "Improved methods for the imputation of missing data by nearest neighbor methods," Computational Statistics & Data Analysis, vol. 90, pp. 84-99, 2015.

[9] J. Xia, S. Zhang, G. Cai, L. Li, Q. Pan, J. Yan, et al., "Adjusted weight voting algorithm for random forests in handling missing values," Pattern Recognition, vol. 69, pp. 52-60, 2017.

[10] P. Schmitt, J. Mandel, and M. Guedj, "A comparison of six methods for missing data imputation," Journal of Biometrics & Biostatistics, vol. 6, p. 1, 2015.

[11] P. Cihan, E. Gökçe, and O. Kalıpsız, "A review of machine learning applications in veterinary field," Kafkas Univ Vet Fak Derg, vol. 23, pp. 673-680, 2017.

[12] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, et al., "Missing value estimation methods for DNA microarrays," Bioinformatics, vol. 17, pp. 520-525, 2001.

[13] S. Van Buuren, H. C. Boshuizen, and D. L. Knook, "Multiple imputation of missing blood pressure covariates in survival analysis," Statistics in medicine, vol. 18, pp. 681-694, 1999.

[14] D. J. Stekhoven and P. Bühlmann, "MissForest—non-parametric missing value imputation for mixed-type data," Bioinformatics, vol. 28, pp. 112-118, 2011.

[15] E. HM, "An epidemiological study on neonatal lamb health," Kafkas Üniversitesi Veteriner Fakültesi Dergisi, vol. 15, 2009.

[16] K. AH and E. HM, "Risk Factors Associated with Passive Immunity, Health, Birth Weight And Growth Performance in Lambs: III. The Relationship among Passive Immunity, Birth Weight Gender, Birth Type, Parity, Dam," Kafkas Üniversitesi Veteriner Fakültesi Dergisi, vol. 19, 2013.

[17] R. J. Little and D. B. Rubin, Statistical analysis with missing data vol. 333: John Wiley & Sons, 2014.

[18] E. Alpaydin, Introduction to machine learning: MIT press, 2009.

[19] A. J. Viera and J. M. Garrett, "Understanding interobserver agreement: the kappa statistic," Fam Med, vol. 37, pp. 360-363, 2005.

[20] P. Cihan, "Determination of diagnosis, prognosis and risk factors in animal diseases using by data mining methods," Ph.D. dissertation, Comp. Eng., Yildiz Technical Univ., Istanbul, 2018.