Huaping GUO, Xiaoyu DIAO, Hongbing LIU

Improving undersampling-based ensemble with rotation forest for imbalanced problem

As one of the most challenging and attractive issues in pattern recognition and machine learning, theimbalanced problem has attracted increasing attention. For two-class data, imbalanced data are characterized by the size of one class (majority class) being much larger than that of the other class (minority class), which makes the constructed models focus more on the majority class and ignore or even misclassify the examples of the minority class. The undersampling-based ensemble, which learns individual classifiers from undersampled balanced data, is an effective method to cope with the class-imbalance data. The problem in this method is that the size of the dataset to train each classifier is notably small; thus, how to generate individual classifiers with high performance from the limited data is a key to the success of the method. In this paper, rotation forest (an ensemble method) is used to improve the performance of the undersampling-based ensemble on the imbalanced problem because rotation forest has higher performance than other ensemble methods such as bagging, boosting, and random forest, particularly for small-sized data. In addition, rotation forest is more sensitive to the sampling technique than some robust methods including SVM and neural networks; thus, it is easier to create individual classifiers with diversity using rotation forest. Two versions of the improved undersampling-based ensemble methods are implemented: 1) undersampling subsets from the majorityclass and learning each classifier using the rotation forest on the data obtained by combing each subset with the minority class and 2) similarly to the first method, with the exception of removing the majority class examples that are correctly classified with high confidence after learning each classifier for further consideration. The experimental results show that the proposed methods show significantly better performance on measures of recall, g-mean, f-measure, and AUC than other state-of-the-art methods on 30 datasets with various data distributions and different imbalance ratios.

PDF

___

[1] Guo H, Li Y, Shang J, Mingyun G, Yuanyue H, Bing G. Learning from class-imbalanced data: review of methods and applications. Expert Systems with Applications 2017; 73: 220-239.
[2] He H, Garcia EA. Learning from imbalanced Data. IEEE Transactions on Knowledge and Data Engineering 2009; 21: 1263-1284.
[3] He H, Ma Y (editors). Imbalanced Learning: Foundations, Algorithms, and Applications. New York, NY, USA: IEEE Press, 2013.
[4] Martin PD. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies 2011; 2(1): 37-63.
[5] Liu XY, Wu J, Zhou ZH. Exploratory under-sampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics Part B 2009; 39(2): 965-969.
[6] Castellanos FJ, Valero-Mas JJ, Calvo-Zaragoza J, Rico-Juan JR. Oversampling imbalanced data in the string space. Pattern Recognition Letters 2018; 103: 32-38.
[7] Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 2002; 16: 321-357.
[8] Han H, Wang W, Mao B. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In: Proceedings of the First International Conference on Intelligent Computing (Part I); 23–26 August 2005; Hefei, China. Berlin, Germany: Springer. pp. 878-887.
[9] Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C. Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining(PAKDD); 27–30 April 2009; Bangkok, Thailand. Berlin, Germany: Springer. pp. 475-482.
[10] Barua S, Islam MM, Yao X, Murase K. MWMOTE-Majority weighted minority oversampling technique for imbalanced data set learning. IEEE Transactions on Knowledge and Data Engineering 2014; 26(2): 405-425.
[11] Kazemi Y, Mirroshandel SA. A novel method for predicting kidney stone type using ensemble learning. Artificial Intelligence in Medicine 2018; 84: 117-126.
[12] Chan YT, Wang SJ, Tsai C. Real-time foreground detection approach based on adaptive ensemble learning with arbitrary algorithms for changing environments. Information Fusion 2018; 39: 154-167.
[13] Han M, Liu B. Ensemble of extreme learning machine for remote sensing image classification. Neurocomputing 2015; 149: 65-70.
[14] Tong H, Liu B, Wang S. Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Information and Software Technology 2018; 96: 94-111.
[15] Seiffert C, Khoshgoftaar T, Hulse JV, Napolitano A. Rusboost: a hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics Part A 2010; 40(1): 185-197.
[16] Lu W, Li Z, Chu J. Adaptive Ensemble Undersampling-Boost: A novel learning framework for imbalanced data. Journal of Systems and Software 2017; 132: 272-282.
[17] Bao L, Juan C, Li J, Zhang Y. Boosted near-miss under-sampling on SVM ensembles for concept detection in large-scale imbalanced datasets. Neurocomputing 2016; 172: 198-206.
[18] Barandela R, Valdovinos RM, Sánchez JS. New applications of ensembles of classifiers. Pattern Analysis and Applications 2003; 6(3): 245-256.
[19] Liu XY, Wu J, Zhou ZH. Exploratory under-sampling for class-imbalance learning. In: Proceedings of the 6th IEEE International Conference on Data Mining (ICDM); 18–22 December 2006; Hong Kong, China. New York, NY, USA: IEEE. pp. 965-969.
[20] Tao D, Tang X, Li X, Wu X. Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 2006; 28(7): 1088- 1099.
[21] Hido S, Kashima H, Takahashi Y. Roughly balanced bagging for imbalanced data. Statistical Analysis and Data Mining 2009; 2(5-6): 412-426.
[22] Rodríguez JJ, Kuncheva LI, Alonso CJ. Rotation forest: a new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence 2006; 28(10): 1619-1630.
[23] Breiman L. Bagging predictors. Machine Learning 1996; 24(2): 123-140.
[24] Freund Y, Schapire RE. A Decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 1997; 55(1): 119-139.
[25] Breiman L. Random forests. Machine Learning 2001; 45(1): 5-32.
[26] López V, Fernández A, García S, Palade V, Herrera F. An insight into classification with imbalanced data: Empirical results and current trends on us- ing data intrinsic characteristics. Information Sciences 2013; 250: 113-141.
[27] Branco P, Torgo L, Ribeiro RP. A Survey of Predictive Modelling under Imbalanced Distributions. CoRR abs/1505.01658 2015.
[28] Zhai J, Zhai M, Kang X. Condensed fuzzy nearest neighbor methods based on fuzzy rough set technique. Intelligent Data Analysis 2014; 18(3): 429-447.
[29] Devi D, Biswas SK, Purkayastha B. Redundancy-driven modified Tomek-link based undersampling: a solution to class imbalance. Pattern Recognition Letters 2017; 93: 3-12.
[30] Japkowicz N. The class imbalance problem: significance and strategies. In: Proceedings of the International Conference on Artificial Intelligence; 2000; Las Vegas, NV, USA. pp. 111-117.
[31] Sun Z, Song Q, Zhu X, Sun H, Xu B, Zhou Y. A novel ensemble method for classifying imbalanced data. Pattern Recognition 2015; 48(5): 1623-1637.
[32] Nguyen HM, Cooper EW, Kamei K. Borderline over-sampling for imbalanced data classification. International Journal of Knowledge Engineering and Soft Data Paradigms 2011; 3(1): 4-21.
[33] Sun J, Lang J, Fujita H, Li H. Imbalanced enterprise credit evaluation with DTE-SBD: Decision tree ensemble based on SMOTE and bagging with differentiated sampling rates. Information Sciences 2018; 425: 76-91.
[34] Estabrooks A, Jo T, Japkowicz N. A multiple resampling method for learning from imbalanced data sets. Computational Intelligence 2004; 20: 18-36.
[35] Kang P, Cho S. EUS SVMs: Ensemble of under-sampled SVMs for data imbalance problems. In: Proceedings of the 13th International Conference on Neural Information Processing, Part I; 3–6 October 2006; Hong Kong, China. Berlin, Germany: Springer. pp. 837-846.
[36] Alcalá-Fdez J, Fernandez A, Luengo J, Derrac J, García S, Sánchez L, Herrera F. KEEL data-mining software tool: data set repository integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing 2011; 17(2-3): 255-287.
[37] García-Pddrajas N, García-Osorio C, Fyfe C. Nonlinear boosting projections for ensemble construction. Journal of Machine Learning Research 2007; 8: 1-33.
[38] Chawla NV. C4.5 and imbalanced data sets: investigating the effective of sampling method, probabilistic estimate, and decision tree structure. In: Proceedings of the ICML ’03 Workshop on Learning from Imbalanced Data Sets; 21–24 August 2003; Washington, DC, USA. Palo Alto, CA, USA: AAAI Press.
[39] Quinlan JR. C4.5: Programs for Machine Learning. San Mateo, CA, USA: Morgan Kaufmann, 1993.
[40] Demsar J. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 2006, 6: 1-30.
[41] García S, Fernández A, Luengo J, Herrera F. Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Information Sciences 2010; 180: 2044-2064.
[42] Li J. A two-step rejection procedure for testing multiple hypotheses. Journal of Statistical Planning and Inference 2008; 138: 1521-1527.
[43] Rodríguez-Fdez I, Canosa A, Mucientes M, Bugarín A. STAC: A web platform for the comparison of algorithms using statistical tests. In: Proceedings of the IEEE International Conference on Fuzzy Systems; 2–5 August 2015; İstanbul, Turkey. New York, NY, USA: IEEE. pp. 1-8.
[44] Ren F, Cao P, Li W, Zhao D, Zaiane O. Ensemble based adaptive over-sampling method for imbalanced data learning in computer aided detection of microaneurysm. Computerized Medical Imaging and Graphics 2017; 55: 54-67.