Improving undersampling-based ensemble with rotation forest for imbalanced problem

As one of the most challenging and attractive issues in pattern recognition and machine learning, the imbalanced problem has attracted increasing attention. For two-class data, imbalanced data are characterized by the size of one class (majority class) being much larger than that of the other class (minority class), which makes the constructed models focus more on the majority class and ignore or even misclassify the examples of the minority class. The undersampling-based ensemble, which learns individual classifiers from undersampled balanced data, is an effective method to cope with the class-imbalance data. The problem in this method is that the size of the dataset to train each classifier is notably small; thus, how to generate individual classifiers with high performance from the limited data is a key to the success of the method. In this paper, rotation forest (an ensemble method) is used to improve the performance of the undersampling-based ensemble on the imbalanced problem because rotation forest has higher performance than other ensemble methods such as bagging, boosting, and random forest, particularly for small-sized data. In addition, rotation forest is more sensitive to the sampling technique than some robust methods including SVM and neural networks; thus, it is easier to create individual classifiers with diversity using rotation forest. Two versions of the improved undersampling-based ensemble methods are implemented: 1) undersampling subsets from the majority class and learning each classifier using the rotation forest on the data obtained by combing each subset with the minority class and 2) similarly to the first method, with the exception of removing the majority class examples that are correctly classified with high confidence after learning each classifier for further consideration. The experimental results show that the proposed methods show significantly better performance on measures of recall, g-mean, f-measure, and AUC than other state-of-the-art methods on 30 datasets with various data distributions and different imbalance ratios.