Sınıflandırma Başarımını Ölçme ve Seyreklik İşleme Üzerine

Sezgisel olarak, sınıflandırıcı başarımını ölçme, sınama örnekleri üzerinde koşma yaparak ve doğru kararların tüm kararlara oranı gözlemlenerek yapılabilir görünmektedir. Ne var ki, çoğu zaman eldeki probleme bağlı olan, doğru ve yanlış kararların bağıl önemi (ağırlığı) ve çeşitli işletim noktaları dikkate alındığında, konu bu kadar basit değildir. Genellikle belli bir ölçev türünden yüksek başarım istenirken, başka bir ilintili ölçev türünden daha düşük bir başarım kabul edilebilmektedir. Böyle bakılırsa, başarım ölçme, mühendislik görüngülerinin çoğundaki gibi, bir ödünleşim süreci olarak anlaşılabilir. Bu makalede, sınıflandırıcıların karşı karşıya olduğu farklı koşullar da dikkate alınarak başarım ölçme üzerine bir derleme sunulurken, sınıflandırma başarımını azaltan ve benzer özellikler barındıran durum/sınıf seyrekliği ve sınıf dengesizliği (dengelenmemiş durum/sınıf dağılımları) sorunları üzerinde durulmaktadır. Derleme, bu bağlamda, söz konusu darboğazların üstesinden gelmeyi amaçlayan tamamlayıcı bir örnek yöntemler kümesi sunmaktadır.

___

  • [1] Kubat, M., Holte, R.C. ve Matwin, S., “Machine Learning for the Detection of Oil Spills in Satellite Radar Images”, Machine Learning, 30(2-3), 195-215, 1998.
  • [2] Sebastiani, F., “Machine Learning in Automated Text Categorization”, ACM Computing Surveys, 34(1), 1-47, 2002.
  • [3] Lynch, P, “The Origins of Computer Weather Prediction and Climate Modeling”, Journal of Computational Physics, 227(7), 3431-3444, 2008.
  • [4] Kumar, A., “Computer Vision-based Fabric Defect Detection: A Survey”, IEEE Transactions on Industrial Electronics, 55(1), 348-363, 2008.
  • [5] Patel, A., Qassim, Q. ve Wills, C., “A Survey of Intrusion Detection and Prevention Systems”, Information Management & Computer Security, 18(4), 277-290, 2010.
  • [6] Farajian, M.A. ve Mohammadi, S., “Mining the Banking Customer Behavior using Clustering and Association Rules Methods”, International Journal of Industrial Engineering & Production Research, 21(4), 239-245, 2010.
  • [7] Takahashi, R., ve Kajikawa, Y., “Computer-aided Diagnosis: A Survey with Bibliometric Analysis”, International Journal of Medical Informatics, 101, 58-67, 2017.
  • [8] Fawcett, T., “An Introduction to ROC Analysis”, Pattern Recognition Letters, 27, 861-874, 2006.
  • [9] Swets, J.A., “Measuring the Accuracy of Diagnostic Systems", Science, 240(4857), 1285-1293, 1988.
  • [10] Han, J., Kamber, M. ve Pei, J., Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, Waltham, Massachusetts, 2012.
  • [11] Alpaydın, E., Introduction to Machine Learning, The MIT Press, Cambridge, Massachusetts, 2014.
  • [12] van Rijsbergen, C.J., Information Retrieval, Butterworths, London, 1979.
  • [13] Kononenko, I. ve Bratko, I., “Information-based Evaluation Criterion for Classifier’s Performance”, Machine Learning, 6, 67-80, 1991.
  • [14] Bradley, A., “The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms”, Pattern Recognition, 30(7), 1145-1159, 1997.
  • [15] Weiss, G.M., “Mining with Rarity: A Unifying Framework”, ACM SIGKDD Explorations Newsletter, 6(1), 7-19, 2014.
  • [16] Konur, U., Gürgen, F.S., Varol, F. ve Akarun, L., "Computer Aided Detection of Spina Bifida using Nearest Neighbor Classification with Curvature Scale Space Features of Fetal Skulls Extracted from Ultrasound Images", Knowledge Based Systems, 85, 80-95 2015.
  • [17] Konur, U., “Computerized Detection of Spina Bifida using SVM with Zernike Moments of Fetal Skulls in Ultrasound Screening”, Biomedical Signal Processing and Control, 43, 18-30, 2018.
  • [18] Longadge, R., Dongre, S.S. ve Malik, M., “Class Imbalance Problem in Data Mining: Review”, International Journal of Computer Science and Network, 2(1), 2013.
  • [19] Maheshwari, S., Jain, R.C. ve Jadon, R.S., “A Review on Class Imbalance Problem: Analysis and Potential Solutions”, International Journal of Computer Science Issues, 14(6), 43-51, 2017.
  • [20] Dongre, S.S., ve Malik, L.G., “Rare Class Problem in Data Mining: Review”, International Journal of Advanced Research in Computer Science, 8(7), 1102-1105, 2017.
  • [21] Joshi, M.V., “On Evaluating Performance of Classifiers for Rare Classes”, IEEE International Conference on Data Mining, 641-644, 2002.
  • [22] Weiss, G.M., “Timeweaver: A Genetic Algorithm for Identifying Predictive Patterns in Sequences of Events”, Annual Conference on Genetic and Evolutionary Computation 718-725, 1999.
  • [23] Carvalho, D.R. ve Freitas, A.A., “A Genetic Algorithm for Discovering Small-disjunct Rules in Data Mining”, Applied Soft Computing, 2(2), 75-88, 2002.
  • [24] Japkowicz, N. ve Stephen, S., “The Class Imbalance Problem: A Systematic Study”, Intelligent Data Analysis, 6(5), 429-449, 2002.
  • [25] Estabrooks, A. ve Japkowicz, N., “A Mixture-of-experts Framework for Learning from Imbalanced Data Sets”, International Symposium on Intelligent Data Analysis, 34-43, 2001.
  • [26] Joshi, M.V., Agarwal R.C. ve Kumar, V., “Mining Needles in a Haystack: Classifying Rare Classes via Two-phase Rule Induction”, ACM SIGMOD Conference on Management of Data, 91-102, 2001.
  • [27] Joshi, M.V., Agarwal R.C. ve Kumar, V., “Predicting Rare Classes: Can Boosting Make any Weak Learner Strong?”, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 297-306, 2002.
  • [28] Quinlan, J.R., “Improved Estimates for the Accuracy of Small Disjuncts”, Machine Learning, 6, 93-98, 1991.
  • [29] di Martino, M., Fernãndez, A., Iturrralde, P. ve Lecumberry, F., “Novel Classifier Scheme for Imbalanced Problems”, Pattern Recognition Letters, 34(10), 1146-1151, 2013.
  • [30] Holte, R.C., Acker, L.E. ve Porter B.W., “Concept Learning and the Problem of Small Disjuncts”, International Joint Conference on Artificial Intelligence, 813-818, 1989.
  • [31] Ting, K.M., “The Problem of Small Disjuncts: Its Remedy in Decision Trees”, Canadian Conference on Artificial Intelligence, 91-97, 1994.
  • [32] van den Bosch, A., Weijters, T., van den Herik, H.J. ve Daelemans, W., “When Small Disjuncts Abound, Try Lazy Learning: A Case Study”, Belgian-Dutch Conference on Machine Learning, 109-118, 1997.
  • [33] Goldberg, D.E., Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley Longman Publishing, Boston, Massachusetts, 1989.
  • [34] Resende, P.A.A. ve Drummond, A.C., “Adaptive Anomaly-based Intrusion Detection System using Genetic Algorithm and Profiling”, Security and Privacy, 1(4), 1-13, 2018.
  • [35] Riddle, P., Segal, R. ve Etzioni, O., “Representation Design and Brute-force Induction in a Boeing Manufacturing Design”, Applied Artificial Intelligence, 8, 125-147, 1994.
  • [36] Agrawal, R. Imielinski, T. ve Swami, A., “Mining Association Rules between Sets of Items in Large Databases”, ACM SIGMOD International Conference on Management of Data, 207-217, 1993.
  • [37] Sun, Y., Kamel, M..S. ve Wang, Y., “Boosting for Learning Multiple Classes with Imbalanced Class Distribution”, International Conference on Data Mining, 592-602, 2006.
  • [38] Zhao, J.H., Li, X. ve Dong, Z.Y., “Online Rare Events Detection”, Pacific-Asia Conference on Knowledge Discovery and Data Mining, 1114-1121, 2007.
  • [39] Kubat, M., ve Matwin, S., “Addressing the Curse of Imbalanced Training Sets: One-sided Selection”, International Conference on Machine Learning, 1997.
  • [40] Chawla, N.V., Bowyer, K.W., Hall, L.O. ve Kegelmeyer W.P., “SMOTE: Synthetic Minority Over-sampling Technique”, Journal of Artificial Intelligence Research, 16, 321-357, 2002.
  • [41] Han, H., Wang, W.Y. ve Mao, B.H., “Borderline-SMOTE: A New Over-sampling Method in Imbalanced Data Sets Learning”, International Conference on Advances in Intelligent Computing, 878-887, 2005.
  • [42] García, S., Derrac, J., Triguero, I., Carmona, C.J. ve Herrera, F., “Evolutionary-based Selection of Generalized Instances for Imbalanced Classification”, Knowledge Based Systems, 25(1), 3-12, 2012.
  • [43] Das, B., Krishnan, N.C ve Cook, D.J., “RACOG and wRACOG: Two Probabilistic Oversampling Techniques”, IEEE Transactions on Knowledge and Data Engineering. 27(1), 222-234, 2015.
  • [44] Casella, G. ve George, E.I., “Explaining the Gibbs Sampler”, The American Statistician, 46(3), 167-174, 1992.
  • [45] Schapire, R.E., “A Brief Introduction to Boosting”, International Joint Conference on Artificial Intelligence, 1-6, 1999.
  • [46] Joshi, M.V., Kumar, V. ve Agarwal, R.C., “Evaluating Boosting Algorithms to Classify Rare Cases: Comparisons and Improvements”, IEEE International Conference on Data Mining, 257-264, 2001.
  • [47] Chawla, N.V., Lazarevic, A., Hall, L.O. ve Bowyer, K.W., “SMOTEBoost: Improving Predition of the Minority Class in Boosting”, European Conference on Principles and Practice of Knowledge Discovery in Databases, 107-119, 2003.
  • [48] Fan, W., Stolfo, S.J., Zhang, J. ve Chan, P.K., “AdaCost: Misclassification Cost-sensitive Boosting”, International Conference on Machine Learning, 97-105, 1999.
  • [49] Breiman, L., “Bagging Predictors”, Machine Learning, 24, 123-140, 1996.
  • [50] Guo, H. ve Viktor, H.L., “Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost-IM Approach”, ACM SIGKDD Explorations Newsletter, 6(1), 30-39, 2004.
  • [51] Kang, P. ve Cho, S., “EUS SVMs: Ensemble of Under-sampled SVMs for Data Imbalance Problems”, International Conference on Neural Information Processing, 837-846, 2006.
  • [52] Liu, X.Y., Wu, J. ve Zhou Z.H., “Exploratory Undersampling for Class-imbalance Learning”, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics,) 39(2), 539-550, 2009.
  • [53] Guo, H., Zhou, J. ve Wu, C-a., “Ensemble Learning via Constraint Projection and Undersampling Technique for Class-imbalance Problem”, Soft Computing, 24, 4711-4727, 2019.
  • [54] Lusa, L. ve Blagus, R., “The Class-imbalance for High-dimensional Class Prediction”, International Conference on Machine Learning and Application, 123-126, 2012.
  • [55] Chomboon, K., Kerdprasop, K. ve Kerdprasop, N., “Rare Class Discovery Techniques for Highly Imbalanced Data”, International MultiConference of Engineers and Computer Scientists, 2013.
  • [56] Mladenic, D. ve Grobelnik, M., “Feature Selection for Unbalanced Class Distribution and Naive Bayes”, International Conference on Machine Learning, 258-267, 1999.
  • [57] Wasikowski, M. ve Chen, X-w., “Combating the Small Sample Class Imbalance Problem using Feature Selection”, IEEE Transactions on Knowledge and Data Engineering, 22(10), 1388-1400, 2010.
  • [58] Yang, Y. ve Pedersen, J.O., “A Comparative Study on Feature Selection for Text Categorization”, International Conference on Machine Learning, 412-420, 1997.
  • [59] Jin, X., Xu, A., Bie, R. ve Guo, P., “Machine Learning Techniques and Chi-Square Feature Selection for Cancer Classification using SAGE Gene Expression Profiles”, International Workshop on Data Mining for Biomedical Applications, 106-115, 2006.
  • [60] Caropreso, M.F., Matwin, S. ve Sebastiani, F., “A Learner-independent Evaluation of the Usefulness of Statistical Phrases for Automated Text Categorization”, Text Databases and Document Management: Theory and Practice (ed: Chin, A.G.), Idea Group Publishing, Hershey, Pennsylvania, 78-102, 2001.
  • [61] Shang, C., Li, M., Feng, S., Jiang, Q. ve Fan, J., “Feature Selection via Maximizing Global Information Gain for Text Classification”, Knowledge Based Systems, 54, 298-309, 2013.
  • [62] Kira, K. ve Rendell, L.A., “The Feature Selection Problem: Traditional Methods and New Algorithms”, AAAI Conference on Artificial Intelligence, 129-134, 1992.
  • [63] Kononenko, I., “Estimating Attributes: Analysis and Extension of RELIEF”, European Conference on Machine Learning, 171-182, 1994.
  • [64] Zheng, Z., Wu, X. ve Srihari, R., ”Feature Selection for Text Categorization on Imbalanced Data”, ACM SIGKDD Explorations Newsletter, 6(2), 80-89, 2004.
  • [65] Ertekin, Ş., Huang, J. ve Gilles, C.L., “Active Learning for Class Imbalance Problem”, ACM SIGIR Conference on Research and Development in Information Retrieval, 823-824, 2007.
  • [66] Chen, X-w. ve Wasikowski, M., “FAST: A Roc-based Feature Selection Metric for Small Samples and Imbalanced Data Classification Problems”, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 124-132, 2008.
  • [67] Alibeigi, M., Hashemi, S. ve Hamzeh, A., “DBFS: An Effective Density Based Feature Selection Scheme for Small Sample Size and High Dimensional Imbalanced Data Sets”, Data and Knowledge Engineering, 81-82(1), 67-103, 2012.
  • [68] Yin, L., Ge, Y., Xiao, K., Wang, X. ve Quan, X., “Feature Selection for High Dimensional Imbalanced Data”, Neurocomputing, 105, 3-11, 2013.
  • [69] Nikulin, M.S., “Hellinger Distance”, Encyclopedia of Mathematics, EMS press, 2001.
  • [70] Jovic, A., Brkic, K. ve Bogunovic, N., “A Review of Feature Selection Methods with Applications”, International Convention on Information and Communication Technology, Electronics and Microelectronics, 1200-1205, 2015.
  • [71] Dong, G. ve Li, J., “Efficient Mining of Emerging Patterns: Discovering Trends and Differences”, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 43-52, 1999.
  • [72] Alhammady, H. ve Ramamohanarao, K., “The Application of Emerging Patterns for Improving the Quality of Rare-class Classification”, Pacific-Asia Conference on Knowledge Discovery and Data Mining, 207-211, 2004.
  • [73] Alhammady, H. ve Ramamohanarao, K., “Using Emerging Patterns and Decision Trees in Rare-class Classification”, IEEE International Conference on Data Mining, 315-318, 2004.
  • [74] Alhammady, H., “A Novel Approach for Mining Emerging Patterns in Rare-class Datasets”, Innovations and Advanced Techniques in Computer and Information Sciences and Engineering (ed: Sobh, T.), 207-211, 2007.
  • [75] Fernãndez, A., del Jesus, M.J. ve Herrera, F., “On the Influence of an Adaptive Inference System in Fuzzy Rule Based Classification Systems for Imbalanced Data-sets”, Expert Systems with Applications, 36(6), 9805-9812, 2009.
  • [76] Japkowicz, N., Myers, C. ve Gluck, M., “A Novelty Detection Approach to Classification”, International Joint Conference on Artificial Intelligence, 518-523, 1995.
  • [77] Raskutti, B. ve Kowalczyk, A., “Extreme Re-balancing for SVMs: A Case Study”, ACM SIGKDD Explorations Newsletter, 6(1), 60-69, 2004.
  • [78] Cohen, W.W., “Fast Effective Rule Induction”, International Conference on Machine Learning, 115-123, 1995.
  • [79] Japkowicz, N., “Learning from Imbalanced Data Sets: A Comparison of Various Strategies”, AAAI Workshop on Learning from Imbalanced Data Sets, 10-15, 2000.
  • [80] Lee, H. ve Cho, S., “The Novelty Detection Approach for Different Degrees of Class Imbalance”, International Conference on Neural Information Processing, 21-30, 2006.
  • [81] Bellinger, C., Sharma, S. ve Japkowicz, N., “One-class versus Binary Classification: Which and When?”, International Conference on Machine Learning and Applications, 102-106, 2012.
  • [82] Japkowicz, N., “Supervised Learning with Unsupervised Output Separation”, International Conference on Artificial Intelligence and Soft Computing, 321-325, 2002.
  • [83] Wu, J., Xiong, H., Wu, P. ve Chen, J., “Local Decomposition for Rare Class Analysis”, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 814-823, 2007.
  • [84] Gong, R., ve Huang, S.R., “A Kolmogorov-Smirnov Statistic based Segmentation Approach to Learning from Imbalanced Datasets: With Application in Property Refinance Prediction”, Expert Systems with Applications, 39(6), 6192-6200, 2012.
  • [85] Agrawal, R. ve Srikant, R., “Fast Algorithms for Mining Association Rules”, International Conference on Very Large Databases, 487-499, 1994.
  • [86] Liu, B., Hsu, W. ve Ma, Y., “Mining Association Rules with Multiple Minimum Supports”, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 337-341, 1999.
  • [87] Koh, Y.S. ve Rountree, N., “Finding Sporadic Rules using Apriori-inverse”, Pacific-Asia Conference on Knowledge Discovery and Data Mining, 97-106, 2005.
  • [88] Szathmary, L., Napoli, A. ve Valtchev, P., “Towards Rare Itemset Mining”, International Conference on Tools with Artificial Intelligence, 305-312, 2007.
  • [89] Batuwita, R. ve Palade, V., “FSVM-CIL: Fuzzy Support Vector Machines for Class Imbalance Learning”, IEEE Transactions on Fuzzy Systems, 18(3), 558-571, 2010.
  • [90] Kohavi, R., “Data Mining with MineSet: What Worked, What did not Work, and What might”, International Conference on Knowledge Discovery and Data Mining, 1-6, 1998.
  • [91] Kopanas, I., Avouris, N.M. ve Daskalaki, S., “The Role of Domain Knowledge in a Large Scale Data Mining Project”, Hellenic Conference on AI: Methods and Applications of Articial Intelligence, 288-299, 2002.
  • [92] Peréz-Godoy M.D., Alberto, F., Rivera, A.J. ve del Jesus M.J., “Analysis of an Evolutionary RBFN Design Algorithm, CO2RBFN, for Imbalanced Data Sets”, Pattern Recognition Letters, 31(15), 2375-2388, 2010.
  • [93] Quinlan, J., C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San Mateo, California, 1993.
  • [94] Nguwi, Y.Y. ve Cho, S.Y., “An Unsupervised Self-organizing Learning with Support Vector Ranking for Imbalanced Datasets”, Expert Systems with Applications, 37(12), 8303-8312, 2010.
  • [95] Ultsch, A. ve Mörchen, F., “ESOM-maps: Tools for Clustering, Visualization and Classification with Emergent SOM”, Teknik rapor 46, Matematik ve Bilgisayar Bilimi Bölümü, Marburg Üniversitesi, 2005.
  • [96] Haines T.S.F. ve Xiang, T., “Active Rare Class Discovery and Classification using Dirichlet Processes”, International Journal of Computer Vision, 106, 315-331, 2014.
  • [97] Wankhade, K.K., Jondhale, K.C. ve Thool, V.R., “A Hybrid Approach for Classification of Rare Class Data”, Knowledge and Information Systems, 56, 197-221, 2017.
  • [98] Johnson, J.M. ve Koshgoftaar, T.M., “Survey on Deep Learning with Class Imbalance”, Journal of Big Data, 6(27), 2019.
  • [99] Leevy, J.L., Khoshgoftaar, T.M., Bauder, R.A. ve Seliya, N., “A Survey on Addressing High Class Imbalance in Big Data”, Journal of Big Data, 5(42), 2018.