Uğur BİNZAT, Engin YILDIZTEPE

THE ADJUSTED HISTOGRAM-BASED OUTLIER SCORE - AHBOS

Anahtar Kelimeler:

unsupervised anomaly detection, outlier, histogram, density estimation

THE ADJUSTED HISTOGRAM-BASED OUTLIER SCORE - AHBOS

Histogram is a commonly used tool for visualizing data distribution. It has also been used in semi-supervised and unsupervised anomaly detection tasks. The histogram-based outlier score is a fast unsupervised anomaly detection method that has become more popular because of the rapid increase in the amount of data collected in recent decades. Histogram-based outlier score can be computed using either static or dynamic bin-width histograms. When a histogram contains large gaps, the dynamic bin-width approach is preferred over the static bin-width approach. These gaps in a histogram usually occur as a result of various distributions in real data. When working with a static bin-width histogram, gaps can be utilized to acquire better distinction between outliers and inliers. In this study, we propose an adjusted version of the histogram-based outlier score named adjusted histogram-based outlier score, which considers neighboring bins prior to density estimation. Results from a simulation study and real data application indicate that the adjusted histogram-based outlier score yields a better performance not only in the simulated data but also for various types of real data.

Keywords:

unsupervised anomaly detection, outlier, histogram, density estimation,

PDF

___

Chandola, V., Banerjee, A., and Kumar, V., “Anomaly Detection: a Survey”, ACM Computing Surveys (CSUR), 41(3), 1-58, 2009.
Anscombe, F. J., “Rejection of Outliers”, Technometrics, 2(2), 123-146, 1960.
Grubbs, F. E., “Procedures for Detecting Outlying Observations in Sample”, Technometrics, 11(1), 1-21, 1969.
Hawkins, D. M., Identification of Outliers, London: Chapman and Hall, 1980.
Breunig, M. M., Kriegel, H. P., Ng, R. T. and Sander, J., “LOF: Identifying Density Based Local Outlier”, In Proceedings of the 2000 ACM SIGMOD International Conference on Management of data, 2000, 93-104.
Hodge, V. and Austin, J., “A survey of Outlier Detection methodologies”, Artificial Intelligence Review, 22, 85-126, 2004.
Goldstein, M. and Uchida, S., “A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data”, PloS One, 11(4), 2016.
Zoppi, T., Ceccarelli, A., Puccetti, T. and Bondavalli, A., “Which Algorithm Can Detect Unknown Attacks? Comparison of Supervised, Unsupervised and Meta-Learning Algorithms for Intrusion Detection”, Computers & Security, 127, 2023.
Kind, A., Stoecklin, M. P. and Dimitropoulos, X., “Histogram-Based Traffic Anomaly Detection”, IEEE Transactions on Network and Service Management, 6(2), 110-121, 2009.
Sabau, A. S., “Survey of Clustering Based Financial Fraud Detection Research”, Informatica Economica, 16(1), 110, 2012.
Xie, M., Hu, J. and Tian, B., “Histogram-Based Online Anomaly Detection in Hierarchical Wireless Sensor Network”, In 2012 IEEE 11th International Conference on Trust, Security and Privacy in Computing And Communications, 2012. 751-759.
Sharma, A., Pujari, A. K. and Paliwal, K. K., “Intrusion Detection Using Text Processing Techniques with a Kernel Based Similarity Measure”, Computers & Security, 26(7-8), 488-495, 2007.
Carminati, M., Polino, M., Continella, A., Lanzi, A., Maggi, F. and Zanero, S., “Security Evaluation of a Banking Fraud Analysis System”, ACM Transactions on Privacy and Security (TOPS), 21(3), 1-31, 2018.
Munir, M., Chattha, M. A., Dengel, A. and Ahmed, S., “A Comparative Analysis of Traditional and Deep Learning-Based Anomaly Detection Methods For Streaming Data”, In 2019 18th IEEE International Conference On Machine Learning and Applications (ICMLA), 2019, 561-566.
Goldstein, M. and Dengel, A., “Histogram-Based Outlier Score (HBOS): A Fast Unsupervised Anomaly Detection Algorithm”, KI-2012: Poster and Demo Track, 59-63, 2012.
Saba-Sadiya, S., Chantland, E., Alhanai, T., Liu, T. and Ghassemi, M. M., “Unsupervised EEG Artifact Detection and Correction”, Frontiers in Digital Health, 2, 2021.
Han, S., Hu, X., Huang, H., Jiang, M. and Zhao, Y., “Adbench: Anomaly Detection Benchmark.”, Advances in Neural Information Processing Systems, 35, 32142-32159, 2022.
Dobos, D., Nguyen, T. T., Dang, T., Wilson, A., Corbett, H., McCall, J. and Stockton, P., “A Comparative Study of Anomaly Detection Methods for Gross Error Detection Problems”, Computers & Chemical Engineering, 175, 2023.
Paulauskas, N. and Baskys, A., “Application of Histogram-Based Outlier Scores to Detect Computer Network Anomalies”, Electronics, 8(11), 1251, 2019.
Wand, M. P., “Data-Based Choice of Histogram Bin Width”, The American Statistician, 51(1), 59-64, 1997.
Sturges, H. A., “The Choice of a Class Interval”, Journal of the American Statistical Association, 21(153), 65-66, 1926.
R Core Team, R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/, 2022.
Scott, D. W., “On Optimal and Data-Based Histograms”, Biometrika, 66(3), 605-610, 1979.
Freedman, D. and Diaconis, P., “On the Histogram as a Density Estimator: L2 Theory”, Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 57(4), 453-476, 1981.
Venables, W. N. and Ripley, B. D., In Modern Applied Statistics with S, Springer, New York, 2002.
Shimazaki, H. and Shinomoto, S., “A Method for Selecting the Bin Size of a Time Histogram”, Neural Computation, 19(6), 1503-1527, 2007.
Wilkinson, L., “Visualizing Big Data Outliers Through Distributed Aggregation”, IEEE Transactions on Visualization and Computer Graphics, 24(1), 256-266, 2017.
Provost, F. J., Fawcett, T. and Kohavi, R., “The Case Against Accuracy Estimation for Comparing Induction Algorithms”, In ICML, 1998, 445-453.
Davis, J. and Goadrich, M., “The Relationship Between Precision-Recall and ROC Curves”, In Proceedings of the 23rd International Conference on Machine Learning, 2006, 233-240.
Friedman, M., “The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance”, Journal of the American Statistical Association, 32(200), 675-701, 1937.
Friedman, M., “A Comparison of Alternative Tests of Significance for the Problem of M Rankings”, The Annals of Mathematical Statistics, 11(1), 86-92, 1940.
Nemenyi, P. B., Distribution-Free Multiple Comparisons, PhD Thesis, Princeton University, 1963.
Demšar, J., “Statistical Comparisons of Classifiers Over Multiple Data Sets”, The Journal of Machine Learning Research, 7, 1-30, 2006.
Thiele, C. and Hirschfeld, G.,"Cutpointr: Improved Estimation and Validation of Optimal Cutpoints In R.", Journal of Statistical Software, 98(11), 1-27, 2021.
Yan, Y., MLmetrics: Machine Learning Evaluation Metrics. R package version 1.1.1, 2016.
Ligges, U. and Mächler, M., Scatterplot3d an R Package for Visualizing Multivariate Data. Technical Report, 2002.
Pohlert, T., PMCMRplus: Calculate Pairwise Multiple Comparisons of Mean Rank Sums Extended, 2022.
Campos, G. O., Zimek, A., Sander, J., Campello, R. J., Micenková, B., Schubert, E., Assent, I. and Houle, M. E., “On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study”, Data Mining and Knowledge Discovery, 30(4), 891-927, 2016.
Goldstein, M., Unsupervised Anomaly Detection Benchmark. Harvard Dataverse, 2015. doi: 10.7910/DVN/OPQMVF.
Rayana, S., ODDS Library [http://odds.cs.stonybrook.edu]. Stony Brook, NY: Stony Brook University, Department of Computer Science, 2016.