THE ADJUSTED HISTOGRAM-BASED OUTLIER SCORE - AHBOS

THE ADJUSTED HISTOGRAM-BASED OUTLIER SCORE - AHBOS

Histogram is a commonly used tool for visualizing data distribution. It has also been used in semi-supervised and unsupervised anomaly detection tasks. The histogram-based outlier score is a fast unsupervised anomaly detection method that has become more popular because of the rapid increase in the amount of data collected in recent decades. Histogram-based outlier score can be computed using either static or dynamic bin-width histograms. When a histogram contains large gaps, the dynamic bin-width approach is preferred over the static bin-width approach. These gaps in a histogram usually occur as a result of various distributions in real data. When working with a static bin-width histogram, gaps can be utilized to acquire better distinction between outliers and inliers. In this study, we propose an adjusted version of the histogram-based outlier score named adjusted histogram-based outlier score, which considers neighboring bins prior to density estimation. Results from a simulation study and real data application indicate that the adjusted histogram-based outlier score yields a better performance not only in the simulated data but also for various types of real data.

___

  • Chandola, V., Banerjee, A., and Kumar, V., “Anomaly Detection: a Survey”, ACM Computing Surveys (CSUR), 41(3), 1-58, 2009.
  • Anscombe, F. J., “Rejection of Outliers”, Technometrics, 2(2), 123-146, 1960.
  • Grubbs, F. E., “Procedures for Detecting Outlying Observations in Sample”, Technometrics, 11(1), 1-21, 1969.
  • Hawkins, D. M., Identification of Outliers, London: Chapman and Hall, 1980.
  • Breunig, M. M., Kriegel, H. P., Ng, R. T. and Sander, J., “LOF: Identifying Density Based Local Outlier”, In Proceedings of the 2000 ACM SIGMOD International Conference on Management of data, 2000, 93-104.
  • Hodge, V. and Austin, J., “A survey of Outlier Detection methodologies”, Artificial Intelligence Review, 22, 85-126, 2004.
  • Goldstein, M. and Uchida, S., “A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data”, PloS One, 11(4), 2016.
  • Zoppi, T., Ceccarelli, A., Puccetti, T. and Bondavalli, A., “Which Algorithm Can Detect Unknown Attacks? Comparison of Supervised, Unsupervised and Meta-Learning Algorithms for Intrusion Detection”, Computers & Security, 127, 2023.
  • Kind, A., Stoecklin, M. P. and Dimitropoulos, X., “Histogram-Based Traffic Anomaly Detection”, IEEE Transactions on Network and Service Management, 6(2), 110-121, 2009.
  • Sabau, A. S., “Survey of Clustering Based Financial Fraud Detection Research”, Informatica Economica, 16(1), 110, 2012.
  • Xie, M., Hu, J. and Tian, B., “Histogram-Based Online Anomaly Detection in Hierarchical Wireless Sensor Network”, In 2012 IEEE 11th International Conference on Trust, Security and Privacy in Computing And Communications, 2012. 751-759.
  • Sharma, A., Pujari, A. K. and Paliwal, K. K., “Intrusion Detection Using Text Processing Techniques with a Kernel Based Similarity Measure”, Computers & Security, 26(7-8), 488-495, 2007.
  • Carminati, M., Polino, M., Continella, A., Lanzi, A., Maggi, F. and Zanero, S., “Security Evaluation of a Banking Fraud Analysis System”, ACM Transactions on Privacy and Security (TOPS), 21(3), 1-31, 2018.
  • Munir, M., Chattha, M. A., Dengel, A. and Ahmed, S., “A Comparative Analysis of Traditional and Deep Learning-Based Anomaly Detection Methods For Streaming Data”, In 2019 18th IEEE International Conference On Machine Learning and Applications (ICMLA), 2019, 561-566.
  • Goldstein, M. and Dengel, A., “Histogram-Based Outlier Score (HBOS): A Fast Unsupervised Anomaly Detection Algorithm”, KI-2012: Poster and Demo Track, 59-63, 2012.
  • Saba-Sadiya, S., Chantland, E., Alhanai, T., Liu, T. and Ghassemi, M. M., “Unsupervised EEG Artifact Detection and Correction”, Frontiers in Digital Health, 2, 2021.
  • Han, S., Hu, X., Huang, H., Jiang, M. and Zhao, Y., “Adbench: Anomaly Detection Benchmark.”, Advances in Neural Information Processing Systems, 35, 32142-32159, 2022.
  • Dobos, D., Nguyen, T. T., Dang, T., Wilson, A., Corbett, H., McCall, J. and Stockton, P., “A Comparative Study of Anomaly Detection Methods for Gross Error Detection Problems”, Computers & Chemical Engineering, 175, 2023.
  • Paulauskas, N. and Baskys, A., “Application of Histogram-Based Outlier Scores to Detect Computer Network Anomalies”, Electronics, 8(11), 1251, 2019.
  • Wand, M. P., “Data-Based Choice of Histogram Bin Width”, The American Statistician, 51(1), 59-64, 1997.
  • Sturges, H. A., “The Choice of a Class Interval”, Journal of the American Statistical Association, 21(153), 65-66, 1926.
  • R Core Team, R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/, 2022.
  • Scott, D. W., “On Optimal and Data-Based Histograms”, Biometrika, 66(3), 605-610, 1979.
  • Freedman, D. and Diaconis, P., “On the Histogram as a Density Estimator: L2 Theory”, Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 57(4), 453-476, 1981.
  • Venables, W. N. and Ripley, B. D., In Modern Applied Statistics with S, Springer, New York, 2002.
  • Shimazaki, H. and Shinomoto, S., “A Method for Selecting the Bin Size of a Time Histogram”, Neural Computation, 19(6), 1503-1527, 2007.
  • Wilkinson, L., “Visualizing Big Data Outliers Through Distributed Aggregation”, IEEE Transactions on Visualization and Computer Graphics, 24(1), 256-266, 2017.
  • Provost, F. J., Fawcett, T. and Kohavi, R., “The Case Against Accuracy Estimation for Comparing Induction Algorithms”, In ICML, 1998, 445-453.
  • Davis, J. and Goadrich, M., “The Relationship Between Precision-Recall and ROC Curves”, In Proceedings of the 23rd International Conference on Machine Learning, 2006, 233-240.
  • Friedman, M., “The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance”, Journal of the American Statistical Association, 32(200), 675-701, 1937.
  • Friedman, M., “A Comparison of Alternative Tests of Significance for the Problem of M Rankings”, The Annals of Mathematical Statistics, 11(1), 86-92, 1940.
  • Nemenyi, P. B., Distribution-Free Multiple Comparisons, PhD Thesis, Princeton University, 1963.
  • Demšar, J., “Statistical Comparisons of Classifiers Over Multiple Data Sets”, The Journal of Machine Learning Research, 7, 1-30, 2006.
  • Thiele, C. and Hirschfeld, G.,"Cutpointr: Improved Estimation and Validation of Optimal Cutpoints In R.", Journal of Statistical Software, 98(11), 1-27, 2021.
  • Yan, Y., MLmetrics: Machine Learning Evaluation Metrics. R package version 1.1.1, 2016.
  • Ligges, U. and Mächler, M., Scatterplot3d an R Package for Visualizing Multivariate Data. Technical Report, 2002.
  • Pohlert, T., PMCMRplus: Calculate Pairwise Multiple Comparisons of Mean Rank Sums Extended, 2022.
  • Campos, G. O., Zimek, A., Sander, J., Campello, R. J., Micenková, B., Schubert, E., Assent, I. and Houle, M. E., “On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study”, Data Mining and Knowledge Discovery, 30(4), 891-927, 2016.
  • Goldstein, M., Unsupervised Anomaly Detection Benchmark. Harvard Dataverse, 2015. doi: 10.7910/DVN/OPQMVF.
  • Rayana, S., ODDS Library [http://odds.cs.stonybrook.edu]. Stony Brook, NY: Stony Brook University, Department of Computer Science, 2016.