An unsupervised heterogeneous log-based framework for anomaly detection

Log analysis is a method to identify intrusions at the host or network level by scrutinizing the log events recorded by the operating systems, applications, and devices. Most work contemplates a single type of log for analysis, leading to an unclear picture of the situation and difficulty in deciding the existence of an intrusion. Moreover, most existing detection methods are knowledge-dependent, i.e. using either the characteristics of an anomaly or the baseline of normal traffic behavior, which limits the detection process to only anomalies based on the acquired knowledge. To discover a wide range of anomalies by scrutinizing various logs, this paper presents a new unsupervised framework, UHAD, which uses a two-step strategy to cluster the log events and then uses a filtering threshold to reduce the volume of events for analysis. The events from heterogeneous logs are assembled together into a common format and are analyzed based on their features to identify anomalies. Clustering accuracy of K-means, expectation-maximization, and farthest first were compared and the impact of clustering was captured in all the subsequent phases. Even though log events pass through several phases in UHAD before being concluded as anomalous, experiments have shown that the selection of the clustering algorithm and the filtering threshold significantly influences the decision. The framework detected the majority of anomalies by relating the events from heterogeneous logs. Specifically, the usage of K-means and expectation-maximization supported the framework to detect an average of 87.26% and 85.24% anomalous events respectively with various subsets.