A comparison of feature extraction techniques for malware analysis

A comparison of feature extraction techniques for malware analysis

The manifold growth of malware in recent years has resulted in extensive research being conducted in the domain of malware analysis and detection, and theories from a wide variety of scienti c knowledge domains have been applied to solve this problem. The algorithms from the machine learning paradigm have been particularly explored, and many feature extraction methods have been proposed in the literature for representing malware as feature vectors to be used in machine learning algorithms. In this paper we present a comparison of several feature extraction techniques byrst applying them on system call logs of real malware, and then evaluating them using a random forest classi er. In our experiment the HMM-based feature extraction method outperformed the other methods by obtaining an F-measure of 0.87. We also explored the possibility of using ensembles of feature extraction methods, and discovered that combination of HMM-based features with bigram frequency features improved the F-measure by 1.7%.

___

  • [1] Elhadi AAE, Maarof MA, Barry BI, Hamza H. Enhancing the detection of metamorphic malware using call graphs. Comput Secur 2014; 46: 62-78.
  • [2] Hu X, Chiueh T, Shin KG. Large-scale malware indexing using function-call graphs. Proc of CCS'09 2009; 611-620.
  • [3] Rieck K, Trinius P, Willems C, Holz T. Automatic analysis of malware behavior using machine learning. J Comput Sec 2011; 19: 639-668.
  • [4] Schultz MG, Eskin E, Zadok E, Stolfo SJ. Data mining methods for detection of new malicious executables. IEEE Proc of S&P 2001; 38-49.
  • [5] Nataraj L, Karthikeyan S, Jacob G, Manjunath B. Malware images: visualization and automatic classi cation. Proc of VizSec'11 2011; 4.
  • [6] Saxe J, Mentis D, Greamo C. Visualization of shared system call sequence relationships in large malware corpora. Proc of VizSec'12 2012; 33-40.
  • [7] Liu A, Martin C, Hetherington T, Matzner S. A comparison of system call feature representations for insider threat detection. Proc of IAW'05 2005; 340-347.
  • [8] Ranveer S, Hiray S. Comparative analysis of feature extraction methods of malware detection. Int J Comput App 2015; 120.
  • [9] Tian R, Islam R, Batten L, Versteeg S. Differentiating malware from cleanware using behavioural analysis. In: Malicious and Unwanted Software (MALWARE), 2010 5th International Conference on; 2010; New York, NY, USA: IEEE. pp. 23-30.
  • [10] Santos I, Brezo F, Ugarte-Pedrero X, Bringas PG. Opcode sequences as representation of executables for data- mining-based unknown malware detection. Info Sci 2013; 231: 64-82.
  • [11] Marian T, Weatherspoon H, Lee KS, Sagar A. Fmeter: Extracting indexable low-level system signatures by counting kernel function calls. In: Middleware 2012 Springer; 2012. pp. 81-100.
  • [12] Bicego M, Murino V, Figueiredo MA. Similarity-based classi cation of sequences using hidden Markov models. Pattern Recog 2004; 37: 2281-2291.
  • [13] Imran M, Afzal MT, Qadir MA. Similarity-based Malware Classi cation using Hidden Markov Model. Proc of CyberSec2015 2015; 129-134.
  • [14] Rabiner L, Juang BH. An introduction to hidden Markov models. ASSP Mag 1986; 3: 4-16.
  • [15] Devesa J, Santos I, Cantero X, Penya YK, Bringas PG. Automatic Behaviour-based Analysis and Classi cation System for Malware Detection. In: ICEIS (2); 2010; pp. 395-399.
  • [16] Alazab M, Layton R, Venkataraman S, Watters P. Malware detection based on structural and behavioural features of API calls. Edith Cowan University, 2010.
  • [17] Altaher A, Ramadass S, Ali A. Computer virus detection using features ranking and machine learning. Aus J Bas App Sci 2011; 5: 1482-1486.
  • [18] Kolter JZ, Maloof MA. Learning to detect malicious executables in the wild. Proc of KDD'04 2004; 470-478.
  • [19] Lee W, Stolfo SJ, Chan PK. Learning patterns from unix process execution traces for intrusion detection. In: AAAI Workshop on AI Approaches to Fraud Detection and Risk Management; 1997; pp. 50-56.
  • [20] Liao Y, Vemuri VR. Using Text Categorization Techniques for Intrusion Detection. In: USENIX Security Sympo- sium; 2002; pp. 51-59.
  • [21] Lin C-T, Wang N-J, Xiao H, Eckert C. Feature selection and extraction for malware classi cation. J Info Sci Eng 2015; 31: 965-992.
  • [22] Annachhatre C, Austin T, Stamp M. Hidden Markov models for malware classi cation. J Comput Virol Hack Tech 2014; 1-15.
  • [23] Trinius P, Willems C, Holz T, Rieck K. A malware instruction set for behavior-based analysis. 2009.
  • [24] Garner SR. Weka: The waikato environment for knowledge analysis. The University of Waikato, 2007.
  • [25] Liaw A, Wiener M. Classi cation and regression by Random Forest. R News 2002; 2: 18-22.
  • [26] Breiman L. Random forests. Mach Learn 2001; 45: 5-32.
  • [27] Bascil MS, Temurtas F. A study on hepatitis disease diagnosis using multilayer neural network with Levenberg Marquardt training algorithm. J Med Syst 2011; 35: 433-436.