Proposal of Machine Learning Approach for Identification of Instant Messaging Applications in Raw Network Traffic

Identification of Internet protocol from either raw network traffic or either network flows plays a crucial role at maintaining and improving the security of computer systems. A significant amount of research is carried out while exploiting a variety of identification techniques. Although certain level in success at detection of network protocols for unencrypted traffic has been achieved, accuracy and performance is rather poor for encrypted traffic. Considering technological trends, new and existing applications have been adopted to use encryption mechanism to protect information and privacy. Therefore, classification of encrypted network traffic is mandatory for ensuring security. Moreover, while performing network forensic investigation, labelling of network protocols/applications is a must to accomplish. In this study, we propose a method to automatically identify instant messaging applications from raw network traffic. To this end, we first extract flow based static features from network capture and then apply machine learning algorithms. The proposed method is evaluated with fairly large dataset. The dataset compromise of publicly available NISM dataset and the network traffic of 9 popular instant messaging applications collected in a controlled environment. The dataset overall contains 716607network flows belonging to 20 application categories. The proposed method classifies network flows of instant messaging applications into their corresponding application categories with the accuracy over 99 percent and F1-score of 99 percent.

___

[1] A. W. Moore and D. Zuev, “Internet traffic classification using bayesian analysis techniques,” ACM SIGMETRICS Performance Evaluation Review., vol. 33, pp. 50-60, 2005.

[2] C. V Wright, F. Monrose, and G. M. Masson, “On inferring application protocol behaviors in encrypted network traffic,” Journal of Machine Learning Research, vol. 7, pp. 2745- 2769, 2006.

[3] R. Alshammari and A. N. Zincir-Heywood, “Machine learning based encrypted traffic classification: Identifying ssh and skype”, CISDA, vol. 9, pp. 289-296, 2009.

[4] R. Alshammari and A. N. Zincir-Heywood, “Can encrypted traffic be identified without port numbers, IP addresses and payload inspection?” Computer networks, vol. 55, no.6, pp. 1326-1350, 2011.

[5] Calculating Flow Statistics Using NetMate, 2017. [Online], Available: https://dan.arndt.ca/nims/calculating-flow- statistics-using-netmate/ . Accessed on: Jan15, 2017.

[6] D. J. Arndt and A N. Zincir-Heywood, “A comparison of three machine learning techniques for encrypted network traffic analysis,” In Proc. IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA), 2011, pp. 107-114.

[7] Y. Okada, S. Ata, N. Nakamura, Y. Nakahira, and I. Oka, “Comparisons of machine learning algorithms for application identification of encrypted traffic,” In Proc. Machine Learning and Applications and Workshops (ICMLA), 2011, pp. 358-361.

[8] Github repo containing the source code and the dataset of this work, 2017, [Online], Available: https://gitlab.com/apektas/instant_messaging_app_identific ation. Accessed on: Feb-12, 2017.

[9] W. M. Shbair, T. Cholez, J. Francois, I. Chrisment, "A multi- level framework to identify HTTPS services,". In Proc. Network Operations and Management Symposium (NOMS), 2016, pp. 240-248.

[10] Z. A. Qazi, J. Lee, T. Jin, G. Bellala, M. Arndt, G. Noubir, "Application-awareness in SDN," ACM SIGCOMM computer communication review, vol. 43, no. 4, pp. 487-488, 2013

[11] H. F. Alan, J. Kaur, "Can Android Applications Be Identified Using Only TCP/IP Headers of Their Launch Time Traffic?," in Proc. 9th ACM Conference on Security & Privacy in Wireless and Mobile Networks, 2016, pp. 61-66.

[12] L. Vu, D. Tra Van, Q. U, Nguyen, "Learning from imbalanced data for encrypted traffic identification problem," in Proc. Seventh Symposium on Information and Communication Technology, 2016, pp. 147-152.

[13] A. Cuadra-Sanchez, J. Aracil, "A novel blind traffic analysis technique for detection of WhatsApp VoIP calls," International Journal of Network Management, vol. 27, no. 2, 2017.

[14] P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Machine learning, vol. 63, no. 1, pp. 3- 42, 2006.

[15] NIMS1 data set, 2017, [Online], Available: https://projects.cs.dal.ca/projectx/data/NIMS.arff.zip. Accessed on: Jan-15, -2017.

[16] H. Yu, F. Huang, and C. Lin, “Dual coordinate descent methods for logistic regression and maximum entropy models,” Machine Learning, vol. 85, no.1, pp.41-75, 2011.

[17] M. Schmidt, N. L. Roux, and F. Bach, “Minimizing finite sums with the stochastic average gradient,” Mathematical Programming, pp. 1-30, 2013.

[18] T. Wu, C. Lin, and R. C. Weng, “Probability estimates for multiclass classification by pairwise coupling,” Journal of Machine Learning Research, vol. 5, pp.975-1005, 2004.



[20] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, “Scikit-learn: Machine learning in python,”. Journal of Machine Learning Research, vol. 12, pp. 2825- 2830, 2011.

[21] Scikit-learn: machine learning in Python, 2017, [Online], Available: http://scikit-learn.org/stable/index.html., Accessed on: Mar-15, 2017.