Gaining New Insight into Machine-Learning Datasets via Multiple Binary-Feature Frequency Ranks with a Mobile Benign/Malware Apps Example
Researchers compare their Machine Learning (ML) classification performances with other studies without examining and comparing the datasets they used in training, validating, and testing. One of the reasons is that there are not many convenient methods to give initial insights about datasets besides the descriptive statistics applied to individual continuous or quantitative features. After demonstrating initial manual analysis techniques, this study proposes a novel adaptation of the Kruskal-Wallis statistical test to compare a group of datasets over multiple prominent binary features that are very common in today’s datasets. As an illustrative example, the new method was tested on six benign/malign mobile application datasets over the frequencies of prominent binary features to explore the dissimilarity of the datasets per class. The feature vector consists of over a hundred “application permission requests” that are binary flags for Android platforms’ primary access control to provide privacy and secure data/information in mobile devices. Permissions are also the first leading transparent features for ML-based malware classification. The proposed data analytical methodology can be applied in any domain through their prominent features of interest. The results, which are also visualized in three new ways, have shown that the proposed method gives the dissimilarity degree among the datasets. Specifically, the conducted test shows that the frequencies in the aggregated dataset and some of the datasets are not substantially different from each other even they are in close agreement in positive-class datasets. It is expected that the proposed domain-independent method brings useful initial insight to researchers on comparing different datasets.
___
- 1. Canbek G, Sagiroglu S, Taskaya Temizel T, Baykal N., Binary
classification performance measures/metrics: A comprehensive
visualized roadmap to gain new insights, in: 2017 International
Conference on Computer Science and Engineering (UBMK),
IEEE, Antalya, Turkey, 2017: pp. 821–826. doi:10.1109/
UBMK.2017.8093539.
- 2. Ostertagová E, Ostertag O, Kováč J., Methodology and Application
of the Kruskal-Wallis Test, Applied Mechanics and Materials. 611
(2014) 115–120. doi:10.4028/www.scientific.net/AMM.611.115.
- 3. Piringer H, Berger W, Hauser H., Quantifying and comparing
features in high-dimensional datasets, in: Proceedings of the
International Conference on Information Visualisation, IEEE,
London, 2008: pp. 240–245. doi:10.1109/IV.2008.17.
- 4. Canbek G, Sagiroglu S, Taskaya Temizel T., New techniques in
profiling big datasets for machine learning with a concise review
of Android mobile malware datasets, 2018 International Congress
on Big Data, Deep Learning and Fighting Cyber Terrorism
(IBIGDELFT). (2018) 117–121. doi:10.1109/ibigdelft.2018.8625275.
- 5. Andrade RO, Yoo SG., Cognitive security: A comprehensive study of
cognitive science in cybersecurity, Journal of Information Security
and Applications. 48 (2019) 1–13. doi:10.1016/j.jisa.2019.06.008.
- 6. Canbek G, Sagiroglu S, Baykal N., New comprehensive taxonomies
on mobile security and malware analysis, International Journal of
Information Security Science (IJISS). 5 (2016) 106–138. http://www.
ijiss.org/ijiss/index.php/ijiss/article/view/227.
- 7. Surendran R, Thomas T, Emmanuel S., A TAN based hybrid model
for android malware detection, Journal of Information Security
and Applications. 54 (2020) 1–11. doi:10.1016/j.jisa.2020.102483.
- 8. Clement J., Average number of new Android app releases via Google
Play per month as of May 2020, New York, 2020. https://www.
statista.com/statistics/276703/android-app-releases-worldwide.
- 9. Suarez-Tangil G, Tapiador JE, Peris-Lopez P, Ribagorda A.,
Evolution, detection and analysis of malware for smart devices,
IEEE Communications Surveys & Tutorials. 16 (2014) 961–987.
doi:10.1109/SURV.2013.101613.00077.
- 10. Deypir M, Horri A., Instance based security risk value estimation
for Android applications, Journal of Information Security and
Applications. 40 (2018) 20–30. doi:10.1016/j.jisa.2018.02.002.
- 11. Android, Manifest.permission, Android Developers. (2020). https://
developer.android.com/reference/android/Manifest.permission.
html (accessed September 2, 2020).
- 12. Cen L, Gates C, Si L, Li N., A probabilistic discriminative model
for Android malware detection with decompiled source code, IEEE
Transactions on Dependable and Secure Computing. 12 (2015)
400–412. doi:10.1109/TDSC.2014.2355839.
- 13. Lindorfer M, Neugschwandtner M, Weichselbaum L, Fratantonio
Y, Van Der Veen V, Platzer C., ANDRUBIS - 1,000,000 apps later: a
view on current Android malware behaviors, in: 3rd International
Workshop on Building Analysis Datasets and Gathering Experience
Returns for Security (BADGERS), Wroclaw, Poland, 2014: pp. 3–17.
- 14. Aswini AM, Vinod P., Droid permission miner: Mining prominent
permissions for Android malware analysis, in: The 5th International
Conference on the Applications of Digital Information and Web
Technologies (ICADIWT), IEEE, Bangalore, India, 2014: pp. 81–86.
doi:10.1109/ICADIWT.2014.6814679.
- 15. Wang W, Wang X, Feng D, Liu J, Han Z, Zhang X., Exploring
permission-induced risk in Android applications for malicious
application detection, IEEE Transactions on Information Forensics
and Security. 9 (2014) 1828–1842. doi:10.1109/TIFS.2014.2353996.
- 16. Yerima SY, Sezer S, McWilliams G., Analysis of Bayesian
classification-based approaches for Android malware detection, IET
Information Security. 8 (2014) 25–36. doi:10.1049/iet-ifs.2013.0095.
- 17. Jiang X, Zhou Y., Android Malware, Springer, Raleigh, NC, USA,
2013.
- 18. Peng H, Gates C, Sarma B, Li N, Qi Y, Potharaju R, Nita- Rotaru
C, Molloy I., Using probabilistic generative models for ranking
risks of Android apps, in: 19th Conference on Computer and
Communications Security (CCS), ACM, New York, New York, USA,
2012: pp. 241–252. doi:10.1145/2382196.2382224.
- 19. Hoffmann J, Ussath M, Holz T, Spreitzenbarth M., Slicing
droids: Program slicing for smali code, in: SAC ’13 Proceedings
of the 28th Annual ACM Symposium on Applied Computing,
Coimbra, Portugal, 2013: pp. 1844–1851. http://dl.acm.org/citation.
cfm?id=2480706 (accessed October 22, 2013).
- 20. Sarma B, Li N, Gates C, Potharaju R, Nita-Rotaru C, Molloy I.,
Android permissions: A perspective combining risks and benefits,
in: 17th Symposium on Access Control Models and Technologies
(SACMAT), ACM, New York, New York, USA, 2012: pp. 13–22.
doi:10.1145/2295136.2295141.
- 21. Canfora G, Mercaldo F, Visaggio CA., A classifier of malicious
Android applications, in: The 8th International Conference on
Availability, Reliability and Security (ARES), IEEE, Regensburg,
2013: pp. 607–614. doi:10.1109/ARES.2013.80.
- 22. Peiravian N, Zhu X., Machine learning for Android malware
detection using permission and API calls, in: IEEE 25th
International Conference on Tools with Artificial Intelligence
G. Canbek/ Hittite J Sci Eng, 2021, 8 (2) 103–121 118
(ICTAI), IEEE, Herndon, VA, 2013: pp. 300–305. doi:10.1109/
ICTAI.2013.53.
- 23. Felt AP, Chin E, Hanna S, Song D, Wagner D., Android permissions
demystified, in: Proceedings of the 18th ACM Conference on
Computer and Communications Security (CCS), ACM Press, New
York, New York, USA, 2011: p. 627. doi:10.1145/2046707.2046779.
- 24. Canbek G, Baykal N, Sagiroglu S., Clustering and visualization of
mobile application permissions for end users and malware analysts,
in: The 5th International Symposium on Digital Forensic and
Security (ISDFS), IEEE, Tirgu Mures, 2017: pp. 1–10. doi:10.1109/
ISDFS.2017.7916512.
- 25. Kruskal WH, Wallis WA., Use of Ranks in One-Criterion
Variance Analysis, Journal of the American Statistical Association.
47 (1952) 583–621. http://www.jstor.org/stable/pdf/2280779.
pdf?_=1463988119080.
- 26. Theodorsson-Norheim E., Kruskal-Wallis test: BASIC computer
program to perform nonparametric one-way analysis of variance
and multiple comparisons on ranks of several independent samples,
Computer Methods and Programs in Biomedicine. 23 (1986) 57–62.
doi:10.1016/0169-2607(86)90081-7.
- 27. Bommert A, Sun X, Bischl B, Rahnenführer J, Lang M., Benchmark
for filter methods for feature selection in high-dimensional
classification data, Computational Statistics and Data Analysis. 143
(2020) 1–19. doi:10.1016/j.csda.2019.106839.
- 28. Vora S, Yang H., A Comprehensive Study of Eleven Feature
Selection Algorithms and their Impact on Text Classification, in:
Computing Conference, London, United Kingdom, 2017: pp. 440–
449. doi:10.1109/SAI.2017.8252136.
- 29. Boulesteix AL, Tutz G., Identification of interaction patterns and
classification with applications to microarray data, Computational
Statistics and Data Analysis. 50 (2006) 783–802. doi:10.1016/j.
csda.2004.10.004.
- 30. Chen Y, Datta S., Adjustments of multi-sample U-statistics to
right censored data and confounding covariates, Computational
Statistics and Data Analysis. 135 (2019) 1–14. doi:10.1016/j.
csda.2019.01.012.
- 31. Yu C, Zelterman D., A parametric model to estimate the proportion
from true null using a distribution for p-values, Computational
Statistics and Data Analysis. 114 (2017) 105–118. doi:10.1016/j.
csda.2017.04.008.
- 32. Von Borries G, Wang H., Partition clustering of high dimensional
low sample size data based on p-values, Computational Statistics and
Data Analysis. 53 (2009) 3987–3998. doi:10.1016/j.csda.2009.06.012.
- 33. Semwal VB, Singha J, Sharma PK, Chauhan A, Behera B., An
optimized feature selection technique based on incremental feature
analysis for bio-metric gait data classification, Multimedia Tools
and Applications. 76 (2017) 24457–24475. doi:10.1007/s11042-016-
4110-y.
- 34. Yang C, Ji J, Liu J, Liu J, Yin B., Structural learning of Bayesian
networks by bacterial foraging optimization, International Journal
of Approximate Reasoning. 69 (2016) 147–167. doi:10.1016/j.
ijar.2015.11.003.
- 35. Rueda R, Ruiz LGB, Cuéllar MP, Pegalajar MC., An Ant Colony
Optimization approach for symbolic regression using Straight
Line Programs . Application to energy consumption modelling,
International Journal of Approximate Reasoning. 121 (2020) 23–38.
doi:10.1016/j.ijar.2020.03.005.
- 36. Alomari R, Thorpe J., On password behaviours and attitudes
in different populations, Journal of Information Security and
Applications. 45 (2019) 79–89. doi:10.1016/j.jisa.2018.12.008.
- 37. Zhang D, Li Q, Yang G, Li L, Sun X., Detection of image seam
carving by using weber local descriptor and local binary patterns,
Journal of Information Security and Applications. 36 (2017) 135–
144. doi:10.1016/j.jisa.2017.09.003.
- 38. Asmitha KA, Vinod P., Linux Malware Detection using non-
Parametric Statistical methods, in: 2014 International Conference
on Advances in Computing, Communications and Informatics
(ICACCI), IEEE, New Delhi, 2014: pp. 319–332.
- 39. Zorn C., Shapiro-Wilk Test, Encyclopedia of Social Science
Research Methods. (2004) 1305.
- 40. Royston JP., Algorithm AS 181: The W Test for Normality, Applied
Statistics. 31 (1982) 176–180.
- 41. MathWorks, Multiple Comparison Test - MATLAB multcompare,
(2020). http://www.mathworks.com/access/helpdesk/help/toolbox/
stats/multcompare.html (accessed September 2, 2020).
- 42. Enck W, Ongtang M, McDaniel P., On lightweight mobile phone
application certification, in: 16th Conference on Computer and
Communications Security (CCS), ACM, New York, New York, USA,
2009: pp. 235–245. http://www.patrickmcdaniel.org/pubs/ccs09a.
pdf.
- 43. Pearce P, Felt AP, Nunez G, Wagner D., AdDroid: Privilege
Separation for Applications and Advertisers in Android, in:
Proceedings of the 7th ACM Symposium on Information, Computer
and Communications Security - ASIACCS ’12, ACM Press, Seoul,
Korea, 2012: p. 71. doi:10.1145/2414456.2414498.
- 44. Sanz B, Santos I, Laorden C, Ugarte-Pedrero X, Bringas PG, Alvarez
G., PUMA: Permission usage to detect malware in Android, in:
International Joint Conference CISIS-ICEUTE-SOCO Special
Sessions, Springer Berlin Heidelberg, Ostrava, Czech Republic,
2013: pp. 289–298.
- 45. Canbek G., “Prominent Binary-Feature (Permissions) Frequencies
for Android Mobile Benign Apps and Malware Datasets”, Mendeley
Data, V1, https://doi.org/10.17632/ptd9fnsrtr.1