Ender ŞAHİNASLAN, Mustafa GÜNERKAN, Önder ŞAHİNASLAN

Makine Öğrenmesinde Kategorik Veri Kodlama Tekniğinin Kullanımına Alternatif Bir Çözüm Yöntemi

Makine öğrenimi, derin öğrenme algoritmaları kullanarak insan zekâsını taklit eden bir teknolojidir. Öğrenme algoritmaları yalnızca sayısal veri kümeleri üzerinde çalışır. Kategorik veri kümeleri nitel veya nicel verilerden oluşur. Nitel veri setlerinin öğrenme algoritmalarında kullanılabilmesi için veri setinin sayısallaştırılması gerekmektedir. Sayısallaştırma için etiket kodlama, sıralı kodlama, toplam kodlama, ikili kodlama ve sıcak kodlama gibi birçok kodlama tekniği vardır ancak bu kodlama teknikleri performans, maliyet ve kullanım açısından bazı güçlükler ve yetersizlikleri barındırmaktadır. Diğer taraftan bir kodlama tekniği ile elde edilen eğitim çıktısının orijinalinin bilinmesine ihtiyaç duyulabilmektedir. Bu çalışma, kategorik verilerin sayısallaştırılmasında kodlama tekniklerinin kullanılmasından kaynaklanan yetersizliklere çözüm olabilecek, daha özgün ve daha iyi performansa sahip bir altyapı oluşturma arayışının bir sonucu olarak ortaya çıkmıştır. Geliştirilen yöntem uluslararası bir lojistik firmada 7 farklı kategoride toplam 46 kategorik özellik ve 80.154.139 adet veri üzerinden uygulanmıştır. Testlerin sonucuna göre veri setleri bazında %23.07 ile %300.13 arasında toplamda %153.62 performans kazancı elde edilmiştir. Bu sonuçlar, geliştirilen yöntemin daha başarılı ve uygulanabilir olduğunu göstermektedir. Çalışma, yüksek performans kazancı ve özgün yapısı ile benzer alanlarda kolaylıkla kullanılabilecek bir yapıya sahiptir. Makine öğrenmesinde kodlama tekniklerinin kullanımına alternatif bir çözüm sunmuştur.

Anahtar Kelimeler:

Kodlama, Makine Öğrenimi, Sistem Geliştirme, Teknoloji ve Yenilik., Veri Yönetimi

An Alternative Solution Method to Using Categorical Data Encoding Technique in Machine Learning

Machine learning is a technology that mimics human intelligence using deep learning algorithms. Learning algorithms only work on numerical datasets. Categorical datasets consist of qualitative or quantitative data. In order for qualitative data sets to be used in learning algorithms, the data set must be digitized. There are many coding techniques for digitization, such as label coding, sequential coding, total coding, binary coding and hot coding, but these coding techniques have some difficulties and inadequacies in terms of performance, cost and use. On the other hand, it may be necessary to know the original of the training output obtained with a coding technique. This study has emerged as a result of the search for a more original and better performing infrastructure that can be a solution to the inadequacies arising from the use of coding techniques in the digitization of categorical data. The developed method was applied on a total of 46 categorical features and 80.154.139 pieces of data in 7 different categories in an international logistics company. According to the results of the tests, a total of 153.62% performance gain was obtained between 23.07% and 300.13% on the basis of data sets. The study has a structure that can be used easily in similar areas with its high performance gain and original structure. It offered an alternative solution to the use of coding techniques in machine learning.

Keywords:

Encoding, Machine Learning, Systems Development, Technology and Innovation., Data Management,

PDF

___

Al-Shehari T., Alsowail R. A., 2021. An insider data leakage detection using one-hot encoding, synthetic minority oversampling and machine learning techniques. Entropy, 23(10), 1258, doi:10.3390/e23101258
Bilgin, T., Oğuz, M., 2021. A new approach to minimize memory requirements of frequent subgraph mining algorithms. Politeknik Dergisi, 24(1), 237-246
Calp, M., Akcayol, M., 2020. Design and Implementation of Web Based Risk Management System Based on Artificial Neural Networks for Software Projects: WEBRISKIT. Pamukkale Univ Muh Bilim Derg., 26(5), 993-1014
Chakrabarty, N., 2019. A data mining approach to flight arrival delay prediction for american airlines. 9th Annual Information Technology, Electromechanical Engineering and Microelectronics Conference (IEMECON). doi:10.1109/iemeconx.2019.8876970
Cerda, P., Varoquaux, G., Kégl, B., 2018. Similarity encoding for learning with dirty categorical variables. Machine Learning, 107(8-10), 1477–1494. doi:10.1007/s10994-018-5724-2
Chandradeva, L. S., Jayasooriya, I., Aponso, A. C., 2019. Fraud Detection Solution for Monetary Transactions with Autoencoders. National Information Technology Conference(NITC). doi:10.1109/nitc48475.2019.9114519
Chen, L., Xian, M., Liu, J., & Wang, H., 2020. Intrusion detection system in cloud computing environment. International Conference on Computer Communication and Network Security (CCNS). doi:10.1109/ccns50731. 2020.00037
Famili, A., Shen, W.-M., Weber, R., Simoudis, E., 1997. Data preprocessing and ıntelligent data analysis. Intelligent Data Analysis, 1(1), 3–23. doi:10.3233/ida-1997-1102
Günerkan M., Şahinaslan E., Şahinaslan Ö., 2022. Gümrük beyannamesi sürecinde öğrenmeye dayalı algoritmaların etkinliğinin incelenmesi. Acta Infologica, doi: 10.26650/acin.1057060
Jackson, E., & Agrawal, R., 2019. Performance evaluation of different feature encoding schemes on cybersecurity logs. IEEE, 1-9. doi:10.1109/southeastcon42311.2019.9020560
Jiang, D., Lin, W., Raghavan, N., 2020. A novel framework for semiconductor manufacturing final test yield classification using machine learning techniques. IEEE 197885–197895. doi:10.1109/access.2020.3034680
Karasulu, B., Yücalar, F., Borandag, E., 2022. İnsan kulağı görüntüleri kullanarak cinsiyet tanıma için derin öğrenme tabanlı melez bir yaklaşım. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi, 37 (3) , 1579-1594 . doi: 10.17341/gazimmfd.945188
Kıran, E. , Karasulu, B. & Borandag, E. (2022). Gemi Çeşitlerinin Derin Öğrenme Tabanlı Sınıflandırılmasında Farklı Ölçeklerdeki Görüntülerin Kullanımı . Journal of Intelligent Systems: Theory and Applications , 5 (2) , 161-167 . DOI: 10.38016/jista.1118740
Li, J., 2018. Monthly housing rent forecast based on lightgbm (light gradient boosting) model. International Journal of Intelligent Information and Management Science, 7(6). http://www.hknccp.org/Public/upload/goods/2019/09-03/5d6e145f40393.pdf
Li, Y., Zhu, Z., Wu, H., Ding, S., & Zhao, Y., 2020. CCAE: Cross-field categorical attributes embedding for cancer clinical endpoint prediction. Artificial Intelligence in Medicine, 107, doi:10.1016/j.artmed.2020.101915
MarketResearch., 2022. Types of data & measurement scales: nominal, ordinal, ınterval, and ratio. "https://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio", 13.05.2022
Ma, Y., Zhang, Z. 2020. Travel mode choice prediction using deep neural networks with entity embeddings. IEEE, 8, 64959-64970, doi: 10.1109/access.2020.2985542.
Mitchell, T. M., 1997. Machine learning. New York: McGraw-Hill
Nerlikar, P., Pandey, S., Sharma, S., Bagade, S., 2020. Analysis of intrusion detection using machine learning techniques. International Journal of Computer Networks and Communications Security, 8(10), 84-93
Potdar, K., Pardawala, T.S., Pai, C.D., 2017. A comparative study of categorical variable encoding techniques for neural network classifiers. International journal of computer applications, 175(4), 7-9. doi:10.1207/s15328031us0301_3
Reilly, D., Taylor, M., Fergus, P., Chalmers, C., Thompson, S., 2022. The categorical data conundrum: Heuristics for classification problems - A case study on domestic fire injuries. IEEE Access, 10, 70113-70125.
Sharma, N., Bhandari, H.V., Yadav, N.S., Shroff, H.V.J., 2020. Optimization of IDS using filter-based feature selection and machine learning algorithms”. Int. J. Innov. Technol. Explor. Eng, 10(2), 96-102.
SAS., 2022. Makine Öğrenimi Nedir ve Neden Önemlidir, "https://www.sas.com/tr_tr/insights/analytics/machine-learning.html ", 15.06.2022
Scikit-Learn., 2022. sklearn.preprocessing.LabelEncoder. scikit-learn:https://scikit-learn.org/stable/modules/ generated/sklearn.preprocessing.LabelEncoder.html, 13.05.2022
ScikitLearn-OneHotEncoder., 2022. One Hot Encoder "https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder", 13.05.2022
ScikitLearn-OrdinalEncoder., 2022. Ordinal Encoder. "https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing. OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder", 13.05.2022
Seger, C., 2018. An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing. "https://www.diva-portal.org/smash/get/diva2:1259073/Fulltext01.pdf"
Sethi, A., 2022. Categorical encoding | one hot encoding vs label encoding. "https://www.analyticsvidhya.com/blog/ 2020/03/one-hot-encoding-vs-label-encoding-using-scikit-learn", 13.05.2022
Shen, J., Shafiq, M. O., 2019. Learning mobile application usage - A deep learning approach. 18th IEEE International Conference On Machine Learning And Applications (ICMLA). doi:10.1109/icmla.2019.00054
Şahinaslan, Ö., Dalyan, H., Şahinaslan, E., 2022. Naive bayes sınıflandırıcısı kullanılarak youtube verileri üzerinden çok dilli duygu analizi. Bilişim Teknolojileri Dergisi, 15(2), 221-229. doi: 10.17671/gazibtd.999960
Takçı, H., 2018. Improvement of heart attack prediction by the feature selection methods, Turkish Journal of Electrical Engineering and Computer Science, 26 (1), 1-10
Tekin, M., Tunalı, V., 2019. Prioritization of software development demands with text mining techniques. Pamukkale Univ Muh Bilim Derg., 25(5), 615-620
Turcanik, M., Javurek, M., 2016. Hash function generation by neural network. 1-5. 10.1109/NTSP.2016.7747793
Yılmaz Yalçıner, A., Gelen Mert, M.B., 2021. Estimating the occupancy rate of an accommodation business using artificial neural networks . Pamukkale Üniversitesi Sosyal Bilimler Enstitüsü Dergisi , (47) , 209-218 . doi: 10.30794/pausbed.828902
Yu, L., Zhou, R., Chen, R., Lai, K. K., 2020. Missing data preprocessing in credit classification: one-hot encoding or imputation? Emerging Markets Finance and Trade, 1–11. doi:10.1080/1540496x.2020.1825935