Genetik Algoritma Temelli Yeni Bir Sentetik Veri Üretme Yaklaşımının Geliştirilmesi

Yapay zeka tabanlı çalışmalar, iş sektörlerinde karar destek sistemi oluşturmak, etkili çıktılar üretmek, sistem verimliliğini arttırmak ve maliyet etkin çözümler sunmak için büyük bir ilgi odağına sahiptir. Özellikle inovasyon sürecinin gelişmesinde, hızlanmasında ve hedef alana evrilmesinde yapay zeka tabanlı çalışmalar ile yenilikler sağlanmaktadır. Bu yeniliklerin gerçekleşmesinde veri, kritik bir anlama sahiptir. Algoritmalar vasıtasıyla eğitilen modellerin bilgisayarlar ya da özel makineler tarafından işlevselleştirilmesinde önemli bir rol oynamaktadır. Bununla birlikte yetersiz veri erişimi, yasal düzenlemeler, etik kurallar, gizlilik prosedürleri, mahremiyet, veri paylaşım kısıtı ve maliyet; verilerin sahip olduğu potansiyelin açığa çıkarılmasının önündeki engellerdir. Bu engelleri aşmak için sentetik veri üretme yaklaşımı tercih edilmektedir. Fakat sentetik veri üretme yaklaşımına ilişkin standart bir çerçeve olmadığı için yeni ve güncel yaklaşımların geliştirilmesine yönelik araştırmalar devam etmektedir. Bu çalışmada genetik algoritma temelli yeni bir sentetik veri üretme yaklaşımı önerilmiştir. Bu doğrultuda orijinal veri kümesinin dinamiğinde yapay veriler üretmek için hedef veri kümesine uyarlanan çaprazlama ve mutasyon genetik operatörleri kullanılarak veri çeşitliliği arttırılmıştır ve yeni bir nesil elde edilmiştir. Ardından üretilen bu nesildeki yapay örneklerin kategori tanımlaması, genetik algoritmanın maliyet fonksiyon bileşeni kullanılarak belirlenmiştir. Son aşamada üretilen yapay verilerin orijinal verilere benzerliğinin başarısını ölçmek için 6 farklı makine öğrenmesi sınıflandırıcısı kullanılmıştır. Zenginleştirilen veri kümesi üzerinde Destek Vektör Makinesi sınıflandırıcısı ile maksimum duyarlılık ölçütü, %100 olarak elde edilmiştir. Bu durum artan veri sayısı ile orantılı olarak eğitim başarısının pozitif yönde eğilim gösterdiğini ifade etmektedir.

Anahtar Kelimeler:

Sentetik veri üretimi, genetik algoritma, makine öğrenmesi sınıflandırıcıları

Development of a New Synthetic Data Generation Approach Based on Genetic Algorithm

Artificial intelligence-based studies have a great interest in creating decision support systems in business sectors, producing effective outputs, increasing system efficiency and providing cost-effective solutions. Especially in the development of the innovation process, the acceleration of the innovation process and its evolution into the target area, innovations are provided with artificial intelligence-based studies. In the realization of these innovations, data has a critical meaning for artificial intelligence-based studies. It plays an important role in the functionalization of models trained through algorithms by computers or special machines. However, insufficient data access, legal regulations, ethical rules, confidentiality procedures, privacy, data sharing limitations and cost; are major obstacles to revealing the potential of data. To overcome these obstacles, the synthetic data generation approach is preferred. But, since there is no standard framework for the synthetic data generation approach, research on the development of new and current approaches continues. In this study, a new synthetic data generation approach based on a genetic algorithm is proposed. In this direction, data diversity has been increased and a new generation has been obtained by using the crossover and mutation genetic operators adapted to the target dataset to produce artificial data in the dynamics of the original dataset. Then, the category definition of the artificial samples in this generation was done using the cost function component of the genetic algorithm. In the last stage, 6 different machine learning classifiers were used to measure the success of the similarity of the artificial data produced to the original data. The maximum sensitivity criterion was obtained as 100% with the Support Vector Machine classifier on the enriched dataset. This indicates that educational success tends to be in the positive direction in proportion to the increasing number of data.

Keywords:

Synthetic data generation, genetic algorithm, machine learning classifiers,

PDF

___

Mavrogenis AF, Scarlat MM. Artificial intelligence publications: synthetic data, patients, and papers, Int Orthop 2023; 47:1395–1396.
Hashimoto DA, Ward TM, Meireles OR. The Role of Artificial Intelligence in Surgery. Adv. Surg 2020; 54:89–101.
Shah S, Gandhi D, Kothari J. Machine learning based Synthetic Data Generation using Iterative Regression Analysis. Proc. 4th Int. Conf. Electron. Commun. Aerosp. Technol ICECA 2020; pp. 1093–1100.
Lu Y, Shen M, Wang H, Wei W. Machine Learning for Synthetic Data Generation : A Review. arXiv 2021; 14(8): 1–18.
Pacheco F. et al. Generation of Synthetic Data for the Analysis of the Physical Stability of Tailing Dams through Artificial Intelligence. Mathematics 2022; 10(23):1–15.
Belke M, Blanke P. Storms S, Herfs W. Object pose estimation in industrial environments using a synthetic data generation pipeline, Proc. - 2022 6th IEEE Int Conf Robot Comput IRC 2022; pp. 435–438.
Ucuzova E, Kurtulmaz E, Gokalp Yavuz F, Karacan H, Sahin NE. Synthetic CANBUS data generation for driver behavior modeling. 29th IEEE Conf. Signal Process. Commun Appl Proc SIU 2021; pp. 28–31.
Nicholson AD, Peplow DE, Ghawaly JM, Willis MJ, Archer DE. Generation of Synthetic Data for a Radiation Detection Algorithm Competition. IEEE Trans. Nucl. Sci 2020; 67(8): 1968–1975.
Pérez-Porras FJ, Triviño-Tarradas P, Cima-Rodríguez C, Meroño-De-larriva JE, García-Ferrer A, Mesas-Carrascosa FJ. Machine learning methods and synthetic data generation to predict large wildfires. Sensors 2021; 21:1–19.
Mahmood A, Bennamoun M, An S, Sohel F, Boussaid F, Hovey R, Kendrick G. Automatic detection of western rock lobster using synthetic data. ICES Journal of Marine Science 2020; 77(4): 1308–1317.
Nabati M, Navidan H, Shahbazian R, Ghorashi SA, Windridge D. Using Synthetic Data to Enhance the Accuracy of Fingerprint-Based Localization: A Deep Learning Approach. IEEE Sensors Lett 2020; 4(4):1–4.
Khan AR, Khan S, Harouni M, Abbasi R, Iqbal S, Mehmood Z. Brain tumor segmentation using K-means clustering and deep learning with synthetic data augmentation for classification. Microsc. Res. Tech. 2021; 84(7): 1389–1399.
Douzas G, Lechleitner M, Bacao F. Improving the quality of predictive models in small data GSDOT: A new algorithm for generating synthetic data. PLoS One 2022; 17(4):1–15.
Arab N, Nemmour H, Chibani Y. A new synthetic feature generation scheme based on artificial immune systems for robust offline signature verification. Expert Syst Appl 2023; 213.
İmak A, Doğan G, Şengür A, and Ergen B. Asma Yaprağı Türünün Sınıflandırılması için Doğal ve Sentetik Verilerden Derin Öznitelikler Çıkarma, Birleştirme ve Seçmeye Dayalı Yeni Bir Yöntem. Int J Pure Appl Sci 2022; 9(1): 46–55.
UCI (the University of California Irvine Machine Learning Repository), https://archive.ics.uci.edu/.
Turgun FS, Zorlu H. Parçacık Filtresinin Optimizasyonu için Genetik Algoritma Tabanlı Yeni Bir Yaklaşım/A New Approach Based on Genetic Algorithm for Optimization of Particle Filter. Bozok J Eng Archit 2023; 2(1):24–33.
Hassanat A, Almohammadi K, Alkafaween E, Abunawas E, Hammouri A, Prasath VBS. Choosing mutation and crossover ratios for genetic algorithms-a review with a new dynamic approach. Information 2019; 10:1–36.
Altay A. Genetik Algoritma ve Bir Uygulama, Yüksek Lisans Tezi, İstanbul Teknik Üniversitesi, İstanbul, 2007.
Zhou J, Huang S, Zhou T, Armaghani DJ, Qiu Y, Employing a genetic algorithm and grey wolf optimizer for optimizing RF models to evaluate soil liquefaction potential. Artificial Intelligence Review 2022; 55: 5673-5705.
Akalın F, Sayısal Haritalama Teknikleri Kullanılarak DNA Dizilimleri Üzerinden Lösemi Hastalığının Temel Türlerinin Yapay Zeka Tabanlı Algoritmalar ile Sınıflandırılması, Doktora Tezi, Sakarya Üniversitesi, Sakarya, 2023.