Mondrian Tabanlı Kimliksizleştirme Modeli

“Büyük Veri” olarak adlandırılan veri yığınlarında kişilere ait özel bilgilerin bulunması ifşa ataklarına karşı kişinin mahremiyetinin tehlikeye girmesine neden olmaktadır. Büyük veride kişi mahremiyetinin korunması için kimliksizleştirme yöntemleri ile kimliksiz veri oluşturup sistemlerde bu şekilde saklanması ve paylaşılması sağlanmaktadır. Fakat kimliksiz hale getirilen veride bilgi kaybı olduğu için veri eski haline döndürülememektedir. Bu çalışmanın amacı; büyük veri yığınları için anlık olarak kimliksizleştirme sağlayan ve sistemdeki veri yapısını bozmayan yeni bir yöntem oluşturmaktır. Çalışmada büyük veri yığınlarını işleyebilmek için Hadoop ekosistemi kullanılmıştır. Önerilen model ile kullanıcıdan gelen isteklerin ara katmanda bulunan servisler yardımı ile Hadoop ekosisteminde işlenmesi sağlanarak kimliksiz veri elde edilmesi sağlanmıştır. Kimliksizleştirme için kullanılan algoritma optimize edilerek kullanılmış ve literatürdeki algoritmalara göre avantajları kaydedilmiştir. Önerilen Modelle, kullanıcının sorgu çekmesi ve kimliksiz veri seti elde etmesi bakımından kullanıcı dostu olduğu görülmüştür. Analiz sonuçlarına göre, modelde kullanılan kimliksizleştirme algoritmasıyla işleme hızı bakımından diğer algoritmalara göre %40 verimli çalışan bir algoritma oluşturulmuştur.

Mondrian Based Real Time Anonymization Model

The presence of private information belonging to individuals in data heaps called "Big Data" causes the privacy of the person to be endangered against disclosure attacks. To protect personal privacy in big data, it is ensured that anonymous data is created, stored, and shared in systems with anonymization methods. However, de-identified data cannot be reinstatement. The aim of this study is to create a new method that provides instant disidentification and does not disrupt the data structure in the system. In the study, the Hadoop ecosystem was used to process large data heaps. With the proposed model, it has been ensured that the requests from the user are processed in the Hadoop ecosystem with the services in the middle layer, thus obtaining anonymous data. The algorithm used for disidentification is optimized and results are compared according to algorithms in the literature. With the proposed model, it has been observed that the user is user-friendly in terms of querying and obtaining an anonymous data set. According to the analysis results, an algorithm that works with 40% efficiency compared to other algorithms in terms of processing speed was created with the disidentification algorithm used in the model.

Keywords:

Anonymization Privacy Protection Model, Spark,

PDF

___

[1] Erdoğan, H., Küçük, K. & Khan, S. A. Endüstriyel IoT Bulut Uygulamaları için Düşük Maliyetli Modbus/MQTT Ağ Geçidi Tasarımı ve Gerçekleştirilmesi. Bilecik Şeyh Edebali Üniversitesi Fen Bilimleri Dergisi, 7(1), 170-183.
[2] Nasser, T. & Tariq, R. S. (2015). Big data challenges. J Comput Eng Inf Technol 4: 3. doi: http://dx. doi. org/10.4172/2324, 9307(2).
[3] Mehmood, A., Natgunanathan, I., Xiang, Y., Hua, G. & Guo, S. (2016). Protection of big data privacy. IEEE access, 4, 1821-1834.
[4] Sweeney, L. (2002). k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05), 557-570.
[5] Nergiz, M. E., Atzori, M. & Clifton, C. (2007). Hiding the presence of individuals from shared databases. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data, 665-676..
[6] Machanavajjhala, A., Kifer, D., Gehrke, J. & Venkitasubramaniam, M. (2007). l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1), 3-es.
[7] Venugopal, V. & Vigila, S. M. C. (2018). Implementing Big Data Privacy with MapReduce for Multidimensional Sensitive Data. International Journal of Applied Engineering Research, 13(15), 11824- 11829.
[8] Jadhav, R. H. (2018). Distributed Bottom up Approach for Data Anonymization using Map Reduce framework on Cloud. Internatıonal Journal, 3(6).
[9] Canbay, Y., Vural, Y. & Sağıroğlu, S. (2018). Privacy preserving big data publishing. In 2018 International Congress on Big Data, Deep Learning and Fighting Cyber Terrorism (IBIGDELFT). 24-29
[10] Goswami, P. & Madan, S. (2017). A survey on big data & privacy preserving publishing techniques. Advances in Computational Sciences and Technology, 10(3), 395-408.
[11] Wang, L., Jajodia, S. & Wijesekera, D. (2004). Securing OLAP data cubes against privacy breaches. In IEEE Symposium on Security and Privacy, 2004, 161-175.
[12] Li, N., Li, T. & Venkatasubramanian, S. (2007). t-closeness: Privacy beyond k-anonymity and l-diversity. In 2007 IEEE 23rd International Conference on Data Engineering, 106-115.
[13] Kohlmayer, F., Prasser, F. & Kuhn, K. A. (2015). The cost of quality: Implementing generalization and suppression for anonymizing biomedical data with minimal information loss. Journal of biomedical informatics, 58, 37-48.
[14] Apache Hadoop. (2006). The Apache Software Foundation, https://hadoop.apache.org/ (25.03.2021)
[15] Shvachko, K., Kuang, H., Radia, S. & Chansler, R. (2010). The hadoop distributed file system. In 2010 IEEE 26th symposium on mass storage systems and technologies (MSST), 1-10.
[16] Dean, J. & Ghemawat, S. (2004). MapReduce: Simplified data processing on large clusters. 6th Symposium on Operating Systems Design and Implementation, 137-149
[17] Apache Hive. (2011). Apache Hive TM, https://hive.apache.org/ (18.04.2021).
[18] Apache Impala. (2021). Apachecon, https://impala.apache.org/overview.html (18.04.2021).
[19] Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T., Ching, C., Choi, A., ... & Yoder, M. (2015). Impala: A Modern, Open-Source SQL Engine for Hadoop. In Cidr, 1, 9.
[20] Spark Apache. (2011). The Apache Software Foundation, http://spark.apache.org/ (26.03.2021).
[21] Sweeney, L. (1998). Data fly: A system for providing anonymity in medical data. In Database Security XI, 356-381.
[22] Sweeney, L. (2002). Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05), 571- 588.
[23] Wang, K., Yu, P. S. & Chakraborty, S. (2004,). Bottom-up generalization: A data mining solution to privacy protection. In Fourth IEEE International Conference on Data Mining (ICDM'04), 249-256.
[24] Fung, B. C., Wang, K. & Yu, P. S. (2005). Top-down specialization for information and privacy preservation. In 21st international conference on data engineering (ICDE'05), 205-216.
[25] LeFevre, K., DeWitt, D. J. & Ramakrishnan, R. (2006). Mondrian multidimensional k-anonymity. In 22nd International conference on data engineering (ICDE'06), 25-25.
[26] Wang, H. & Liu, R. (2009). Hiding distinguished ones into crowd: privacy-preserving publishing data with outliers. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, 624-635.
[27] Majeed, A. (2019). Attribute-centric anonymization scheme for improving user privacy and utility of publishing e-health data. Journal of King Saud University-Computer and Information Sciences, 31(4), 426- 435.
[28] Canbay, Y., Vural, Y. & Sağıroğlu, Ş. (2020). OAN: outlier record-oriented utility-based privacy preserving model. Journal of the Faculty of Engineering and Architecture of Gazi University, 35(1), 355-368.
[29] Tortikar, P. (2019). K-Anonymization Implementation Using Apache Spark, Master of Science, North Dakota State University, Department of Computer Science, Fargo, North Dakota.
[30] Ashkouti, F. & Sheikhahmadi, A. (2021). DI-Mondrian: Distributed improved Mondrian for satisfaction of the L-diversity privacy model using Apache Spark. Information Sciences, 546, 1-24.
[31] Gündüz, H. (2020). WEKA Veri Madenciliği Yazılımının Sürümleri Arasındaki Kalite Değişimlerinin QMOOD ile İncelenmesi. Bilecik Şeyh Edebali Üniversitesi Fen Bilimleri Dergisi, 7(2), 825-836.
[32] Sezgin, E. & Çelik, Y. (2013). Veri madenciliğinde kayıp veriler için kullanılan yöntemlerin karşılaştırılması. Akademik Bilişim Konferansı, Akdeniz Üniversitesi, 23-25.
[33] Adult Data Set. (1994). The UCI Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets/Adult (26.03.2021).