Yavuz CANBAY, Şeref SAĞIROĞLU, Yılmaz VURAL

A New Anonymization Model for Privacy Preserving Data Publishing: CANON

Data privacy is a challenging trade-off problem between privacy preserving and data utility. Anonymization is a fundamental approach for privacy preserving and also a hard trade-off problem. It enables to hide the identities of data subjects or record owners and requires to be developed near-optimal solutions. In this paper, a new multidimensional anonymization model (CANON) that employs vantage-point tree (VPtree) and multidimensional generalization for greedy partitioning and anonymization, respectively, is proposed and introduced successfully for the first time. The main concept of CANON is inspired from Mondrian, which is an anonymization model for privacy preserving data publishing. Experimental results have shown that CANON takes data distribution into consideration and creates equivalence classes including closer data points than Mondrian. As a result, CANON provides better data utility than Mondrian in terms of GCP metric and it is a promising anonymization model for future works.

Keywords:

Data privacy, anonymization, data publishing,

PDF

___

[1] Fang, W., Wen, X., Zheng, Y. & Zhou, M. A Survey of Big Data Security and Privacy Preserving. IETE Technical Review. 34, 544-560 (2017)
[2] Hasan, A. & Jiang, Q. A General Framework for Privacy Preserving Sequential Data Publishing. International Conference On Advanced Information Networking And Applications Workshops. pp. 519-524 (2017)
[3] Almasi, M., Siddiqui, T., Mohammed, N. & Hemmati, H. The Risk-Utility Tradeoff for Data Privacy Models. International Conference On New Technologies, Mobility And Security. pp. 1-5 (2016)
[4] Chen, X. & Huang, V. Privacy Preserving Data Publishing for Recommender System. IEEE Annual Computer Software And Applications Conference Workshops. pp. 128-133 (2012)
[5] Wang, R., Zhu, Y., Chang, C. & Peng, Q. Privacy-preserving Highdimensional Data Publishing for Classification. Computers And Security (2020)
[6] Chibba, M. & Cavoukian, A. Privacy, Consumer Trust and Big Data: Privacy by Design and the 3 C’s. ITU Kaleidoscope: Trust In The Information Society. pp. 1-5 (2015)
[7] Jain, P., Gyanchandani, M. & Khare, N. Big Data Privacy: A Technological Perspective and Review. Journal Of Big Data. 3, 25 (2016)
[8] Nayahi, J. & Kavitha, V. Privacy and utility preserving data clustering for data anonymization and distribution on Hadoop. Future Generation Computer Systems. 74 pp. 393-408 (2017)
[9] Tang, Q., Wu, Y., Liao, S. & Wang, X. Utility-based k-Anonymization. International Conference On Networked Computing And Advanced Information Management. pp. 318-323 (2010)
[10] Fung, B., Wang, K., Fu, A. & Philip, S. Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques. (CRC Press,2010)
[11] Fung, B., Wang, K., Chen, R. & Yu, P. Privacy-Preserving Data Publishing: A Survey of Recent Developments. Computing Surveys. 42, 14 (2010)
[12] Sweeney, L. k-Anonymity: A Model for Protecting Privacy. International Journal Of Uncertainty, Fuzziness And Knowledge-Based Systems. 10, 557-570 (2002)
[13] Machanavajjhala, A., Gehrke, J., Kifer, D. & Venkitasubramaniam, M. l-Diversity: Privacy Beyond k-Anonymity. (IEEE,2006)
[14] Li, N., Li, T. & Venkatasubramanian, S. t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. IEEE International Conference On Data Engineering. pp. 106-115 (2007)
[15] Abdelhameed, S., Moussa, S. & Khalifa, M. Privacy-Preserving Tabular Data Publishing: A Comprehensive Evaluation From Web to Cloud. Computers And Security. 72 pp. 74-95 (2017)
[16] Dwork, C. Differential Privacy. International Colloquium On Automata, Languages And Programming. pp. 1-12 (2006)
[17] Meyerson, A. & Williams, R. On the Complexity of Optimal k-Anonymity. ACM SIGMOD-SIGACT-SIGART Symposium On Principles Of Database Systems. pp. 223-228 (2004)
[18] Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, R., Thomas, D. & Zhu, A. Approximation Algorithms for k-Anonymity. Journal Of Privacy Technology. pp. 1-18 (2005)
[19] Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, R., Thomas, D. & Zhu, A. Anonymizing Tables. International Conference On Database Theory. pp. 246-258 (2005)
[20] Sun, X., Sun, L. & Wang, H. Extended k-Anonymity Models Against Sensitive Attribute Disclosure. Computer Communications. 34, 526-535 (2011)
[21] Bonizzoni, P., Della Vedova, G. & Dondi, R. Anonymizing Binary and Small Tables is Hard to Approximate. Journal Of Combinatorial Optimization. 22, 97-119 (2011)
[22] Blocki, J. & Williams, R. Resolving the Complexity of Some Data Privacy Problems. International Colloquium On Automata, Languages, And Programming. pp. 393-404 (2010)
[23] Scott, A., Srinivasan, V. & Stege, U. k-Attribute-Anonymity is Hard Even for k=2. Information Processing Letters. 115, 368-370 (2015)
[24] Chen, R., Fung, B., Mohammed, N., Desai, B. & Wang, K. Privacypreserving trajectory data publishing by local suppression. Information Sciences. 231 pp. 83-97 (2013)
[25] Sweeney, L. Achieving k-Anonymity Privacy Protection Using Generalization and Suppression. International Journal Of Uncertainty, Fuzziness And Knowledge-Based Systems. 10, 571-588 (2002)
[26] LeFevre, K., DeWitt, D. & Ramakrishnan, R. Incognito: Efficient Fulldomain k-Anonymity. ACM SIGMOD International Conference On Management Of Data. pp. 49-60 (2005)
[27] Kohlmayer, F., Prasser, F., Eckert, C., Kemper, A. & Kuhn, K. Flash: Efficient, Stable and Optimal k-Anonymity. International Conference On Privacy, Security, Risk And Trust And International Confernece On Social Computing. pp. 708-717 (2012)
[28] Sweeney, L. Datafly: A System for Providing Anonymity in Medical Data. Database Security XI. pp. 356-381 (1998)
[29] Wang, K., Yu, P. & Chakraborty, S. Bottom-Up Generalization: A Data Mining Solution to Privacy Protection. IEEE International Conference On Data Mining. pp. 249-256 (2004)
[30] Fung, B., Wang, K. & Yu, P. Top-Down Specialization for Information and Privacy Preservation. International Conference On Data Engineering. pp. 205-216 (2005)
[31] LeFevre, K., DeWitt, D. & Ramakrishnan, R. Mondrian Multidimensional k-Anonymity. International Conference On Data Engineering. pp. 25-25 (2006)
[32] Kacha, L., Zitouni, A. & Djoudi, M. KAB: A new k-anonymity approach based on black hole algorithm. Journal Of King Saud University-Computer And Information Sciences. (2021)
[33] Bhati, B., Ivanchev, J., Bojic, I., Datta, A. & Eckhoff, D. Utility-Driven k-Anonymization of Public Transport User Data. IEEE Access. 9 pp. 23608-23623 (2021)
[34] Mahanan, W., Chaovalitwongse, W. & Natwichai, J. Data privacy preservation algorithm with k-anonymity. World Wide Web. pp. 1-11 (2021)
[35] Andrew, J. & Karthikeyan, J. Privacy-preserving big data publication:(K,L) anonymity. Intelligence In Big Data Technologies—Beyond The Hype. pp. 77-88 (2021)
[36] Wang, P., Huang, P., Tsai, Y. & Tso, R. An Enhanced Mondrian Anonymization Model based on Self-Organizing Map. Asia Joint Conference On Information Security. pp. 97-100 (2020)
[37] Andrew, J., Karthikeyan, J. & Jebastin, J. Privacy preserving big data publication on cloud using mondrian anonymization techniques and deep neural networks. International Conference On Advanced Computing And Communication Systems. pp. 722-727 (2019)
[38] Nezarat, A. & Yavari, K. A distributed method based on mondrian algorithm for big data anonymization. International Congress On High-Performance Computing And Big Data Analysis. pp. 84-97 (2019)
[39] Ashkouti, F. & Sheikhahmadi, A. DI-Mondrian: Distributed improved Mondrian for satisfaction of the L-diversity privacy model using Apache Spark. Information Sciences. 546 pp. 1-24 (2021)
[40] Tang, Q., Wu, Y. & Wang, X. New Algorithm with Lower Upper Size Bound for k-Anonymity. International Conference On Communication Systems, Networks And Applications. 1 pp. 421-424 (2010)
[41] Liu, K., Kuo, C., Liao, W. & Wang, P. Optimized Data de-Identification Using Multidimensional k-Anonymity. IEEE International Conference On Trust, Security And Privacy In Computing And Communication, IEEE International Conference On Big Data Science And Engineering. pp. 1610-1614 (2018)
[42] Gong, Q., Yang, M., Chen, Z. & Luo, J. Utility Enhanced Anonymization for Incomplete Microdata. International Conference On Computer Supported Cooperative Work In Design. pp. 74-79 (2016)
[43] Nergiz, M. & G¨ok, M. Hybrid k-Anonymity. Computers And Security. 44 pp. 51-63 (2014)
[44] LeFevre, K., DeWitt, D. & Ramakrishnan, R. Workload-aware Anonymization. ACM SIGKDD International Conference On Knowledge Discovery And Data Mining. pp. 277-286 (2006)
[45] Kumar, N., Zhang, L. & Nayar, S. What is a Good Nearest Neighbors Algorithm for Finding Similar Patches in Images?. (Springer,2008)
[46] Dolatshah, M., Hadian, A. & Minaei-Bidgoli, B. Ball*-Tree: Efficient Spatial Indexing for Constrained Nearest-Neighbor Search in Metric Spaces. Preprint ArXiv:1511.00628. (2015)
[47] Witten, I., Frank, E., Hall, M. & Pal, C. Data Mining: Practical Machine Learning Tools and Techniques. (Morgan Kaufmann,2016)
[48] Wang, J., Wang, N., Jia, Y., Li, J., Zeng, G., Zha, H. & Hua, X. Trinary-Projection Trees for Approximate Nearest Neighbor Search. IEEE Transactions On Pattern Analysis And Machine Intelligence. 36, 388-403 (2014)
[49] Bentley, J. Multidimensional Binary Search Trees Used for Associative Searching. Communications Of The ACM. 18, 509-517 (1975)
[50] Yianilos, P. Data Structures and Algorithms for Nearest Neighbor Search In General Metric Spaces. Symposium On Discrete Algorithms. 93 pp. 311-321 (1993)
[51] Fu, A., Chan, P., Cheung, Y. & Moon, Y. Dynamic Vp-Tree Indexing for N-Nearest Neighbor Search Given Pair-Wise Distances. International Journal On Very Large Data Bases. 9, 154-173 (2000)
[52] Bozkaya, T. & Ozsoyoglu, M. Distance-Based Indexing for High-Dimensional Metric Spaces. ACM SIGMOD International Conference On Management Of Data. 26 pp. 357-368 (1997)
[53] DeFreitas, T., Saddiki, H. & Flaherty, P. GEMINI: A Computationally-Efficient Search Engine for Large Gene Expression Datasets. BMC Bioinformatics. 17, 102 (2016)
[54] B¨ohm, C., Berchtold, S. & Keim, D. Searching in High-Dimensional Spaces: Index Structures for Improving the Performance of Multimedia Databases. ACM Computing Surveys. 33, 322-373 (2001)
[55] Kuznetsov, A. & Myasnikov, E. Copy-Move Detection Algorithm Efficiency Increase Using Binary Space Partitioning Trees. CEUR Workshop Proceedings. 1638 pp. 373-378 (2016)
[56] Zhang, F., Di, P., Zhou, H., Liao, X. & Xue, J. RegTT: Accelerating Tree Traversals on GPUs by Exploiting Regularities. International Conference On Parallel Processing. pp. 562-571 (2016)
[57] Kristensen, T. & Pedersen, C. Data Structures for Accelerating Tanimoto Queries on Real Valued Vectors. International Workshop On Algorithms In Bioinformatics. pp. 28-39 (2010)
[58] Dheeru, D. & Taniskidou, E. UCI Machine Learning Repository., http://archive.ics.uci.edu/ml
[59] Strack, B., DeShazo, J., Gennings, C., Olmo, J., Ventura, S., Cios, K. & Clore, J. Impact of HbA1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. BioMed Research International. (2014)
[60] Skowron, A. & Rauszer, C. The Discernibility Matrices and Functions in Information Systems. Intelligent Decision Support. pp. 331-362 (1992)
[61] Ghinita, G., Karras, P., Kalnis, P. & Mamoulis, N. Fast Data Anonymization with Low Information Loss. International Conference On Very Large Databases. pp. 758-769 (2007)