Sınıflandırma için diferansiyel mahremiyete dayalı öznitelik seçimi

Veri madenciliği ve makine öğrenmesi çözümlerinin en önemli ön aşamalarından biri yapılacak analizde kullanılacak verinin özniteliklerinin uygun bir alt kümesini belirlemektir. Sınıflandırma yöntemleri için bu işlem, bir özniteliğin sınıf niteliği ile ne oranda ilişkili olduğuna bakılarak yapılır. Kişisel gizliliği koruyan pek çok sınıflandırma çözümü bulunmaktadır. Ancak bu yöntemler için öznitelik seçimi yapan çözümler geliştirilmemiştir. Bu çalışmada, istatistiksel veritabanı güvenliğinde bilinen en kapsamlı ve güvenli çözüm olan diferansiyel mahremiyete dayalı özgün öznitelik seçimi yöntemleri sunulmaktadır. Önerilen bu yöntemler, yaygın olarak kullanılan bir veri madenciliği kütüphanesi olan WEKA ile entegre edilmiş ve deney sonuçları ile önerilen çözümlerin sınıflandırma başarımına olumlu etkileri gösterilmiştir.

Anahtar Kelimeler:

diferansiyel mahremiyet, sınıflandırma, öznitelik seçimi

PDF

___

Kantarcioglu M., Privacy-Preserving Distributed Data Mining And Processing On Horizontally Partitioned Data, PhD thesis, Purdue University, 08-2005.
Vaidya J., Privacy Preserving Data Mining over Vertically Partitioned Data, PhD thesis, Purdue University, 08-2004.
Sweeney L., Achieving k-anonymity privacy protection using generalization and suppression, Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10(5), 571-588, 2002.
Machanavajjhala A., Kifer D., Gehrke J., Venkitasubramaniam M., l-diversity: privacy beyond k-anonymity, ACM Trans. Knowl. Discov. Data, 1(1), 1-36, 2007.
Li N., Li T., t-closeness: privacy beyond k-anonymity and l-diversity, Proc. of IEEE 23rd Int’l Conf. on Data Engineering, İstanbul-Turkey, 106-115, 2007.
Dwork C., Differential privacy: A survey of results, Proc. of the 5th International Conference on Theory and Applications of Models of Computation, Heidelberg-Berlin, 1-19, 2008.
Yang Y., Pedersen J. O., A comparative study on feature selection in text categorization, Proc. of the Fourteenth International Conference on Machine Learning, San Francisco CA - USA, 412-420, 1997.
Hall M., Frank E., Holmes G., Pfahringer B., Reutemann P., Witten I. H., The weka data mining software: An update. SIGKDD Explor. Newsl., 11(1), 10-18, 2009.
Aggarwal C. C., On k-anonymity and the curse of dimensionality, Proc. of the 31st International Conference on Very Large Data Bases, Trondheim-Norway, 901-909, 2005.
M. M. Zhang G. L. Zou. A new data perturbation method of reference control in statistical database. Applied Mechanics and Materials, 241, 3134-3137, Trans. Tech. Publications, 2013.
Zayatz L., Evans T., Slanta J., Using noise for disclosure limitation of establishment tabular data, Journal of
Official Statistics, 14(4), 537-551, 1998.
Demirelli Okkalıoğlu B., Koç M., Polat H., Deriving private data in partitioned data-based privacypreserving
collaborative filtering systems, Journal of the Faculty of Engineering and Architecture of Gazi
University, 32(1), 53-64, 2017.
Shlomo N., Skinner C. J., Privacy protection from sampling and perturbation in survey microdata. Journal of
Privacy and Confidentiality, 4(1), 155-169, 2012.
Kadampur M. A., Somayajulu D. V. L. N., A noise addition scheme in decision tree for privacy preserving
data mining. The Computing Research Repository, arXiv:1001.3504, 2010.
Soria-Comas J., Domingo-Ferrer J., Optimal data-independent noise for differential privacy, Information
Sciences, 250(0), 200-214, 2013.
D.G.Y. Lee. Protecting Patient Data Confidentiality Using Differential Privacy, MSc. Thesis, Oregon Health
and Science University, 2008.
Lee N. Y., Kwon O., A privacy-aware feature selection method for solving the personalization-privacy
paradox in mobile wellness healthcare services. Expert Syst. Appl., 42(5), 2764-2771, 2015.
Gkoulalas-Divanis A., Loukides G., Sun J., Publishing data from electronic health records while preserving
privacy: A survey of algorithms, Journal of Biomedical Informatics, 50, 4-19, 2014.
Çelik C., Bilge H. Ş., Feature selection with weighted conditional mutual information, Journal of the Faculty
of Engineering and Architecture of Gazi University, 30(4), 585-596, 2015.
Akben S. B., Alkan A., Density-based feature extraction to improve the classification performance in the
datasets having low correlation between attributes, Journal of the Faculty of Engineering and Architecture of
Gazi University, 30(4), 597-603, 2015.
Xiao X., Tao Y., Output perturbation with query relaxation. Proc. VLDB Endow., 1(1), 857-869, 2008.
Dwork C., McSherry F., Nissim K., Smith A., Calibrating noise to sensitivity in private data analysis, Lecture
Notes in Computer Science, 3876, 265-284. Springer, Berlin Heidelberg, 2006.
John G. H., Kohavi R., Pfleger K., Irrelevant features and the subset selection problem. Proc. of the Eleventh
International Conference on Machine Learning, New Brunswick NJ – USA, 121-129, 1994.
Xiao Z., Dell E., Dou W., Chen L., ESFS: A new embedded feature selection method based on SFS,
Rapports de recherché, RR-LIRIS-2008-018, 1-10, 2008.
Lichman M., UCI machine learning repository, http://archive.ics.uci.edu/ml, published: 2013, accessed: Jan.
Mitchell T. M., Machine Learning, McGraw-Hill Inc., New York NY-USA, 1st edition, ISBN 0070428077,
Jagannathan G., Pillaipakkamnatt K., Wright R. N., A practical differentially private random decision tree
classifier. Trans. Data Privacy, 5(1), 273-295, 2012.
Allison P.D., Missing Data, SAGE Publications, ISBN 9780761916727, 2002.
Sayyad Shirabad J., Menzies T.J., The PROMISE Repository of Software Engineering Databases. School of
Information Technology and Engineering, University of Ottawa, Canada.
http://promise.site.uottawa.ca/SERepository, published: 2005, accessed: Jan. 2018.