Deep learning-aided automated personal data discovery and profiling

Deep learning-aided automated personal data discovery and profiling

In Turkey, Turkish Personal Data Protection Rule (PDPR) No. 6698, in force since 2016, provides protection to citizens for the legal existence of their personal data. Although the law provides excellent guidance, companies currently face challenges in complying with its regulations in terms of storing, sharing, or monitoring personal data. Since any specially designed software with wide industrial usage is not on the market, almost all of the companies have no other choice but to take expensive and error-prone operations manually to ensure their compliance. In this paper, we present an automated solution to facilitate and accelerate PDPR compliance. In a structured or unstructured document, words or phrases that could include personal data entities are tagged with the help of a Bi-LSTM based named entity recognition model and a rule-based matching component that employs contextual analysis. To find associations in personal data and obtain individual personal profiles, these entities are divided into categories according to their confidence levels. Personal profiles are constructed using an approach similar to clustering. It treats the personal data categories with high identification levels as separate clusters and finds related personal data entities at the left and/or right of its contexts. We evaluated the system on a data set formed of 70 documents of different types and personal data entities. We obtained 91.76 % micro-averaged F1-measure for personal data entity classification and 72.46 % accuracy for profile extraction. We also performed experiments related to the performance of the named entity recognition and to the time complexity of the overall system on a data set formed of 33K documents.

___

  • [1] Ling X,Weld D. Fine-grained entity recognition. In Proceedings of the Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence, volume 26, 2012.
  • [2] Yosef MA, Bauer S, Hoffart J, Spaniol M, Weikum G. Hyena: Hierarchical type classification for entity names. In Proceedings of Conference on Computational Linguistics, pages 1361–1370, 2012.
  • [3] Kalender M, Korkmaz E. Turkish entity discovery with word embeddings. Turkish Journal of Electrical Engineering & Computer Sciences, 25 (3):2388–2398, 2017.
  • [4] Tür G. A statistical information extraction system for Turkish. PhD thesis, Bilkent University, 2000.
  • [5] Bayraktar Ö , Temizel TT. Person name extraction from Turkish financial news text using local grammar-based approach. In 23rd International Symposium on Computer and Information Sciences, pages 1–4, 2008.
  • [6] Küçük D,Yazıcı A. Rule-based named entity recognition from turkish texts. In Proceedings of the International Symposium on Innovations in Intelligent Systems and Applications, pages 456–460, 2009.
  • [7] Tatar S,Çiçekli I. Automatic rule learning exploiting morphological features for named entity recognition in turkish. Journal of Information Science, 37 (2):137–151, 2011.
  • [8] GA Şeker,Eryiğit G. Initial explorations on using crfs for turkish named entity recognition. In Proceedings of Conference on Computational Linguistics, pages 2459–2474, 2012.
  • [9] Demir H, Özgür A. Improving named entity recognition for morphologically rich languages using word embeddings. In 13th International Conference on Machine Learning and Applications, pages 117–122, 2014.
  • [10] Şeker GA, Eryiğit G. Extending a crf-based named entity recognition model for turkish well formed text and user generated content 1. Semantic Web, 2017;8 (5):625–642.
  • [11] Güngör O, Üsküdarlı S,Güngör T. Recurrent neural networks for turkish named entity recognition. In 26th Signal Processing and Communications Applications Conference, pages 1–4, 2018.
  • [12] Akkaya EK. Deep neural networks for named entity recognition on social media. Master’s thesis, Hacettepe University, Institute of Natural Sciences, 2018.
  • [13] Güngör O, Güngör T, Üsküdarli S. The effect of morphology in named entity recognition with sequence tagging. Natural Language Engineering, 2019;25 (1):147–169.
  • [14] Yamada I, Asai A, Shindo H, Takeda H, Matsumoto Y. Luke: Deep contextualized entity representations with entity-aware self-attention. arXiv preprint, 2020. arXiv: 2010.01057.
  • [15] Luoma J,Pyysalo S. Exploring cross-sentence contexts for named entity recognition with bert. arXiv preprint, 2020. arXiv:2006.01563.
  • [16] Yu J,Bohnet B,Poesio M. Named entity recognition as dependency parsing. arXiv preprint, 2020. arXiv:2005.07150.
  • [17] Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. arXiv preprint, 2016. arXiv:1607.04606.
  • [18] Olby L, Thomander I. A step toward gdpr compliance: Processing of personal data in email. 2018.
  • [19] Dasgupta R, Ganesan B, Kannan A, Reinwald B, Kumar A. Fine grained classification of personal data entities. arXiv preprint, 2018. arXiv:1811.09368.
  • [20] Dias M, Boné J, Ferreira J, Ribeiro R,Maia R. Named entity recognition for sensitive data discovery in portuguese. Applied Sciences, 2020;10 (7):2303.
  • [21] Jurafsky D. Speech & Language Processing. Pearson Education India, 2000.
  • [22] Srivastava N,Hinton G, Krizhevsky A,Sutskever I,Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 2014;15 (1):1929–1958.
  • [23] Luhn HP. Computer for verifying numbers, 1960.
  • 24] Postel J. Dod standard internet protocol. Association for Computing Machinery Special Interest Group on Data Communications Computer Communication Review 1980;10 (4):12–51.
  • [25] Deering S et al. Internet protocol, version 6 (ipv6) specification, 1998.
  • [26] Ramshaw L,Marcus M. Text chunking using transformation-based learning. In Third Workshop on Very Large Corpora, 1995.
  • [27] Jacob B, Kligys S,Chen B, Zhu M, Tang M et al. Quantization and training of neural networks for efficient integerarithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2704–2713, 2018.
  • [28] Tieleman T,Hinton G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. Coursera: Neural Networks For Machine Learning2012; 4 (2):26–31.
  • [29] Devlin J, Chang MW,Lee K,Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint, 2018. arXiv:1810.04805.
  • [30] Bakar B, Aksoy F, Yayık A, İçöz S,Aybar V et al. Turkish rule-based official document type detection. In 28th Signal Processing and Communications Applications Conference, pages 1–4, 2020.
  • [31] Yayık A,Apik H, Tosun A, Ozdemir E. Deep learning based topic classification for sensitivity assignment to personal data. Technical report, Partnership for Advanced Computing in Europe (EU Union Horizon 2020 Research and Innovation Program), 2021.
Turkish Journal of Electrical Engineering and Computer Sciences-Cover
  • ISSN: 1300-0632
  • Yayın Aralığı: Yılda 6 Sayı
  • Yayıncı: TÜBİTAK