Turgay Tugay BİLGİN, Murat OĞUZ

Sık Alt Çizge Madenciliği Algoritmalarının Bellek Gereksinimlerini En Aza İndirmek İçin Yeni Bir Yaklaşım

Sık alt çizge madenciliği (SAÇM), çizge sınıflandırma ve çizge kümeleme için yaygın olarak kullanılan bir çizge madenciliği alt türüdür. Son on yılda, birçok verimli SAÇM algoritması geliştirilmiştir. Geliştirmeler genellikle algoritma yapısını değiştirerek veya paralel programlama teknikleri kullanarak zaman karmaşıklığını azaltmaya odaklanmıştır. SAÇM algoritmalarının çözülmesi gereken en önemli problemlerinden biri yüksek bellek tüketimidir. Bu çalışmada, SAÇM algoritmalarının bellek gereksinimini en aza indirmek için Öngörücü Dinamik Boyutlu Yapı Paketleme (ÖDBYP) adı verilen yeni bir yaklaşım önerilmiştir. Önerilen yaklaşım SAÇM algoritmalarının iç veri yapılarında herhangi bir algoritmik değişiklik yapmadan yeniden tasarlamaya olanak sağlamaktadır. Bu çalışma kapsamında geliştirilen ÖDBYP ile very madenciliği alanına iki önemli katkı sağlanmaktadır. Birincisi, yeni tasarlanmış işaretsiz bir tamsayı veri türü olan Dinamik Boyutlu Tamsayı Türüdür (ds_Int). İkinci katkı, derleyicinin davranışını değiştiren bir veri yapısı paketleme tekniği kullanan “Veri Yapısı paketleme” bileşenidir. ÖDBYP yaklaşımının etkinliğini ve verimliliğini, gSpan ve Gaston adlı güncel algoritmalara gömerek çeşitli deneyler gerçekleştirilmiştir. Çalışma kapsamında geliştirilen yöntem ile algoritmaların original halleri ile kıyaslanmıştır. Neredeyse tüm sonuçlar, önerilen uygulamanın her destek düzeyinde daha az bellek harcadığını göstermektedir. Sonuç olarak, ÖDBYP uzantıları bellek tasarrufu sağlayabilir ve veri kümesine bağlı olarak maksimum bellek kullanımı % 38 kadar düşürülebilmektedir.

A New Approach to Minimize Memory Requirements of Frequent Subgraph Mining Algorithms

Frequent subgraph mining (FSM) is a subsection of graph mining domain which is extensively used for graph classification andgraph clustering purposes. Over the past decade, many efficient FSM algorithms have been developed. The improvements generallyfocus on reducing time complexity by changing the algorithm structure or using parallel programming techniques. FSM algorithmshave another problem to solve, which is the high memory consumption. In this study, a new approach called Predictive DynamicSized Structure Packing (PDSSP) have been proposed to minimize the memory requirement of FSM algorithms. Proposed approachredesigns the internal data structures of FSM algorithms without any algorithmic modifications. PDSSP has two contributions. Thefirst one is the Dynamic Sized Integer Type (ds_Int) which is a newly designed unsigned integer data type. The second contributionis “Data Structure packaging” component that uses a data structure packing technique which changes the behaviour of the compiler.A number of experiments have been conducted to examine the effectiveness and efficiency of the PDSSP approach by embeddingit into two state-of-art algorithms called gSpan and Gaston. Proposed implementation have been compared to the official one.Almost all results show that the proposed implementation consumes less memory on each support level. As a result, PDSSPextensions can save memory and the peak memory usage may decrease up to 38% depending on the dataset.

PDF

___

[1] Burkhardt P., Waring C., “An NSA big graph experiment”. In presentation at the Carnegie Mellon University SDI/ISTC Seminar, Pittsburgh, USA, (2013).
[2] Fleury E., Lattanzi S., Mirrokni V., Perozzi B., “ASYMP: faulttolerant mining of massive graphs.” arXiv preprint arXiv:1712.09731, (2017).
[3] Muttipati A. S., Padmaja P., “Analysis of large graph partitioning and frequent subgraph mining on graph data”. International Journal of Advanced Research in Computer Science, 1: 6-7, (2015).
[4] Carlos G. V., Esteban M., “Comparative Analysis of de Bruijn Graph Parallel Genome Assemblers”, 2018 IEEE International Work Conference on Bioinspired Intelligence (IWOBI), 1-8, (2018).
[5] Talukder N., Zaki, M. J., “A distributed approach for graph mining in massive networks”. Data Mining and Knowledge Discovery, 30: 1024-1052, (2016).
[6] Di Fatta G., Berthold M. R., “High performance subgraph mining in molecular compounds”. In International Conference on High Performance Computing and Communications, 866- 877, (2005).
[7] Stratikopoulos A., Chrysos G., Papaefstathiou I., and Dollas A., “HPC-gSpan: An FPGA-based parallel system for frequent subgraph mining”. In IEEE 2014 24th International Conference on Field Programmable Logic and Applications (FPL),1-4, (2014).
[8] Aggarwal C. C., Bhuiyan M. A., Al Hasan M., “Frequent pattern mining algorithms: A survey”. In: Aggarwal CC, Han J, editors. Frequent Pattern Mining. Switzerland: Springer International Publishing, 19–64, (2014).
[9] Anastasiu D. C., Iverson J., Smith S., Karypis G., “Big data frequent pattern mining”. In: Frequent Pattern Mining. Switzerland: Springer International Publishing, 225–259, (2014).
[10] Yan X., Han J., “gSpan: Graph-based substructure pattern mining”. In: IEEE 2003 International Conference on Data Mining, 721–724, (2003).
[11] Nijssen S., Kok J. N., “The Gaston tool for frequent subgraph mining”. Electronic Notes in Theoretical Computer Science, 127: 77–87, (2005).
[12] Yan X., Han J., “CloseGraph: Mining closed frequent graph patterns”. In: ACM SIGKDD 2003 International Conference on Knowledge Discovery and Data Mining, 286–295, (2003).
[13] Lakshmi K., Meyyappan D.T., “A comparative study of frequent subgraph mining algorithms”. IJITCS 2012, 2: 2, (2012).
[14] Borgelt C., Berthold M. R., “Mining molecular fragments: Finding relevant substructures of molecules”. In: IEEE 2003 International Conference on Data Mining, 51–58, (2003).
[15] Guan B., Zan X. Z., Xiao B.Y., Ma R. N., Zhang F.Y., and Liu W. B., "Detecting dense subgraphs in complex networks based on edge density coefficient", Chinese Journal of Electronics, 22: 517–520, (2013).
[16] Kuramochi M., Karypis G., “Frequent subgraph discovery”. In: IEEE 2001 International Conference on Data Mining, 313– 320, (2001).
[17] Vanetik N., Gudes E., Shimony S. E., “Computing frequent graph patterns from semistructured data”. In: IEEE 2002 International Conference on Data Mining, 458–465, (2002).
[18] Rehman S. U., Asghar S., and Fong S. J., “Optimized and Frequent Subgraphs: How Are They Related?”. IEEE Access, 6: 37237-37249, (2018).
[19] Yang S., Guo R., Liu R., Liao X., Zou Q., Shi B. and Peng S., “cmFSM: a scalable CPU-MIC coordinated drug-finding tool by frequent subgraph mining”. BMC bioinformatics, 19: 98, (2018).
[20] Horton I., “Working with fundamental data types”. In: Anglin S, lead editor. Beginning C++. New York, NY, USA: Apress Press, 55–77, (2014).
[21] Bader D. A., Meyerhenke H., Sanders P., Wagner D. “Benchmarking for graph clustering and partitioning”. Encyclopedia of Social Network Analysis and Mining, 73–84, (2012).
[22] Bryant R. E., David Richard O. H., “Computer systems: a programmer's perspective.” Upper Saddle River: Prentice Hall; (2003).
[23] https:/docs.microsoft.com/en-us/previous-versions/ms253935
[24] "Valgrind and Massif Visualizer tools", http://valgrind.org/info/tools.html, (2017).
[25] Wale N., Ian A. W., and Karypis G., "Comparison of descriptor spaces for chemical compound retrieval and classification." Knowledge and Information Systems, 347-375, (2008).
[26] Dobson P. D., Doig A. J., “Distinguishing enzyme structures from non-enzymes without alignments”. Journal of Molecular Biology, 330: 771–783, (2003).
[27] Wajdi D., Sabeur A., Mephu N. E., “MR-SimLab: Scalable subgraph selection with label similarity for big data”. Information Systems, 69: 155-163, (2017).
[28] Thoma M., Cheng H., Gretton A., Han J., Kriegel H. P., Smola A., Song L., Yu P. S., Yan X., Borgwardt K. M., “Discriminative frequent subgraph mining with optimality guarantees”. Statistical Analysis and Data Mining. The ASA Data Science Journal; 3: 302–318, (2010).
[29] http://liacs.leidenuniv.nl/~nijssensgr/gaston/index.html
[30] Wörlein M., Meinl T., Fischer I., and Philippsen, M., “A quantitative comparison of the subgraph miners MoFa, gSpan, FFSM, and Gaston.”, In: European Conference on Principles of Data Mining and Knowledge Discovery, 392-403, (2005).