İsa Ahmet GÜNEY, Gürhan KÜÇÜK

A Fine-grain and scalable set-based cache partitioning through thread classification

As contemporary processors utilize more and more cores, cache partitioning algorithms tend to preserve cacheassociativity with a finer-grain of control to achieve higher throughput and fairness goals. In this study, we propose ascalable set-based cache partitioning mechanism, which welds an allocation policy and an enforcement scheme together.We also propose a set-based classifier to better allocate partitions to more deserving threads, a fast set redirection logicto map accesses to dedicated cache sets, and a double access mechanism to overcome the performance penalty due toa repartitioning phase. We compare our work to the best line-grain cache partitioning scheme that is available in theliterature. Our results show that set-based partitioning improves throughput and fairness by 5.6% and 4.8% on average,respectively. The maximum achievable gains are as high as 33% in terms of throughput and 23% in terms of fairness.

PDF

___

[1] Qureshi MK, Patt YN. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In: MICRO 39 Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture; Orlando, FL, USA; 2006. pp. 423-432.
[2] Ovant BS, Güney İAG, Savaş ME, Küçük GK. Allocation of last level cache partitions through thread classification with parallel universes. In: 2016 International Conference on High Performance Computing Simulation; Inssbruck, Austria; 2016. pp. 204-212.
[3] Güney İA, Yıldız A, Bayındır İU, Serdaroğlu KÇ, Bayık U et al. A machine learning approach for a scalable, energy-efficient utility-based cache partitioning. In: High Performance Computing: 30th International Conference, ISC High Performance; Frankfurt, Germany; 2015. pp. 409-421.
[4] Xie Y, Loh GH. PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches. ACM SIGARCH Computer Architecture News 2009; 37 (3): 174-183. doi: 10.1145/1555815.1555778
[5] Sanchez D, Kozyrakis C. Scalable and efficient fine-grained cache partitioning with Vantage. IEEE Micro 2012; 32 (3): 26-37. doi: 10.1109/MM.2012.19
[6] Wang R, Chen L. Futility scaling: High-associativity cache partitioning. In: MICRO-47 Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture; Cambridge, UK; 2014. pp. 256-367.
[7] Beckmann N, Sanchez D. Talus: A simple way to remove cliffs in cache performance. In: IEEE International Symposium on High Performance Computer Architecture (HPCA); Burlingame, CA, USA; 2015. pp. 64-75.
[8] Suh GE, Devadas S, Rudolph L. Analytical cache models with applications to cache partitioning. In: ICS ’01 Proceedings of the 15th International Conference on Supercomputing; Sorrento, Italy; 2001. pp. 1-12.
[9] Stone HS, Turek J, Wolf JL. Optimal partitioning of cache memory. IEEE Transactions on Computers 1992; 41 (9): 1054-1068. doi: 10.1109/12.165388
[10] Brock J, Ye C, Ding C, Li Y, Wang X et al. Optimal cache partition-sharing. In: ICPP ’15 Proceedings of the 2015 44th International Conference on Parallel Processing (ICPP); Beijing, China; 2015. pp. 749-758.
[11] Duong N, Zhao D, Kim T, Cammarota R, Valero M et al. Improving cache management policies using dynamic reuse distances. In: Annual IEEE/ACM International Symposium on Microarchitecture; Vancouver, BC, Canada; 2012. pp. 389-400.
[12] Li L, Lu J, Cheng X. Block value based insertion policy for high performance last-level caches. In: ICS ’14 Proceedings of the 28th ACM International Conference on Supercomputing; Munich, Germany; 2014. pp. 63-72.
[13] Abousamra S, El-Mahdy A, Selim S. Fair and adaptive online set-based cache partitioning. In: The 2011 International Conference on Computer Engineering & Systems; Cairo, Egypt; 2011. pp. 9-16.
[14] Chang J, Sohi GS. Cooperative cache partitioning for chip multiprocessors. In: ACM International Conference on Supercomputing 25th Anniversary Volume; Munich, Germany; 2014. pp. 402-412.
[15] Suh GE, Rudolph L, Devadas S. Dynamic partitioning of shared cache memory. The Journal of Supercomputing 2004; 28 (1): 7-26. doi: 10.1023/B:SUPE.0000014800.27383.8f
[16] Beckmann N, Sanchez D. Jigsaw: Scalable software-defined caches. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques; Edinburgh, UK; 2013. pp. 213-224.
[17] Kim S, Chandrra D, Solihin Y. Fair cache sharing and partitioning in a chip multiprocessor architecture. In: Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques; Antibes, Juan-les-Pins, France; 2004. pp. 111-122.
[18] El-Sayed B, Mukkara A, Tsai O, Kasture H, Ma X et al. KPart: A hybrid cache partitioning-sharing technique for commodity multicores. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA); Vienna, Austria; 2018. pp. 104-117.
[19] Wang X, Chen S, Setter J, Martinez JF. SWAP: Effective fine-grain management of shared last-level caches with minimum hardware support. In: 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA); Austin, TX, USA; 2017. pp. 121-132.
[20] Ye Y, West R, Cheng Z, Li Y. COLORIS: A dynamic cache partitioning system using page coloring., In: International Conference on Parallel Architecture and Compilation Techniques (PACT); Edmonton, AB; 2014. pp. 381-392.
[21] Sherwood T, Calder B, Emer JS. Reducing cache misses using hardware and software page placement. In: ICS ’99 Proceedings of the 13th International Conference on Supercomputing; Rhodes, Greece; 1999. pp. 155-164.
[22] Tam DK, Azimi R, Soares L, Stumm M. Managing shared L2 caches on multicore systems in software. In: Workshop on the Interaction Between Operating Systems and Computer Architecture, Citeseer, 2007. pp. 26-33.
[23] Zhang L, Liu Y, Wanr G, Qian D. Lightweight dynamic partitioning for last-level cache of multicore processor on real system. The Journal of Supercomputing 2014; 69 (2): 547-560. doi: 10.1007/s11227-014-1092-2
[24] Zarandi HR, Sarbazi-Azad H. Hierarchical binary set partitioning in cache memories. The Journal of Supercomputing 2005; 31 (2): 185-202. doi: 10.1007/s11227-005-0106-5
[25] Sanchez D, Kozyrakis C. The ZCache: decoupling ways and associativity. In: MICRO ’43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture; Atlanta, GA, USA; 2010. pp. 187-198.
[26] Ranganathan P, Adve S, Jouppi NP. Reconfigurable caches and their application to media processing. In: ISCA ’00 Proceedings of the 27th Annual International Symposium on Computer Architecture; Vancouver, BC, Canada; 2000. pp. 214-224.
[27] Varadarajan K, Nandy SK, Sharda V, Bharadwaj A, Iyer R et al. Molecular caches: A caching structure for dynamic creation of application-specific Heterogeneous cache regions. In: MICRO 39 Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture; Orlando, FL, USA; 2006. pp. 433-442.
[28] Vassiliadis S, Phillips J, Blaner B. Interlock collapsing ALU’s. IEEE Transactions on Computers 1993; 42 (7): 825-839. doi: 10.1109/12.237723
[29] Kaxiras S, Hu Z, Martonosi M. Cache decay: Exploiting generational behavior to reduce cache leakage power. In: Proceedings 28th Annual International Symposium on Computer Architecture; Goteborg, Sweden; 2001. pp. 240-251.
[30] Henning JL. SPEC CPU2006 benchmark descriptions.ACM SIGARCH Computer Architecture News 2006; 34 (4): 1-17. doi: 10.1145/1186736.1186737
[31] Vandierendonck H, Seznec A. Fairness metrics for multi-threaded processors. IEEE Computer Architecture Letters 2011; 10 (1): 4-7. doi: 10.1109/L-CA.2011.1
[32] Mattson RL, Gecsei J, Slutz DR, Traiger IL. Evaluation techniques for storage hierarchies. IBM Systems Journal 1970; 9 (2): 78-117. doi: 10.1147/sj.92.0078