High-speed data deduplication using parallelized cuckoo hashing Jane

High-speed data deduplication using parallelized cuckoo hashing Jane

Data deduplication is a capacity optimization technology used in backup systems for identifying and storing the nonredundant data blocks. The CPU intensive tasks involved in a hash-based deduplication system remain as challenges in improving the performance of the system. In this paper, we propose a parallel variant of the standard cuckoo hashing that enables the hashing technique to be performed in parallel. The CPU intensive tasks of fingerprint insertion and lookup operations are performed in parallel and distributed among the nodes of the deduplication cluster. Furthermore, the uniform handling of the blocks by the cluster nodes involved in the process of duplicate identification provides good load balance. Experimental evaluations using real-world backup and Linux kernel data sets reveal that the proposed deduplication system achieves up to 100% higher backup speed, up to 28% reduced lookup latency, and up to 24% reduced backup time than the other deduplication systems.

___

  • Bellare M, Keelveedhi S, Ristenpart T. Message-locked encryption and secure deduplication. In: Springer Annual International Conference on the Theory and Applications of Cryptographic Techniques; 26–30 May 2013; Athens, Greece. Berlin, Germany: Springer. pp. 296-312.
  • Paulo J, Pereira J. A survey and classification of storage deduplication systems. ACM Comput Surv 2014; 1: 1-30.
  • Rhea S, Cox R, Pesterev A. Fast, inexpensive content-addressed storage in foundation. In: USENIX Annual Technical Conference; 22–27 June 2008; Boston, MA, USA. Berkeley, CA, USA: USENIX. pp. 143-156.
  • Dan AA, Sharf A, Abbasinejad F, Sengupta S, Mitzenmacher M, Owens JD, Amenta N. Real-time parallel hashing on the GPU. ACM T Graphic 2009; 5: 1-9.
  • Kirsch A, Mitzenmacher M, Udi Wieder. More robust hashing: cuckoo hashing with a stash. Siam J Comput 2009; 4: 1543-1561.
  • Li X, Andersen DG, Kaminsky M, Freedman MJ. Algorithmic improvements for fast concurrent cuckoo hashing. In: Ninth European Conference on Computer Systems; 14–16 April 2014; Amsterdam, Netherlands. New York, NY, USA: ACM. pp. 1-14.
  • Fan B, Andersen DG, Kaminsky M. MemC3: compact and concurrent MemCache with dumber caching and smarter hashing. In: USENIX annual technical conference; 2-5 April 2013; Lombard, IL, USA. Berkeley, CA, USA: USENIX. pp. 385-398.
  • Debnath BK, Sengupta S, Li J. ChunkStash: speeding up inline storage deduplication using flash memory. In: USENIX annual technical conference; 23–25 June 2010; Boston, MA, USA. Berkeley, CA, USA: USENIX. pp.16-16.
  • Pagh R, Rodler FF. Cuckoo hashing. J Algorithm 2004; 2: 122-144.
  • Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to Algorithms. 3rd ed. Cambridge, MA, USA: Massachusetts Institute of Technology, 2009.
  • Bhagwat D, Eshghi K, Long DDE, Lillibridge M. Extreme binning: scalable, parallel deduplication for chunkbased file backup. In: IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems; 21–23 September 2009; London, UK. New York, NY, USA: IEEE. pp. 1-9.
  • Min J, Yoon D, Won Y. Efficient deduplication techniques for modern backup operation. IEEE T Comput 2011; 6:824-840.
  • Stallings W. Cryptography and Network Security: Principles and Practice. 6th ed. Harlow, United Kingdom:Pearson Education Limited, 2013.
  • Arcangeli A, Eidus I, Wright C. Increasing memory density by using KSM. In: Linux Symposium; 13–17 July 2009;Montreal, Canada. pp. 19-28.
  • Won Y, Lim K, Min J. MUCH: multithreaded content-based file chunking. IEEE T Comput 2015; 5: 1375-1388.
  • Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezise G, Camble P. Sparse indexing: large scale, inline deduplication using sampling and locality. In: USENIX 7th Conference on File and Storage Technologies; 24–27 February 2009; San Francisco, CA, USA. Berkeley, CA, USA: USENIX. pp. 111-123.
  • Xia W, Jiang H, Feng D, Hua Y. Similarity and locality based indexing for high performance data deduplication. IEEE T Comput 2015; 4: 1162-1176.
  • Wei J, Jiang H, Zhou K, Feng D. MAD2: A scalable high-throughput exact deduplication approach for network backup services. In: IEEE 26th Symposium on Mass Storage Systems and Technologies; 3–7 May 2010; NV, USA. New York, NY, USA: IEEE. pp. 1-14.
  • Li J, Li YK, Chen X, Lee PPC, Lou W. A hybrid cloud approach for secure authorized deduplication. IEEE T Parall Distr 2015; 5: 1206-1216.
  • Ha JY, Lee YS, Kim JS. Deduplication with block-level content-aware chunking for solid state drives (SSDs). In: IEEE 10th International Conference on High Performance Computing and Communications & IEEE International Conference on Embedded and Ubiquitous Computing; 13–15 November 2013; Zhangjiajie, China. New York, NY, USA: IEEE. pp. 1982-1989.
  • Meyer DT, Bolosky WJ. A study of practical deduplication. ACM T Storage 2012; 4: 1-13.
  • Sayood K. Introduction to Data Compression. 4th ed. Waltham, MA, USA: Elsevier, 2012.
  • Xia W, Jiang H, Feng D, Douglis F, Shilane P, Hua Y, Fu M, Zhang Y, Zhou Y. A comprehensive study of the past, present, and future of data deduplication. P IEEE 2016; 9: 1681-1710.
Turkish Journal of Electrical Engineering and Computer Sciences-Cover
  • ISSN: 1300-0632
  • Yayın Aralığı: Yılda 6 Sayı
  • Yayıncı: TÜBİTAK
Sayıdaki Diğer Makaleler

Ab initio study of the structural, electronic, and magnetic properties of Co2FeGa and Co2FeSi and their future contribution to the building of quantum devices

Badra BOUABDALLAH, Abdallah OUMSALEM, Yamina BOUREZIG, Zakia NABI

Topological feature extraction of nonlinear signals and trajectories and its application in EEG signals classification

Hamid Reza KOBRAVI, Ali MOGHİMİ, Saleh LASHKAR, Mohammad Reza HASHEMI GOLPAYEGAN, Ali SHEIKHANI

Experimental realization of a multi-input buck–boost DC–DC converter

Madhuri CHAUDHARI, Kushal KANHAV

A new inset-fed UWB printed antenna with triple 3.5/5.5/7.5-GHz band-notched characteristics

Avez SYED, Rabah Wasel ALDHAHERI

Model predictive control of a dual induction motor drive fed by a single voltage source inverter

Muhammad Abbas ABBASI, Abdul Rashid BIN HUSAIN

Adaptive collaborative speed control of PMDC motor using hyperbolic secant functions and particle swarm optimization

Khalid MAHMOOD-UL-HASAN, Omer SALEEM

Analyzing methods of network topologies based on chordal rings

Damian LEDZINSKI, Sandra SMIGIEL, Lukasz ZAB LUDOWSKI

Implementing universal dependency, morphology, and multiword expression annotation standards for Turkish language processing

Umut SULUBACAK, Gül¸sen ERYİĞİT

Performance analysis of radial and axial flux PMSM based on 3D FEM modeling

Faouzi BEN AMMAR, Oussama BOUAZIZ, Imen JAAFAR

Bagged tree classification of arrhythmia using wavelets for denoising, compression, and feature extraction

Özgür TOMAK, Temel KAYIKÇIOĞLU