High-speed data deduplication using parallelized cuckoo hashing

Data deduplication is a capacity optimization technology used in backup systems for identifying and storing the nonredundant data blocks. The CPU intensive tasks involved in a hash-based deduplication system remain as challenges in improving the performance of the system. In this paper, we propose a parallel variant of the standard cuckoo hashing that enables the hashing technique to be performed in parallel. The CPU intensive tasks of fingerprint insertion and lookup operations are performed in parallel and distributed among the nodes of the deduplication cluster. Furthermore, the uniform handling of the blocks by the cluster nodes involved in the process of duplicate identification provides good load balance. Experimental evaluations using real-world backup and Linux kernel data sets reveal that the proposed deduplication system achieves up to 100{\%} higher backup speed, up to 28{\%} reduced lookup latency, and up to 24{\%} reduced backup time than the other deduplication systems.