Date of Award
Doctor of Philosophy (PhD)
Electrical and Computer Engineering
The rate of data growth outpaces the decline of hardware costs, and there has been an ever-increasing demand in reducing the storage and network overhead for online database management systems (DBMSs). The most widely used approach for data reduction in DBMSs is blocklevel compression. Although this method is simple and effective, it fails to address redundancy across blocks and therefore leaves significant room for improvement for many applications. This dissertation proposes a systematic approach, termed similaritybased deduplication, which reduces the amount of data stored on disk and transmitted over the network beyond the benefits provided by traditional compression schemes. To demonstrate the approach, we designed and implemented dbDedup, a lightweight record-level similaritybased deduplication engine for online DBMSs. The design of dbDedup exploits key observations we find in database workloads, including small item sizes, temporal locality, and the incremental nature of record updates. The proposed approach differs from traditional chunk-based deduplication approaches in that, instead of finding identical chunks anywhere else in the data corpus, similarity-based deduplication identifies a single similar data-item and performs differential compression to remove the redundant parts for greater savings. To achieve high efficiency, dbDedup introduces novel encoding, caching and similarity selection techniques that significantly mitigate the deduplication overhead with minimal loss of compression ratio. For evaluation, we integrated dbDedup into the storage and replication components of a distributed NoSQL DBMS and analyzed its properties using four real datasets. Our results show that dbDedup achieves up to 37⇥ reduction in the storage size and replication traffic of the database on its own and up to 61⇥ reduction when paired with the DBMS’s block-level compression. dbDedup provides both benefits with negligible effect on DBMS throughput or client latency (average and tail).
Xu, Lianghong, "Online Deduplication for Distributed Databases" (2016). Dissertations. 719.