Publication details

Modern Storage Stack with Key-Value Store Interface and Snapshots Based on Copy-On-Write Bε-Trees (Felix Wiedemann), Master's Thesis, School: Universität Hamburg, 2018-01-11
Publication details

Abstract

The ever increasing gap between computational power and storage capacity on the one side and storage throughput on the other side leads to I/O bottlenecks in many applications. To cope with huge data volumes, many storage stacks apply copy-on-write techniques. Copy-on-write enables efficient snapshots and guarantees on-disk consistency. However, a major downside of copy-on-write is potentially massive fragmentation as data is always redirected on write. As fragmentation leads to random reads, the I/O throughput of copy-on-write storage stacks suffers especially under random write workloads. In this thesis, we will design, implement, and evaluate a copy-on-write storage stack that uses Bε-Trees which are a generalisation of B-Trees and limit fragmentation by design due to larger node sizes. The storage stack has an underlying storage pool which consists of groups of storage devices called vdevs. Each vdev is either a single storage device, a mirror of devices, or a group of devices with additional parity blocks so that a storage pool improves performance and/or resilience compared to a single storage device. In the storage pool, the data is protected by checksums. On top of the storage pool, we use Bε-Trees to save all user data and metadata. The user interface of the storage stack provides data sets which have a simple key-value store interface and save their data in dedicated Bε-Trees. Each data set can be individually snapshotted as we simply use the path-copying technique for the corresponding Bε-Tree. In the performance evaluation, our storage stack shows its advantage over ZFS – a mature copy-on-write storage stack – in a database workload. Our storage stack is not only 10 times faster regarding small random overwrites (6.6 MiB/s versus 0.66 MiB/s) but it also exhibits a much smaller performance degradation in the following sequential read of data. While the sequential read throughput of ZFS drops by 82% due to the random writes, our storage stack only incurs a 23% slowdown. Hence, limiting fragmentation by design can be very useful for copy-on-write storage stacks so that the read performance is higher and more consistent regardless of write access patterns.

BibTeX

@mastersthesis{MSSWKSIASB18,
author = {Felix Wiedemann},
title = {{Modern Storage Stack with Key-Value Store Interface and Snapshots Based on Copy-On-Write Bε-Trees}},
advisors = {Michael Kuhn},
year = {2018},
month = {01},
school = {Universität Hamburg},
howpublished = {{Online \url{https://wr.informatik.uni-hamburg.de/_media/research:theses:felix_wiedemann_modern_storage_stack_with_key_value_store_interface_and_snapshots_based_on_copy_on_write_bε_trees.pdf}}},
type = {Master's Thesis},
abstract = {The ever increasing gap between computational power and storage capacity on the one side and storage throughput on the other side leads to I/O bottlenecks in many applications. To cope with huge data volumes, many storage stacks apply copy-on-write techniques. Copy-on-write enables efficient snapshots and guarantees on-disk consistency. However, a major downside of copy-on-write is potentially massive fragmentation as data is always redirected on write. As fragmentation leads to random reads, the I/O throughput of copy-on-write storage stacks suffers especially under random write workloads. In this thesis, we will design, implement, and evaluate a copy-on-write storage stack that uses Bε-Trees which are a generalisation of B-Trees and limit fragmentation by design due to larger node sizes. The storage stack has an underlying storage pool which consists of groups of storage devices called vdevs. Each vdev is either a single storage device, a mirror of devices, or a group of devices with additional parity blocks so that a storage pool improves performance and/or resilience compared to a single storage device. In the storage pool, the data is protected by checksums. On top of the storage pool, we use Bε-Trees to save all user data and metadata. The user interface of the storage stack provides data sets which have a simple key-value store interface and save their data in dedicated Bε-Trees. Each data set can be individually snapshotted as we simply use the path-copying technique for the corresponding Bε-Tree. In the performance evaluation, our storage stack shows its advantage over ZFS – a mature copy-on-write storage stack – in a database workload. Our storage stack is not only 10 times faster regarding small random overwrites (6.6 MiB/s versus 0.66 MiB/s) but it also exhibits a much smaller performance degradation in the following sequential read of data. While the sequential read throughput of ZFS drops by 82\% due to the random writes, our storage stack only incurs a 23\% slowdown. Hence, limiting fragmentation by design can be very useful for copy-on-write storage stacks so that the read performance is higher and more consistent regardless of write access patterns.},
}