Modern data systems are required to process an enormous
amount of data generated by a variety of applications.
To support this requirement, modern data stores reduce
read/write interference by employing the out-of-place
paradigm. The classical out-of-place design is the
log-structured merge (LSM) tree. LSM-trees write the
incoming key-value entries in an in-memory buffer to
ensure high write throughput, and uses in-memory
auxiliary data structures (such as Bloom filters
and fence pointers) to provide competitive read
performance. While LSMs are great for writes and
reads in general, we show that all state-of-the-art
LSM-based data stores perform suboptimally in presence
deletes in a workload.
Deletes in LSM-trees are realized logically by inserting a special type of key-value entry, known as a tombstone. Once inserted, a tombstone logically invalidates all entries in a tree that has a matching key, without necessarily disturbing the physical target data entries. The target entries are persistently deleted from the data store only after the corresponding tombstone reaches the last level of the tree through compactions. The time taken to persistently delete the target entries depends on (i) the file picking policy followed by the compaction routine, (ii) the rate of ingestion of entries to the database, (iii) the size ratio of the LSM-tree, and (iv) the number of levels in the tree.
The figure above shows how tombstone-driven logical deletes eventually becomes persistent using an example where we want to delete all entries with key 6 from an LSM-based data store. In Step 1, a tombstone for 6, denoted by 6*, is inserted into the tree. Step 2, shows how the file containing the tombstone 6* moves to the next level of the tree due to (partial) compaction. Finally, Step 3 shows how all entries with a matching key are eventually persistently removed from the LSM-tree after the tombstone reaches the last level of the tree.
All state-of-the-art LSM engines, consider deletes as "second-class citizens". The present out-of-place mechanism that realizes deletes logically has implications on the overall performance of a data store as well as on the privacy frontier.
As discussed, deletes on the sort-key, i.e., the "key" of the key-value entries, is realized by inserting tombstones against the to-be-deleted key. The problems with this are the following. First, retaining the invalidated entries (along with the tombstones) in the tree leads to mis-utilization of disk space, which causes space amplification. Second, the invalidated entries in the tree are re-written over and over again during compaction which causes write amplification. Third, as the invalidated entries (and the tombstones) remain hashed into the Bloom filters, for a fixed memory budget, they increase the false positive rate of the filter, which results in asymptotically poorer read performance. Finally, as there is no bound on the time taken to persistently delete the data objects, out-of-place deletes may lead to very high latency for persistent deletes, which is a serious privacy concern.
Deletes on an attribute that is different from the sort key are referred to as secondary deletes. Secondary range delete operations, that delete a range of entries based on a secondary attribute, are also frequent in several practical workloads. Modern LSM-based data stores are unable to realize secondary range deletes efficiently, as the entries qualifying for such an operation may be scattered all across the tree. The way commercial data stores realize secondary range delete operations is by periodically (typically, every 7, 15, or 30 days) compacting the whole tree to a single base level. Such full tree compactions are highly undesirable as they require many superfluous I/Os to the disk and may also lead to prolonged write stalls and latency spikes. In the example here, we use S to denote the sort key and D for the delete key, and we observe that to realize a secondary range delete operation that deletes all entries with a timestamp smaller than 66, all four pages shown, must be read to memory.
To address the aforementioned problems associated with deletes in LSM-trees, we introduce Lethe. Lethe aims (i) to provide persistence guarantees for primary delete operations and (ii) to enable efficient secondary range deletes in LSM-trees.
To achieve the first design goal, we introduce FADE, a family of delete-aware compaction strategies that ensures that all tombstones are persisted within a delete persistence threshold. FADE uses additional information about the age of the tombstones contained in a file and the estimated invalidated entries per tombstone to ensure that all tombstones in an LSM-tree are persisted within the user/application-set threshold. FADE achieves this by using the delete persistence threshold to assign a time-to-live for every level of a tree. Now, if at any point, the age of a file containing a tombstone exceeds the level-TTL, FADE initiates a compaction.
Note that, this compaction trigger is completely orthogonal to the saturation-based trigger, and is driven by the age of the oldest tombstone contained in a file. This way, FADE ensures timely delete persistence, and by removing the invalidated entries timely and persistently from the tree, FADE also improves the overall performance of the storage engine in presence of deletes. The figure above shows step-wise how FADE guarantees timely delete persistence by setting earlier deadlines for future compaction jobs.
To facilitate the second design goal concerning secondary range deletes, we introduce KiWi, a continuum of physical storage layouts the arranges the data on disk in an interweaved sorted order on the sort key and delete key. KiWi supports secondary range deletes without performing a full-tree compaction, at the cost of minimal extra metadata and a tunable penalty on read performance.
KiWi adds one new layer in the storage layout of LSM trees. In particular, in addition to the levels of the tree, the files of a level, and the pages of a file, we now introduce delete tiles that belong to a file and consist of pages. A delete tile is a logical collection of consecutive pages in a file. Within a delete tile, the entries are partitioned on the delete key, D and are split into pages. This way, by ordering the entries on D allows KiWi to cluster the qualifying entries for secondary range deletes. This, in turn, allows KiWi to perform secondary range deletes at the cost of fewer disk I/Os. To facilitate tight binary search within a page during lookups, KiWi then re-sorts the entries on the sort key, S, as shown in the figure above.
Lethe puts together the benefits of FADE and KiWi to
better support deletes in LSM-trees. For a given workload
and a persistence delete latency threshold, Lethe offers
maximal performance, by carefully tuning the cost of
persistent deletes, and their impact on the overall
performance of the system. We evaluate Lethe against
state-of-the-art LSM-tree designs for a range of workloads
with deletes and different delete persistence
For the following experiments, we insert 1GB data in the database (avg. entry size = 1KB). The size of the memory buffer is 1MB and size ratio of the tree is 10. The delete persistence threshold is set to 16.67%, 25%, and 50% of the experiment's run-time.
Figure (A) shows that Lethe ensures timely persistent
deletion. Lethe persists
between 40% and 80% more tombstones as compared
to RocksDB, and it does so while
honoring the delete persistence threshold. Figure (B)
shows that Lethe significantly reduces space
amplification by persisting deletes timely. Figure (C)
shows that Lethe offers improved read performance as
compared to RocksDB by purging invalidated
entries in a time-bound manner, and
thus, improving the Bloom filter accuracy.
For the following experiments, the workload comprises 0.001% secondary range delete operations along with 1% range queries and 50% point queries. Each file has 256 pages and the size of every page is 4KB.
Figure (D) shows that, a higher delete tile granularity increases the number of full page drops, thus reducing I/O when data is stored in KiWi. Figure (E) shows that different delete tile granularity is optimal for different workloads. Lethe can navigate the continuum of storage layout and offer superior overall performance by determining the optimal storage layout. Figure (F) shows the total time spent in hashing and I/O access for both Lethe and RocksDB while varying delete tile size. Lethe computes the optimal value of h, which in this case is 8. For h = 8, the I/O cost for Lethe is 76% lower than that in RocksDB. This comes at the price of a 5x increase in the hashing cost, which however, is completely hidden behind the massive benefits in total number of I/Os.
State-of-the-art LSM-based key-value stores perform suboptimally for workloads with even a small proportion of deletes, and that the delete persistence latency in these data stores are potentially unbounded. To address this, we build Lethe, a new LSM-based engine that introduces a new family of compaction strategies FADE and KiWi, a continuum of physical data storage layouts. FADE enforces delete persistence within a user-defined threshold while increasing read throughput and reducing space amplification, at the expense of a modest increase in write amplification. KiWi offers efficient secondary range deletes, which can be tuned to outperform state of the art for a given workload. Lethe is the first LSM-based storage engine to our knowledge that offers efficient deletes with no or positive interference with read performance, supports user-defined delete latency thresholds, and enables practical secondary range deletes.