Publication details

Learned Index Structures: An Evalution of their Performance in Key-Value Storage Solutions (Leonhard Reichenbach), Bachelor's Thesis, School: Universität Hamburg, 2018-09-10
Publication details

Abstract

Traditional index structures like B-Trees and hash tables have reached the peaks of their performance. Their existing implementations are highly tuned to use the hardware as efficiently as possible. However, they are static solutions and do not exploit existing inner structures in the indexed data which would offer another source for further performance increases. Additionally, some of their underlying techniques are fundamentally flawed. For instance, the hash function in a hash table that hashes keys to uniformly distributed random numbers will always inevitably cause a collision rate of approximately 36% due to statistic phenomena. Thereby, creating a demand for better solutions. One of those solutions could be learned index structures. The traditional index structures are in the wide sense just models taking a key and predicting an index position. If the resulting array is sorted, the index is effectively modeling a cumulative distribution function. If this distribution can be learned by a machine learning model like a neural network, the potential to reduce collisions further is given. In this thesis, an in-memory key-value store named HTableDB was implemented. It is able to use learned index models as well as regular hash functions for indexing and thus offers an easy way to compare the two. Using HTableDB this comparison was made for keys consisting of various file paths taken from a scientific computing environment. The learned index model used was a two-stage feed-forward network. For datasets with keys that belong to a cluster of similar ones, like the ImageNet dataset, the learned index models showed a collision rate of 43% while being approximately 45 times slower than regular hash functions. However, more sophisticated network architectures could lower the collision rate of learned index models in the future.

BibTeX

@misc{LISAEOTPIK18,
author = {Leonhard Reichenbach},
title = {{Learned Index Structures: An Evalution of their Performance in Key-Value Storage Solutions}},
advisors = {Michael Kuhn and Jakob Lüttgau},
year = {2018},
month = {09},
school = {Universität Hamburg},
howpublished = {{Online \url{https://wr.informatik.uni-hamburg.de/_media/research:theses:leonhard_reichenbach_learned_index_structures_an_evalution_of_their_performance_in_key_value_storage_solutions.pdf}}},
type = {Bachelor's Thesis},
abstract = {Traditional index structures like B-Trees and hash tables have reached the peaks of their performance. Their existing implementations are highly tuned to use the hardware as efficiently as possible. However, they are static solutions and do not exploit existing inner structures in the indexed data which would offer another source for further performance increases. Additionally, some of their underlying techniques are fundamentally flawed. For instance, the hash function in a hash table that hashes keys to uniformly distributed random numbers will always inevitably cause a collision rate of approximately 36\% due to statistic phenomena. Thereby, creating a demand for better solutions. One of those solutions could be learned index structures. The traditional index structures are in the wide sense just models taking a key and predicting an index position. If the resulting array is sorted, the index is effectively modeling a cumulative distribution function. If this distribution can be learned by a machine learning model like a neural network, the potential to reduce collisions further is given. In this thesis, an in-memory key-value store named HTableDB was implemented. It is able to use learned index models as well as regular hash functions for indexing and thus offers an easy way to compare the two. Using HTableDB this comparison was made for keys consisting of various file paths taken from a scientific computing environment. The learned index model used was a two-stage feed-forward network. For datasets with keys that belong to a cluster of similar ones, like the ImageNet dataset, the learned index models showed a collision rate of 43\% while being approximately 45 times slower than regular hash functions. However, more sophisticated network architectures could lower the collision rate of learned index models in the future.},
}