Publication details

Integrating self-describing data formats into file systems (Benjamin Warnke), Master's Thesis, School: Universität Hamburg, 2019-11-08
Publication details

Abstract

The computational power of huge supercomputers is increasing exponentially every year. The use of this computational power in scientific research in various fields quickly generates large amounts of data. The amount of data, that can be stored in individual file systems, as well as their access speed, also increases exponentially. Unfortunately, the memory size and speed grow slower than the computational performance. The research needs a tool that makes it possible to quickly find the data of interest to researchers in these huge file collections. File system interfaces were defined in the early days of computers and have not been modified since their invention. A search for certain data in huge data collections was not intended. This search could be done by custom metadata. This was not directly possible in the past because the Portable Operating System Interface (POSIX) does not contain the required functions. Self-Describing Data Format (SDDF) were developed for the exchange and reusability of research results between research groups. These are file formats that store metadata along with their raw data, that is the file itself contains the description of the data as well as how it is encoded. Looking for information according to certain criteria, all eligible files had to be opened. This thesis describes the development of a new interface as well as the reference-implementation for the combined storage of data and metadata. This novel interface includes functions for storing metadata in a dedicated Structured Query Language (SQL) backend. Storing metadata in a dedicated backend enables a more efficient search process. Since metadata is now stored in a central database, the files which meet the search-criteria can be opened systematically, as many files are excluded from the beginning. This saves time. The existence of metadata in file systems also has the advantage of intelligently distributing file content across different storage nodes. For example, metadata can be written to faster storage mediums than the raw data which is needed less frequently. This saves time and money.

BibTeX

@mastersthesis{ISDFIFSW19,
	author	 = {Benjamin Warnke},
	title	 = {{Integrating self-describing data formats into file systems}},
	advisors	 = {Michael Kuhn and Kira Duwe},
	year	 = {2019},
	month	 = {11},
	school	 = {Universität Hamburg},
	howpublished	 = {{Online \url{https://wr.informatik.uni-hamburg.de/_media/research:theses:benjamin_warnke_integrating_self_describing_data_formats_into_file_systems.pdf}}},
	type	 = {Master's Thesis},
	abstract	 = {The computational power of huge supercomputers is increasing exponentially every year. The use of this computational power in scientific research in various fields quickly generates large amounts of data. The amount of data, that can be stored in individual file systems, as well as their access speed, also increases exponentially. Unfortunately, the memory size and speed grow slower than the computational performance. The research needs a tool that makes it possible to quickly find the data of interest to researchers in these huge file collections. File system interfaces were defined in the early days of computers and have not been modified since their invention. A search for certain data in huge data collections was not intended. This search could be done by custom metadata. This was not directly possible in the past because the Portable Operating System Interface (POSIX) does not contain the required functions. Self-Describing Data Format (SDDF) were developed for the exchange and reusability of research results between research groups. These are file formats that store metadata along with their raw data, that is the file itself contains the description of the data as well as how it is encoded. Looking for information according to certain criteria, all eligible files had to be opened. This thesis describes the development of a new interface as well as the reference-implementation for the combined storage of data and metadata. This novel interface includes functions for storing metadata in a dedicated Structured Query Language (SQL) backend. Storing metadata in a dedicated backend enables a more efficient search process. Since metadata is now stored in a central database, the files which meet the search-criteria can be opened systematically, as many files are excluded from the beginning. This saves time. The existence of metadata in file systems also has the advantage of intelligently distributing file content across different storage nodes. For example, metadata can be written to faster storage mediums than the raw data which is needed less frequently. This saves time and money.},
}