Publication details

Data-Aware Compression for HPC using Machine Learning (Julius Plehn), Master's Thesis, School: Universität Hamburg, 2022-05-09
Publication details

Abstract

While compression can provide significant storage and cost savings, its use within HPC applications is often only of secondary concern. This is in part due to the inflexibility of existing approaches where a single compression algorithm has to be used throughout the whole application, but also because insights into the behavior of the algorithms within the context of individual applications are missing. Several compression algorithms are available, with each one also having a unique set of options. These options have a direct influence on the achieved performance and compression results. Furthermore, the algorithms and options to use for a given dataset are highly dependent on the characteristics of said dataset. This thesis explores how machine learning can help identify fitting compression algorithms with corresponding options based on actual data structure encountered during I/O. To do so, a data collection and training pipeline is introduced. Inferencing is performed during regular application runs and shows promising results. Moreover, it provides valuable insights into the benefits of using certain compression algorithms and options for specific data.

BibTeX

@mastersthesis{DCFHUMLP22,
	author	 = {Julius Plehn},
	title	 = {{Data-Aware Compression for HPC using Machine Learning}},
	advisors	 = {Michael Kuhn and Anna Fuchs and Jakob Lüttgau},
	year	 = {2022},
	month	 = {05},
	school	 = {Universität Hamburg},
	howpublished	 = {{Online \url{https://wr.informatik.uni-hamburg.de/_media/research:theses:julius_plehn_data_aware_compression_for_hpc_using_machine_learning.pdf}}},
	type	 = {Master's Thesis},
	abstract	 = {While compression can provide significant storage and cost savings, its use within HPC applications is often only of secondary concern. This is in part due to the inflexibility of existing approaches where a single compression algorithm has to be used throughout the whole application, but also because insights into the behavior of the algorithms within the context of individual applications are missing. Several compression algorithms are available, with each one also having a unique set of options. These options have a direct influence on the achieved performance and compression results. Furthermore, the algorithms and options to use for a given dataset are highly dependent on the characteristics of said dataset. This thesis explores how machine learning can help identify fitting compression algorithms with corresponding options based on actual data structure encountered during I/O. To do so, a data collection and training pipeline is introduced. Inferencing is performed during regular application runs and shows promising results. Moreover, it provides valuable insights into the benefits of using certain compression algorithms and options for specific data.},
}