publication

Publication details

  • Analyzing Data Properties using Statistical Sampling Techniques – Illustrated on Scientific File Formats and Compression Features (Julian Kunkel), In High Performance Computing: ISC High Performance 2016 International Workshops, ExaComm, E-MuCoCoS, HPC-IODC, IXPUG, IWOPH, P3MA, VHPC, WOPSSS, Lecture Notes in Computer Science (9945 2016), pp. 130–141, (Editors: Michela Taufer, Bernd Mohr, Julian Kunkel), Springer, ISC-HPC 2017, Frankfurt, Germany, ISBN: 978-3-319-46079-6, 2016-06
    Publication detailsDOI

Abstract

Understanding the characteristics of data stored in data centers helps computer scientists in identifying the most suitable storage infrastructure to deal with these workloads. For example, knowing the relevance of file formats allows optimizing the relevant formats but also helps in a procurement to define benchmarks that cover these formats. Existing studies that investigate performance improvements and techniques for data reduction such as deduplication and compression operate on a small set of data. Some of those studies claim the selected data is representative and scale their result to the scale of the data center. One hurdle of running novel schemes on the complete data is the vast amount of data stored and, thus, the resources required to analyze the complete data set. Even if this would be feasible, the costs for running many of those experiments must be justified. This paper investigates stochastic sampling methods to compute and analyze quantities of interest on file numbers but also on the occupied storage space. It will be demonstrated that on our production system, scanning 1 % of files and data volume is sufficient to deduct conclusions. This speeds up the analysis process and reduces costs of such studies significantly. The contributions of this paper are: (1) the systematic investigation of the inherent analysis error when operating only on a subset of data, (2) the demonstration of methods that help future studies to mitigate this error, (3) the illustration of the approach on a study for scientific file types and compression for a data center.

BibTeX

@inproceedings{ADPUSSTIOS16,
	author	 = {Julian Kunkel},
	title	 = {{Analyzing Data Properties using Statistical Sampling Techniques -- Illustrated on Scientific File Formats and Compression Features}},
	year	 = {2016},
	month	 = {06},
	booktitle	 = {{High Performance Computing: ISC High Performance 2016 International Workshops, ExaComm, E-MuCoCoS, HPC-IODC, IXPUG, IWOPH, P3MA, VHPC, WOPSSS}},
	editor	 = {Michela Taufer and Bernd Mohr and Julian Kunkel},
	publisher	 = {Springer},
	series	 = {Lecture Notes in Computer Science},
	number	 = {9945 2016},
	pages	 = {130--141},
	conference	 = {ISC-HPC 2017},
	location	 = {Frankfurt, Germany},
	isbn	 = {978-3-319-46079-6},
	doi	 = {http://link.springer.com/chapter/10.1007/978-3-319-46079-6_10},
	abstract	 = {Understanding the characteristics of data stored in data centers helps computer scientists in identifying the most suitable storage infrastructure to deal with these workloads. For example, knowing the relevance of file formats allows optimizing the relevant formats but also helps in a procurement to define benchmarks that cover these formats. Existing studies that investigate performance improvements and techniques for data reduction such as deduplication and compression operate on a small set of data. Some of those studies claim the selected data is representative and scale their result to the scale of the data center. One hurdle of running novel schemes on the complete data is the vast amount of data stored and, thus, the resources required to analyze the complete data set. Even if this would be feasible, the costs for running many of those experiments must be justified. This paper investigates stochastic sampling methods to compute and analyze quantities of interest on file numbers but also on the occupied storage space. It will be demonstrated that on our production system, scanning 1 \% of files and data volume is sufficient to deduct conclusions. This speeds up the analysis process and reduces costs of such studies significantly. The contributions of this paper are: (1) the systematic investigation of the inherent analysis error when operating only on a subset of data, (2) the demonstration of methods that help future studies to mitigate this error, (3) the illustration of the approach on a study for scientific file types and compression for a data center.},
}

publication.txt · Last modified: 2019-01-23 10:26 (external edit)