Publication details
- Analyzing Data Properties using Statistical Sampling Techniques – Illustrated on Scientific File Formats and Compression Features (Julian Kunkel), Frankfurt, ISC High Performance 2016, 2016-21-06 – Awards: Best Poster
Publication details – Publication
Abstract
Understanding the characteristics of data stored in data centers helps computer scientists identifying the most suitable storage infrastructure to deal with these workloads. For example, knowing the relevance of file formats allows optimizing the relevant file formats but also helps in a procurement to define useful benchmarks. Existing studies that investigate performance improvements and techniques for data reduction such as deduplication and compression operate on a small set of data. Some of those studies claim the selected data is representative and scale their result to the scale of the data center. One hurdle of evaluate novel schemes on the complete data is the vast amount of data stored and, thus, the resources required to analyze the complete data set. Even if this would be feasible, the costs for running many of those experiments must be justified. This poster investigates stochastic sampling methods to compute and analyze quantities of interest on file numbers but also on the occupied storage space. It is demonstrated that scanning 1% of files and data volume is sufficient on DKRZ's supercomputer to obtain accurate results. This not only speeds up the analysis process but reduces costs of such studies significantly. Contributions of this poster are: 1) investigation of the inherent error when operating only on a subset of data, 2) presentation of methods that help future studies to mitigate this error and, 3) illustration of the approach with a study for scientific file types and compression
BibTeX
@misc{ADPUSSTIOS16, author = {Julian Kunkel}, title = {{Analyzing Data Properties using Statistical Sampling Techniques – Illustrated on Scientific File Formats and Compression Features}}, year = {2016}, month = {21}, location = {Frankfurt}, activity = {ISC High Performance 2016}, abstract = {Understanding the characteristics of data stored in data centers helps computer scientists identifying the most suitable storage infrastructure to deal with these workloads. For example, knowing the relevance of file formats allows optimizing the relevant file formats but also helps in a procurement to define useful benchmarks. Existing studies that investigate performance improvements and techniques for data reduction such as deduplication and compression operate on a small set of data. Some of those studies claim the selected data is representative and scale their result to the scale of the data center. One hurdle of evaluate novel schemes on the complete data is the vast amount of data stored and, thus, the resources required to analyze the complete data set. Even if this would be feasible, the costs for running many of those experiments must be justified. This poster investigates stochastic sampling methods to compute and analyze quantities of interest on file numbers but also on the occupied storage space. It is demonstrated that scanning 1\% of files and data volume is sufficient on DKRZ's supercomputer to obtain accurate results. This not only speeds up the analysis process but reduces costs of such studies significantly. Contributions of this poster are: 1) investigation of the inherent error when operating only on a subset of data, 2) presentation of methods that help future studies to mitigate this error and, 3) illustration of the approach with a study for scientific file types and compression}, }