Many high-performance computing (HPC) applications generate vast amounts of data and write them to parallel distributed file systems where they are kept for further processing. All production-level file systems currently in use offer a POSIX I/O interface that treats file data as an opaque byte stream. As it is not possible to reconstruct the data format from this byte stream without prior knowledge, self-describing data formats such as NetCDF (Network Common Data Format) and ADIOS (Adaptable IO System) are widely used to be able to exchange data with other researchers and annotate data with meaningful metadata. For example, the World Data Center for Climate (WDCC) alone holds multiple petabytes (PB) of data in such formats and provides access for other researchers. The data structure information is encoded in the files themselves, which makes it possible to group and annotate data. Moreover, data can be accessed and interpreted without having knowledge about its structure in advance.
In a typical HPC I/O stack, applications only interface directly with NetCDF, which depends on HDF5 (Hierarchical Data Format) and so on. The coupling between the different layers is loose and mainly used for performance tuning (for example, applications might set optional I/O hints). However, structural information about the data is lost as it is handed down through the layers: While an application might pass multi-dimensional matrices of numerical data to NetCDF, MPI-IO is only aware of a stream of elements (which can be different simple or composed data types) and Lustre's POSIX interface interprets file data as a stream of bytes.
Additionally, data and metadata are typically separated in such file systems for performance reasons. This leads to two different types of metadata: file system metadata (such as ownership and permissions) are stored on the file system's metadata servers, while file metadata (such as the type of physical quantities) are stored within files on the file system's data servers. File system metadata is typically in the range of 5% of the overall data volume, which leads to significant amounts of metadata in current multi-petabyte file systems.
There are several major problems with the current state:
writecall and to be performed in an atomic way; this can lead to performance issues in a distributed system due to synchronization overhead. HPC applications often do not require the strict consistency and coherence semantics offered by POSIX but are not able to opt out. Even though higher layers such as MPI-IO typically offer relaxed semantics, their dependence on the parallel file system's POSIX interface effectively negates these relaxations. Static semantics is unable to satisfy the requirements of vastly different applications, requiring application-specific workarounds.
Finally, existing file systems, especially parallel distributed file systems, do not provide possibilities for flexible adaptations to their core concepts such as the semantics or file system structure. This makes the evaluation of new propositions very expensive and ultimately hinders innovation. Another drawback is the fact that a considerable number of file systems is only realized in the kernel space, increasing the complexity of changes significantly. Kernel file systems can typically only be set up and mounted by privileged users. Moreover, integration into the kernel requires a more sophisticated implementation and involves considerable maintenance costs.
The aim of CoSEMoS is to propose and implement significant improvements regarding performance and data management through a novel architecture making use of established technologies from the fields of databases and HPC. By introducing a unified global metadata management combined with adaptable I/O semantics and an intelligent storage selection, the problems discussed above will be addressed. We expect this to have considerable performance benefits as our previous studies have shown that especially metadata performance can be improved by up to two orders of magnitude using similar techniques. Furthermore, by leveraging existing I/O interfaces, this approach is completely transparent for users. CoSEMoS will also provide a data analysis interface for efficient post-processing and an export functionality so that files can be exchanged between scientists as before.