Since our studies show that the only way to achieve decent parallel output performance is to use one file per process, we have looked into different ways to provide a logical one-file view while using multiple physical files under the hood.
This is a simple solution for the HDF5 file format: we replaced the mpiposix file driver with a version that writes data using multiple files and recreates a merged file using one process when the data is read again. Even with this rather simple approach, we can massively reduce CPU time wasted while waiting for output to complete, and even the total runtime of a write/read cycle.
Using the patch
We have now a new version of the patch available for HDF5 1.8.16 with several new features home of the improved patch.
Download the patch: hdf5-1.8.11-multifile.patch.gz
The patch is based on HDF5 version 1.8.11, but since all it does is to replace one source file with another it should be very easy to apply it to other versions as well. To apply it, run
$ cd hdf5-1.8.11 $ patch -p1 < hdf5-1.8.11-multifile.patch
Configure, make and install the HDF5 distribution the usual way. After that, you can use the multifile driver by creating HDF5 files with
hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS); H5Pset_fapl_mpiposix(plist_id, MPI_COMM_WORLD, false); hid_t fileId = H5Fcreate("path/to/file/to/create.h5", H5F_ACC_TRUNC, H5P_DEFAULT, plist_id);
After that they can be read using this sequence
hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS); H5Pset_fapl_mpiposix(plist_id, MPI_COMM_WORLD, false); hid_t fileId = H5Fopen("path/to/file/to/read.h5", H5F_ACC_RDONLY, plist_id);
When a file is created, the patched HDF5 library will create a directory with the given name instead. In this directory, a small file “metafile” will be created that basically contains only the number of process files present. This is only written once by process 0.
Each process creates two files within the directory: a file with its process number as a name and one with the suffix “.offset” appended to its process number. The first holds all the data that the process writes, the second holds the ranges within the reconstructed file where each block of data should end up. This allows us to use purely appending write operations while writing in parallel.
When the file is subsequently opened for reading, one process reads all the individual files, creates a new file, and replays the recorded writes into this file. Of course, this operation takes longer than the original write, but it is much faster than a parallel write to a shared file would have been. After the file is successfully reconstructed, the directory is replaced by the reconstructed file, which is just a conventional HDF5 file. This file is opened and can be read normally.
We have tried a number of different reconstruction algorithms, the code for three of them are available with the patch above, you can select them by commenting/uncommenting #define statements at the top of “src/H5FDmpio_multiple.c”. The default is to read the entire data at once, close the input files, then write the output file in one go. Another, less memory demanding, available option is to use mmap() to read the input, a strategy which should give good performance on a well designed OS. We have not made it the default, though, since there are important systems which fail to deliver good performance with mmap().
The reconstruction can also be parallelized by uncommenting another #define at the top of “src/H5FDmpio_multiple.c”, however this leads to bad performance in all cases.
Some performance numbers
- writing with 32 processes on one node: between 1 and 2 GiB/s
- reconstructing with one process: ca. 500 MiB/s (this includes both reading and writing the data, the raw read/write throughput is roughly twice this number)
- Unpatched, 32 processes writing from one node: 150 MiB/s