research:projects:bullio:netcdf_external_links:start

External Links for netCDF

Maintainer

Introduction

In the climate computing netCDF is often used as a container for two types of data: the grid and the measurements. Typically, a grid is represented as a set of points, whereas each of them represents a position on the earth surface. Many models use pairs of longitude and latitude values for this purpose, but there are also more complex ones, that include height, vertices, topography, and so on. To each such a grid point we can assign a set of different values, like temperature, air pressure, or insolation. We also need to take into the account, that these values can change over time.

In netCDF a simple grid can be constructed by two dimensions, e.g. longitude and latitude. Changes over time can be represented by another dimension, e.g. time. These dimensions can be assigned to a variable which contains our measurements. As a result we get a variable, where each value is assigned to longitude, latitude and time step. In a netCDF we can put several such variables and connect them to the same dimensions. But as soon we start using several files we get duplication of the grid, because each file must contain one. Depending on the size and complexity of the grid, it can consume a significant amount of disk space. Therefore we need to find a way to reuse the grid.

An obvious solution to this problem is to store the grid in one file and create links to the grid in the other files. But the current netCDF version is missing such a feature. We decided to extend netCDF and wrote a patch.

Detailed Description

Behind the API of the current implementation of netCDF-4 interface hides the HDF5 library, that has a wide range of useful features, which we can use to implement the link functionality in netCDF. We decided to use HDF5 Virtual Datasets (VDS), a feature that was introduced in the HDF5-1.10.0. VDS is powerful feature of HDF5 and it provides more functionality that is needed for our purpose. Our patch takes a simple usage of it. It looks at the dimensionality of the source dataset and create a virtual dataset with the same dimensionality in the target file. Infinite dimensions are not supported yet.

Emulations of netCDF dimensions in HDF5 is realized by HDF5 datasets and relies on a heavy usage of HDF5 attributes. More precisely, for each dimension netCDF creates a dataset stored and attaches a set of different attributes. The dataset can store dimension labels and the attributes contains meta information about the dimensions, e.g. index, name, attached variables. When using virtual datasets for creating links to datasets the attributes are not created automatically. This work must be done manually. The attributes “CLASS”, “NAME”, and “REFERENCE_LIST”, are can be easily created by the HDF5 scale interface. The attribute “_Netcdf4Dimid” is a pure netCDF component and is created by the HDF5 attribute interface.

Although, HDF5 allows to create virtual datasets even if the target datasets don't exist, we couldn't make our patch to work in this way. To work properly our patch requires information from the source file, like dimensionality, datatype. This implies, that the target file and the valid datasets must exist at runtime.

When for some reason, after the links are created, the target file becomes inaccessiable (e.g. deleted, renamed, unreadable, …) and the target dataset is not available the links will be filled with default values. In our case it is the value 0.

Our patch introduces a new function:

int nc_def_dim_external(int ncid, const int dimncid, const char *name, int *idp)
Name Type Description
ncid in File id. Links will be created here.
dimncid in File id, where dimensions are located.
name in dimension name
idp out dimension id

Measurements

Model Resolution Grid size Data size
HD(CP)2 2323968 cells 2.8GB 100GB
HD(CP)2 7616120 cells 9.1GB 340GB
HD(CP)2 22282304 cells 27GB 1000GB

Internal structure of datafiles with high resolution:

$ h5ls 3d_fine_day_DOM03_ML_20130502T141200Z.nc 
bnds                     Dataset {2}
clc                      Dataset {1/Inf, 150, 22282304}
cli                      Dataset {1/Inf, 150, 22282304}
clw                      Dataset {1/Inf, 150, 22282304}
height                   Dataset {150}
height_2                 Dataset {151}
height_bnds              Dataset {150, 2}
hus                      Dataset {1/Inf, 150, 22282304}
ncells                   Dataset {22282304}
ninact                   Dataset {1/Inf, 150, 22282304}
pres                     Dataset {1/Inf, 150, 22282304}
qg                       Dataset {1/Inf, 150, 22282304}
qh                       Dataset {1/Inf, 150, 22282304}
qnc                      Dataset {1/Inf, 150, 22282304}
qng                      Dataset {1/Inf, 150, 22282304}
qnh                      Dataset {1/Inf, 150, 22282304}
qni                      Dataset {1/Inf, 150, 22282304}
qnr                      Dataset {1/Inf, 150, 22282304}
qns                      Dataset {1/Inf, 150, 22282304}
qr                       Dataset {1/Inf, 150, 22282304}
qs                       Dataset {1/Inf, 150, 22282304}
ta                       Dataset {1/Inf, 150, 22282304}
time                     Dataset {1/Inf}
tkvh                     Dataset {1/Inf, 151, 22282304}
ua                       Dataset {1/Inf, 150, 22282304}
va                       Dataset {1/Inf, 150, 22282304}
vertices                 Dataset {3}
wa                       Dataset {1/Inf, 151, 22282304}

Download

netCDF patch

netcdf-c-4.4.1-rc2.patch.gz

  • Depends on HDF5-1.10.0 or higher
  • Adds external dimensions to netCDF-library
    • no support for unlimited dimensions
    • no support for external variables

HDF5 patch

hdf5-1.10.0-patch1.patch.gz

  • Workaround for the “Assertion `ret == NC_NOERR' failed.” issue.

Installation

  1. HDF5
    1. Download and extract HDF5-1.10.0-patch1
    2. Download and extract the patch to the HDF5 directory
    3. Change to the HDF5 directory
    4. Apply the patch
       git apply hdf5-1.10.0-patch1.patch 
    5. Configure, make and install HDF5 in the usual way
  2. netCDF
    1. Download and extract netCDF-c-4.4.1-rc2
    2. Download and extract the patch to the netCDF directory
    3. Change to the netCDF directory
    4. Apply the patch
       git apply netcdf-c-4.4.1-rc2.patch 
    5. Configure, make and install netCDF in the usual way

Usage

int nlat, dimid;
int grid_ncid, data_ncid;
const char* gridfile = "grid.nc";
const char* datafile = "data.nc";
 
nc_open(gridfile, NC_NOWRITE, &grid_ncid);
nc_create(datafile, NC_NETCDF4, &data_ncid);
nc_def_dim_external(data_ncid, grid_ncid, "lat", &dimid);

Example

Step 1: Creating a netCDF file

In the first step a netCDF4 file “grid.nc” is created. It contains labeled dimensions “lat”, “lon” and “time”, and a variable “var1”. The output of ncdump shows the header of the file.

Download: mkncfile.c

Output of ncdump:

$ ncdump -h  grid.nc
netcdf grid {
dimensions:
	time = UNLIMITED ; // (4 currently)
	lat = 6 ;
	lon = 5 ;
variables:
	float time(time) ;
	float lat(lat) ;
	float lon(lon) ;
	int var1(time, lat, lon) ;
}

Output h5dump:

$ h5dump -A -p grid.nc
HDF5 "grid.nc" {
GROUP "/" {
   ATTRIBUTE "_NCProperties" {
      DATATYPE  H5T_STRING {
         STRSIZE 8192;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "version=1|netcdflibversion=4.4.1-rc2|hdf5libversion=1.10.0"
      }
   }
   DATASET "lat" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 6 ) / ( 6 ) }
      STORAGE_LAYOUT {
         CONTIGUOUS
         SIZE 24
         OFFSET 17098
      }
      FILTERS {
         NONE
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_IFSET
         VALUE  9.96921e+36
      }
      ALLOCATION_TIME {
         H5D_ALLOC_TIME_LATE
      }
      ATTRIBUTE "CLASS" {
         DATATYPE  H5T_STRING {
            STRSIZE 16;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "DIMENSION_SCALE"
         }
      }
      ATTRIBUTE "NAME" {
         DATATYPE  H5T_STRING {
            STRSIZE 4;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "lat"
         }
      }
      ATTRIBUTE "REFERENCE_LIST" {
         DATATYPE  H5T_COMPOUND {
            H5T_REFERENCE { H5T_STD_REF_OBJECT } "dataset";
            H5T_STD_I32LE "dimension";
         }
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
         DATA {
         (0): {
               DATASET 11799 /var1 ,
               1
            }
         }
      }
      ATTRIBUTE "_Netcdf4Dimid" {
         DATATYPE  H5T_STD_I32LE
         DATASPACE  SCALAR
         DATA {
         (0): 1
         }
      }
   }
   DATASET "lon" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 5 ) / ( 5 ) }
      STORAGE_LAYOUT {
         CONTIGUOUS
         SIZE 20
         OFFSET 17122
      }
      FILTERS {
         NONE
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_IFSET
         VALUE  9.96921e+36
      }
      ALLOCATION_TIME {
         H5D_ALLOC_TIME_LATE
      }
      ATTRIBUTE "CLASS" {
         DATATYPE  H5T_STRING {
            STRSIZE 16;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "DIMENSION_SCALE"
         }
      }
      ATTRIBUTE "NAME" {
         DATATYPE  H5T_STRING {
            STRSIZE 4;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "lon"
         }
      }
      ATTRIBUTE "REFERENCE_LIST" {
         DATATYPE  H5T_COMPOUND {
            H5T_REFERENCE { H5T_STD_REF_OBJECT } "dataset";
            H5T_STD_I32LE "dimension";
         }
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
         DATA {
         (0): {
               DATASET 11799 /var1 ,
               2
            }
         }
      }
      ATTRIBUTE "_Netcdf4Dimid" {
         DATATYPE  H5T_STD_I32LE
         DATASPACE  SCALAR
         DATA {
         (0): 2
         }
      }
   }
   DATASET "time" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 4 ) / ( H5S_UNLIMITED ) }
      STORAGE_LAYOUT {
         CHUNKED ( 1024 )
         SIZE 4096
      }
      FILTERS {
         NONE
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_IFSET
         VALUE  9.96921e+36
      }
      ALLOCATION_TIME {
         H5D_ALLOC_TIME_INCR
      }
      ATTRIBUTE "CLASS" {
         DATATYPE  H5T_STRING {
            STRSIZE 16;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "DIMENSION_SCALE"
         }
      }
      ATTRIBUTE "NAME" {
         DATATYPE  H5T_STRING {
            STRSIZE 5;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "time"
         }
      }
      ATTRIBUTE "REFERENCE_LIST" {
         DATATYPE  H5T_COMPOUND {
            H5T_REFERENCE { H5T_STD_REF_OBJECT } "dataset";
            H5T_STD_I32LE "dimension";
         }
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
         DATA {
         (0): {
               DATASET 11799 /var1 ,
               0
            }
         }
      }
      ATTRIBUTE "_Netcdf4Dimid" {
         DATATYPE  H5T_STD_I32LE
         DATASPACE  SCALAR
         DATA {
         (0): 0
         }
      }
   }
   DATASET "var1" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 4, 6, 5 ) / ( H5S_UNLIMITED, 6, 5 ) }
      STORAGE_LAYOUT {
         CHUNKED ( 1, 6, 5 )
         SIZE 480
      }
      FILTERS {
         NONE
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_IFSET
         VALUE  -2147483647
      }
      ALLOCATION_TIME {
         H5D_ALLOC_TIME_INCR
      }
      ATTRIBUTE "DIMENSION_LIST" {
         DATATYPE  H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
         DATASPACE  SIMPLE { ( 3 ) / ( 3 ) }
         DATA {
         (0): (DATASET 8428 /time ), (DATASET 10954 /lat ),
         (2): (DATASET 11377 /lon )
         }
      }
   }
}
}

Step 2: Linking dimensions

In the second step another netCDF file “data.nc” is created. It has the same structure as the file in the previous step, but the dimensions are connected to the “grid.nc” file. Unlimited dimensions are not supported at the moment. They are converted to limited ones, as you can see in the output of ncdump.

Download: mklink.c

Output of ncdump:

$ ncdump -h  data.nc
netcdf data {
dimensions:
	lat = 6 ;
	lon = 5 ;
	time = 4 ;
variables:
	float lat(lat) ;
	float lon(lon) ;
	float time(time) ;
	int var1(time, lat, lon) ;
}

Output of h5dump:

$ h5dump -A -p  data.nc
HDF5 "data.nc" {
GROUP "/" {
   ATTRIBUTE "_NCProperties" {
      DATATYPE  H5T_STRING {
         STRSIZE 8192;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "version=1|netcdflibversion=4.4.1-rc2|hdf5libversion=1.10.0"
      }
   }
   DATASET "lat" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 6 ) / ( 6 ) }
      STORAGE_LAYOUT {
         MAPPING 0 { 
            VIRTUAL {
               SELECTION REGULAR_HYPERSLAB { 
                  START (0)
                  STRIDE (1)
                  COUNT (1)
                  BLOCK (6)
               }
            }
            SOURCE {
               FILE "grid.nc"
               DATASET "lat"
               SELECTION REGULAR_HYPERSLAB { 
                  START (0)
                  STRIDE (1)
                  COUNT (1)
                  BLOCK (6)
               }
            }
         }
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_IFSET
         VALUE  H5D_FILL_VALUE_DEFAULT
      }
      ATTRIBUTE "CLASS" {
         DATATYPE  H5T_STRING {
            STRSIZE 16;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "DIMENSION_SCALE"
         }
      }
      ATTRIBUTE "NAME" {
         DATATYPE  H5T_STRING {
            STRSIZE 4;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "lat"
         }
      }
      ATTRIBUTE "REFERENCE_LIST" {
         DATATYPE  H5T_COMPOUND {
            H5T_REFERENCE { H5T_STD_REF_OBJECT } "dataset";
            H5T_STD_I32LE "dimension";
         }
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
         DATA {
         (0): {
               DATASET 15866 /var1 ,
               1
            }
         }
      }
      ATTRIBUTE "_Netcdf4Dimid" {
         DATATYPE  H5T_STD_I32LE
         DATASPACE  SCALAR
         DATA {
         (0): 0
         }
      }
   }
   DATASET "lon" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 5 ) / ( 5 ) }
      STORAGE_LAYOUT {
         MAPPING 0 { 
            VIRTUAL {
               SELECTION REGULAR_HYPERSLAB { 
                  START (0)
                  STRIDE (1)
                  COUNT (1)
                  BLOCK (5)
               }
            }
            SOURCE {
               FILE "grid.nc"
               DATASET "lon"
               SELECTION REGULAR_HYPERSLAB { 
                  START (0)
                  STRIDE (1)
                  COUNT (1)
                  BLOCK (5)
               }
            }
         }
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_IFSET
         VALUE  H5D_FILL_VALUE_DEFAULT
      }
      ATTRIBUTE "CLASS" {
         DATATYPE  H5T_STRING {
            STRSIZE 16;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "DIMENSION_SCALE"
         }
      }
      ATTRIBUTE "NAME" {
         DATATYPE  H5T_STRING {
            STRSIZE 4;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "lon"
         }
      }
      ATTRIBUTE "REFERENCE_LIST" {
         DATATYPE  H5T_COMPOUND {
            H5T_REFERENCE { H5T_STD_REF_OBJECT } "dataset";
            H5T_STD_I32LE "dimension";
         }
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
         DATA {
         (0): {
               DATASET 15866 /var1 ,
               2
            }
         }
      }
      ATTRIBUTE "_Netcdf4Dimid" {
         DATATYPE  H5T_STD_I32LE
         DATASPACE  SCALAR
         DATA {
         (0): 1
         }
      }
   }
   DATASET "time" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 4 ) / ( H5S_UNLIMITED ) }
      STORAGE_LAYOUT {
         CHUNKED ( 1024 )
         SIZE 4096
      }
      FILTERS {
         NONE
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_IFSET
         VALUE  9.96921e+36
      }
      ALLOCATION_TIME {
         H5D_ALLOC_TIME_INCR
      }
      ATTRIBUTE "CLASS" {
         DATATYPE  H5T_STRING {
            STRSIZE 16;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "DIMENSION_SCALE"
         }
      }
      ATTRIBUTE "NAME" {
         DATATYPE  H5T_STRING {
            STRSIZE 5;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "time"
         }
      }
      ATTRIBUTE "REFERENCE_LIST" {
         DATATYPE  H5T_COMPOUND {
            H5T_REFERENCE { H5T_STD_REF_OBJECT } "dataset";
            H5T_STD_I32LE "dimension";
         }
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
         DATA {
         (0): {
               DATASET 15866 /var1 ,
               0
            }
         }
      }
      ATTRIBUTE "_Netcdf4Dimid" {
         DATATYPE  H5T_STD_I32LE
         DATASPACE  SCALAR
         DATA {
         (0): 2
         }
      }
   }
   DATASET "var1" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 4, 6, 5 ) / ( H5S_UNLIMITED, 6, 5 ) }
      STORAGE_LAYOUT {
         CHUNKED ( 1, 6, 5 )
         SIZE 480
      }
      FILTERS {
         NONE
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_IFSET
         VALUE  -2147483647
      }
      ALLOCATION_TIME {
         H5D_ALLOC_TIME_INCR
      }
      ATTRIBUTE "DIMENSION_LIST" {
         DATATYPE  H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
         DATASPACE  SIMPLE { ( 3 ) / ( 3 ) }
         DATA {
         (0): (DATASET 9250 /time ), (DATASET 8428 /lat ),
         (2): (DATASET 8842 /lon )
         }
      }
   }
}
}

Future Work

In the next step we plan to integrate the external dimensions in our workflows.

research/projects/bullio/netcdf_external_links/start.txt · Last modified: 2018-05-09 17:27 (external edit)