Using compound data types in h5py


Compound data types allow you to create Numpy arrays of heterogeneous data types and store them in HDF5. For the example in this blog post, we want to store X and Y coordinates (as unsigned 32-bit integers), an intensity value (as a float), and a DNA sequence (as a variable-length string).

import numpy as np
import h5py

my_datatype = np.dtype([('x', np.uint32),
                        ('y', np.uint32),
                        ('intensity', np.float),
                        ('sequence', h5py.special_dtype(vlen=str))])
                 
h5 = h5py.File("my_data.h5", "w")
data = [(1003, 4321, 43.2, "ACGTACTG"), (55, 4098, 12.1, "GGT"), (3209, 909, 59.7, "ACTAC")]
dataset = h5.create_dataset('/coordinates', (len(data),), dtype=my_datatype)
data_array = np.array(data, dtype=my_datatype)

We can look at data_array and see how this datatype looks:

In [1]: data_array
Out[1]: 
array([(1003, 4321,  43.2, 'ACGTACTG'), (  55, 4098,  12.1, 'GGT'),
       (3209,  909,  59.7, 'ACTAC')],
      dtype=[('x', '<u4'), ('y', '<u4'), ('intensity', '<f8'), ('sequence', 'O')])

To save the data efficiently, just assign it to the dataset like so:

dataset[...] = data_array

Regular indexing works:

In [2]: h5['/coordinates'][0]
Out[2]: (1003, 4321,  43.2, 'ACGTACTG')

Selection syntax works basically as you'd expect. Optionally, you can tack on an extra index with fields in it to limit the fields that are returned:

selector = (h5['coordinates']['x'] > 1000) & (h5['coordinates']['intensity'] > 40.0)
fields = ['x', 'y']
coords = h5['/coordinates'][selector][fields]

coords is now a compound datatype consisting of just two 32-bit unsigned integers:

In [3]: coords
Out[3]:
array([(1003, 4321), (3209,  909)],
      dtype=[('x', '<u4'), ('y', '<u4')])

Even though coords is homogenous, many numpy operations can't be performed on it. This might not be the best way to do this, but you can get a regular array with:

In [4]: raw_coordinates = np.stack((coords['x'], coords['y']), axis=-1)
In [5]: raw_coordinates
Out[5]: 
array([[1003, 4321],
       [3209,  909]], dtype=uint32)