Layout for HDF5 files



Hi group,

I'm looking to build a database of numerical data, and I'm evaluating
HDF5 (via PyTables, most likely) for storing the database. The data
the database will be built from is collected now in separate binary
files with between zero and ~1000 events, and each event is described
by x,y coordinates, a magnitude, and a detector channel ID (possibly
stored as an 8-bit int). Each file will be complete before it is added
to my database. Associated with each file will be a unique
~15-character string that identifies the measurement.

I might end up with about a million measurements, each with around four
detector channels, each channel with zero to ~1000 events.

Can anyone offer some advice on how I should structure this data into a
single HDF5 file? I would like to be able to retrieve efficiently all
events in a certain magnitude range across a large set of measurements.


Here are some alternative ideas on how to lay out my HDF5 file.

1: Put all the events into one group of datasets, with the x, y,
magnitude in one float32 dataset with ~1E6 rows and 3 columns, the
detector channels in a 1D uint8 dataset with ~1E6 rows, and the
measurement IDs in a 1D string dataset. I would want to create an
index for fast lookup of rows matching a measurement ID. The drawback
here is that I would store the same ~15 character string measurement ID
for every event in that measurement. Maybe compression filters would
make that bearable. The datasets would need to be enlargeable to add
measurements as time goes by.

2: Create a group for each measurement, and put the data from each
measurement in datasets under its group. I'm unsure if this will be
efficient when I'm trying to query the database. With this approach,
all the datasets can be fixed-size.

3: Same as 2, but arrange the groups into a binary search tree to speed
up queries.

Does anyone have any wisdom to share?

Thanks,
John
[Please reply to the group, so everyone can learn]

.



Relevant Pages

  • Looking for generic measurement api / industry leader
    ... I'm a solo developer working on a shareware time series database ... Does such a generic measurement API exist, ... have to interface on a measuring unit by measuring unit basis? ...
    (comp.arch.embedded)
  • Looking for generic measurement api / industry leader
    ... I'm a solo developer working on a shareware time series database ... Does such a generic measurement API exist, ... have to interface on a measuring unit by measuring unit basis? ...
    (sci.electronics.misc)
  • Re: Your expert suggestions, please: what is the best scripting language to prepare and run gnuplot
    ... Actually, my files will be created based on a first analysis of our measurement software, and the primary output will be this big file, which contains all the datasets with some descriptors after each other... ... What I will do for the beginning is to split that file into single data sets, and plot each of them automatically, using only parameters that are defined by the measurement (in this case, the time of each measurement has the biggest meaning for us - it is chemistry, but briefly: we separate different chemical substances using gas chromatography - gases travel through our instrument, and, based on their identity, they need more or less time for it; because we have calibrated our method well, we know which substance will arrive at the detector when; the thing we want to plot now is the actual detector signal, a mass spectrum, allowing us to identify and quantify the substance). ... I will keep the idea with the database in my head... ...
    (comp.graphics.apps.gnuplot)
  • Re: Layout for HDF5 files
    ... Note that this index creation process is very fast in PyTables ... one order of magnitude faster than using other databases). ... so as to get the magnitude column for measurement XXX. ... for storing the database. ...
    (sci.data.formats)