hdf5 shuffle caffe

What is HDF5? HDF5 is a file format that is useful with Caffe because it allows you to have labels that are continuous valued and multidimensional. For example you may be interested in a regression problem where for each person you want to predict their height in inches, and their weight in pounds. This would mean that for each training sample you need to label it with 2 continuous values. HDF5 let’s you store this kind of label.

How are HDF5 files allocated? From what I have seen people typically store their training set across multiple HDF5 files. In theory you could store your whole training set into a single HDF5 file, or you could store a single sample in a single HDF5 file. My guess would be that storing all of your data in a single HDF5 file would result in a large HDF5 file on the order of gigabytes or even terrabytes and so it may be too costly to read in. Storing a single sample in an HDF5 file would create a large number of HDF5 files and then the overhead of reading from each increases. For my own work, where I have 200,000 training samples, I divided that up into 200 HDF5 files (1000 samples per file).

What is this shuffling business? Well when you have a particularly small dataset the neural network will see the training data multiple times (multiple epochs). The concern is that seeing training data in the same order each time may bias the network in an undesirable way. To address this one should randomize the order in which the network sees the training samples.

From some Google searches I got the impression that shuffling cannot easily be done with HDF5 files in Caffe. However, from looking at this commit, it appears you can shuffle HDF5 files. To shuffle both the order in which HDF5 files are consumed, and the order in which the samples within an HDF5 are consumed, add shuffle: true to your hdf5_data_param. Note you only need to add this to the training phase. Currently, testing to see if this works…