Bringing Computer Vision datasets to a Single Format: Step towards Consistency

Hey Folks,

Required Libraries and Versions:

  • Python: 3.x
  • Numpy: 1.12
  • Scipy: 0.18
  • h5py
  • xml.etree.ElementTree
  • multiprocessing
  • PIL
  • six.moves.cPickle
  • functools


When you have a good working algorithm and you want to test your masterpiece on some dataset, almost always have to spend quite a lot of time on actual loading and preprocessing of data. It would be quite nice if we could have all the data in one single format and consistent way of accessing data (e.g. always store training images under the key “train/image”).

Here I’ll be sharing a github repo written by me that converts several popular datasets into HDF5 format. Currently supports following datasets.

What is and Why HDF5?

I think it’s fair I give a quick introduction to and explain why I pick HDF5 (Stands for Heirarchical Data Format) of all the other extensions available. As the expansion of the extension suggests, this allows you to create a huge single file that can contain multiple data arrays. So you do not need to pull your hair trying to recall where you saved the 126th array of data to run your full code correctly. And the other nice property is that, it allows you to have all the subtle and nice organization you need for different data arrays you need (e.g. create data groups). And many different programming languages have libraries readily made to read these files. So it is not language specific like pickle files.

What does this code do?

So this repository does quite a few things. First let me tell you the organization. Code base is pretty simple. It has a single file for each dataset to preprocess data and save as HDF5 (e.g. for Imagenet we have, CIFAR-10 and CIFAR-100 we have and for SVHN we have Essentially each file does the following:

  • Load the original data into the memory
  • Perform any reshaping required to get data to the proper dimensionality (e.g. cifar dataset gives image as a vector so need to bring that to a 3-dimensional matrix)
  • Create a HDF5 file to saved data in
  • Use Python multiprocessing library and process each image according to the user specifications

Below I’m gonna tell what the ImageNet file does. This is the most complicated file and the others are quite straight-forward.

Here I discuss what file does. This basically saves a subset of ImageNet data as a HDF5 file. This subset is a data belonging to number of natural classes (e.g. plant, cat) and artificial classes (e.g. chair, desk). Furthermore you can normalize data while saving data. First you can provide various arguments at run time to define you you want to preprocess data.

User Provided Arguments

  • --train_dir: Directory the train data is in (e.g. …/Data/CLS-LOC/train/)
  • --valid_dir: Directory the valid data is in (e.g. …/Data/CLS-LOC/valid/)
  • --valid_annotation_dir: Directory the valid data is in (e.g. …/Annotations/CLS-LOC/valid/)
  • --save_dir: Directory you want to save data in
  • --gloss_fname: This file contains a mapping between the synset ID and what the synset represents (i.e. a description) for all the classes in the ImageNet dataset
  • --zero_mean and --unit_variance: Normalization step, you can select zero-mean normalization, unit-variance normalization or both
  • --resize_to: Resize the images to a specific size
  • --nat_classes: Number of natural classes you need in the dataset
  • --art_classes: Number of artificial classes you need in the dataset
  • --n_threads: Number of threads to use while processing data (for the multiprocessing library)

Then the save_imagenet_as_hdf5(...) function takes over. This function first create a mapping between the valid dataset filenames and labels (i.e. build_or_retrieve_valid_filename_to_synset_id_mapping(...). Next it isolates the classes related to the classification problem of the ImageNet dataset (1000 classes) with write_art_nat_ordered_class_descriptions(...) or retrieve_art_nat_ordered_class_descriptions(...). Then we write the selected artificial and natural class information to an xml file using write_selected_art_nat_synset_ids_and_descriptions(...) method.

Next we sweep through all the subdirectories in the training data and get all the related data points into the memory. Next we create HDF files to save data. this is done with the save_train_data_in_filenames(...) function. The data will be saved under the following keys:

  • /train/images/
  • /train/images/
  • /valid/images/
  • /valid/images/

Saving and Analysing Data

You also have test code for saving the preprocessed images to the disk and visually analyse that they are fine. These are found in test_saved_<dataset>.py.

Running the Code

Here I’ll explain how to run the code for processing each dataset.


First create a folder called data in your project’s home folder. Next create a folder called svhn-10 within the data folder. Copy the original SVHN 10 files in here (they should be like train_32x32.mat, test_32x32.mat). Then run it as:



First create a folder called data in your project’s home folder (if doesn’t exists). Next create a folder called cifar-10 within the data folder. Copy the original CIFAR 10 files in here (they should be like cifar_10_data_batch_1,..., cifar_10_data_batch_5, cifar_10_test_batch). Then run it as (for example):

python3 --data_type=cifar-10 --resize_to=24 --zero_mean=1 --unit_variance=1


First create a folder called data in your project’s home folder (if doesn’t exists). Next create a folder called cifar-100 within the data folder. Copy the original CIFAR 100 files in here (they should be like train, test). Then run it as (for example):

python3 --data_type=cifar-100 --resize_to=32 --zero_mean=1 --unit_variance=1


First create a folder called data in your project’s home folder (if doesn’t exists). You don’t need to copy data to the data folder instead, you need to provide specific locations the files are found. Explained in the section above about user arguments for

Accessing and Loading Data Later

You can access this saved data later as:

dataset_file = h5py.File("data" + os.sep + "filename.hdf5", "r")
train_dataset, train_labels = dataset_file['/train/images'], dataset_file['/train/labels']
test_dataset, test_labels = dataset_file['/test/images'], dataset_file['/test/labels']

You can access the full Github repo Here.

Sorry, no related posts found