Modeling, simulation, and engineering

Storing large Numpy arrays on disk: Python Pickle vs. HDF5

January 10th, 2010 Posted in Python, Scientific computing, Software development

In a previous post, I described how Python’s Pickle module is fast and convenient for storing all sorts of data on disk. More recently, I showed how to profile the memory usage of Python code.  In recent weeks, I’ve uncovered a serious limitation in the Pickle module when storing large amounts of data: Pickle requires a large amount of memory to save a data structure to disk. Fortunately, there is an open standard called HDF, which defines a binary file format that is designed to efficiently store large scientific data sets. I will demonstrate both approaches, and profile them to see how much memory is required. I am writing the HDF file using the PyTables interface. Here’s the little test program I’ve been using:

#!/usr/bin/env python

from numpy import *

array_len = 10000000

a = zeros(array_len, dtype=float)

#import pickle
#f = open('test.pickle', 'wb')
#pickle.dump(a, f, pickle.HIGHEST_PROTOCOL)
#f.close()

import tables
h5file = tables.openFile('test.h5', mode='w', title="Test Array")
root = h5file.root
h5file.createArray(root, "test", a)
h5file.close()

I first ran the program with both the pickle and the HDF code commented out, and profiled RAM usage with Valgrind and Massif (see my post about profiling memory usage of Python code). Massif shows that the program uses about 80MB of RAM. I then uncommented the Pickle code, and profiled the program again. Look at how the memory usage almost triples!

 MB
232.2^                                                                  #
  |                                                                  #
  |                                                                  #
  |                                                                  #
  |                                                                  #
  |                                                                  #
  |                                                                  #
  |                                                         @        #
  |                                                         @        #
  |                                                         @        #
  |                                                         @        #
  |                                                         @        #
  |                                                         @        #
  |                                                ,        @        #  .
  |                                                @        @        #  :
  |                                                @        @        #  :
  |                                                @        @        #  :
  |                                                @        @        #  :
  |                                                @        @        #  :
  |                                                @        @        #  :
0 +----------------------------------------------------------------------->Mi
  0                                                                   161.9

I then commented out the Pickle code and uncommented the HDF code, and ran the profile again. Notice how efficient the HDF library is:

    MB
81.62^                              ,.., .....,.. .,...,... .,......,..,:  .
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |              .,..,. .....    @::@ :::::@:: :@:::@::: :@::::::@::#:  :::
 0 +----------------------------------------------------------------------->Mi
   0                                                                   246.6

I didn’t benchmark the speed because, for my application, it doesn’t really matter, because the data gets dumped to disk relatively infrequently. It’s interesting to note that the data files on disk are almost identical in size: 76.29MB.

Why does Pickle consume so much more memory? The reason is that HDF is a binary data pipe, while Pickle is an object serialization protocol. Pickle actually consists of a simple virtual machine (VM) that translates an object into a series of opcodes and writes them to disk. To unpickle something, the VM reads and interprets the opcodes and reconstructs an object. The downside of this approach is that the VM has to construct a complete copy of the object in memory before it writes it to disk. I’m not sure why the pickle operation in my example code seems to require three times as much memory as the array itself. 99% of the time, memory usage isn’t a problem, but it becomes a real issue when you have hundreds of megabytes of data in a single array. Fortunately, HDF exists to efficiently handle large binary data sets, and the Pytables package makes it easy to access in a very Pythonic way.

You must be logged in to post a comment.