Python Pandas - Working with HDF5 Format



When working with large datasets, we may get "out of memory" errors. These types of problems can be avoided by using an optimized storage format like HDF5. The pandas library offers tools like the HDFStore class and read/write APIs to easily store, retrieve, and manipulate data while optimizing memory usage and retrieval speed.

HDF5 stands for Hierarchical Data Format version 5, is an open-source file format designed to store large, complex, and heterogeneous data efficiently. It organizes the data in a hierarchical structure similar to a file system, with groups acting like directories and datasets functioning as files. The HDF5 file format can store different types of data (such as arrays, images, tables, and documents) in a hierarchical structure, making it ideal for managing heterogeneous data.

Creating an HDF5 file using HDFStore in Pandas

The HDFStore class in pandas is used to manage HDF5 files in a dictionary-like manner. The HDFStore class is a dictionary-like object that reads and writes Pandas data in the HDF5 format using PyTables library.

Example

Here is an example of demonstrating how to create a HDF5 file in Pandas using the pandas.HDFStore class.

Open Compiler
import pandas as pd import numpy as np # Create the store using the HDFStore class store = pd.HDFStore("store.h5") # Display the store print(store) # It is important to close the store after use store.close()

Following is the output of the above code −

<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

Note: To work with HDF5 format in pandas, you need the pytables library. It is an optional dependency for pandas and must be installed separately using one of the following commands −

# Using pip pip install tables # or using conda installer conda install pytables

Write/read Data to the HDF5 using HDFStore in Pandas

The HDFStore is a dict-like object, so that we can directly write and read the data to the HDF5 store using key-value pairs.

Example

The below example demonstrates how to write and read data to and from the HDF5 file using the HDFStore in Pandas.

Open Compiler
import pandas as pd import numpy as np # Create the store store = pd.HDFStore("store.h5") # Create the data index = pd.date_range("1/1/2024", periods=8) s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"]) df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=["A", "B", "C"]) # Write Pandas data to the Store, which is equivalent to store.put('s', s) store["s"] = s store["df"] = df # Read Data from the store, which is equivalent to store.get('df') from_store = store["df"] print('Retrieved Data From the HDFStore:\n',from_store) # Close the store after use store.close()

Following is the output of the above code −

Retrieved Data From the HDFStore:
A B C
2024-01-01 0.200467 0.341899 0.105715
2024-01-02 -0.379214 1.527714 0.186246
2024-01-03 -0.418122 1.008820 1.331104
2024-01-04 0.146418 0.587433 -0.750389
2024-01-05 -0.556524 -0.551443 -0.161225
2024-01-06 -0.214145 -0.722693 0.072083
2024-01-07 0.631878 -0.521474 -0.769847
2024-01-08 -0.361999 0.435252 1.177110

Read and write HDF5 Format Using Pandas APIs

Pandas also provides high-level APIs to simplify the interaction with HDFStore (Nothing but HDF5 files). These APIs allow you to read and write data directly to and from HDF5 files without needing to manually create an HDFStore object. Following are the primary APIs for handling HDF5 files in pandas −

Writing Pandas Data to HDF5 Using to_hdf()

The to_hdf() function allows you to write pandas objects such as DataFrames and Series directly to an HDF5 file using the HDFStore. This function provides various optional parameters like compression, handling missing values, format options, and more, allowing you to store your data efficiently.

Example

This example uses the DataFrame.to_hdf() function to write data to the HDF5 file.

Open Compiler
import pandas as pd import numpy as np # Create a DataFrame df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]},index=['x', 'y', 'z']) # Write data to an HDF5 file using the to_hdf() df.to_hdf("data_store.h5", key="df", mode="w", format="table") print("Data successfully written to HDF5 file")

Following is the output of the above code −

Data successfully written to HDF5 file

Reading Data from HDF5 Using read_hdf()

The pandas.read_hdf() method is used to retrieve Pandas object stored in an HDF5 file. It accepts the file name, file path or buffer from which data is read.

Example

This example demonstrates how to read data stored under the key "df" from the HDF5 file "data_store.h5" using the pd.read_hdf() method.

import pandas as pd # Read data from the HDF5 file using the read_hdf() retrieved_df = pd.read_hdf("data_store.h5", key="df") # Display the retrieved data print("Retrieved Data:\n", retrieved_df.head())

Following is the output of the above code −

Retrieved Data:
A B
x 1 4
y 2 5
z 3 6

Appending Data to HDF5 Files Using to_hdf()

Appending data to an existing HDF5 file can be possible by using the mode="a" option of the to_hdf() function. This is useful when you want to add new data to a file without overwriting the existing content.

Example

This example demonstrates how to append data to an an existing HDF5 file using the to_hdf() function.

import pandas as pd import numpy as np # Create a DataFrame to append df_new = pd.DataFrame({'A': [7, 8], 'B': [1, 1]},index=['i', 'j']) # Append the new data to the existing HDF5 file df_new.to_hdf("data_store.h5", key="df", mode="a", format="table", append=True) print("Data successfully appended") # Now read data from the HDF5 file using the read_hdf() retrieved_df = pd.read_hdf("data_store.h5", key='df') # Display the retrieved data print("Retrieved Data:\n", retrieved_df.head())

Following is the output of the above code −

Data successfully appended
Retrieved Data:
A B
x 1 4
y 2 5
z 3 6
i 7 1
j 8 1
Advertisements