Welcome to the next step in our journey of handling large datasets. In the last lesson, we explored managing data from compressed JSON files within a zip archive. Today, we will delve into how NumPy, a powerful library for numerical operations, can help us handle large data arrays efficiently by using the .npz
file format.
NumPy is widely used in data science and machine learning for its ability to process large arrays and matrices swiftly. The .npz
format, specifically, allows us to store multiple NumPy arrays in a single file, making it efficient for storing and accessing large datasets.
To begin handling large datasets, we need some large NumPy arrays as examples. Here’s how you can generate large arrays filled with random values using NumPy.
First, let's start by importing the NumPy
library. If you're using a personal development environment, you must install it with pip install numpy
. On CodeSignal, it's pre-installed.
Python1import numpy as np
Next, create two large arrays of random numbers. We’ll use np.random.rand
to generate arrays of size 1000x1000. This size is just for demonstration; you can adjust it based on your needs.
Python1array1 = np.random.rand(1000, 1000) 2array2 = np.random.rand(1000, 1000)
Here, array1
and array2
are two-dimensional arrays filled with random floats between 0 and 1, each with a shape of 1000x1000.
Now that we have our arrays, let's save them to a .npz
file. This file format is efficient, as it can store multiple arrays in a single file.
We use the np.savez
function to achieve this:
Python1npz_file_path = 'large_data.npz' 2np.savez(npz_file_path, array1=array1, array2=array2)
In this snippet:
np.savez
is used to save the arrays into a.npz
file.npz_file_path
is the location where the file will be saved.array1=array1, array2=array2
saves each array with a key, so you can retrieve them later.
For saving with compression, which reduces file size at the cost of longer loading time, use np.savez_compressed
:
Python1compressed_npz_file_path = 'large_data_compressed.npz' 2np.savez_compressed(compressed_npz_file_path, array1=array1, array2=array2)
After running this, you should see messages indicating the files have been saved.
Next, let’s read the saved arrays from the .npz
file. For this, use the np.load
function.
Python1with np.load(npz_file_path) as data: 2 loaded_array1 = data['array1'] 3 loaded_array2 = data['array2']
Explanation:
np.load(npz_file_path)
opens the.npz
file.- Using
with
ensures that the file is properly closed after loading. data['array1']
anddata['array2']
are keys used to access the respective arrays saved earlier.
An important feature of np.load
is lazy-loading. When you load an .npz
file, you are essentially creating a dictionary-like object that doesn’t load arrays into memory until accessed. This is beneficial for handling very large datasets because it minimizes memory usage.
Once loaded, you can verify the arrays by checking their shapes:
Python1print("Loaded arrays from .npz file.") 2print(f"Array1 shape: {loaded_array1.shape}, Array2 shape: {loaded_array2.shape}")
The output should confirm the shapes are unchanged:
Plain text1Loaded arrays from .npz file. 2Array1 shape: (1000, 1000), Array2 shape: (1000, 1000)
You've now learned how to efficiently store and retrieve large NumPy arrays using the .npz
format. This technique is crucial in scenarios where you work with large datasets, such as image processing, scientific simulations, or machine learning, where saving and loading data efficiently can conserve both time and storage resources.
In summary:
- We created large
NumPy
arrays. - We saved them to a single
.npz
file and also demonstrated how to save using compression. - We loaded the arrays back from the
.npz
file, retaining their structure and learned about lazy-loading as a memory optimization feature.
Next, you'll have hands-on practice exercises to reinforce these concepts. Congratulations on reaching this point! These skills are fundamental as you continue exploring data handling techniques.