Welcome to the next step in our journey of handling large datasets. In the last lesson, we explored managing data from compressed JSON files within a zip archive. Today, we will delve into how NumPy, a powerful library for numerical operations, can help us handle large data arrays efficiently by using the .npz file format.
NumPy is widely used in data science and machine learning for its ability to process large arrays and matrices swiftly. The .npz format, specifically, allows us to store multiple NumPy arrays in a single file, making it efficient for storing and accessing large datasets.
To begin handling large datasets, we need some large NumPy arrays as examples. Here’s how you can generate large arrays filled with random values using NumPy.
First, let's start by importing the NumPy library. If you're using a personal development environment, you must install it with pip install numpy. On CodeSignal, it's pre-installed.
Next, create two large arrays of random numbers. We’ll use np.random.rand to generate arrays of size 1000x1000. This size is just for demonstration; you can adjust it based on your needs.
Here, array1 and array2 are two-dimensional arrays filled with random floats between 0 and 1, each with a shape of 1000x1000.
Now that we have our arrays, let's save them to a .npz file. This file format is efficient, as it can store multiple arrays in a single file.
We use the np.savez function to achieve this:
In this snippet:
np.savezis used to save the arrays into a.npzfile.npz_file_pathis the location where the file will be saved.array1=array1, array2=array2saves each array with a key, so you can retrieve them later.
For saving with compression, which reduces file size at the cost of longer loading time, use np.savez_compressed:
After running this, you should see messages indicating the files have been saved.
Next, let’s read the saved arrays from the .npz file. For this, use the np.load function.
Explanation:
np.load(npz_file_path)opens the.npzfile.- Using
withensures that the file is properly closed after loading. data['array1']anddata['array2']are keys used to access the respective arrays saved earlier.
An important feature of np.load is lazy-loading. When you load an .npz file, you are essentially creating a dictionary-like object that doesn’t load arrays into memory until accessed. This is beneficial for handling very large datasets because it minimizes memory usage.
Once loaded, you can verify the arrays by checking their shapes:
The output should confirm the shapes are unchanged:
You've now learned how to efficiently store and retrieve large NumPy arrays using the .npz format. This technique is crucial in scenarios where you work with large datasets, such as image processing, scientific simulations, or machine learning, where saving and loading data efficiently can conserve both time and storage resources.
In summary:
- We created large
NumPyarrays. - We saved them to a single
.npzfile and also demonstrated how to save using compression. - We loaded the arrays back from the
.npzfile, retaining their structure and learned about lazy-loading as a memory optimization feature.
Next, you'll have hands-on practice exercises to reinforce these concepts. Congratulations on reaching this point! These skills are fundamental as you continue exploring data handling techniques.
