Loading...

Introduction to NumPy and .npz Files

Welcome to the next step in our journey of handling large datasets. In the last lesson, we explored managing data from compressed JSON files within a zip archive. Today, we will delve into how NumPy, a powerful library for numerical operations, can help us handle large data arrays efficiently by using the .npz file format.

NumPy is widely used in data science and machine learning for its ability to process large arrays and matrices swiftly. The .npz format, specifically, allows us to store multiple NumPy arrays in a single file, making it efficient for storing and accessing large datasets.

Creating Large NumPy Arrays

To begin handling large datasets, we need some large NumPy arrays as examples. Here’s how you can generate large arrays filled with random values using NumPy.

First, let's start by importing the NumPy library. If you're using a personal development environment, you must install it with pip install numpy. On CodeSignal, it's pre-installed.

Next, create two large arrays of random numbers. We’ll use np.random.rand to generate arrays of size 1000x1000. This size is just for demonstration; you can adjust it based on your needs.

Here, array1 and array2 are two-dimensional arrays filled with random floats between 0 and 1, each with a shape of 1000x1000.

Writing Arrays to .npz Files

Now that we have our arrays, let's save them to a .npz file. This file format is efficient, as it can store multiple arrays in a single file.

We use the np.savez function to achieve this:

In this snippet:

np.savez is used to save the arrays into a .npz file.
npz_file_path is the location where the file will be saved.
array1=array1, array2=array2 saves each array with a key, so you can retrieve them later.

For saving with compression, which reduces file size at the cost of longer loading time, use np.savez_compressed:

After running this, you should see messages indicating the files have been saved.

Reading Arrays from .npz Files

Next, let’s read the saved arrays from the .npz file. For this, use the np.load function.

Explanation:

np.load(npz_file_path) opens the .npz file.
Using with ensures that the file is properly closed after loading.
data['array1'] and data['array2'] are keys used to access the respective arrays saved earlier.

An important feature of np.load is lazy-loading. When you load an .npz file, you are essentially creating a dictionary-like object that doesn’t load arrays into memory until accessed. This is beneficial for handling very large datasets because it minimizes memory usage.

Once loaded, you can verify the arrays by checking their shapes:

The output should confirm the shapes are unchanged:

Practical Application and Summary

You've now learned how to efficiently store and retrieve large NumPy arrays using the .npz format. This technique is crucial in scenarios where you work with large datasets, such as image processing, scientific simulations, or machine learning, where saving and loading data efficiently can conserve both time and storage resources.

In summary:

We created large NumPy arrays.
We saved them to a single .npz file and also demonstrated how to save using compression.
We loaded the arrays back from the .npz file, retaining their structure and learned about lazy-loading as a memory optimization feature.

Next, you'll have hands-on practice exercises to reinforce these concepts. Congratulations on reaching this point! These skills are fundamental as you continue exploring data handling techniques.

Previous Lesson

Next Lesson: Writing Data in Batches

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal