Lesson 4
Reading Data in Batches from Multiple CSV Files
Introduction to Reading Data in Batches

In previous lessons, you learned how to handle datasets stored in compressed formats and manage large numerical datasets using NumPy arrays. Building on that foundation, today's lesson will teach you how to read and process data in batches from multiple CSV files. This is important because working with data in smaller chunks, or batches, can make your code more efficient and faster when dealing with large datasets.

Our focus in this lesson will be on a practical scenario where a dataset containing car information is spread across multiple files. You will learn to read, process, and analyze this data to extract meaningful insights, such as determining the car with the lowest price.

Understanding CSV Data Structure

In this lesson, we'll work with a set of CSV files containing car data. Here's what a typical record might look like:

  • Model: Ford Mustang
  • Transmission: Automatic
  • Year: 2020
  • Price: 25000.00
  • Distance Traveled (km): 50000
  • Color: Red

These files are divided into multiple parts to allow batch processing, and understanding their structure is crucial as you learn to read and process them efficiently.

Implementing Batch Reading of CSV Files

Now, let's delve into reading these CSV files in batches. We'll build our solution step-by-step.

First, we need to specify the filenames for our CSV files and prepare a data structure to hold the combined data.

Python
1import csv 2 3# Filenames to read 4filenames = ['data_part1.csv', 'data_part2.csv', 'data_part3.csv'] 5 6# List to store all car data 7car_data = []

Here, we initialize a list of filenames and create an empty list car_data to store all the car data read from the files.

Explanation of DictReader

The csv.DictReader class in Python's csv module is a helpful tool for reading CSV files into dictionaries, allowing for easier manipulation of data. When using csv.DictReader, each row in the CSV file is read as a dictionary, with the CSV headers as the dictionary keys and the corresponding data fields as values. This format facilitates accessing individual fields by name, simplifying data handling.

For example, consider a CSV file with the following content:

Plain text
1model,transmission,year,price,distance_traveled,color 2Ford Mustang,Automatic,2020,25000.00,50000,Red

Reading this file with csv.DictReader produces:

Python
1{ 2 'model': 'Ford Mustang', 3 'transmission': 'Automatic', 4 'year': '2020', 5 'price': '25000.00', 6 'distance_traveled': '50000', 7 'color': 'Red' 8}

This feature is particularly useful when working with CSV data, as it allows for straightforward access to individual elements by their column names, improving code clarity and maintainability.

Read Data from Each File

Now, we'll loop through each filename, read the data, and append it to our car_data list.

Python
1for filename in filenames: 2 with open(filename, newline='') as csv_file: 3 reader = csv.DictReader(csv_file) 4 for row in reader: 5 # Convert price from string to float for comparison 6 row['price'] = float(row['price']) 7 car_data.append(row)

In this snippet:

  • We use a for loop to iterate over our list of filenames.
  • For each file, we open it using with open(filename), ensuring it is closed properly after processing.
  • csv.DictReader(csv_file) reads each line into a dictionary, which makes accessing fields by name straightforward.
  • We convert the price field from a string to a float for numerical comparison purposes and then store the entire row in car_data.
Finding the Car with the Lowest Price

With all data combined in car_data, the next step is identifying the car with the lowest price.

Python
1# Find the car with the lowest price 2lowest_cost_car = min(car_data, key=lambda car: car['price']) 3print(f"Model: {lowest_cost_car['model']}") 4print(f"Price: ${lowest_cost_car['price']:.2f}")

Here:

  • We use Python’s min() function with a key argument to find the car with the lowest price in car_data.
  • A lambda function is used to specify that the function should look at the price field while evaluating each dictionary in the list.
  • We then print the model and price of the car with the lowest price, providing a clear output.
Streaming Approach: Finding the Car with the Lowest Price Without Loading All Data into Memory

While the previous solution loads all data into memory, an alternative approach involves streaming data, processing each record as it's read. This is beneficial for systems with limited memory or when working with extremely large datasets. Below, you'll find the implementation of this streaming approach:

Python
1import csv 2 3# Filenames to read 4filenames = ['data_part1.csv', 'data_part2.csv', 'data_part3.csv'] 5 6# Initialize variables to keep track of the lowest price car 7lowest_cost_car = None 8lowest_price = float('inf') 9 10# Process each file one by one 11for filename in filenames: 12 with open(filename, newline='') as csv_file: 13 reader = csv.DictReader(csv_file) 14 for row in reader: 15 # Convert price from string to float for comparison 16 price = float(row['price']) 17 if price < lowest_price: 18 lowest_price = price 19 lowest_cost_car = row 20 21# Output the car with the lowest price 22if lowest_cost_car: 23 print(f"Model: {lowest_cost_car['model']}") 24 print(f"Price: ${lowest_cost_car['price']:.2f}")

In this implementation:

  • We maintain two variables: lowest_cost_car to store the data of the car with the lowest price and lowest_price initialized to infinity for comparison purposes.
  • As we stream through each CSV file, we convert the price field and compare it with lowest_price.
  • If the current record's price is lower, we update lowest_price and store the entire row in lowest_cost_car.
  • This approach processes each line individually and does not accumulate the data, effectively reducing the application's memory footprint.
Summary and Practice Preparation

In this lesson, you learned how to:

  • Read data in batches from multiple CSV files using Python's csv.DictReader.
  • Process that data efficiently and convert data types when necessary.
  • Identify specific insights, such as the car with the lowest price, by using the min() function with a key argument.

Now, you're ready to apply these skills with practice exercises designed to reinforce your understanding. These exercises will challenge you to read and analyze data from similar datasets efficiently. Continuous practice is key to mastering these data handling techniques.

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.