In previous lessons, you learned how to handle datasets stored in compressed formats and manage large numerical datasets using NumPy arrays. Building on that foundation, today's lesson will teach you how to read and process data in batches from multiple CSV files. This is important because working with data in smaller chunks, or batches, can make your code more efficient and faster when dealing with large datasets.
Our focus in this lesson will be on a practical scenario where a dataset containing car information is spread across multiple files. You will learn to read, process, and analyze this data to extract meaningful insights, such as determining the car with the lowest price.
In this lesson, we'll work with a set of CSV files containing car data. Here's what a typical record might look like:
- Model: Ford Mustang
- Transmission: Automatic
- Year: 2020
- Price: 25000.00
- Distance Traveled (km): 50000
- Color: Red
These files are divided into multiple parts to allow batch processing, and understanding their structure is crucial as you learn to read and process them efficiently.
Now, let's delve into reading these CSV files in batches. We'll build our solution step-by-step.
First, we need to specify the filenames for our CSV files and prepare a data structure to hold the combined data.
Python1import csv 2 3# Filenames to read 4filenames = ['data_part1.csv', 'data_part2.csv', 'data_part3.csv'] 5 6# List to store all car data 7car_data = []
Here, we initialize a list of filenames and create an empty list car_data
to store all the car data read from the files.
The csv.DictReader
class in Python's csv
module is a helpful tool for reading CSV files into dictionaries, allowing for easier manipulation of data. When using csv.DictReader
, each row in the CSV file is read as a dictionary, with the CSV headers as the dictionary keys and the corresponding data fields as values. This format facilitates accessing individual fields by name, simplifying data handling.
For example, consider a CSV file with the following content:
Plain text1model,transmission,year,price,distance_traveled,color 2Ford Mustang,Automatic,2020,25000.00,50000,Red
Reading this file with csv.DictReader
produces:
Python1{ 2 'model': 'Ford Mustang', 3 'transmission': 'Automatic', 4 'year': '2020', 5 'price': '25000.00', 6 'distance_traveled': '50000', 7 'color': 'Red' 8}
This feature is particularly useful when working with CSV data, as it allows for straightforward access to individual elements by their column names, improving code clarity and maintainability.
Now, we'll loop through each filename, read the data, and append it to our car_data
list.
Python1for filename in filenames: 2 with open(filename, newline='') as csv_file: 3 reader = csv.DictReader(csv_file) 4 for row in reader: 5 # Convert price from string to float for comparison 6 row['price'] = float(row['price']) 7 car_data.append(row)
In this snippet:
- We use a
for
loop to iterate over our list of filenames. - For each file, we open it using
with open(filename)
, ensuring it is closed properly after processing. csv.DictReader(csv_file)
reads each line into a dictionary, which makes accessing fields by name straightforward.- We convert the
price
field from a string to a float for numerical comparison purposes and then store the entire row incar_data
.
With all data combined in car_data
, the next step is identifying the car with the lowest price.
Python1# Find the car with the lowest price 2lowest_cost_car = min(car_data, key=lambda car: car['price']) 3print(f"Model: {lowest_cost_car['model']}") 4print(f"Price: ${lowest_cost_car['price']:.2f}")
Here:
- We use Python’s
min()
function with akey
argument to find the car with the lowest price incar_data
. - A
lambda
function is used to specify that the function should look at theprice
field while evaluating each dictionary in the list. - We then print the model and price of the car with the lowest price, providing a clear output.
While the previous solution loads all data into memory, an alternative approach involves streaming data, processing each record as it's read. This is beneficial for systems with limited memory or when working with extremely large datasets. Below, you'll find the implementation of this streaming approach:
Python1import csv 2 3# Filenames to read 4filenames = ['data_part1.csv', 'data_part2.csv', 'data_part3.csv'] 5 6# Initialize variables to keep track of the lowest price car 7lowest_cost_car = None 8lowest_price = float('inf') 9 10# Process each file one by one 11for filename in filenames: 12 with open(filename, newline='') as csv_file: 13 reader = csv.DictReader(csv_file) 14 for row in reader: 15 # Convert price from string to float for comparison 16 price = float(row['price']) 17 if price < lowest_price: 18 lowest_price = price 19 lowest_cost_car = row 20 21# Output the car with the lowest price 22if lowest_cost_car: 23 print(f"Model: {lowest_cost_car['model']}") 24 print(f"Price: ${lowest_cost_car['price']:.2f}")
In this implementation:
- We maintain two variables:
lowest_cost_car
to store the data of the car with the lowest price andlowest_price
initialized to infinity for comparison purposes. - As we stream through each CSV file, we convert the
price
field and compare it withlowest_price
. - If the current record's price is lower, we update
lowest_price
and store the entire row inlowest_cost_car
. - This approach processes each line individually and does not accumulate the data, effectively reducing the application's memory footprint.
In this lesson, you learned how to:
- Read data in batches from multiple CSV files using Python's
csv.DictReader
. - Process that data efficiently and convert data types when necessary.
- Identify specific insights, such as the car with the lowest price, by using the
min()
function with a key argument.
Now, you're ready to apply these skills with practice exercises designed to reinforce your understanding. These exercises will challenge you to read and analyze data from similar datasets efficiently. Continuous practice is key to mastering these data handling techniques.