In previous lessons, you learned how to handle datasets stored in compressed formats using JSON and manage large numerical datasets using the Arrow package with Parquet files. Building on that foundation, today's lesson will teach you how to read and process data in batches from multiple CSV files in R. This is important because working with data in smaller chunks, or batches, can make your code more efficient and faster when dealing with large datasets.
Our focus in this lesson will be on a practical scenario where a dataset containing car information is spread across multiple files. You will learn to read, process, and analyze this data to extract meaningful insights, such as determining the car with the lowest price.
In this lesson, we'll work with a set of CSV files containing car data. Here's what a typical record might look like:
- Model: Ford Mustang
- Transmission: Automatic
- Year: 2020
- Price: 25000.00
- Distance Traveled (km): 50000
- Color: Red
These files are divided into multiple parts to allow batch processing, and understanding their structure is crucial as you learn to read and process them efficiently.
Now, let's delve into reading these CSV files in batches. We'll build our solution step by step.
First, we need to specify the filenames for our CSV files and prepare a data structure to hold the combined data.
R1# Filenames to read 2filenames <- c("data_part1.csv", "data_part2.csv", "data_part3.csv") 3 4# List to store all car data 5car_data <- list()
Here, we initialize a vector of filenames for the CSV files we want to read. We use a list (car_data
) to store the data because we will be appending individual data frames to it, and lists are more flexible for such operations.
Now, we'll loop through each filename, read the data, and append it to our car_data
list.
R1for (filename in filenames) { 2 data <- read.csv(filename) 3 car_data <- append(car_data, list(data)) 4}
In this snippet:
- We use a
for
loop to iterate over our vector of filenames. - For each file, we use
read.csv(filename)
to read the data into a dataframe, which is appended to thecar_data
list.
With all data combined in car_data
, the next step is to identify the car with the lowest price.
R1# Combine data into a single dataframe 2car_data <- do.call(rbind, car_data) 3 4# Convert price column to numeric 5car_data$price <- as.numeric(car_data$price) 6 7# Find the car with the lowest price 8lowest_cost_car <- car_data[which.min(car_data$price), ] 9 10cat(sprintf("Model: %s\n", lowest_cost_car$model)) 11cat(sprintf("Price: $%.2f\n", lowest_cost_car$price))
Here:
- We combine the list of dataframes into a single dataframe using
do.call(rbind, car_data)
. - The
price
column is converted to numeric for accurate comparison. - We use
which.min()
to find the car with the lowest price. This function returns the index of the smallest element in a numeric vector. - Finally, we print the model and price of the car with the lowest price, providing a clear output.
While the previous solution loads all data into memory, an alternative approach involves streaming data, processing each record as it's read. This is beneficial for systems with limited memory or when working with extremely large datasets. Below, you'll find the implementation of this streaming approach:
R1# Filenames to read 2filenames <- c("data_part1.csv", "data_part2.csv", "data_part3.csv") 3 4# Initialize variables to keep track of the lowest price car 5lowest_cost_car <- NULL 6lowest_price <- Inf 7 8# Process each file one by one 9for (filename in filenames) { 10 data <- read.csv(filename) 11 data$price <- as.numeric(data$price) 12 current_lowest <- data[which.min(data$price), ] 13 14 if (is.null(lowest_cost_car) || current_lowest$price < lowest_price) { 15 lowest_price <- current_lowest$price 16 lowest_cost_car <- current_lowest 17 } 18} 19 20# Output the car with the lowest price 21if (!is.null(lowest_cost_car)) { 22 cat(sprintf("Model: %s\n", lowest_cost_car$model)) 23 cat(sprintf("Price: $%.2f\n", lowest_cost_car$price)) 24}
In this implementation:
- We maintain two variables:
lowest_cost_car
to store the data of the car with the lowest price andlowest_price
initialized to infinity for comparison purposes. - As we stream through each CSV file, we convert the
price
field to numeric and find the lowest price in the current batch. - If the current record's price is lower than our stored lowest price, we update
lowest_price
and store the entire row inlowest_cost_car
. - This approach processes each batch individually and does not accumulate the data, effectively reducing the application's memory footprint.
In this lesson, you learned how to:
- Read data in batches from multiple CSV files using R's
read.csv()
. - Process that data efficiently and convert data types when necessary.
- Identify specific insights, such as the car with the lowest price, by using
which.min()
.
Now, you're ready to apply these skills with practice exercises designed to reinforce your understanding. These exercises will challenge you to read and analyze data from similar datasets efficiently. Continuous practice is key to mastering these data handling techniques.