Lesson 4
Reading Data in Batches from Multiple CSV Files
Introduction to Reading Data in Batches

In previous lessons, you learned how to handle datasets stored in compressed formats using JSON and manage large numerical datasets using the Arrow package with Parquet files. Building on that foundation, today's lesson will teach you how to read and process data in batches from multiple CSV files in R. This is important because working with data in smaller chunks, or batches, can make your code more efficient and faster when dealing with large datasets.

Our focus in this lesson will be on a practical scenario where a dataset containing car information is spread across multiple files. You will learn to read, process, and analyze this data to extract meaningful insights, such as determining the car with the lowest price.

Understanding CSV Data Structure

In this lesson, we'll work with a set of CSV files containing car data. Here's what a typical record might look like:

  • Model: Ford Mustang
  • Transmission: Automatic
  • Year: 2020
  • Price: 25000.00
  • Distance Traveled (km): 50000
  • Color: Red

These files are divided into multiple parts to allow batch processing, and understanding their structure is crucial as you learn to read and process them efficiently.

Implementing Batch Reading of CSV Files

Now, let's delve into reading these CSV files in batches. We'll build our solution step by step.

First, we need to specify the filenames for our CSV files and prepare a data structure to hold the combined data.

R
1# Filenames to read 2filenames <- c("data_part1.csv", "data_part2.csv", "data_part3.csv") 3 4# List to store all car data 5car_data <- list()

Here, we initialize a vector of filenames for the CSV files we want to read. We use a list (car_data) to store the data because we will be appending individual data frames to it, and lists are more flexible for such operations.

Read Data from Each File

Now, we'll loop through each filename, read the data, and append it to our car_data list.

R
1for (filename in filenames) { 2 data <- read.csv(filename) 3 car_data <- append(car_data, list(data)) 4}

In this snippet:

  • We use a for loop to iterate over our vector of filenames.
  • For each file, we use read.csv(filename) to read the data into a dataframe, which is appended to the car_data list.
Finding the Car with the Lowest Price

With all data combined in car_data, the next step is to identify the car with the lowest price.

R
1# Combine data into a single dataframe 2car_data <- do.call(rbind, car_data) 3 4# Convert price column to numeric 5car_data$price <- as.numeric(car_data$price) 6 7# Find the car with the lowest price 8lowest_cost_car <- car_data[which.min(car_data$price), ] 9 10cat(sprintf("Model: %s\n", lowest_cost_car$model)) 11cat(sprintf("Price: $%.2f\n", lowest_cost_car$price))

Here:

  • We combine the list of dataframes into a single dataframe using do.call(rbind, car_data).
  • The price column is converted to numeric for accurate comparison.
  • We use which.min() to find the car with the lowest price. This function returns the index of the smallest element in a numeric vector.
  • Finally, we print the model and price of the car with the lowest price, providing a clear output.
Streaming Approach: Without Loading All Data into Memory

While the previous solution loads all data into memory, an alternative approach involves streaming data, processing each record as it's read. This is beneficial for systems with limited memory or when working with extremely large datasets. Below, you'll find the implementation of this streaming approach:

R
1# Filenames to read 2filenames <- c("data_part1.csv", "data_part2.csv", "data_part3.csv") 3 4# Initialize variables to keep track of the lowest price car 5lowest_cost_car <- NULL 6lowest_price <- Inf 7 8# Process each file one by one 9for (filename in filenames) { 10 data <- read.csv(filename) 11 data$price <- as.numeric(data$price) 12 current_lowest <- data[which.min(data$price), ] 13 14 if (is.null(lowest_cost_car) || current_lowest$price < lowest_price) { 15 lowest_price <- current_lowest$price 16 lowest_cost_car <- current_lowest 17 } 18} 19 20# Output the car with the lowest price 21if (!is.null(lowest_cost_car)) { 22 cat(sprintf("Model: %s\n", lowest_cost_car$model)) 23 cat(sprintf("Price: $%.2f\n", lowest_cost_car$price)) 24}

In this implementation:

  • We maintain two variables: lowest_cost_car to store the data of the car with the lowest price and lowest_price initialized to infinity for comparison purposes.
  • As we stream through each CSV file, we convert the price field to numeric and find the lowest price in the current batch.
  • If the current record's price is lower than our stored lowest price, we update lowest_price and store the entire row in lowest_cost_car.
  • This approach processes each batch individually and does not accumulate the data, effectively reducing the application's memory footprint.
Summary and Practice Preparation

In this lesson, you learned how to:

  • Read data in batches from multiple CSV files using R's read.csv().
  • Process that data efficiently and convert data types when necessary.
  • Identify specific insights, such as the car with the lowest price, by using which.min().

Now, you're ready to apply these skills with practice exercises designed to reinforce your understanding. These exercises will challenge you to read and analyze data from similar datasets efficiently. Continuous practice is key to mastering these data handling techniques.

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.