Lesson 1
Managing Data from Compressed Datasets in R
Introduction to Managing Compressed Data

Welcome to the first lesson of our course on managing data from different datasets. In today's digital world, it's common to encounter large volumes of data. Understanding how to efficiently manage this data, especially when it's compressed, is crucial. This lesson will focus on handling JSON files contained within a zip archive. By the end, you'll be able to extract, read, and process data stored in compressed formats, building a strong foundation for handling real-world datasets.

Recall: Essentials of JSON and File I/O

Before we dive into zip files, let's briefly recall some essentials about JSON and file I/O operations. JSON, or JavaScript Object Notation, is a lightweight data interchange format. It's easy for humans to read and write and easy for machines to parse and generate. In R, we interact with JSON data using the jsonlite package and its fromJSON and toJSON functions.

Working with Zip Files Using the Base R Functions

Zip files are a type of compressed file format that allows you to bundle many files into one. In R, you can work with zip files using the base function unzip(), which provides tools to handle zip files by extracting them to a directory.

Here's how you open a zip file and extract its contents using the unzip() function:

R
1zip_file_name <- "universe_data.zip" 2extract_dir <- tempdir() # Use a temporary directory for extraction 3unzip(zip_file_name, exdir = extract_dir) 4print(paste("Zip File Extracted to:", extract_dir))

In this example:

  • unzip(zip_file_name, exdir = extract_dir) is used to extract the zip file's contents to the specified temporary directory.
Listing File Contents

Once the zip file is open, you can list its contents using the unzip() function with the list = TRUE argument:

R
1file_list <- unzip(zip_file_name, list = TRUE) 2print("Files in the archive:") 3print(file_list)

The unzip(list = TRUE) argument lists the contents of the zip file without extracting them. The returned data frame (file_list) contains file metadata, including names, sizes, and modification dates. Here is a sample representation of its structure:

NameLengthDate
stars.json10242025-01-01

Before we proceed with parsing data from the universe dataset, let’s discuss the data itself. The dataset contains information about various stars and is provided in a JSON format. Each entry in the data corresponds to a star with details like its name, type, and mass. For instance, the mass field may look like "90.45 × 10^30 kg", indicating the mass in scientific notation. Understanding the structure will help us process and analyze the data efficiently.

Reading and Processing JSON Files from a Zip Archive

Now, let's move on to reading JSON files stored in the zip archive. We begin by extracting, then reading a specific file from the archive using the fromJSON method from the jsonlite package.

Here's how you access a JSON file within the zip:

R
1library(jsonlite) 2 3stars_file_path <- file.path(extract_dir, "stars.json") 4stars <- fromJSON(stars_file_path)

In this code:

  • file.path(extract_dir, "stars.json") constructs the full path to the stars.json file in the extracted directory.
  • fromJSON(stars_file_path) reads the content of the JSON file and converts it into a data frame that we can work with in R.
Analyzing Data: Finding the Most Massive Stars

Once we've loaded our JSON data, we can analyze it. Let's sort the stars by their mass to find the top 5 most massive ones. While this step is crucial to show a real example of how to work with data, remember that the key information of this lesson is how to open and manage zipped files, so practices of this unit will focus mainly on it.

Note: For this sorting method to work correctly, it's important that the masses of our stars share the same exponential power number, such as 10^30 in our dataset. If they do not, the code will sort incorrectly as it only considers the first number, not the exponential power. For simplicity, in this unit we will always work with datasets having the same exponential power number for each object.

We'll use the order() function to sort by mass:

R
1stars$mass <- as.numeric(gsub(" × 10\\^30 kg", "", stars$mass)) 2sorted_stars <- stars[order(-stars$mass), ] 3most_massive_stars <- head(sorted_stars, 5)

Explanation:

  • gsub(" × 10\\^30 kg", "", stars$mass) removes the scientific notation from the mass field.
  • as.numeric() converts the mass field into a numeric type for comparison.
  • order(-stars$mass) sorts the data in descending order.
  • head(sorted_stars, 5) extracts the top 5 stars by mass.
Displaying Results and Code Review

Finally, we'll display the top 5 massive stars:

R
1cat("Top 5 Most Massive Stars:\n") 2for (i in 1:nrow(most_massive_stars)) { 3 cat(sprintf("%d. %s - Mass: %f × 10^30 kg\n", 4 i, most_massive_stars$name[i], most_massive_stars$mass[i])) 5}

Using a loop, we print each star's name and mass from our sorted list. The output will be formatted in the following manner:

Plain text
1Top 5 Most Massive Stars: 21. Star Name - Mass: 90.45 × 10^30 kg 32. Star Name - Mass: 89.70 × 10^30 kg 43. Star Name - Mass: 88.35 × 10^30 kg 54. Star Name - Mass: 87.90 × 10^30 kg 65. Star Name - Mass: 86.10 × 10^30 kg
Reading and Processing Text Files from a Zip Archive

In addition to JSON files, you might encounter other types of files within a zip archive. Reading text files can be done using the unzip() function, just as we did with JSON files. Here's an example of how to read a text file:

R
1txt_file_path <- file.path(extract_dir, "data.txt") 2content <- readLines(txt_file_path) 3cat(content, sep = "\n")

Explanation:

  • file.path(extract_dir, "data.txt") constructs the full path to the data.txt file in the extracted directory.
  • readLines(txt_file_path) reads the contents of the text file as a vector of lines.
Reading and Processing CSV Files from a Zip Archive

In addition to JSON and text files, you might encounter CSV files within a zip archive. Reading CSV files can also be done using the unzip() function, followed by read.csv() for parsing CSV data.

Here's how to read a CSV file:

R
1csv_file_path <- file.path(extract_dir, "data.csv") 2csv_data <- read.csv(csv_file_path) 3print(csv_data)

Explanation:

  • file.path(extract_dir, "data.csv") constructs the full path to the data.csv file in the extracted directory.
  • read.csv(csv_file_path) reads the contents of the CSV file into a data frame.
Summary and Preparation for Practice

In this lesson, you learned to use base R functions and the jsonlite package to manage data stored in compressed formats. Specifically, you explored how unzip() is utilized to extract and list files in a zip archive, and how fromJSON() reads JSON files, converting them into R objects. Additionally, you discovered that readLines() reads text files line by line, and read.csv() loads CSV data into R data frames.

These skills will be invaluable as you tackle more complex data handling tasks. Now, you're ready to apply these concepts in the upcoming practice exercises, where you'll get hands-on experience with data extraction and analysis from compressed datasets. Let's move forward with confidence.

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.