Welcome to the lesson on parsing tables from CSV files. In our previous lesson, we focused on parsing text-based tables. Now, we're expanding on that knowledge to work with CSV files, a more structured and widely used format for tabular data.
CSV, which stands for Comma-Separated Values, is a file format that stores tabular data, such as a database or spreadsheet, in plain text. Each line in a CSV file corresponds to a row in the table, and each value is separated by a comma. CSV files are popular because they are simple and easily processed by a variety of programs, including Excel and most data analysis tools.
The CSV file is naturally formatted as a table. Here is an example:
Plain text1Name,Age,Occupation 2John,28,Engineer 3Alice,34,Doctor 4Bob,23,Artist
It uses new lines for rows, and some separator (in this case, a comma) for columns.
In Python, the csv
module provides a straightforward way to handle CSV files. This module is included with Python’s standard library, so you don't need to install anything separately. It simplifies reading from, and writing to, CSV files, helping avoid common pitfalls that arise from manual parsing.
The csv
module contains different methods and classes that allow for easy reading and writing. For our purposes, the csv.reader
will be particularly important, as it reads a CSV file and allows us to iterate over it row by row.
Before we can parse the CSV data, we need to open the file properly. Using Python’s built-in open()
function, we can open our CSV file for reading. We use the with
statement to ensure the file is closed automatically when we're done, freeing up resources efficiently.
Here's how you would open a CSV file named data.csv
:
Python1import csv 2 3file_path = 'data/data.csv' 4 5with open(file_path, newline='') as csvfile: 6 # 'csvfile' is a file object that we can work with 7 pass
Notice the use of newline=''
. This is important because it ensures consistent reading across different platforms by preventing automatic newline translation. The with
statement handles the file closing, saving us from having to call close()
manually.
Once we've got our file object using with open()
, we can start reading the contents using csv.reader
. This functionality helps us process the CSV data correctly, converting each line into a series of values.
Python1import csv 2 3file_path = 'data/data.csv' 4 5with open(file_path, newline='') as csvfile: 6 csv_reader = csv.reader(csvfile)
In this snippet, csv_reader
is an object that allows us to loop over each row in the CSV file. The csv.reader
takes care of correctly parsing the lines based on commas.
Often, CSV files have a header row. When reading a txt
file table, we skipped the header with slicing. For csv.reader
, it doesn't work, as it is not subscriptable. Instead, you can skip the header row using the next()
function, which moves the reader to the next line:
Python1import csv 2 3with open(file_path, newline='') as csvfile: 4 csv_reader = csv.reader(csvfile) 5 next(csv_reader) # This skips the header
Skipping the header is useful for data consistency, to make the parsing easier.
With the header out of the way, you can now iterate through the remaining rows and extract data. This time, let's not only read the data but solve some real-work related task. For example, imagine you want to collect all ages from our CSV file into a list to make a statistical analysis:
Python1ages = [] 2 3with open(file_path, newline='') as csvfile: 4 csv_reader = csv.reader(csvfile) 5 next(csv_reader) # Skip the header row 6 for row in csv_reader: 7 ages.append(int(row[1])) # Append the age, which is the second item in each row 8 9print(ages)
In this code, the for
loop goes through each row in the CSV file, appending the value from the second column to the ages
list. Note that we also convert ages to integer. Once complete, ages
contains all the age data from the CSV.
Output:
Plain text1[28, 34, 23]
This output confirms that ages have been successfully extracted and stored in a list.
By default, the csv.reader
assumes the delimiter is a comma. However, CSV files can use different delimiters, such as semicolons or tabs, depending on how the data was exported. To specify a different delimiter, you need to provide the delimiter
parameter when creating the csv.reader
object. For instance, if your CSV file uses semicolons as delimiters, you would specify it like this:
Python1import csv 2 3file_path = 'data/data.csv' 4 5with open(file_path, newline='') as csvfile: 6 csv_reader = csv.reader(csvfile, delimiter=';')
This tells the csv.reader
to use semicolons instead of commas to separate values, ensuring accurate parsing of the data. Adjust the delimiter
parameter to match the character used in your specific CSV files.
In this lesson, you've learned how to parse data from a CSV file using Python’s csv
module. You saw how to open files using the with
statement, read data using csv.reader
, skip headers, and extract specific columns into Python lists.
These skills are essential for working with structured tabular data and will serve as a foundation for more advanced data manipulation tasks. As you move on to the practice exercises, you'll have the opportunity to apply what you've learned, further reinforcing your understanding of CSV parsing.
Keep practicing, and remember, you're well on your way to becoming proficient in handling data from different file formats.