Welcome back! In the previous lesson, you learned how to set up the OpenAI client for code review tasks. Now that you can connect to the OpenAI API, the next step is to prepare the code changes you want to review.
When developers make changes to code, these changes are often shared as "diffs." A diff shows what was added, removed, or changed in a file. For an AI code review assistant to be helpful, it needs to understand these diffs. That’s why parsing code diffs is so important: it lets us break down code changes into a format that both humans and AI can understand and analyze.
In this lesson, you will learn how to parse a unified diff into structured data using Python
. This is a key step before sending code changes to an AI for review.
Let’s quickly remind ourselves what a unified diff is. You may have seen diffs before when using tools like Git
. A unified diff is a text format that shows the differences between two versions of a file.
Here’s a small example:
- Lines starting with
---
and+++
show the old and new file names. - The
@@
line is called a "hunk header" and shows which lines are changing. - Lines starting with
+
are additions. - Lines starting with
-
are removals. - Lines starting with a space are unchanged (context).
This format is what we will parse in this lesson.
To work with diffs in Python
, it helps to break them down into smaller parts. We will use data classes to represent these parts. Data classes are a simple way to group related data together.
Let’s look at the three main components we’ll use:
DiffLine
represents a single line in the diff. It stores the line number, the content, and the type of change (added
,removed
, orcontext
).DiffHunk
represents a group of changes (a "hunk") in the file. It stores where the changes start in the old and new files, and a list ofDiffLine
objects.FileDiff
represents all the changes for a single file. It stores the file path, a list of hunks, and flags for whether the file is new or deleted.
For example, if a diff adds a line to a file, we would create a DiffLine
with change_type='added'
and include it in a DiffHunk
, which is then part of a FileDiff
.
Now, let’s build the parser step by step. We want to turn a unified diff text into a FileDiff
object.
First, we need to find the file path and check if the file is new or deleted.
- We split the diff into lines.
- We look for the line starting with
+++
to get the new file path. - We check for lines that mention
new file mode
ordeleted file mode
to see if the file is new or deleted.
Next, we need to find each hunk and parse the lines inside.
- We loop through the lines, looking for hunk headers (lines starting with
@@
). - When we find a hunk, we record where it starts in the old and new files.
- For each line in the hunk:
- If it starts with
+
, it’s an added line. - If it starts with
-
, it’s a removed line. - If it starts with a space, it’s a context (unchanged) line.
- If it starts with
- We create
DiffLine
objects for each line and add them to the hunk.
Finally, we return a FileDiff
object with all the parsed information.
Here’s how you might use this function with a sample diff:
Sample Output:
This output shows the file path, the hunks, and each line’s type and content.
In this lesson, you learned how to break down a unified diff into structured data using Python
. You saw how to:
- Identify the file path and file status from a diff.
- Find and parse each hunk and its lines.
- Represent the diff using simple data classes.
This is a key step in building an AI code review assistant, as it allows you to analyze and process code changes before sending them to the AI.
In the next set of exercises, you’ll get hands-on practice parsing diffs and working with the parsed data. This will help you become comfortable with handling code changes in real-world scenarios. Good luck!
