Diff Parser: Breaking Down Code Changes for Review

Introduction: Why Parse Code Diffs?

Welcome back! In the previous lesson, you learned how to set up the OpenAI client for code review tasks. Now that you can connect to the OpenAI API, the next step is to prepare the code changes you want to review.

When developers make changes to code, these changes are often shared as "diffs." A diff shows what was added, removed, or changed in a file. For an AI code review assistant to be helpful, it needs to understand these diffs. That’s why parsing code diffs is so important: it lets us break down code changes into a format that both humans and AI can understand and analyze.

In this lesson, you will learn how to parse a unified diff into structured data using Python. This is a key step before sending code changes to an AI for review.

Recall: What Does a Unified Diff Look Like?

Let’s quickly remind ourselves what a unified diff is. You may have seen diffs before when using tools like Git. A unified diff is a text format that shows the differences between two versions of a file.

Here’s a small example:

Lines starting with --- and +++ show the old and new file names.
The @@ line is called a "hunk header" and shows which lines are changing.
Lines starting with + are additions.
Lines starting with - are removals.
Lines starting with a space are unchanged (context).

This format is what we will parse in this lesson.

Understanding Diff Components In Python

To work with diffs in Python, it helps to break them down into smaller parts. We will use data classes to represent these parts. Data classes are a simple way to group related data together.

Let’s look at the three main components we’ll use:

DiffLine represents a single line in the diff. It stores the line number, the content, and the type of change (added, removed, or context).
DiffHunk represents a group of changes (a "hunk") in the file. It stores where the changes start in the old and new files, and a list of DiffLine objects.
FileDiff represents all the changes for a single file. It stores the file path, a list of hunks, and flags for whether the file is new or deleted.

For example, if a diff adds a line to a file, we would create a DiffLine with change_type='added' and include it in a DiffHunk, which is then part of a FileDiff.

Step-By-Step: Parsing A Unified Diff

Now, let’s build the parser step by step. We want to turn a unified diff text into a FileDiff object.

1. Extracting the File Path and Status

First, we need to find the file path and check if the file is new or deleted.

We split the diff into lines.
We look for the line starting with +++ to get the new file path.
We check for lines that mention new file mode or deleted file mode to see if the file is new or deleted.

2. Identifying Hunks and Parsing Lines

Next, we need to find each hunk and parse the lines inside.

We loop through the lines, looking for hunk headers (lines starting with @@).
When we find a hunk, we record where it starts in the old and new files.
For each line in the hunk:
- If it starts with +, it’s an added line.
- If it starts with -, it’s a removed line.
- If it starts with a space, it’s a context (unchanged) line.
We create DiffLine objects for each line and add them to the hunk.

3. Putting It All Together

Finally, we return a FileDiff object with all the parsed information.

Here’s how you might use this function with a sample diff:

Sample Output:

This output shows the file path, the hunks, and each line’s type and content.

Summary And What’s Next

In this lesson, you learned how to break down a unified diff into structured data using Python. You saw how to:

Identify the file path and file status from a diff.
Find and parse each hunk and its lines.
Represent the diff using simple data classes.

This is a key step in building an AI code review assistant, as it allows you to analyze and process code changes before sending them to the AI.

In the next set of exercises, you’ll get hands-on practice parsing diffs and working with the parsed data. This will help you become comfortable with handling code changes in real-world scenarios. Good luck!

Previous Lesson

Next Lesson: Context Generator: Building Useful Context for AI Code Review

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal