Welcome back! In the previous lessons, you learned how to set up the LLM Code Review Assistant project and how to scan a codebase to collect information about code files. Now, we are ready to take the next step: extracting the history of changes made to the project using git.
Git history is a record of all the changes that have been made to a project over time. This history is very valuable for understanding how a project has evolved, who made which changes, and why those changes were made. For code review and analysis, being able to look at past commits and file changes helps you spot patterns, understand the reasoning behind code, and even catch mistakes.
In this lesson, you will learn how to use Python to extract git history from a project. This will give you the tools to analyze changes and prepare for more advanced code review tasks.
Before we dive in, let’s quickly remind ourselves what a git repository and a commit are.
A git repository
is a folder that tracks changes to files using git. It stores all the information about the project’s history, including every change made to the files.
A commit
is a snapshot of the project at a certain point in time. Each commit has:
- A unique hash (an ID)
- A message describing the change
- The author’s name and email
- The date and time of the change
In the last lesson, you learned how to scan a codebase for files. Now, we will focus on reading the history of commits and the changes they contain.
When we extract git history, we are mainly interested in two things:
- Commit Details: Information about each commit, such as the hash, message, author, and date.
- File Changes: Which files were changed in each commit, and what the changes were.
Here is an example of what a commit might look like:
And an example of a file change in that commit:
This information helps you answer questions like:
- Who made a certain change?
- When was a feature added?
- What exactly was changed in a file?
Let's walk through how to extract git history using Python. We will use the gitpython library, which makes it easy to interact with git repositories from Python code.
Before we can start working with git repositories in Python, we need to install the gitpython library. Run this command in your terminal:
This will install the library that allows Python to interact with git repositories.
First, we need to import the libraries we will use.
gitpython
is used to interact with git, and dataclasses
help us organize the data.
Repo
lets us work with a git repository.datetime
is used for handling dates.dataclass
helps us define simple classes for storing data.
We will use two data classes: one for commits and one for file changes.
GitCommit
stores information about each commit.FileChange
stores information about each file change in a commit.
Now, let’s create a class that will handle extracting the history.
- The
__init__
method sets up two lists: one for commits and one for file changes.
Let's add a method to extract commits and their file changes.
Let's break down what happens in this method in detail:
Repository Connection:
Let’s see how to use this class in a script.
- We create an instance of
GitHistoryExtractor
. - We specify the path to the repository.
- We extract up to 10 recent commits.
- We print out the first 3 commits with their details.
Example Output:
This output shows the most recent commits, who made them, and when.
In this lesson, you learned how to extract git history from a project using Python. You saw how to:
- Use the
gitpython
library to access a repository - Collect commit details and file changes
- Organize this information using data classes
This prepares you for the practice exercises, where you will try out these steps yourself and get comfortable working with git history in Python. Understanding git history is a key skill for code review and project analysis, and you are now ready to put it into practice!
