Introduction: Why Extract Git History?

Welcome back! In the previous lessons, you learned how to set up the LLM Code Review Assistant project and how to scan a codebase to collect information about code files. Now, we are ready to take the next step: extracting the history of changes made to the project using git.

Git history is a record of all the changes that have been made to a project over time. This history is very valuable for understanding how a project has evolved, who made which changes, and why those changes were made. For code review and analysis, being able to look at past commits and file changes helps you spot patterns, understand the reasoning behind code, and even catch mistakes.

In this lesson, you will learn how to use Python to extract git history from a project. This will give you the tools to analyze changes and prepare for more advanced code review tasks.


Quick Recall: Git Repositories and Commits

Before we dive in, let’s quickly remind ourselves what a git repository and a commit are.

A git repository is a folder that tracks changes to files using git. It stores all the information about the project’s history, including every change made to the files.

A commit is a snapshot of the project at a certain point in time. Each commit has:

  • A unique hash (an ID)
  • A message describing the change
  • The author’s name and email
  • The date and time of the change

In the last lesson, you learned how to scan a codebase for files. Now, we will focus on reading the history of commits and the changes they contain.


What Information Can We Get from Git History?

When we extract git history, we are mainly interested in two things:

  1. Commit Details: Information about each commit, such as the hash, message, author, and date.
  2. File Changes: Which files were changed in each commit, and what the changes were.

Here is an example of what a commit might look like:

And an example of a file change in that commit:

This information helps you answer questions like:

  • Who made a certain change?
  • When was a feature added?
  • What exactly was changed in a file?

Extracting Git History with Python

Let's walk through how to extract git history using Python. We will use the gitpython library, which makes it easy to interact with git repositories from Python code.

Installing Required Dependencies

Before we can start working with git repositories in Python, we need to install the gitpython library. Run this command in your terminal:

This will install the library that allows Python to interact with git repositories.

Step 1: Import Required Libraries

First, we need to import the libraries we will use.
gitpython is used to interact with git, and dataclasses help us organize the data.

  • Repo lets us work with a git repository.
  • datetime is used for handling dates.
  • dataclass helps us define simple classes for storing data.
Step 2: Define Data Structures

We will use two data classes: one for commits and one for file changes.

  • GitCommit stores information about each commit.
  • FileChange stores information about each file change in a commit.
Step 3: Create the GitHistoryExtractor Class

Now, let’s create a class that will handle extracting the history.

  • The __init__ method sets up two lists: one for commits and one for file changes.
Step 4: Extract Commits and File Changes

Let's add a method to extract commits and their file changes.

Let's break down what happens in this method in detail:

Repository Connection:

Step 5: Using the Extractor

Let’s see how to use this class in a script.

  • We create an instance of GitHistoryExtractor.
  • We specify the path to the repository.
  • We extract up to 10 recent commits.
  • We print out the first 3 commits with their details.

Example Output:

This output shows the most recent commits, who made them, and when.


Summary And What’s Next

In this lesson, you learned how to extract git history from a project using Python. You saw how to:

  • Use the gitpython library to access a repository
  • Collect commit details and file changes
  • Organize this information using data classes

This prepares you for the practice exercises, where you will try out these steps yourself and get comfortable working with git history in Python. Understanding git history is a key skill for code review and project analysis, and you are now ready to put it into practice!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal