Codebase Scanner Development

Introduction: The Need for a Codebase Scanner

Welcome back! In the previous lesson, you learned about the LLM Code Review Assistant project and how to represent code files and changes using Python data classes. Now, we are ready to take the next step: building a codebase scanner.

A codebase scanner is a tool that automatically finds and reads code files in a project directory. This is important because, before we can review code, we need to know what files exist, which programming languages they use, and what their contents are. Automating this process saves time and reduces the chance of missing important files.

In this lesson, you will learn how to build a simple but effective codebase scanner in Python. This scanner will help us gather all the information we need for future code review tasks.

Quick Recall: Navigating Files and Folders in Python

Before we dive into building the scanner, let’s quickly remind ourselves how Python can work with files and directories. You have already seen how to use Python to open and read files, as well as how to navigate folders.

For example, to list all files in a directory, you can use the os module:

os.walk() helps you go through every folder and file in a directory tree.
root is the current folder.
dirs is a list of subfolders.
files is a list of files in the current folder.

This is the basic idea we will use to scan a codebase.

The CodeFile Dataclass: Storing File Information

To keep things organized, we use a data structure called a dataclass. In Python, a dataclass is a simple way to group related information together.

Here is how we define a CodeFile dataclass to store information about each code file:

file_path: The location of the file in the project.
content: The text inside the file.
language: The programming language (like Python or JavaScript).
last_updated: The date and time when the file was last updated.

This makes it easy to keep track of all the important details for each file we scan.

Building the RepositoryScanner: Key Steps

Now, let’s build the main part of our scanner step by step.

1. Detecting File Types Using File Extensions

First, we need a way to figure out which programming language a file uses. We can do this by looking at the file extension (like .py for Python).

self.language_map is a dictionary that matches file extensions to programming languages.
detect_language() checks the file extension and returns the language name, or 'Unknown' if it’s not in the map.

2. Skipping Unnecessary Folders

Some folders, like .git or node_modules, are not useful for code review. We want to skip these.

self.exclude_dirs is a set of folder names to skip.

When walking through the directory, we filter out these folders:

This line updates the dirs list so that os.walk() does not go into excluded folders.

3. Reading File Contents and Handling Errors

Now, let’s read the contents of each file and handle any errors that might occur (such as unreadable files).

We use open() with 'utf-8' encoding to read the file.
If the file cannot be read, we catch the error and print a message.
For each file, we create a CodeFile object and add it to our list.

End-to-End Example: Scanning a Project Folder

Let’s see how all these pieces fit together in a real example. Here is how you would use the RepositoryScanner to scan a folder:

When you run this code, you might see output like:

This means the scanner found 5 code files in the my_project folder, read their contents, and stored their details in CodeFile objects.

Summary And What’s Next

In this lesson, you learned how to build a codebase scanner in Python. You saw how to:

Detect programming languages by file extension.
Skip folders that are not useful for code review.
Read file contents safely and handle errors.
Store file information in a structured way using a dataclass.

Next, you will get a chance to practice scanning codebases yourself. This will help you get comfortable with these concepts and prepare you for more advanced code review tasks in future lessons. Good luck, and see you in the practice exercises!

Previous Lesson

Next Lesson: Git History Extraction with Python

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal