Welcome back! In the previous lesson, you learned about the LLM Code Review Assistant project and how to represent code files and changes using Python data classes. Now, we are ready to take the next step: building a codebase scanner.
A codebase scanner is a tool that automatically finds and reads code files in a project directory. This is important because, before we can review code, we need to know what files exist, which programming languages they use, and what their contents are. Automating this process saves time and reduces the chance of missing important files.
In this lesson, you will learn how to build a simple but effective codebase scanner in Python. This scanner will help us gather all the information we need for future code review tasks.
Before we dive into building the scanner, let’s quickly remind ourselves how Python can work with files and directories. You have already seen how to use Python to open and read files, as well as how to navigate folders.
For example, to list all files in a directory, you can use the os
module:
os.walk()
helps you go through every folder and file in a directory tree.root
is the current folder.dirs
is a list of subfolders.files
is a list of files in the current folder.
This is the basic idea we will use to scan a codebase.
To keep things organized, we use a data structure called a dataclass. In Python, a dataclass is a simple way to group related information together.
Here is how we define a CodeFile
dataclass to store information about each code file:
file_path
: The location of the file in the project.content
: The text inside the file.language
: The programming language (like Python or JavaScript).last_updated
: The date and time when the file was last updated.
This makes it easy to keep track of all the important details for each file we scan.
Now, let’s build the main part of our scanner step by step.
First, we need a way to figure out which programming language a file uses. We can do this by looking at the file extension (like .py
for Python).
self.language_map
is a dictionary that matches file extensions to programming languages.detect_language()
checks the file extension and returns the language name, or'Unknown'
if it’s not in the map.
Some folders, like .git
or node_modules
, are not useful for code review. We want to skip these.
self.exclude_dirs
is a set of folder names to skip.
When walking through the directory, we filter out these folders:
- This line updates the
dirs
list so thatos.walk()
does not go into excluded folders.
Now, let’s read the contents of each file and handle any errors that might occur (such as unreadable files).
- We use
open()
with'utf-8'
encoding to read the file. - If the file cannot be read, we catch the error and print a message.
- For each file, we create a
CodeFile
object and add it to our list.
Let’s see how all these pieces fit together in a real example. Here is how you would use the RepositoryScanner
to scan a folder:
When you run this code, you might see output like:
This means the scanner found 5 code files in the my_project
folder, read their contents, and stored their details in CodeFile
objects.
In this lesson, you learned how to build a codebase scanner in Python. You saw how to:
- Detect programming languages by file extension.
- Skip folders that are not useful for code review.
- Read file contents safely and handle errors.
- Store file information in a structured way using a dataclass.
Next, you will get a chance to practice scanning codebases yourself. This will help you get comfortable with these concepts and prepare you for more advanced code review tasks in future lessons. Good luck, and see you in the practice exercises!
