Loading Data into Postgres

Introduction: The Movie Backend Project

Welcome to "Building the Movie Recommendation Base Backend using Codex"! In this course, we are going to build a powerful backend system capable of recommending movies and TV shows.

To build a recommendation engine, we first need data. We are starting with a raw dataset: a CSV file containing thousands of Netflix titles, including their descriptions, actors, directors, and ratings. However, a simple text file is not powerful enough for a complex application. We need to move this data into a PostgreSQL database.

In this first lesson, we will focus on the "planning and migration" phase. We will not write every single line of code by hand. Instead, we will learn how to leverage Codex, an AI coding assistant, to accelerate our development. We will use Codex to design our database structure, write scripts to load our data, and verify that everything looks correct.

Security First: Working with AI Assistants

Before we start asking Codex to write code for us, we must address a critical topic: Security.

When you use an AI tool like Codex, you are sending data to an external server for processing. This leads to the "Golden Rule" of AI development:

Never send secrets, passwords, or Personally Identifiable Information (PII) to an AI assistant.

Let's look at an example of an unsafe prompt:

This is dangerous because you just shared your password. If the AI logs this conversation, your password is no longer private.

Here is the safe way to ask the same question:

By using placeholders or environment variables, we get the code we need without exposing our secrets. Throughout this course, we will practice this safe approach.

Designing the Database Schema with Codex

A database needs a blueprint, known as a Schema. This defines what tables we have and how they relate to each other. Our raw data is in a CSV file, which is just one big table. However, a movie might have multiple actors, and an actor might be in multiple movies.

To handle this efficiently, we need to "normalize" our data. This means breaking it into smaller, related tables. We can ask Codex to help us design this.

We might prompt Codex like this:

"I have a dataset of movies with columns for title, director, cast, and country. The 'cast' column contains comma-separated lists of actors. Please generate a PostgreSQL schema that normalizes this data."

Codex will generate SQL code to create these tables. Let's look at how this structure is built, step by step.

First, it creates the main table for the shows:

This table holds the core information for every movie or TV show.

Next, instead of storing a long list of actors inside the shows table, Codex creates a separate table just for actors:

Finally, to link them together, it creates a "junction" table. This table says, "This specific Actor is in this specific Show."

By using this structure, our database is organized and efficient. We save this blueprint as schema.sql.

Automating Data Loading

Now that we have a database structure, we need to move the data from our CSV file into these new tables. Doing this manually for thousands of movies would take forever. We need a Python script to automate it. Here is an overview of the functionality of the script:

flowchart TD B[Parse command-line arguments] --> C[Load CSV file into DataFrame] C --> D[Validate expected columns] D --> E[Normalize DataFrame into tables] E --> F{--create-tables flag?} F -- Yes --> G[Run schema.sql to create tables] F -- No --> I G --> I[Bulk insert data into Postgres tables]

Let's look at the logic Codex uses to build this script, load_netflix.py.

Reading the Data

First, the script needs to load the raw data. Codex often suggests using a library called pandas for this, as it is very good at handling CSV files.

Processing the Data

The raw data has strings like "Actor A, Actor B, Actor C." The script needs to split these into individual names so they can go into our actors table.

Inserting into the Database

Finally, the script connects to Postgres and inserts the data. It separates the data into lists: one list for movies, one for actors, and one for the connections between them.

By running this script, we transform our messy text file into a structured, queryable database.

Verifying the Migration

After running a migration script, we should never assume it worked perfectly. We need to verify the results. We can ask Codex to write a "sanity check" script, check_import.py.

This script performs simple queries to ensure the data looks right.

Checking Counts

The most basic check is counting the rows. If our CSV had 8,000 titles, our database should also have 8,000 titles.

Inspecting Samples

We also want to see a few actual records to ensure the relationships are working. For example, if we query a movie, do we get the correct actors back?

If this script prints out movie titles with their correct cast members, we know our migration was a success.

Summary and Next Steps

In this lesson, we established the foundation for our movie recommendation backend.

Security: We learned never to share secrets with AI tools.
Planning: We used Codex to design a normalized SQL schema (schema.sql) that organizes our data efficiently.
Automation: We generated a Python script (load_netflix.py) to clean and load our CSV data into Postgres.
Verification: We created a check script (check_import.py) to confirm our data is safe and sound.

Now it is time for you to put this into practice. In the upcoming exercises, you will use Codex to generate these scripts and run the migration yourself. This is the first step in building a robust backend!

Next Lesson: Designing AGENTS.md Files

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal