Enriching Data with Codex

Beyond Default Values

In the previous lesson, you built a comprehensive data cleaning pipeline that systematically filled missing values with the placeholder "Unknown." You created the clean_data.py script that transformed your raw Netflix dataset by replacing empty director, cast, country, and rating fields with a consistent default value. This approach worked well for handling the bulk of your missing data, and you learned how to organize cleaning logic into reusable functions, add validation, and save your cleaned results to a new file.

However, filling missing values with "Unknown" has limitations. While it allows you to proceed with analysis and prevents errors from empty fields, it doesn't actually recover the missing information. When you analyze directors and see that "Unknown" is the most common value in your dataset, that's not particularly insightful. When you try to understand when titles were added to Netflix but 10 records show "Unknown" dates, you're missing potentially valuable information. The placeholder approach is honest about what you don't know, but it doesn't help you find out what you could know.

This is where the distinction between data cleaning and data enrichment becomes important. Data cleaning is about handling problems in your existing data, like filling gaps with defaults, removing duplicates, or correcting obvious errors. Data enrichment goes a step further by actually finding missing information from external sources and adding it to your dataset. Instead of marking a missing date as "Unknown," data enrichment means searching for when that title was actually added to Netflix and recording the real date.

After running your cleaning script from the previous lesson, you still have 13 missing values in your dataset. Specifically, 10 titles are missing their date_added values, and 3 titles are missing their duration information. These represent only 0.01% of all cells in your dataset, but they're fields where having the actual values would be more useful than having "Unknown." A date added of "Unknown" doesn't help you analyze Netflix's content acquisition patterns over time. A duration of "Unknown" doesn't help you understand whether a title is a quick watch or a longer commitment.

This is where Codex's web search capability becomes valuable. Codex is not just an AI coding assistant that helps you write code based on its training data. It can also search the internet to find current information, recent documentation, and specific data points that aren't in its training set. This means you can ask Codex to help you find the actual dates when specific Netflix titles were added or the actual runtime of movies and shows that are missing duration information.

In this lesson, you'll learn how to use Codex's --search option to find real values for your missing data. You'll discover how to craft effective prompts that guide Codex to search for specific information, how to verify that the information you find is accurate, and how to use Codex to generate data overviews in formats that help you understand your enriched dataset. By the end of this lesson, you'll have moved beyond basic data cleaning to intelligent data enrichment, transforming your dataset from one with placeholders to one with actual, verified information.

The skills you learn today extend beyond this specific Netflix dataset. Any time you have missing data that could potentially be found through research, you can use these same techniques. Whether you're enriching customer records with publicly available information, filling in product details from manufacturer websites, or adding geographic coordinates to address data, the pattern of using AI-assisted web search to enrich your data is broadly applicable.

Codex's Web Search Capability

Codex includes a powerful web search feature that allows it to retrieve fresh information from the internet when needed. This capability transforms Codex from a tool that only knows what was in its training data to one that can access current information, recent documentation, and specific data points that you need for your work.

By default, Codex operates without internet access for security reasons. This isolation ensures that Codex can't accidentally expose sensitive information or access resources it shouldn't. However, when you need Codex to search the web, you can enable this functionality through the --search flag.

To enable web search when running Codex from the command line, you simply add the --search flag to your command.

The --search flag tells Codex that it's allowed to perform web searches to answer your question. Without this flag, Codex would only be able to respond based on its training data, which might not include specific information about when individual titles were added to Netflix.

Web search is particularly useful in several scenarios. When you need to access the latest documentation for a library or framework, Codex can search for and retrieve current information rather than relying on potentially outdated training data. When you're looking for recent code examples or best practices, web search ensures you're getting current approaches rather than older patterns. And most relevant to your current task, when you need to find specific data points like dates, durations, or other factual information, web search allows Codex to look up these details from authoritative sources.

It's important to understand that web search should be used thoughtfully. Not every question requires searching the internet. If you're asking about fundamental programming concepts, standard library functions, or general coding patterns, Codex's built-in knowledge is usually sufficient and faster. Web search is most valuable when you need current information, specific data points, or details about recent changes and updates.

Finding Missing Values With Web Search

Now that you understand how Codex's web search capability works, let's apply it to finding the actual missing values in your Netflix dataset. You have 10 titles missing their date_added values and 3 titles missing their duration information. Rather than leaving these as "Unknown" or making up values, you can use Codex to search for the real information.

Let's start with finding missing date_added values. First, you need to identify which titles are missing this information. You can do this by loading your dataset and filtering for rows where date_added is null:

This will show you the titles that need date information. You might see output like:

Now you can use Codex to search for when each of these titles was added to Netflix. The key to getting good results is crafting effective prompts. You want to be specific about what you're looking for and provide enough context for Codex to find the right information.

Summary: Data Enrichment Complete

You've now completed the journey from basic data cleaning to intelligent data enrichment, transforming your Netflix dataset from one with placeholder values to one with actual, verified information. You started this lesson by understanding the limitations of filling missing values with "Unknown" and recognizing the difference between data cleaning, which handles problems in existing data, and data enrichment, which finds missing information from external sources.

You learned how to use Codex's web search capability to find real values for your missing data. You discovered how to enable web search using the --search flag. You explored appropriate use cases for web search, understanding that it's most valuable when you need current information, specific data points, or details about recent changes.

In the upcoming practice exercises, you'll apply everything you've learned to enrich your own dataset. You'll use Codex's web search to find missing dates and durations, verify the accuracy of the information you find, and document your enrichment work. These exercises will give you hands-on experience with the complete data enrichment workflow, solidifying your understanding and building your confidence in using these techniques on real-world data challenges.

Previous Lesson

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal