Parsing and Selecting Useful Information

Introduction: The Role of Parsing and Selection in DeepResearcher

Welcome back! In the previous lessons, you learned how DeepResearcher is structured and how it generates search queries using OpenAI. Now, you are ready for the next step: making sense of the information you collect from the web.

When you run a search, you get a lot of web pages. Not all of them are helpful. Some might be off-topic, and others might have only a small piece of useful information. That’s why parsing (breaking down) and selecting (choosing) the right information are so important. In this lesson, you’ll learn how DeepResearcher uses AI to filter out the noise and keep only what matters for your research question.

By the end of this lesson, you’ll understand how to:

Decide if a web page is useful for your research.
Extract only the relevant information from a web page.
Use these steps in your own code.

Let’s get started!

Evaluating Relevance with the LLM

Now that we have web content, the first thing we need to do is decide: Is this page useful for our research question?

DeepResearcher uses a language model (LLM) to help with this. It does this by sending a special prompt to the LLM, asking it to answer with just Yes or No to the question: “Is this page relevant to the user’s query?”

Let’s look at how this works in code, step by step.

Step 1: Prepare the Variables

We need to give the LLM two things:

The user’s original research question.
The content of the web page.

Here’s how we set up these variables:

user_query is the question the user asked.
page_text[:20000] is the first 20,000 characters of the web page content. We limit the length to avoid sending too much data to the LLM.

Step 2: Ask the LLM if the Page is Useful

We use the function called generate_boolean to send our prompt and variables to the LLM. You will need to write the prompt in the exercises of this unit.

If the LLM thinks the page is useful, it returns Yes.
If not, it returns No.

Example Output:

This way, we can quickly filter out pages that don’t help answer the user’s question.

Extracting Key Information Using the Extractor Prompt

Once we know a page is useful, the next step is to pull out only the information that answers the user’s question. We don’t want to keep the whole page — just the relevant parts.

DeepResearcher uses another prompt for this that you will also need to write, called the extractor prompt. Let’s see how this works.

Step 1: Prepare the Extraction Variables

We need to give the LLM:

The user’s research question.
The search query that led to this page.
The content of the web page.

Here’s how we set up these variables:

Step 2: Ask the LLM to Extract Relevant Information

We use the generate_response function to send our prompt and variables to the LLM. This function uses a prompt that tells the LLM to act as an expert information extractor.

The LLM reads the web page and returns only the parts that are relevant to the user’s question.
The result, context, is a string containing the extracted information.

Example Output:

This makes it much easier to build a final report or summary later.

Putting It All Together in Code

Let’s see how these steps fit together in the main code. We’ll focus on the part of the code that checks if a page is useful and then extracts the relevant information.

Here’s a simplified version of the process:

Let’s break this down:

For each search query, we get a few web pages.
For each page, we check if it has any content.
We ask the LLM if the page is useful for the user’s question.
If the answer is Yes, we ask the LLM to extract the relevant information.
We save the extracted information for later use.

This process helps us build a collection of only the most useful and relevant information for the user’s research.

Summary and Practice Preview

In this lesson, you learned how DeepResearcher filters and extracts useful information from web pages. You saw how to:

Use the LLM to decide if a page is relevant to the user’s question.
Extract only the key information from useful pages.
Combine these steps in code to build a focused research tool.

Next, you’ll get to practice these skills yourself. You’ll work with real code to filter and extract information, just like we did here. Take a moment to review the examples, and get ready to try it out on your own!

Previous Lesson

Next Lesson: Iterative Search: Refining Research Through Multiple Rounds

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal