Welcome back! In the previous lessons, you learned how DeepResearcher is structured and how it generates search queries using OpenAI. Now, you are ready for the next step: making sense of the information you collect from the web.
When you run a search, you get a lot of web pages. Not all of them are helpful. Some might be off-topic, and others might have only a small piece of useful information. That’s why parsing (breaking down) and selecting (choosing) the right information are so important. In this lesson, you’ll learn how DeepResearcher uses AI to filter out the noise and keep only what matters for your research question.
By the end of this lesson, you’ll understand how to:
- Decide if a web page is useful for your research.
- Extract only the relevant information from a web page.
- Use these steps in your own code.
Let’s get started!
Now that we have web content, the first thing we need to do is decide: Is this page useful for our research question?
DeepResearcher uses a language model (LLM
) to help with this. It does this by sending a special prompt to the LLM, asking it to answer with just Yes
or No
to the question: “Is this page relevant to the user’s query?”
Let’s look at how this works in code, step by step.
We need to give the LLM two things:
- The user’s original research question.
- The content of the web page.
Here’s how we set up these variables:
user_query
is the question the user asked.page_text[:20000]
is the first 20,000 characters of the web page content. We limit the length to avoid sending too much data to the LLM.
We use the function called generate_boolean
to send our prompt and variables to the LLM. You will need to write the prompt in the exercises of this unit.
- If the LLM thinks the page is useful, it returns
Yes
. - If not, it returns
No
.
Example Output:
This way, we can quickly filter out pages that don’t help answer the user’s question.
Once we know a page is useful, the next step is to pull out only the information that answers the user’s question. We don’t want to keep the whole page — just the relevant parts.
DeepResearcher uses another prompt for this that you will also need to write, called the extractor prompt. Let’s see how this works.
We need to give the LLM:
- The user’s research question.
- The search query that led to this page.
- The content of the web page.
Here’s how we set up these variables:
We use the generate_response
function to send our prompt and variables to the LLM. This function uses a prompt that tells the LLM to act as an expert information extractor.
- The LLM reads the web page and returns only the parts that are relevant to the user’s question.
- The result,
context
, is a string containing the extracted information.
Example Output:
This makes it much easier to build a final report or summary later.
Let’s see how these steps fit together in the main code. We’ll focus on the part of the code that checks if a page is useful and then extracts the relevant information.
Here’s a simplified version of the process:
Let’s break this down:
- For each search query, we get a few web pages.
- For each page, we check if it has any content.
- We ask the LLM if the page is useful for the user’s question.
- If the answer is
Yes
, we ask the LLM to extract the relevant information. - We save the extracted information for later use.
This process helps us build a collection of only the most useful and relevant information for the user’s research.
In this lesson, you learned how DeepResearcher filters and extracts useful information from web pages. You saw how to:
- Use the LLM to decide if a page is relevant to the user’s question.
- Extract only the key information from useful pages.
- Combine these steps in code to build a focused research tool.
Next, you’ll get to practice these skills yourself. You’ll work with real code to filter and extract information, just like we did here. Take a moment to review the examples, and get ready to try it out on your own!
