Welcome back! So far, you have learned how to search the web using Python and how to build a module that fetches and processes web content. In this lesson, we will focus on making your web searcher more reliable by avoiding common mistakes that can cause problems in automated research.
When you build tools that interact with the web, you will often run into issues like duplicate results, broken links, or slow responses. If you do not handle these problems, your tool might waste time, give you bad data, or even stop working. By learning how to avoid these pitfalls, you will make your web searcher much more robust and useful.
Let’s quickly remind ourselves what our web searcher does. In the previous lessons, you learned how to:
- Use the DDGS library to search the web for a query.
- Fetch the content of the top search results using the
httpx
library. - Convert the HTML content of each page to Markdown using
html_to_markdown
.
All of these steps are combined in a function that takes a search query and returns a list of results, each with a title, URL, and Markdown content.
Now, let’s see how we can improve this process by handling some common issues.
One common problem is processing the same web page more than once. This can happen if the same URL appears in multiple searches or if your code is run multiple times. To avoid this, we need a way to remember which pages we have already visited.
Let’s start by creating a set to keep track of visited URLs:
Here, _visited_pages
is a set that will store the URLs of pages we have already processed. Sets are useful because they do not allow duplicate values, and checking if a value is in a set is very fast.
Now, before we fetch a page, we check if its URL is already in _visited_pages
. If it is, we skip it:
not url
checks if the URL is missing or empty.url in _visited_pages
checks if we have already seen this URL.
If either is true, we use continue
to skip to the next result.
After we successfully fetch and process a page, we add its URL to the set:
This way, we make sure we do not process the same page twice.
Example Output:
If you search for "Python programming" twice, the second time, any URLs you have already visited will be skipped, making your searcher more efficient.
Another common issue is that some web pages might be broken, slow, or unreachable. If your code tries to fetch a page and something goes wrong, it could crash or get stuck.
To handle this, we use a try
and except
block when fetching each page:
Let’s break this down:
- The
try
block attempts to fetch the page and convert it to Markdown. - If anything goes wrong (for example, the page does not load or the server returns an error), the code jumps to the
except
block. - In the
except
block, we still add the URL to_visited_pages
so we do not try it again. - We also add a result to
markdown_pages
with an error message, so we know what went wrong.
Example Output:
If a page is broken, the output might look like this:
This way, your tool does not crash, and you get a clear message about what happened.
Sometimes, a web page might take a long time to respond, or you might get results that are not appropriate for your research. To handle these issues, we use timeouts and safe search settings.
When fetching a page, we set a timeout:
- The
timeout
parameter makes sure that if a page takes too long to load, your code will stop waiting and move on. This keeps your tool fast and responsive.
When searching with DDGS, we can also set the safesearch
and region
parameters:
safesearch
helps filter out inappropriate or irrelevant results.region
can help you get results that are more relevant to your location or language.
Example Output:
If a page takes too long to load, you might see an error like:
This shows that your timeout setting is working as expected.
In this lesson, you learned how to make your web searcher more reliable by:
- Tracking and skipping already-visited URLs to avoid duplicates.
- Handling errors when fetching pages, so your tool does not crash or get stuck.
- Using timeouts and safe search settings to keep your searches fast and appropriate.
These improvements will help you build a more robust and efficient research tool. In the next practice exercises, you will get a chance to apply these ideas and see how they make your web searcher stronger. Good luck, and keep up the great work!
