Avoiding Common Pitfalls in Our Web Searcher

Introduction: Building a Reliable Web Searcher

Welcome back! So far, you have learned how to search the web using Python and how to build a module that fetches and processes web content. In this lesson, we will focus on making your web searcher more reliable by avoiding common mistakes that can cause problems in automated research.

When you build tools that interact with the web, you will often run into issues like duplicate results, broken links, or slow responses. If you do not handle these problems, your tool might waste time, give you bad data, or even stop working. By learning how to avoid these pitfalls, you will make your web searcher much more robust and useful.

Recall: Our Web Searcher Workflow

Let’s quickly remind ourselves what our web searcher does. In the previous lessons, you learned how to:

Use the DDGS library to search the web for a query.
Fetch the content of the top search results using the httpx library.
Convert the HTML content of each page to Markdown using html_to_markdown.

All of these steps are combined in a function that takes a search query and returns a list of results, each with a title, URL, and Markdown content.

Now, let’s see how we can improve this process by handling some common issues.

Tracking and Skipping Visited URLs

One common problem is processing the same web page more than once. This can happen if the same URL appears in multiple searches or if your code is run multiple times. To avoid this, we need a way to remember which pages we have already visited.

Let’s start by creating a set to keep track of visited URLs:

Here, _visited_pages is a set that will store the URLs of pages we have already processed. Sets are useful because they do not allow duplicate values, and checking if a value is in a set is very fast.

Now, before we fetch a page, we check if its URL is already in _visited_pages. If it is, we skip it:

not url checks if the URL is missing or empty.
url in _visited_pages checks if we have already seen this URL.

If either is true, we use continue to skip to the next result.

After we successfully fetch and process a page, we add its URL to the set:

This way, we make sure we do not process the same page twice.

Example Output:

If you search for "Python programming" twice, the second time, any URLs you have already visited will be skipped, making your searcher more efficient.

Handling Errors When Fetching Pages

Another common issue is that some web pages might be broken, slow, or unreachable. If your code tries to fetch a page and something goes wrong, it could crash or get stuck.

To handle this, we use a try and except block when fetching each page:

Let’s break this down:

The try block attempts to fetch the page and convert it to Markdown.
If anything goes wrong (for example, the page does not load or the server returns an error), the code jumps to the except block.
In the except block, we still add the URL to _visited_pages so we do not try it again.
We also add a result to markdown_pages with an error message, so we know what went wrong.

Example Output:

If a page is broken, the output might look like this:

This way, your tool does not crash, and you get a clear message about what happened.

Using Timeouts and Safe Search Settings

Sometimes, a web page might take a long time to respond, or you might get results that are not appropriate for your research. To handle these issues, we use timeouts and safe search settings.

When fetching a page, we set a timeout:

The timeout parameter makes sure that if a page takes too long to load, your code will stop waiting and move on. This keeps your tool fast and responsive.

When searching with DDGS, we can also set the safesearch and region parameters:

safesearch helps filter out inappropriate or irrelevant results.
region can help you get results that are more relevant to your location or language.

Example Output:

If a page takes too long to load, you might see an error like:

This shows that your timeout setting is working as expected.

Summary and What’s Next

In this lesson, you learned how to make your web searcher more reliable by:

Tracking and skipping already-visited URLs to avoid duplicates.
Handling errors when fetching pages, so your tool does not crash or get stuck.
Using timeouts and safe search settings to keep your searches fast and appropriate.

These improvements will help you build a more robust and efficient research tool. In the next practice exercises, you will get a chance to apply these ideas and see how they make your web searcher stronger. Good luck, and keep up the great work!

Previous Lesson

Next Lesson: Making the Web Search Reliable and Safe

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal