Welcome to the first lesson of our course, "Automating Web Content Retrieval and Parsing in Python". In this course, you will learn how to build a research tool that can search the web, gather content, and process information automatically.
Automating web search is a key part of building modern research tools. Instead of searching for information manually, you can write a Python module to do it for you. This saves time and allows you to collect and process large amounts of information quickly.
In this lesson, we will focus on using DDGS, A metasearch library that aggregates results from diverse web search services.
By the end of this lesson, you will know how to:
- Search the web using
DDGS
in Python - Fetch the first result from your search
- Convert the web page content into a readable format
Let’s get started!
The DDGS
library allows you to perform web searches directly from Python. This is helpful because you can automate the process of finding information online.
DDGS works as a metasearch tool: it automatically selects and queries different search backends (such as DuckDuckGo, Brave, or others) to provide you with a diverse set of results. You do not need to choose the backend yourself, the library handles this for you.
First, let’s see how to import the DDGS
library and perform a simple search.
- Here, we import the
DDGS
class from the library. - We define a search query, in this case,
"Python programming"
. - We call the
.text()
method to perform the search and ask for just one result (max_results=1
). - The
results
variable will contain a list of search results.
Sample Output:
Each result is a dictionary with keys like title
, href
(the URL), and body
(a short description).
Note: On CodeSignal, the library is already installed, so you do not need to install it yourself. On your own computer, you would install it using .
Now that we have search results, let’s extract the URL of the first result and fetch the web page content.
First, let’s check if we got any results and get the URL:
- We check if
results
is not empty and if the first result has a"href"
key. - If so, we extract the URL and print it.
- If not, we print a message saying no valid results were found.
Even though we request only one result with max_results=1
, DDGS().text()
always returns a list. To access the actual result, we need to extract the first item from the list using results[0]
, even if there's only one result.
Next, let’s fetch the content of the web page using the httpx
library:
- We use
httpx.get()
to download the web page at the given URL. - The
timeout=10
argument means the request will wait up to 10 seconds. response.raise_for_status()
will raise an error if the request fails.- We print the first 200 characters of the web page content to see what we got.
Web pages are usually written in HTML, which can be hard to read. To make the content easier to work with, we can convert it to Markdown, a simpler text format.
We will use the html_to_markdown
library for this. Here’s how you can do it:
- We import the
convert_to_markdown
function. - We pass the HTML content to this function, and it returns a Markdown version.
- We print the first 200 characters of the Markdown content.
Sample Output:
Markdown is much easier to read and process than raw HTML, which is why we use this step.
In this lesson, you learned how to:
- Use the
DDGS
library to search the web from Python - Extract the first search result and fetch its web page content using
httpx
- Convert the web page’s HTML to Markdown for easier reading
These steps are the foundation for building an automated research tool. In the practice exercises that follow, you will get hands-on experience with these skills. You will write your own code to search, fetch, and convert web content, preparing you for more advanced features in future lessons.
Let’s move on to the practice exercises and start building your DeepResearcher!
