Introduction

Welcome back to our "Q-Learning Unleashed: Building Intelligent Agents" course! In the first lesson, we explored the fundamental concepts of Q-Learning, including the Bellman equation, Q-tables, and the Q-Learning update rule. Now, in this second lesson we're ready to take the next step: training a Q-learning agent through interaction with an environment.

In this lesson, we'll build a complete Q-Learning agent that learns to navigate a simple environment through trial and error. We'll implement a training loop that allows our agent to explore its environment, collect experiences, and update its Q-table to discover optimal behavior. By the end of this lesson, you'll see how a reinforcement learning agent evolves from random actions to strategic decision-making as it learns from experience.

Let's dive in and bring our Q-learning theory to life with practical implementation!

The Line-World Environment

Before we start training our agent, let's understand the environment it will interact with. While our previous course focused on a grid-world environment, for the sake of clarity in this course we'll consider an even simpler environment, called a line-world.

In this one-dimensional world, the agent exists on a line with positions numbered from 0 to size-1. There's a goal state in the middle of the line, and the agent can move left or right. The agent receives a reward of 1 when it reaches the goal and 0 otherwise. The agent starts on a random cell and each episode ends when the agent reaches the goal.

This simplified environment allows us to focus on the core aspects of agent training rather than complex environment dynamics. Here's how we can visualize a line-world of size 7:

Where [G] represents the goal state at position 3 (size // 2).

Implementing Environment: Reset

Let's implement two core functions for our line-world environment: reset() and step(). These functions will allow our agent to interact with the environment. Note we won't be using a class, like we did in Course 1, for simplicty.

First, the reset() function initializes the environment at the beginning of each episode:

This function sets the goal position at the middle of the line (size // 2), randomly places the agent at any position except the goal, and returns the initial state (position) of the agent. This ensures our agent has to learn to navigate to reach the goal from different starting points.

Implementing Environment: Step

Next, we implement the step() function, which processes the agent's action and returns the next state, reward, and whether the episode is done:

The step() function takes the current state and action as inputs, computes the next state by adding the action value (-1 for left, +1 for right), and ensures the agent stays within the environment boundaries. It assigns a reward of 1 if the agent reaches the goal, 0 otherwise, and determines if the episode is done when the agent reaches the goal. These two functions form the core of our environment interface, allowing our agent to start episodes and interact with the environment. Please note that, differently from the Grid World environment we developed in Course 1, we're not setting a max_steps limit for simplicity - although it is generally recommended to do so.

Q-Learning Update Implementation

As you may recall from our previous lesson, the heart of Q-learning is the update function that modifies the Q-table based on experience. Let's briefly see the implementation once more:

This function implements the Bellman equation for Q-learning:

Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)]

The best_next_q variable captures the maximum Q-value for the next state, representing the agent's estimate of future rewards. The learning rate alpha controls how quickly the agent updates its beliefs, while the discount factor gamma balances immediate versus future rewards. The term inside the parentheses is the temporal difference error — the difference between the expected value and the current estimate.

Setting Up the Training Environment

Now let's set up the training loop that will allow our agent to learn through interaction with the environment. We'll start by initializing the necessary parameters:

Here we set up the environment parameters (size and possible actions), Q-learning parameters (alpha, gamma, and Q-table), and the number of episodes to train for. The Q-table is implemented as a defaultdict that returns an array of zeros for any new state we encounter, allowing our agent to handle new states seamlessly during exploration.

The Learning Process: Episode by Episode

With our setup complete, we can now implement the core training loop that iterates through episodes:

For each episode, we reset the environment, choose an action, update Q-values based on rewards received, and continue until the agent reaches the goal. This exploration process allows the agent to discover which actions lead to better outcomes in each state. Initially, the agent may take many steps to reach the goal, but as it learns, it typically becomes more efficient. The loop represents the core of reinforcement learning: taking actions, receiving feedback, and updating knowledge to improve future decisions. Note we're still taking actions randomly here - this is only done for learning, and in the next lesson we'll discuss how to extract the best action in each state using the trained Q-table!

Examining the Learning Results

After training, we can analyze the Q-values to understand what our agent has learned:

When we run this code, we might see output similar to:

These values reveal the learned policy: for positions left of the goal (0, 1, 2), the Q-values for moving right are higher, indicating the agent learned to move right to reach the goal. For positions right of the goal (4, 5, 6), the Q-values for moving left are higher, showing the agent learned to move left. Position 3 is the goal state, featuring Q-values of zero for both actions since no further action is taken once the agent reaches the goal state. This pattern perfectly captures the optimal solution to our line-world problem!

Conclusion and Next Steps

Congratulations! In this lesson, we've built a complete Q-learning agent that learns to navigate a line-world environment. We've implemented the environment interface, a training loop for agent-environment interaction, Q-table updates, and analyzed the resulting learned behavior.

This implementation demonstrates the core reinforcement learning loop: the agent takes actions, receives feedback in the form of rewards, and updates its understanding of the world. Over time, this process leads to increasingly optimal behavior, even without explicit programming of the solution.

Keep experimenting with the parameters and environment settings to deepen your understanding of how Q-learning works. The journey of building intelligent agents has just begun!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal