Introduction to Q-Learning

Welcome to the first lesson of "Q-Learning Unleashed: Building Intelligent Agents"! This is the second course in our "Playing Games with Reinforcement Learning" path, in which we build upon the environment concepts we explored in the previous course. Specifically, in this course we'll dive into one of the most fundamental algorithms in Reinforcement Learning: Q-Learning.

Q-Learning is a model-free Reinforcement Learning algorithm that allows an agent to learn optimal actions through experience. Unlike supervised learning, where we train on labeled examples, in Q-Learning, our agent learns by interacting with an environment and receiving feedback in the form of rewards. Throughout this lesson, we'll explore the concept of policies in reinforcement learning, delve into the mathematical foundations of the Q-Learning algorithm, and implement a Q-table in Python. We'll also write the core Q-Learning update function and see the algorithm in action through practical examples.

Virtually every RL algorithm employs Q-learning, or some variation of it, making it a fundamental tool in your RL toolbox. By the end of this lesson, you'll have a solid understanding of how Q-Learning works and be able to implement your own Q-Learning agent!

Understanding Policies in Reinforcement Learning

Before diving into Q-Learning, we need to understand the core concept of policy. In RL, a policy defines the actual agent's behavior by mapping states to actions — essentially, it's the strategy our agent follows, dictating which action the agent should take in each state.

Formally, a policy π(as)\pi(a|s) represents the probability of taking action aa when in state ss. There are two main types of policies:

  1. Deterministic policies: The agent always selects the same action in a given state;
  2. Stochastic policies: The agent selects actions with certain probabilities.

The ultimate goal in Reinforcement Learning is to find an optimal policy π\pi^* that , which is the sum of rewards an agent expects to receive over time. This is where Q-Learning comes in — it helps us find this optimal policy without even explicitly modeling the policy itself! Instead, Q-Learning focuses on learning a value function called the , which estimates how good a particular action is in a given state. Once we have this function, our optimal policy simply becomes "choose the action with the highest Q-value in each state".

The Q-Learning Algorithm

Q-Learning is all about estimating a special function called the Q-function, denoted as Q(s,a). This function helps us understand how good it is to take a particular action aa when in a specific state ss, and then continue following the best possible strategy thereafter.

Think of the Q-function as a guide that tells our agent the potential "value" or "benefit" of making certain decisions in different situations. It's like having a map that shows the expected rewards for each possible path (action) you can take. The goal of Q-Learning is to improve this map over time, so our agent can make better decisions.

The core idea of Q-Learning is to iteratively update our estimates of these Q-values using the Bellman equation:

Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(

Implementing Q-Tables

To implement Q-Learning, we first need a data structure to store our Q-values. The most straightforward approach is a Q-table — a table where rows represent states and columns represent actions, with each cell containing the corresponding Q-value. While this approach works well for simple environments with discrete states and actions, more advanced approaches use function approximation techniques, such as neural networks in Deep Reinforcement Learning, to estimate Q-values for environments with large or continuous state spaces. Please note we won't deal with Deep RL techniques in this course path.

In Python, we can implement a Q-table using different data structures. For grid-world environments with discrete states and actions, a dictionary or a defaultdict is often the most convenient approach:

This code creates a defaultdict where each key is a state, and each value is a numpy array containing Q-values for each possible action. The beauty of using defaultdict is that it automatically creates entries for states we haven't seen before, initializing their action values to zeros. We can then access the Q-value for a specific state-action pair using Q[state][action].

For example, if our state is represented as a tuple (x, y) for position in a grid, we could access the Q-value for moving right in position (3, 2) with Q[(3, 2)][3], assuming action 3 corresponds to moving right.

The Q-Learning Update Function

Now that we have our Q-table, we need to implement the update rule that will allow our agent to learn from experience. Here's the function that performs the Q-Learning update:

Here's how this function implements the Bellman equation:

  • The function takes a parameter done which indicates if we've reached a terminal (goal) state or if the maximum number of steps has been reached. This allows our agent to learn differently when it reaches terminal states (like winning or losing a game) or when it hits the maximum steps compared to intermediate states.
  • For terminal states or when the maximum steps are reached, we set best_next_q = 0.0 because there are no future rewards to consider.
  • For non-terminal states, we calculate best_next_q = np.max(Q[next_state]) to find the maximum possible Q-value from the next state.
  • Finally, we update the Q-table using the Bellman equation: Q[state][action] += alpha * (reward + gamma * best_next_q - Q[state][action]).

The learning process is iterative — with each experience, the agent adjusts its Q-values to better reflect the expected rewards.

Learning Rate and Discount Factor

Two critical parameters in Q-Learning deserve special attention as they dramatically affect how our agent learns:

  1. Learning Rate (α\alpha):

    • Controls how quickly the agent updates its Q-values.
    • A higher α\alpha (e.g., 0.9) makes the agent adapt quickly but might lead to instability.
    • A lower α\alpha (e.g., 0.1) makes learning more stable but slower.
    • Typical values range from 0.01 to 0.5.
    • Think of α\alpha as adjusting how stubborn your agent is. With `α=1\alpha = 1, the agent completely overwrites old knowledge with new experiences. With , the agent mostly sticks to what it already knows, making only small adjustments based on new information.
Q-Learning in Action

Let's see how our Q-Learning update works with a concrete example. Imagine a robot navigating a grid-world, trying to reach a charging station:

Initially, all Q-values are zero because the robot has no experience. After moving right and receiving a (positive) reward, the Q-value for "move right" in state (3, 2) increases to 0.10.1. This tells the robot that moving right from this position is promising.

Let's see what happens with a negative experience:

After this negative experience (a negative reward or "punishment"), the Q-value for "move left" drops to 0.1-0.1, while "move right" still has a positive value of . The robot is learning that moving right from state is better than moving left.

Conclusion and Next Steps

In this lesson, we've explored Q-Learning, a fundamental Reinforcement Learning algorithm that enables agents to learn optimal behaviors through experience. We delved into the concept of policies, which guide agent behavior, and examined the mathematical foundation of Q-Learning through the Bellman equation. By implementing a Q-table using Python's defaultdict and creating a Q-Learning update function, we saw how agents can learn from experience and adjust their actions based on rewards. We also discussed the importance of the learning rate (α\alpha) and discount factor (γ\gamma) in shaping the agent's learning process.

Q-Learning's strength lies in its simplicity and its ability to discover optimal behaviors without requiring a model of the environment. Through practical examples, we also observed how Q-values evolve based on positive and negative experiences, guiding agents toward better decision-making. As you move on to the practice exercises, you'll gain hands-on experience with Q-Learning concepts, reinforcing your understanding by implementing these theoretical ideas in code. Remember, mastering Q-Learning takes practice, so embrace the challenges and build your confidence as you work through the examples.

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal