Introduction

Hello and welcome to the fourth and final lesson of "Game On: Integrating RL Agents with Environments"! So far in our journey, we've covered the fundamentals of integrating agents with environments, explored the crucial balance between exploration and exploitation, and learned how to track and visualize training statistics to monitor agent performance.

Today, we'll dive into one of the most illuminating aspects of reinforcement learning: visualizing policies and value functions. While our previous lesson helped us understand how well our agent is learning through performance metrics, today we'll look at what our agent has learned by visualizing its decision-making strategy and value estimations.

By the end of this lesson, we'll have implemented two powerful visualization techniques that will allow us to see our agent's learned policy as directional arrows and its value function as a colorful heatmap. These visualizations will transform abstract numbers in our Q-table into intuitive representations that reveal the agent's understanding of the environment. Let's begin!

Understanding Policies and Value Functions

Before we start coding our visualizations, let's understand what we're actually trying to visualize.

A policy in reinforcement learning is essentially the agent's strategy — it defines what action the agent will take in each state. For our Q-learning agent, the policy is derived from the Q-table by selecting the action with the highest Q-value for each state.

The value function represents how good it is to be in a particular state. It estimates the expected future reward the agent can obtain starting from that state and following its policy thereafter. There are two types of value functions:

  1. State-value function V(S)V(S): Estimates the expected return when starting in a state and following the current policy.
  2. Action-value function Q(S,A)Q(S, A): Estimates the expected return when taking a specific action in a state and following the current policy afterward.

In our Q-learning agent, we directly learn the Q-values (using the Q-table), but we can derive the state-value function from them by simply taking the maximum Q-value for each state:

V(s)=max(Q(s,a))V(s) = max(Q(s,a))

This relationship is crucial — our Q-table contains action-values for each state-action pair, and the maximum Q-value for a state gives us its state-value. Visualizing these functions helps us understand which states the agent considers valuable, see the direction of the agent's movement strategy, identify potential misconceptions, and confirm if the agent has successfully learned the optimal path to the goal.

Retrieving the Policy from the Q-table

Before we can visualize our agent's policy, we need a method to extract the best action for each state according to our Q-table. The policy is derived from the Q-table by selecting the action with the highest Q-value for each state:

This method simply returns the action index with the highest Q-value for the given state using np.argmax(), which finds the index of the maximum value in an array. This gives us the agent's current policy – what it believes is the best action to take in each state.

Visualizing the Policy with Directional Arrows

Now that we can retrieve the policy, let's implement a function to visualize it in a grid. We'll represent each action as a directional arrow, making it easy to interpret at a glance:

This function takes our agent and the environment size as inputs and prints a grid of arrows. For each state in our grid world, we call the agent's get_policy(state) method to get the best action, then use our arrow symbols (↑, ↓, ←, →) to visualize that action. The goal state is marked with "G" to make it stand out.

Interpreting the Policy Visualization

After training our agent successfully, the policy visualization might look something like this:

By examining this policy grid, we can easily trace the agent's strategy from any starting position. For example, if the agent starts at the top-left corner (0,0), it will follow the arrows: right, right, right, down, down, down, down, reaching the goal in 7 steps – the optimal path length for that starting position, showing us that the agent has learned an optimal policy.

The Value Function and State Assessment

Now let's turn our attention to the value function. As mentioned earlier, the value function tells us how good it is to be in each state, estimated as the expected future reward.

For a Q-learning agent, the value function is directly related to the Q-table. Specifically, the value of a state is the maximum Q-value for that state across all possible actions. This makes sense because a rational agent would always choose the action that promises the highest expected return.

Here's the agent method that extracts the state value from the Q-table:

This simple method takes a state and returns the maximum Q-value for that state, representing the state's value according to our agent's current estimates.

In our grid world environment:

  • States close to the goal should have higher values because they can lead to rewards sooner.
  • States far from the goal typically have lower values.
  • The goal state itself usually has the highest value.

The distribution of these values across the grid creates a "value landscape" that guides the agent's decision-making. Visualizing this landscape can reveal how the agent perceives different states and their potential for future reward.

Creating Value Function Heatmaps

Now let's implement a function to visualize the value function using a colorful heatmap:

This function creates a matrix to hold the value of each state in our grid world. We iterate through each cell, get its value using our agent's get_state_value() method, and store it in the corresponding position in our value_grid matrix.

The visualization uses Matplotlib's imshow() to create a heatmap with the 'viridis' colormap, which transitions from blue (low values) to yellow (high values). We add a colorbar for interpretation and overlay the actual numerical values on each cell for precise reading.

The resulting heatmap makes it immediately apparent where the agent believes the valuable states are, with bright yellow areas representing high-value states (typically near the goal) and dark blue areas representing low-value states (typically far from the goal).

Interpreting the Value Function Heatmap

With our value function heatmap generated, let's examine the learned state values in more detail. In the image above, each cell corresponds to a state in our 5×5 grid, and the color indicates the magnitude of that state’s value — darker blues represent lower values, and brighter yellows represent higher values:

Value Function Heatmap

At a glance, notice how the bottom-right corner — our goal location — is assigned the highest values (1.00). This is expected: since the agent receives a reward for reaching this terminal state, the algorithm learns that being anywhere close to the goal is advantageous, as the agent can collect a high reward quickly. Consequently, cells adjacent to the goal also display high values (0.98–0.99), reflecting the agent’s confidence that from those states, it can secure the reward with only a few moves.

In contrast, states farther away from the goal tend to have lower values (around 0.90–0.94). These cells are still part of a viable path to the goal, but the agent anticipates a longer journey or a bit more uncertainty in receiving the reward.

Overall, this gradient from lower values in the upper-left to higher values in the lower-right reveals that the agent has successfully learned which states are more valuable for collecting future rewards. When combined with the policy visualization, you can trace how the agent uses these state-value estimates to decide the next action, continuously moving toward states with higher expected returns.

Conclusion and Next Steps

Fantastic job! You've now mastered two powerful visualization techniques that give you deeper insights into your reinforcement learning agents. By visualizing both the policy and value function, we've transformed abstract Q-table numbers into intuitive visual representations that reveal what the agent has actually learned. We've covered the relationship between Q-values and the state-value function, implemented text-based policy visualization using directional arrows, created heatmap visualizations of the value function, and learned how to interpret these visualizations to gain insights into agent behavior.

In the upcoming practice exercises, you'll get hands-on experience implementing these visualization techniques and analyzing the results. You'll be able to see firsthand how different training parameters affect the learned policy and value function, and how visualizations can help you debug and understand agent behavior. These visualization skills are invaluable tools in your reinforcement learning toolkit as you tackle increasingly complex environments and agents.

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal