Welcome back to our course on "Navigating RL Challenges: Strategies and Future Directions"! This is the third lesson in our journey, and I'm excited to continue building on what we've learned so far.
So far, we implemented random goals to create more dynamic grid world environments, followed by an exploration of reward shaping to provide more informative feedback to our agents, helping them learn more efficiently. Today, we'll tackle another critical challenge in Reinforcement Learning: environmental hazards. In real-world applications, agents often need to navigate environments with dangers or obstacles that must be avoided. Think of a robot that needs to avoid stairs while cleaning a house or an autonomous vehicle that must recognize and avoid hazards on the road.
We'll implement mines in our grid world that the agent must learn to avoid, adding risk assessment and danger avoidance to our agent's learning objectives. This brings us one step closer to training agents that can handle the complexities and dangers of real-world environments.
Let's start by modifying our grid world environment class to include mines. We'll focus first on how to initialize our environment with mines:
This constructor introduces a new parameter n_mines
that determines how many hazards will be placed in the environment. We also initialize an empty set self.mines
that will store the locations of these mines.
To visualize how mines appear in our environment, here's how a 5x5 grid would render with the agent A
at (1, 1)
, the goal G
at (4, 4)
, and mines *
at positions (1, 2)
, (2, 3)
, and (3, 1)
:
Now we should update the reset
method to place mines when the environment is reset:
When you call this updated reset()
, the environment now places mines randomly with self._sample_mines()
, and returns information about nearby mines as part of the state via self._check_nearby_mines()
. This gives your agent sensory data about potential dangers in its vicinity.
Let's implement the self._sample_mines()
method to place mines randomly:
This method carefully places mines by:
- Creating a list of all valid positions that aren't the start or goal.
- Randomly selecting positions based on your specified number of mines.
- Returning these positions as a set for efficient lookups.
Now let's add the danger detection functionality with the self._check_nearby_mines()
method:
This method simulates a simple danger detection "sensor" for your agent. It examines the four adjacent cells and returns a tuple indicating which directions contain mines. This is crucial for your agent's learning since it provides predictive information about dangers before encountering them directly — much like how real robots use sensors to detect obstacles.
At this point, we need to modify the step
method to handle what happens when the agent moves onto a mine:
When your agent steps on a mine, three important things happen:
- It receives a significant negative reward (-1.0).
- The episode immediately ends (
done = True
). - The information dictionary includes a flag indicating a mine was hit.
This creates a clear negative consequence that your agent will learn to avoid. The negative reward is strategically set to cancel out the positive reward for reaching the goal, creating a balanced risk-reward structure.
Now let's implement the usual code to train and evaluate our agent in these hazardous environments. First, we need to update our episode function:
Notice how we've updated the success condition — your agent only succeeds if it reaches the goal without hitting a mine or timing out. We also track whether a mine was hit, which will help you evaluate safety performance.
Finally, let's implement code that lets you compare performance across environments with different hazard levels:
This experiment trains agents in environments with varying numbers of mines, allowing you to directly observe how increasing environmental hazards affects learning performance. When you run this code, you'll typically observe:
- Environments with no mines are easiest to learn.
- As you add mines, success rates decrease and training takes longer.
- With many mines, finding safe paths becomes significantly challenging.
This progression mirrors real-world RL applications, where increased environmental complexity often requires more sophisticated learning strategies or additional training time.
Below is a plot showing the success rate of three agents trained in environments with different numbers of mines (0, 2, and 5). The x-axis represents the training episodes (from 0 to 1000), and the y-axis represents the success rate — i.e., the fraction of episodes in which the agent reached the goal without hitting a mine or timing out.
-
0 mines (blue line): As expected, this scenario shows the highest success rate overall. Without mines, the agent faces a straightforward navigation problem, and it quickly learns to reach the goal consistently. We see the success rate rapidly climb above 80% and hover near 90% or higher as training progresses.
-
2 mines (green line): Introducing a few mines increases the difficulty, causing the success rate to dip lower than in the 0-mine scenario. Early on, the agent often hits mines while exploring, resulting in lower success rates. Over time, however, the agent learns to avoid the hazards and its performance steadily improves, ultimately stabilizing around 70–80% success.
-
5 mines (red line): With more mines, the environment becomes substantially riskier. The success rate starts low and remains more volatile, reflecting the agent’s greater challenge in finding safe paths. Although performance does improve with training, the success rate plateaus at a lower level compared to the less hazardous environments, highlighting the increased complexity of learning safe navigation.
Overall, these results illustrate how adding mines — even a relatively small number — significantly impacts the agent’s learning curve and final performance. More mines mean the agent must devote more exploration and learning capacity to avoiding hazards, slowing down progress toward high success rates.
In this lesson, you've enhanced your grid world to include environmental hazards, transforming a simple navigation task into a risk-aware planning problem. You've implemented random mine placement, a danger detection system, negative rewards for collisions, and metrics to track safety performance. These additions create a more realistic learning challenge where your agent must balance reaching its goal with avoiding hazards — a fundamental concern in real-world applications.
As you continue your Reinforcement Learning journey, the techniques explored today — negative rewards for hazards, sensory information about dangers, and safety metrics — will serve as building blocks for developing safer and more capable learning systems. The Q-learning algorithm has shown remarkable adaptability, autonomously discovering safer paths without explicit programming about what to avoid — demonstrating the power of Reinforcement Learning to handle complex, risk-aware decision-making.
