Search Unity

Welcome to the second entry in the Unity AI Blog series! For this post, I want to pick up where we left off last time, and talk about how to take a Contextual Bandit problem, and extend it into a full Reinforcement Learning problem. In the process, we will demonstrate  how to use an agent which acts via a learned Q-function that estimates the long-term value of taking certain actions in certain circumstances. For this example we will only use a simple gridworld, and a tabular Q-representation. Fortunately, this, basic idea applies to almost all games. If you like to try out the Q-learning demo, follow the link here. For a deeper walkthrough of how Q-learning works, continue to the full text below.

The Q-Learning Algorithm

Contextual Bandit Recap

The goal when doing Reinforcement Learning is to train an agent which can learn to act in ways that maximizes future expected rewards within a given environment. In the last post in this series, that environment was relatively static. The state of the environment was simply which of the three possible rooms the agent was in, and the actions were choosing which chest within that room to open. Our algorithm learned the Q-function for each of these state-action pairs: Q(s, a). This Q-function corresponded to the expected future reward that would be acquired by taking that action within that state over time. We called this problem the “Contextual Bandit.”

The Reinforcement Learning Problem

The lack of two things kept that Contextual Bandit example from being a proper Reinforcement Learning problem: sparse rewards, and state transitions. By sparse rewards, we refer to the fact that the agent does not receive a reward for every action it takes. Sometimes these rewards are “delayed,” in that certain actions which may in fact be optimal, may not provide a payout until a series of optimal actions have been taken. To use a more concrete example, an agent may be following the correct path, but it will only receive a reward at the end of the path, not for every step along the way. Each of those actions may have been essential to getting the final reward, even if they didn’t provide a reward at the time. We need a way to perform credit assignment, that is, allowing the agent to learn that earlier actions were valuable, even if only indirectly.

The second missing element is that in full reinforcement learning problems there are transitions between states. This way, our actions not only produce rewards according to a reward function: R(s, a) ⇨  r, but also produce new states, according to a state transition function: P(s, a) ⇨ s’. A concrete example here is that every step taken while walking along a path brings the agent to a new place in that path, hence a new state. Therefore we want our agent not only to learn to act to optimize the current possible reward, but act to move toward states we know provide even larger rewards.

Bellman Updates

While these two added elements of complexity may at first seem unrelated, they are in fact directly connected. Both imply a relationship between future states our agent might end up in, and future rewards our agent might receive. We can take advantage of this relationship to learn to take optimal actions under these circumstances with a simple insight. Namely, that under a “true” optimal Q-function (a theoretical one which we may or may not ever reach ourselves) the value of a current state and action can be decomposed into to the immediate reward r plus the discounted maximum future expected reward from the next state the agent will end up in for taking that action:

This is called the Bellman equation, and can be written as follows:

Here ? (gamma) is a discount term, which relates to how much we want our agent to care about future possible rewards. If we set ? to 1.0, our agent would value all possible future rewards equally, and in training episodes which never end, the value estimate might increase to infinity. For this reason, we set ? to something greater than 0 and less than 1. Typical values are between 0.7 and 0.99.

The Bellman equation is useful because it provides a way for us to think about updating our Q-function by bootstrapping from the Q-function itself. Q*(s, a) refers to an optimal Q-function, but even our current, sub-optimal Q value estimates of the next state can help push our estimates of the current state in a more accurate direction. Since we are relying primarily on the true rewards at each step, we can trust that the Q-value estimates themselves will slowly improve. We can use the Bellman equation to inform the following new Q-learning update:

This looks similar to our previous contextual bandit update algorithm, except that our Q-target now includes the discounted future expected reward at the next step.


In order to ensure that our agent properly explores the state space, we will utilize a form of exploration called epsilon-greedy. To use epsilon-greedy, we simply set an epsilon value ϵ to 1.0, and decrease it by a small amount every time the agent takes an action. When the agent chooses an action, it either picks argmax(Q(s, a)), the greedy action, or takes a random action with probability ϵ. The intuition is that at the beginning of training our agent’s Q-value estimates are likely to be very poor, but as we learn about the world, and ϵ decreases, our Q-function will slowly correspond more to the true Q-function of the environment, and the actions we take using it will be increasingly accurate.

The Unity Gridworld

The blue block corresponds to the agent, the red blocks to the obstacles, and the green block to the goal position. The green and red spheres correspond to the value estimates for each of the states within the GridWorld.

To demonstrate a Q-learning agent, we have built a simple GridWorld environment using Unity. The environment consists of the following: 1- an agent placed randomly within the world, 2- a randomly placed goal location that we want our agent to learn to move toward, 3- and randomly placed obstacles that we want our agent to learn to avoid. The state (s) of the environment will be an integer which corresponds to the position on the grid. The four actions (a) will consist of (Up, Down, Left, and Right), and the rewards (r) will be: +1 for moving to the state with the goal, -1 for moving to the state with an obstacle, and -0.05 for each step, to encourage quick movement to the goal on the part of the agent. Each episode will end after 100 steps, or when the agent moves to a state with either a goal or obstacle. Like in the previous tutorial, the agent’s Q values will be stored using a table, where the rows correspond to the state, and the columns to the possible actions. You can play with this environment and agent within your Web browser here, and download the Unity project to modify for use in your own games here. As the agent explores the environment, colored orbs will appear in each of the GridWorld states. These correspond to the agent’s average Q-value estimate for that state. Once the agent learns an optimal policy, it will be visible as a direct value gradient from the start position to the goal.

Going Forward

The agent and environment presented here represent a classic tabular formulation of the Q-learning problem. If you are thinking that perhaps there is not much in common with this basic environment and the ones you find in contemporary games, do not worry. In the years since the algorithm’s introduction in the 90s, there have been a number of important developments to allow Q-learning to be used in more varied and dynamic situations. One prime example is DeepMind’s Deep Q-Network which was used to learn to play dozens of different ATARI games directly from pixels, a feat impossible using only a lookup table like the one here. In order to accomplish this, they utilized an agent which was controlled by a Deep Neural Network (DNN). By using a neural network it is possible to learn a generalized Q-function which can be applied to completely unseen states, such as novel combinations of pixels on a screen.

In the next few weeks we will release an interface with a set of algorithms and example projects to allow for the training of similar Deep Reinforcement Learning agent in Unity games and simulations. For a sneak peek of what these tools are capable of, you can check out the video link here. While this initial release will be limited, and aimed primarily at those working in research, industry, and game QA testing, we at Unity are excited about the possibilities opened up by using modern Deep Learning methods to learn game behavior. We hope that as this work matures, it will spark interest in using ML within game to control complex NPC behavior, game dynamics, and more. We are at the very beginning of exploring using Deep Learning in games, and we look forward to you continuing with us on this journey.

15 replies on “Unity AI – Reinforcement Learning with Q-Learning”

Great initiative! I look forward to more posts in the future.

Are there any plans to integrate support for deep learning toolkits? I know the CNTK team is working on a C# wrapper. This could be a great possibility for Unity to be the first game engine to support next gen AI!

I’m not sure where you’re major errors are, but why would ML people be fixing bugs in other parts of Unity? You’re complaining in the wrong place mate ;)

Having a blast following this series thus far. I can’t wait until you guys get to explore AI that is trained to react appropriately in dynamic environments ( I guess I am just begging for NNs lol)

Thanks for these blog posts!

ML seems pretty scary from the outside, but the value is very apparent. The implementations you’ve published will be very helpful if we try to do something like this ourselves.

OpenAI is not a framework. If you mean OpenAI’s Gym (or Universe), their library of different training environments (games), it probably doesn’t make much sense for Unity as the environments use the games’ UI and only work with Python atm.

A* finds the optimal path if the graph is known (you can describe exactly each state). Q-Learning (and Reinforcement Learning in general) tries to find the optimal path under unknown circumstances (part of the algorithm is to discover possible states, and often there are so many combinations that you can’t learn all of them anyway) and in stochastic environments (action only leads to expected state with a certain probability). If you know the map of your country and want to navigate between cities, use A*. If you want to find the optimal order of steps (keystrokes) in a game, it’s impossible to describe the problem as something A* could solve. Remember that A* uses heuristics to determine the next action. In games, you typically don’t have any feedback (change in score) until much later, so you need to learn the “heuristic” over time, after playing the game over and over. You’re right that once you know all the states in the world (which you’d get if you played the game an infinite number of times), you can come up with a policy that tells you what’s the best action to take in each step, similar to A*’s policy of selecting the best (cost + heuristic) action. However, the whole process of learning the set of possible states and their costs is an integral part of Q-Learning, which A* doesn’t deal with.

Very nice.
Some of the equations could use still or animated visuals.

Cheers and looking forward to more on this series.


How about breaking down those equations a little more and explain them step by step.

My father always told me: “There are two types of smart people in this world: Those who want to help people, and those who want to impress people with their intelligence. I hope you are the first kind!”

I totally agree, you’ve practically summed up most of the coders/ mathematicians, no offense y’all… But when will everyone understand that math is a language, very bad language because not a lot of people understand it, and those who do are generally not very friendly to… the most (with this weird nose up in the air behavior) . What Unity need is more template kinda stuff, simplify the saving system, most stuff that you see in gaming, make a simple template version of it so people can further customize it in their own way… If something is more needed beyond that, that’s where the coding kicks in, and the big coders should be there to simplify it in general for the software, if there is a demand for it. I’m not talking about custom stuff that you buy, most of it is crap, you end up wasting your money…

Again, simplify, simplify, simplify or if you’re too cool and too smart for that, you might start losing your user base (or at least not get a new one, while somebody else take the simple approach). Make it possible for the biggest doofus to make a Battlefield type game if he wants, very simply. That should be your goal.

Comments are closed.