Search Unity

Welcome to the first of Unity’s new AI-themed blog entries! We have set up this space as a place to share and discuss the work Unity is doing around Artificial Intelligence and Machine Learning. In the past few years, advances in Machine Learning (ML) have allowed for breakthroughs in detecting objects, translating text, recognizing speech, and playing games, to name a few. That last point, the connection between ML and games, is something very close to our hearts here at Unity. We believe that breakthroughs in Deep Learning are going to create a sea-change in how games are built, changing everything from how textures and 3D-models are generated, to how non playable characters (NPCs) are programmed, to how we think about animating characters or lighting scenes. These blog entries are a creative space to explore all these emerging developments.

Who are these blog entries for?

It is our objective to inform Unity Game Developers about the power of AI and ML approaches in game development. We also want to show Artists the opportunities of using AI techniques in content creation. We will take this as an opportunity to demonstrate to ML Researchers the potential of Unity as platform for AI research/development. This includes, demonstrating the Industry potential of Unity as a simulation platform for robotics and self-driving cars. And finally, to getting Hobbyists/Students excited about both Unity and ML.

Over the next few months, we hope to use this space to start discussions and build a community around these concepts and use-cases of Unity. Multiple members of the Unity ML team and other related teams within Unity will post here discussing the different connections between Unity and Machine Learning. Whenever possible, we will release open source tools, videos, and example projects to help the different groups mentioned above utilize the ideas, algorithms, and methods we have shared. We will be monitoring this space closely, and encourage the Unity community to contribute to commentary as well.

Why Machine Learning?

To begin the conversation, we want to spend this first entry talking specifically about the relationship between ML and game AI. Most game AI that currently exists is hand coded, consisting of decision-trees with sometimes up to thousands of rules. All of which must be maintained by hand, and thoroughly tested. In contrast, ML relies on algorithms which can make sense of raw data, without the need of an expert to define how to interpret that data.

Take for example the computer vision problem of classifying the content of an image. Until a few years ago, experts would write filters by hand that would extract useful features for classifying an image as containing a cat or dog. In contrast, ML and in particular the newer Deep Learning approaches, only need the images and class labels and learn the useful features automatically. We believe that this automated learning can help simplify and speed up the process of creating games for developers both big and small, in addition to opening up the possibilities of the Unity platform being used in a wider array of contexts such as simulations of ML scenarios.

This automated learning can be applied specifically to game agent behavior a.k.a. NPCs. We can use Reinforcement Learning (RL) to train agents to estimate the value of taking actions within an environment. Once they have been trained, these agents can take actions to receive the most value, without ever having to be explicitly programmed how to act. The rest of this post is going to consist of a simple introduction to Reinforcement Learning (RL) and a walkthrough of how to implement a simple RL algorithm in Unity! And of course all the code used in this post is available in the Github repository here.  You can also access a WebGL demo.

Reinforcement Learning with Bandits

As mentioned above, a core concept behind RL is the estimation of value, and acting on that value estimate. Before going further, it will be helpful to introduce some terminology. In RL, what performs the acting is called an agent, and what it uses to make decisions about its actions is called a policy. An agent is always embedded within an environment and at any given moment the agent is in a certain state. From that state, it can take one of a set of actions. The value of a given state refers to how ultimately rewarding it is to be in that state. Taking an action in a state can bring an agent to a new state, provide a reward, or both. The total cumulative reward is what all RL agent try to maximize over time.

The simplest version of an RL problem is called the multi-armed bandit. This name is derived from the problem of optimizing pay-out across multiple slot machines, also referred as “single-arm bandits” given their propensity for stealing quarters from their users. In this set-up, the environment consists of only a single state, and the agent can take one of n actions. Each action provides an immediate reward to the agent. The agent’s goal is to learn to pick the action that provides the greatest reward.

To make this a little more concrete, let’s imagine a scenario within a dungeon-crawler game. The agent enters a room, and finds a number of chests lined up along the wall. Each of these chests have a certain probability of containing either a diamond (reward +1) or an enemy ghost (reward -1).

The goal of the agent is to learn which chest is the most likely to have the diamond (say, for example, third from the right). The natural way to discover which chest is the most rewarding is to try each of the chests out. Indeed, until the agent has learned enough about the world to act optimally much of RL consists of simple trial and error. Bringing the example above back to the RL lingo, the “trying out” of each chest corresponds to taking a series of actions (opening each chest multiple times), and the learning corresponds to updating an estimate of the value of each action. Once we are reasonably certain about our value estimations, we can then have the agent always pick the chest with the highest estimated value.

These value estimates can be learned using an iterative process in which we start with an initial series of estimates V(a), and then adjust them each time we take an action and observe the result. Formally, this is written as:

Intuitively, the above equation is stating that we adjust our current value estimate a little bit in the direction of the obtained reward. In this way we ensure we are always changing our estimate to better reflect the true dynamics of the environment. In doing so, we also ensure that our estimates don’t become unreasonably large, as might happen if we simply counted positive outcomes. We can accomplish this in code by keeping a vector of value estimates, and referencing them with the index of the action our agent took.

Contextual Bandits

The situation described above lacks one important aspect of any realistic environment: it only has a single state. In reality (and any game world), a given environment can have anywhere from dozens (think rooms in a house) to billions (pixel configurations on a screen) of possible states. Each of these states can have their own unique dynamics in terms of how actions provide new rewards or enable movement between states. As such, we need to condition our actions, and by extension our value estimates, on the state as well. Notationally, will now use Q(s, a)instead of just V(a). Abstractly this means that the reward we expect to receive is now a function of both the action we take, and the state we were in when taking that action. In our dungeon game, the concept of state can enable us to have different sets of chests in different rooms. Each of these rooms can have a different ideal chest, and as such our agent needs to learn to pick different actions in different rooms. We can accomplish this in code by keeping a matrix of value estimates, instead of simply a vector. This matrix can be indexed with [state, action].

Exploring and Exploiting

There is one more important piece of the puzzle to getting RL to work. Before our agent has learned a policy for taking the most rewarding actions, it needs a policy that will allow it to learn enough about the world to be sure it knows what optimal actually is. This presents us with the classic dilemma of how to balance exploration (learning about the environment’s value structure through trial and error) and exploitation (acting on the environments learned value structure). Sometimes these two goals line up, but often they do not. There are a number of strategies to take in balancing these two goals. Below we have outlined a few approaches.

  • One simple, yet powerful strategy follows the principle of “optimism in the face of uncertainty.” The idea here is that the agent starts with high value estimates V(a) for each action, so that acting greedily (taking the action with the maximum value) will lead the agent to explore each of the actions at least once. If the action didn’t lead to a good reward, the value estimate will decrease accordingly, but if it did, then the value estimate will remain high, as that action might be a good candidate to try again in the future. By itself though, this heuristic is often not enough, since we might need to keep exploring a given state to find an infrequent, but large reward.
  • Another strategy is to add random noise to the value estimates for each action, and then act greedily based on these new noisy estimates. With this approach, as long as the noise is less than the difference between the true optimal action and the other actions, it should converge to optimal value estimates.
  • We could also go one step further and take advantage of the nature of the value estimates themselves by normalizing them, and taking actions probabilistically. In this case if the value estimates for each action were roughly equal, then we would take actions with equal probability. On the flip side, if one action had a much greater value estimate, then we would pick it more often. By doing this we slowly weed out unrewarding actions by taking them less and less. This is the strategy we use in the demo project.

Going Forward

With this blog and the accompanying code you should now have all the pieces needed to start working with multi-armed and contextual bandits in Unity. This is all just the beginning. In a follow-up post we will go through Q-learning in full RL problems, and from there start to tackle learning policies for increasingly complex agent behavior in visually rich game environments using deep neural networks. Using these more advanced methods, it is possible to train agents which can serve as companions or opponents in genres ranging from fighting and driving games, to first person shooter, or even to real-time strategy games. All without writing rules, and focusing on what you want the agent to achieve instead of how to achieve it.

In the next few postings we will also be providing an early release of tools to allow researchers interested in using Unity for Deep RL research to connect their models written with frameworks such as Tensorflow or PyTorch to environments made in Unity. On top of all that we have a lot more planned for this year beyond agent behavior, and we hope the community will join us as we explore the uncharted territory that is the future of how games are made!

You can read the second part of this blog series here.

22 replies on “Unity AI-themed Blog Entries”

The way we connect with innovation is changing: the way we interface with innovation is ending up more conversational, perceptual and social. Regardless of the way that these interfaces must draw in with people, they need social and enthusiastic insight to be as compelling as could be expected under the circumstances.

This sounds extremely interesting. I wonder if in the future we’ll have an augmented form of this type of system serve as a playtester or QA tester. Developing an agent to hunt down bugs and crashes would be useful… but it’d sure be hard to quantify and deliver the reward!

What about a simple real gaming example, where you take one of the learn sections tutorial games and add ML to an enemies behaviour. Something we can play with and learn with.

Also you might look into an ML preview build so you start finding out what developers want and need and the various levels of competency you will need to cater for.

I’m excited to see what Unity comes up with to support AI and ML development. If anyone is interested in seeing Q-Learning working in-game and offline, along with Neuro Evolution, I finished a project at University a few months ago which involved developing a 3D Hack ‘n’ Slash game in Unity and applying Machine Learning techniques to create interesting and challenging AI of different difficulty levels. I don’t think I can post links here, but it’s the first link on my website,

If anyone’s interested in seeing Q-Learning working offline and in-game, along with Neuro-evolution, I have a post, video, and download of my university project about Machine Learning working in a 3D Hack ‘n’ Slash game (Made with Unity).

The conclusion –

Mostly about Q-Learning –

And I found that as confidence value is higher, the more good result, then why it need to set confidence value to low?

Hi Lee,

Good question! You want to start with a lower confidence to ensure the agent properly explores all the possible actions and learns about their expected returns. Then, once it has learned enough to have accurate value estimates, it can act to maximize rewards using high confidence.

So trained and learned AI data is where stored? I want to load that trained AI. And so these AI can be differently trained?

Very good!!!! I always have feared as many time as I input to game develop only, I become farther to current boom of AI and Machine learning. So ML can be achieved by C# and Unity inside!

Great approach and I feel this can take me to totally advanced dimension of game design and making fun game!

Really looking forward to this. I like the fact that you dive deeper into this topic and are planning to ontroduce more advanced techniques as well. Keep up the good work!

I’m happy to see you guys bringing RL concepts to the game developers and this is a pretty good intro!

If I may make a couple minor suggestions.

1) We usually say that the environment is in a state, not the agent.

2) I would not use the notation V(a) for an action value. The letter V almost always refers to the state value function in literature (either conditioned on the agent’s current policy or the optimal policy). If you’re talking about an action value, use Q instead, even when you’re talking about a one-state multi-armed bandit. This will keep your naming convention consistent if you later introduce RL algorithms that learn a state value function (e.g., actor critic methods).

Hi James,

Thanks for the feedback. Throughout these posts I’ll do my best to balance terminological consistency with ease of explanation. In this case I felt that talking about the value of each action V(a) would be simpler than the Q value without a state to condition it. I agree that it is non-conventional (and potentially confusing), and will make explicit in future entries the differences between state-condition action value estimates Q(s, a) and state value estimates V(s).

A great strategic move for Unity. Looking forward for the coming tools and tutorials.

VERY VERY GOOD explanation of Q-Learning. Obviously basic, but that’s what I have always loved about Unity, you guys go over everything in the most fundamental basic way first to ensure the user gets it. And it doesn’t stop there, the community has taken on this approach as well. You guys started something really special in the gaming industry and I am looking forward to see what you guys will be doing to shake up the AI industry. Why? Because I am both a games programmer by hobby and AI/SI Researcher as my job. And now both of my favorite things are tied to my favorite tool/platform/community.

Would be nice to have AI blog news letter, if any of you guys in Unity have talked to anyone in the AI Industry, then you know how desperately we need new fresh ideas in AI (you know what I am talking about). It’s going to be exciting watching the unity community orient around AI now.

Looking forward to this! I am working on a PhD in Human Centered Computing. My work is focusing on virtual human and human interaction. To this point, we have utilized a canned or wizard of Oz control for the virtual humans, but I have been trying to learn and incorporate ML into Unity for a more fluid and less structured interaction. Great post! Keep up the good work.

I just skimmed over the article, because I did not have the time to read it throughly but this topic interests me greatly. I will read it more carefully when i arrive home.

I look forward to more posts about this subject. Good work!

Comments are closed.