Today, we are releasing a new update to the Unity ML-Agents Toolkit that enables faster training by launching multiple Unity simulations running on a single machine. This upgrade will enable game developers to create character behaviors by significantly speeding up training of Deep Reinforcement Learning algorithms.
In this blog post, we overview our work with our partner JamCity to train agents to play advanced levels of their Bubble Shooter, Snoopy Pop. Release v0.8 of the Unity ML-Agents Toolkit enabled them to train an agent to play a level on a single machine 7.5 times faster than was previously possible. Our work doesn’t stop here; we are also working on techniques to train multiple levels concurrently by scaling out training across multiple machines.
One of our core guiding principles since first releasing the Unity ML-Agents Toolkit has been to enable game developers to leverage Deep Reinforcement Learning (DRL) to develop behaviors for both playable and non-playable characters. We previously showed how DRL can be used to learn a policy for controlling Puppo using physics-based animations. However, real games are complex and DRL algorithms are computationally intensive and require a large volume of gameplay data in order to learn. Most DRL research leverages very lightweight games that can be sped up greatly (to generate gameplay data faster), whereas real games typically have constraints which require them to run at normal speed (or limit the amount of speed-up). This led us to focus on improving training on the most accessible computation platform available to a developer, their local development machine.
Creating emergent behaviors using DRL involves learning the weights of a neural network that represent a policy, a mapping from the agent’s observation to an action. Learning is accomplished by executing the policy on one or more simulation instances and using the output to update the weights in a manner that maximizes the agent’s reward. Training completes faster when we have more instances on which the policy is evaluated. Today, we are introducing the ability to train faster by having multiple concurrent instances of Unity on a multi-core machine. To illustrate the importance of utilizing multi-core machines in order to train agents in real games, we’ve partnered with JamCity and the Snoopy Pop game team. The changes we provide in v0.8 enable a training speedup of 5.5x on easy levels and up to 7.5x on harder levels by leveraging 16 Unity simulations. Generally speaking, the gains of utilizing multiple Unity simulations are greater for more complex levels and games.
The improvements in this update of the Unity ML-Agents Toolkit will both enable you to fully utilize the resources of your development machine, as well as greatly speed-up training by leveraging a multi-core machine on a cloud provider such as Google Cloud Platform. We’ve additionally been experimenting and building internal infrastructure to scale out training across multiple machines to enable learning a single policy that can solve many levels of Snoopy Pop at the same time. The video below demonstrates a single, trained agent playing through increasingly difficult levels of Snoopy Pop.
A single trained agent playing multiple levels of Snoopy Pop
ML-Agents Toolkit + Snoopy Pop
Snoopy Pop is a bubble shooter created by JamCity. In Snoopy Pop, the player needs to free the character Woodstock and his flock of birds by popping bubbles. The player can shoot a bubble at a particular angle or switch the color of the bubble before shooting. When the bubble sticks onto the same type of bubble and forms a group of more than three, the group will vanish, the bird in the bubble will be freed, and the player will improve their score. The player completes the level when all of the birds on the board are freed. Conversely, a player loses when they deplete all of the bubbles in their bag. Our goal is to train an agent that can play the game as the player would, and reach the highest level possible.
Using the ML-Agents Toolkit, an agent carries out a policy by receiving observations representing the game state and taking actions based on them. To solve Snoopy Pop using DRL, we first need to define these observations and actions, as well as the reward function which the policy attempts to maximize. As observations, the agent receives a simplified, low-resolution (84×84) version of the game board and the bubbles it is holding. The agent can then choose to shoot the bubble along 21 different angles or swap the bubble before shooting. After the bubble is shot and collides with (or does not collide with) other bubbles, the agent is rewarded for increasing the score, freeing birds, and winning. Negative rewards are also given for each bubble shot (to encourage the agent to solve the level quickly) and for losing.
The use of visual observations with a large action space makes Snoopy Pop levels difficult to solve. For a simple level, the agent needs to take more than 80,000 actions to learn an effective policy. More difficult levels may take half a million actions or more.
Additionally, the game uses physics to simulate how the bubbles bounce and collide with other bubbles, making it difficult to change the timescale without substantially changing the dynamics of the game. Even at 5x timescale, we can only collect about two actions a second. This means that it would take over 11 hours to solve a simple level and several days to solve more difficult ones. This makes it critical to scale out the data collection process by launching multiple, concurrent Unity simulations, to best maximize the machine’s resources.
Running multiple, concurrent instances of Snoopy Pop
While we are limited in how much we can speed up a single instance of Snoopy Pop, multi-core processors allow us to run multiple instances on a single machine. Since each play-through of the game is independent, we can trivially parallelize the collection of our training data.
Each simulation feeds data into a common training buffer, which is then used by the trainer to update its policy in order to play the game better. This new paradigm allows us to collect much more data without having to change the timescale or any other game parameters which may have a negative effect on the gameplay mechanics. We believe this is the first necessary step in order to bring higher performance training to users of the ML-Agents Toolkit.
To demonstrate the utility of launching multiple, concurrent Unity simulations we’re sharing training times for two different levels of Snoopy Pop (Level 2 and 25). More specifically, we recorded the training time across a varying number of Unity simulations. Since each additional concurrent environment has a small coordination overhead, we expect diminishing returns as we scale further. Additionally, for simple levels or games, adding more Unity simulations may not improve performance as the gameplay data generated from those additional simulations will be highly correlated with existing gameplay data and thus won’t provide a benefit to the training algorithm. To summarize, expect diminishing returns as you add more Unity simulations, where the diminishing rate depends on the difficulty of the level or game on which the model is being trained.
The first graph below shows the training time for v0.8 release to solve level 2 of Snoopy Pop within the range of 1 and 16 parallel environments. We took the average time across 3 runs since randomness in the training process can significantly change the time from run to run. You’ll notice we see a very large performance boost when scaling from one to two environments and then steady but sub-linear scaling after that with a 5.5x improvement when using 16 environments versus 1 environment.
We also find that the effects of training with parallel environments becomes more relevant on levels of Snoopy Pop. This is due to the fact that with more difficult levels, the experiences generated across the multiple simulations are more independent (and thus beneficial to the training process) than for simpler levels. Here is a graph comparing the performance of our v0.8 release on level 25 of Snoopy Pop. Note that there is an almost 7.5x improvement in using 16 environments compared to 1 environment.
Today’s release of ML-Agents Toolkit v0.8 supports training with multiple, concurrent Unity simulations on a single machine. If you have an existing environment you’ll just need to update to the latest version of the ML-Agents Toolkit and re-build your game. After upgrading, you’ll have access to a new option for the mlagents-learn tool which allows you to specify the number of parallel environments you’d like to run. See our documentation for more information.
In addition to the ability to launch multiple Unity simulations, this update of the ML-Agents Toolkit comes with a few bonus features.
Custom protocol buffer messages
Many researchers need the ability to exchange structured data between Python and Unity outside of what is included by default. In this release, we’ve created an API which allows any developer to create custom protocol buffer messages and use them as observations, actions, or reset parameters.
Render texture observations
In addition to Visual Observations with Cameras, we’ve also included the ability to use RenderTexture. This will enable users to render textures for Visual Observations in ways other than using a camera, such as 2D Sprites, webcam, or other custom implementations.
2D ray casting
Many users have asked about using ray casting in their 2D games. In this release, we have refactored RayPerception and added support for 2D ray casting (RayPerception2D).
Multiple Python packages
We have split the mlagents Python package into two separate packages (mlagents.trainers and mlagents.envs). This will allow users to decouple version dependencies, like TensorFlow, and make it easier for researchers to use Unity environments without having to disrupt their pre-existing Python configurations.
Thanks to our contributors
The Unity ML-Agents Toolkit is an open-source project that has greatly benefited from community contributions. Today, we want to thank the external contributors who have made enhancements that were merged into this release: @pyjamads for render texture, @Tkggwatson for the optimization improvements, @malmaud for the custom protocol buffer feature, and @LeSphax for the video recorder, @Supercurious / @rafvasq / @markovuksanovic / @borisneal / @dmalpica for various improvements.
This release of the Unity ML-Agents Toolkit enables you to train agents faster on a single machine. We intend to continue investing in this area and release future updates that will enable you to better maximize the resource usage on a single machine.
If you’d like to work on this exciting intersection of Machine Learning and Games, we are hiring for several positions, please apply!
If you use any of the features provided in this release, we’d love to hear from you. For any feedback regarding the Unity ML-Agents Toolkit, please fill out the following survey and feel free to email us directly. If you encounter any issues or have questions, please reach out to us on the ML-Agents GitHub issues page.