GameTune: Introducing reinforcement learning for optimizing the player lifecycle
Across the board, we’ve found that developers have significant interest in using machine learning to fine-tune a game from a truly objective perspective. We built GameTune to do exactly that: to offer a machine learning solution so developers can improve game performance on a scientific basis.
During the early stages we focused on simple use cases like optimizing the tutorial difficulty to maximize early player retention. Later, we moved on to more complex use cases – like level-balancing and interstitial ad frequency optimization – that maximize the lifetime value (LTV) of players. When we launched GameTune, our approach was based on standard supervised machine learning.
We now have a fully automated solution to optimize sequential decisions per user by using reinforcement learning. Reinforcement learning makes it possible to optimize the full player lifecycle.
Why sequential decisions are crucial to game optimization
When there are multiple points at which we can decide how to interact with the player – and where the impact of previous interactions can accumulate – you need sequential decision-making.
A great example of a use case that requires sequential decision-making is dynamic difficulty adjustment. Adjusting the difficulty can mean a lot of things, for instance making the game easier or harder for players by giving them different amounts of clues to complete a level. If the player is given too many clues, and the game is made too easy, they can become bored, which can lead to decreased enjoyment and ultimately churn. If the player is challenged too much with not enough clues, they may have a low sense of competence and lose interest. Since each player is different, there is no single right answer to how difficult the game should be. The proper balance for a game is different for each player and depends on their specific progression.
One way to approach this problem is to have, for example, three levels of difficulty to choose from per game level: easy, medium, and hard. The difficulty of a certain level can vary across players and each player will have a different journey in the game. By adjusting the difficulty level, we want to ensure that the decisions that are made for the player will maximize key metrics such as D7 retention or LTV. In a free-to-play mobile game, LTV typically comprises revenue from interstitial ads that players can see between levels, rewarded ads, and in-app purchases (IAPs).
There are several considerations that make it difficult to choose the right difficulty for each level. For instance, we might make multiple decisions during a seven-day period. In that case, should the difficulty for each player be determined only once, or per level? How can we quantify the impact of each decision from the entire seven-day window? Can we attribute an IAP wholly to the most recent decision to make the game more difficult? After all, the long-term retention and LTV depends on the overall engagement of the player.
Using supervised learning to solve sequential problems
Supervised learning is the classic way to approach machine learning problems when we have historical training examples available. If we know the outcome in past cases, we can learn to predict it for new cases. It’s also a good starting point for solving sequential problems. Let’s formalize the problem. Assume that the player is in some state s. We can summarize the state with features that incorporate all the relevant current and past information about the player. This could include general information about the user (demographics, device, language, location) and specific information from the game such as the current level and history of relevant actions in the game (sessions, average hints, tokens used, rewarded ads watched, etc).
In the current state si, we want to select the next action ai. So in our level balancing example, there are three possible actions that define the current level of difficulty: easy, medium, hard. After we’ve taken the action ai, the player continues the game. In the next interaction point, the user is in a new state si+1 and we are ready to choose the next action ai+1. Between the interaction points, the user can generate rewards ri. The reward ri can be, for example, the total revenue we’ve gotten from interstitial ads, rewarded ads, and IAPs between the states.
In supervised learning we predict the reward after taking the action ai in the state si. We can either prioritize short-term benefits and optimize the immediate reward ri, or prioritize long-term benefits and optimize the total future reward. In practice, we usually use a seven-day reward window. Let’s mark the observed total reward as Ri (i.e., it’s the sum of ri, ri+1, … during seven days since ai). In the supervised setting, we can train a value function Q(s, a) to predict the observed total reward R by utilizing standard methods like deep learning. Then, in state si we can simply just select the action a that gives the highest predicted reward Q(si, a) as the next action ai.
For learning, we need historical data with the realized rewards. In the beginning we do not have these, thus we can start with some randomized exploration strategy to select actions to gather learning data. The more data we gather, the better predictions and policy we can produce.
From supervised learning to reinforcement learning
In GameTune, we had a supervised learning setup in place that we could use to learn such value functions Q(s, a). That allowed us to start solving the sequential cases. However, we quickly learned that it does not work well in cases where we take multiple actions inside the reward window. The major issue was that we could not adequately measure the impact of each action on the final reward.
There are a number of approaches for adapting supervised learning for sequential use cases. If, for example, we were to exclusively evaluate one action per player, it would be easy to assign total reward for each action. However, we know that players evolve during the game. With constant actions, you miss out on opportunities to optimize player progression. We could perhaps choose the action per level, rather than per player. It might be better, for some players, to make it easier in the beginning before gradually making it harder.
The next logical experiment is to give more weight to rewards that happen closer to the action. We can, for example, use exponential time decay or discount the value based on the number of states between the rewards. A simple discounted reward could be Ri = sumj>=i(𝛄j-irj) where 𝛄 is less than one. With smaller 𝛄 it’s closer to the immediate reward and easier to learn.
Unfortunately, the approach above usually fails to learn useful policies with regards to optimizing the long-term impact. With these drawbacks, we couldn’t get supervised learning to work well enough in GameTune with the sequential decision cases. Essentially the challenge with supervised learning is credit assignment. Reinforcement learning provides a framework to learn the proper credit to assign for each action on the way to the total return.
Implementing reinforcement learning
Reinforcement learning, as an approach, has been around for a long time but it has gained a lot of popularity in recent years due to practical successes such as AlphaGo achieving superhuman performance in the board game Go. Most of these approaches combine deep learning and reinforcement learning. One of the simplest methods is deep Q-learning where we want to learn the state-action value function Q(s, a) by minimizing the following error term, or loss:
(Q(si, ai) – [ri + 𝛄 maxaQ(si+1, a)])2.
It’s interesting to compare this to the supervised learning where we minimize the loss:
(Q(si, ai) – Ri)2.
In reinforcement learning, we’ve replaced the future reward Ri with the sum of immediate reward ri and the best possible future reward maxaQ(si+1, a).
To train the deep Q-learning, we need to enrich the training data. For supervised learning, we needed a dataset with tuples (si, ai, Ri). In deep Q-learning, we also need the next state si+1 and only the immediate reward – i.e., tuples (si, ai, ri, si+1). Note that we do not need the next action ai+1 for the learning. Due to this, the Q-learning is called “off-policy learning” (i.e., it considers all available actions through the max operation).
From a code point of view, it’s actually straightforward to change your supervised learning to Q-learning: simply include the next state in the learning data and modify your loss function. From a practical point of view, we’re still a long way off from implementing it.
Practical learnings from reinforcement learning solutions
Most of the current successes in reinforcement learning occur in cases where we can simulate how the methods work online. For example, the OpenAI Gym provides an easy environment for testing different reinforcement learning algorithms in solving, such as playing games like Pong or Pinball.
There are only a few existing toolkits that focus on training reinforcement learning models in batch settings where the main input data consists of historical observations from old policies, rather than self-generated fresh training data. We evaluated many of these frameworks. One of the most promising is ReAgent – developed and used at Facebook, and built on top of PyTorch. It’s designed for practical applications. However, most of the frameworks are still in very early phases and there’s not a single one to recommend.
In GameTune, we have built our modeling using Tensorflow and also Tensorflow Probability. We had a good pipeline in place for data preprocessing, feature transformation, supervised learning, evaluation and serving. For us, it was easiest to build the reinforcement learning methods on top of our current solution.
Implementing the first version of deep Q-learning, with updated input data parsing, was completed in about a week. However, it was months before we got the reinforcement learning to reliably achieve better results than those associated with the discounted supervised learning solution. We improved the Q-learning with standard methods like dueling, double, and DQN. Since the revenue from ads and IAPs can be very different, we implemented distributional reinforcement learning and separated our reward sources. All of these small improvements have made our policy-learning more stable and bring constantly better results.
Evaluating the reinforcement learning policies offline is much harder than with supervised learning methods. For supervised methods, we can measure, for example, the accuracy and likelihood of success. However, in reinforcement learning, improving these metrics does not necessarily mean that we get a better policy for selecting the actions. We want to separate the actions from each other. From historical data, we cannot directly measure what would have happened had we chosen a different action at a specific state. To solve that, there are various offline policy evaluation methods that try to tackle the problem. However, offline policy evaluation is especially hard when the previous policy used in taking the actions is different from the new policy.
Currently, we run reinforcement learning for all our clients in GameTune by default. It’s delivering strong results – read about Prime Peaks’ usage here. The state of the art in reinforcement learning is constantly evolving and we are following the field very closely. We’re currently exploring fully parameterized quantile reinforcement learning and dualDICE off-policy evaluation to make GameTune reinforcement learning work even better for our customers.
If you’re looking for a scientific way to improve performance in your game, we hope you’ll consider GameTune reinforcement learning. You can apply to be a part of the beta here.
It’s Unity’s priority to invest heavily in machine learning and, as such, we’re hiring. If you’re interested in joining our team, please check out our Careers page here.