The State of Game Performance Reporting
Many of you have been using our Game Performance Reporting service to get crash reports, and asked questions about our plans going forward. My team and I have read all your posts, so we know you all want cool new features! Please, keep the requests and ideas coming! We’ve spent the past few months reworking the infrastructure from one that supports a project to one that supports a super-cool, multi-tiered web application to handle all the millions of Unity games out there. Read on to find out some more on what we’re up to.
Unity Game Performance started as a Unity Hack Week project a year ago, with the simple goal of trying new things. We had people from different backgrounds contributing and each one took a piece of the crash service, which consists of:
- The Rails API that the UI uses
- The crash report intake pipeline
The UI changes are probably the most visible ones. You may have noticed the launch of developer.cloud.unity3d.com already, which aims to unify access to our growing number of services.
Of the three pieces of the crash service, the one that has changed most in the last 12 months has been the intake pipeline. These intake changes (fortunately) are less visible, but they are crucial, because we want to support every Unity game made.
How It Used To Work
Originally, the intake pipeline looked something like this:
Editor Plugin -> Node -> SQS + DynamoDB -> Rails -> MySQL
The editor plugin listened for exceptions, batched them, then sent them to Node. Node listened for the events, it put them in DynamoDB, then it sent an SQS message to Rails stating where to find the event in Dynamo. Rails then got it back out of Dynamo, processed it, and stored the data in MySQL. Though this workflow was really easy to set up, it’s not very elegant to say the least.
At that time, SQS had a fairly small message-size limit; not enough to store exceptions of all sizes. This is why the SQS message merely states where the event is stored in Dynamo. SQS has since increased the message-size limit to 2GB (which would have relieved our problem with storing exceptions). At first, we stored every event we received in Dynamo, just in case we made a huge mistake, because we could always re-import the data by replaying the events.
What Happened When We Went Live
We launched our little hack project during GDC ‘15, and we got way more activity than we expected. We were expecting thousands of exceptions a day—but we got millions. We had to rate-limit certain projects that were sending thousands of exceptions per second.
Outside of operational issues, we noticed that our setup had one big bottleneck. The time spent putting things into SQS and Dynamo, only to grab them in Rails, process them, and put them into the database. Just the Rails side of that took around 75ms per exception!
One positive thing about the original setup was the way that accepting an event and processing an event were decoupled. This design made it easy to start and stop processing while we updated the code, without dropping ANY events.
What We Did Next
In the abstract, processing a crash report consists of the following steps:
- Fingerprint it,
- Find or create it by the fingerprint,
- Increment the counter,
- Associate it with the operating systems and platforms we saw it on.
Of course, I set out to replace just the fast thing that I didn’t like (Node) with something else that I hadn’t learned yet (Golang). I tried this, but realized it wouldn’t work any better, because the AWS libraries for Golang were very young. So I decided to try replacing the whole intake pipeline, just to simplify it.
My goal was to write something like this:
Editor Plugin -> Go -> MySQL
I wanted something really simple and fast. I didn’t want disk space alerts from verbose logging, or memory alerts from abused Ruby processes. Here’s how my process went:
My initial implementation was a literal translation from Rails. It did all the same MySQL select statements, then created the rows or updated the counters.
My first optimization was to remove all the statements that were duplicated between reports. These duplicates were SELECT statements, such as: ‘SELECT id FROM operating_systems where name = “Windows 7”’. These statements were completely safe to cache in the app, and I made great use of Hashicorp’s go LRU hash to do it. Then I performed the same optimization to cache crash fingerprints, so that I didn’t have to ask the database each time I saw the same exception.
I had to implement a fair amount of locking around each of these LRU hashes, which didn’t feel very Go-like, but it worked. One thing I did was make finer grain locks so that I could update different keys concurrently.
The next bottleneck I hit was regarding writes: each write event caused me to increment the counter. My database was dutifully counting from 1 to 100,000,000. One at a time.
I knew I wanted to batch my writes, but I wanted a to do it in a robust way. I leveraged Hashicorp’s LRU hash again, which provides an on evict hook. That way, when the crash report was evicted from memory, it was written to the database. But then I thought, “What if I don’t get enough unique crash reports to cause an eviction?” So, I hacked it and added another method that lets you make an entry with a Time To Live (“TTL”).
It’s important to note that the TTL lives on each entry. That way, each TTL eviction is staggered, so that it doesn’t create a thundering herd of database writes.
Given all the above considerations, an AWS t2.medium instance can (burst) process about 10,000 req/s, which is pretty decent.
We also plan to have edge servers in different regions. Your games will send reports to the servers in the closest geographic region. Those servers will do the same batching, then they will forward the events to the area where the database lives. They’ll be using the same eviction hook to make an HTTPS request instead of a database call.