Search Unity

Many of you have been using our Game Performance Reporting service to get crash reports, and asked questions about our plans going forward. My team and I have read all your posts, so we know you all want cool new features! Please, keep the requests and ideas coming! We’ve spent the past few months reworking the infrastructure from one that supports a project to one that supports a super-cool, multi-tiered web application to handle all the millions of Unity games out there.  Read on to find out some more on what we’re up to.

Background

Unity Game Performance started as a Unity Hack Week project a year ago, with the simple goal of trying new things. We had people from different backgrounds contributing and each one took a piece of the crash service, which consists of:

  • The javascript UI
  • The Rails API that the UI uses
  • The crash report intake pipeline

The UI changes are probably the most visible ones. You may have noticed the launch of developer.cloud.unity3d.com already, which aims to unify access to our growing number of services.

Of the three pieces of the crash service, the one that has changed most in the last 12 months has been the intake pipeline. These intake changes (fortunately) are less visible, but they are crucial, because we want to support every Unity game made.

How It Used To Work

Originally, the intake pipeline looked something like this:

Editor Plugin -> Node -> SQS + DynamoDB -> Rails -> MySQL

The editor plugin listened for exceptions, batched them, then sent them to Node. Node listened for the events, it put them in DynamoDB, then it sent an SQS message to Rails stating where to find the event in Dynamo. Rails then got it back out of Dynamo, processed it, and stored the data in MySQL. Though this workflow was really easy to set up, it’s not very elegant to say the least.    

At that time, SQS had a fairly small message-size limit; not enough to store exceptions of all sizes. This is why the SQS message merely states where the event is stored in Dynamo. SQS has since increased the message-size limit to 2GB (which would have relieved our problem with storing exceptions). At first, we stored every event we received in Dynamo, just in case we made a huge mistake, because we could always re-import the data by replaying the events.

What Happened When We Went Live

We launched our little hack project during GDC ‘15, and we got way more activity than we expected. We were expecting thousands of exceptions a day—but we got millions.  We had to rate-limit certain projects that were sending thousands of exceptions per second.

Outside of operational issues, we noticed that our setup had one big bottleneck. The time spent putting things into SQS and Dynamo, only to grab them in Rails, process them, and put them into the database. Just the Rails side of that took around 75ms per exception!

One positive thing about the original setup was the way that accepting an event and processing an event were decoupled. This design made it easy to start and stop processing while we updated the code, without dropping ANY events.

What We Did Next

In the abstract, processing a crash report consists of the following steps:

  1. Fingerprint it,
  2. Find or create it by the fingerprint,
  3. Increment the counter,
  4. Associate it with the operating systems and platforms we saw it on.

Of course, I set out to replace just the fast thing that I didn’t like (Node) with something else that I hadn’t learned yet (Golang). I tried this, but realized it wouldn’t work any better, because the AWS libraries for Golang were very young. So I decided to try replacing the whole intake pipeline, just to simplify it.

My goal was to write something like this:

Editor Plugin -> Go -> MySQL

I wanted something really simple and fast. I didn’t want disk space alerts from verbose logging, or memory alerts from abused Ruby processes. Here’s how my process went:

My initial implementation was a literal translation from Rails. It did all the same MySQL select statements, then created the rows or updated the counters.

My first optimization was to remove all the statements that were duplicated between reports. These duplicates were SELECT statements, such as: ‘SELECT id FROM operating_systems where name = “Windows 7”’. These statements were completely safe to cache in the app, and I made great use of Hashicorp’s go LRU hash to do it. Then I performed the same optimization to cache crash fingerprints, so that I didn’t have to ask the database each time I saw the same exception.

I had to implement a fair amount of locking around each of these LRU hashes, which didn’t feel very Go-like, but it worked. One thing I did was make finer grain locks so that I could update different keys concurrently.

The next bottleneck I hit was regarding writes: each write event caused me to increment the counter. My database was dutifully counting from 1 to 100,000,000. One at a time.

I knew I wanted to batch my writes, but I wanted a to do it in a robust way. I leveraged Hashicorp’s LRU hash again, which provides an on evict hook. That way, when the crash report was evicted from memory, it was written to the database. But then I thought, “What if I don’t get enough unique crash reports to cause an eviction?” So, I hacked it and added another method that lets you make an entry with a Time To Live (“TTL”).

It’s important to note that the TTL lives on each entry. That way, each TTL eviction is staggered, so that it doesn’t create a thundering herd of database writes.

Given all the above considerations, an AWS t2.medium instance can (burst) process about 10,000 req/s, which is pretty decent.

We also plan to have edge servers in different regions. Your games will send reports to the servers in the closest geographic region. Those servers will do the same batching, then they will forward the events to the area where the database lives. They’ll be using the same eviction hook to make an HTTPS request instead of a database call.

TL;DR:, I know there hasn’t been much news around Game Performance Reporting, but we haven’t forgotten about it. I hope this story helped you understand what we’ve been doing behind the scenes.  Keep talking to us on our forum!

Comments are closed.

  1. al handu rio peo uru osku mao oeurt hughh hughh hughh ? hanfs.

  2. Why we can’t access directly in our country(Iran) to Unity3d website ?
    Iran and U.S reached to Deal and that deal now is running !

  3. beste
    waarom kan ik niet meer met gamen unity web player vraagt iedere keer om een update als ik update gebeurd er niets kan nog steeds niet gamen .
    beleefde grts dhr mario viaene

    1. Als je Chrome of Firefox gebruikt komt dat omdat de plugin gebruik maakt van NPAPI, welke niet meer ondersteund wordt door deze browsers (weet niet of IExplore of Edge NPAPI ondersteund)

      Misschien is het nog mogelijk om NPAPI aan te zetten met een config tweak?

    2. Holy shzniit, this is so cool thank you.

  4. Wow… there were so few news about it, actually I’ve never heard of that, but it sounds really great.
    Got to check it out as that’s exactly what I’m looking for at the moment as an alternative to Crashlytics.

  5. It’s nice to see you’re improving the service in terms of performance (no pun intended) to make it scale and operate better.

    I wish you had provided features that other services already have, like native exception tracking just to name one.

    We’d really like to replace our current provider, but Unity’s solution simply doesn’t provide enough features right now to justify it.

    1. Chris Lundquist

      December 3, 2015 at 1:01 am

      That’s exactly what we have in mind. We had to make a solid platform first though.