On Hunting the Uncommon Elephant
GDC 2014 had another such trophy-worthy hunting safari in store for us. We were five days away from presenting Unity 5 to the world when we “spotted” (well, it was kinda hard to miss) an ugly little elephant of a bug: our shiny new 64-bit editor was randomly crashing on OSX to the point of being completely unusable. There’s just nothing like being up on stage to showcase how awesome your bug reporter is every couple minutes.
So, Levi, Jonathan and I dropped all the awesome stuff we’re working on (more stories we want to bore our grandkids with) and went stalking. All we knew at that point was that it crashed somewhere in the native code that Mono generates at run-time.
As every programmer knows, when you’re faced with a bug that isn’t obvious, you simply start by gathering evidence. Once you’ve learned enough about the bug’s behavioral patterns, you’ll eventually get a shot at it. And with the clock ticking, we were ready to shoot at pretty much anything.
But we were stumped. For an elephant, the bug turned out to be surprisingly agile and sneaky.
It seemed to happen only on OSX 10.9 although Kim saw something that looked markedly similar on Windows with his heavy duty memory debugger branch. And if you enabled Guard Malloc on earlier versions of OSX, you got what looked fairly similar as well. However, as it was crashing in random script code at arbitrary depths in the call hierarchy, it was difficult to say with certainty what was the same crash and what wasn’t. And the crash could be consistent for ten consecutive runs only to be totally different for the next five.
So while Kim and I waded knee-high through memory and thigh-high through assembly code, Levi ran an extensive trace on all of Mono’s secret and not so secret activities to generate a gigabyte log and an editor that ran at the speed of my grandma. This yielded the first interesting insight: apparently we were always compiling the method we crashed in right before things got ugly.
But what made it crash? The immediate cause was that we were trying to execute code from an invalid address. How did we get there? A bug in Mono’s signal handling where we don’t resume properly? A bug in Mono’s JIT compiler that won’t jump back properly to the compiled code? A different thread corrupting stack memory on the main thread? Fairies and grumkins? (for a bit, the latter seemed the most likely).
After two days of hunting, the elephant was still well alive and out and about.
So, Saturday night I equipped myself with a notebook, four different colored pens and an ample supply of beer from our trademark Unity fridge (carefully making sure I don’t touch the awful canned Christmas beer we still have stuck in its crevices ). Then I spun up Unity instances until I had four different crashes frozen in the debugger, labeled them “Red Crash”, “Blue Crash”, “Green Crash”, and “Black Crash” and went to work with my respectively colored pens to take notes and draw some not-so-pretty diagrams of everything I found.
Here’s my notes for Blue Crash:
And that’s when I made my first discovery: in every case, the stack was 16 bytes larger than it should be!
That then led to the next discovery: for all crashes, looking at those extra 16 bytes turned up a return address back into the function we crashed in. From a trace it was clear that in all cases we already had executed some calls from the same method, and at first I thought the address was from the last call we had traced. However, closer inspection revealed that it was actually the return address for a call whose method had not been compiled yet!
This puzzled me for a moment as in some cases there were several calls in-between the last traced method and this call that hadn’t been compiled yet either. Looking closer, however, revealed that we always had jumped around them.
So, then I looked at that function we apparently were supposed to return from…
And there we have it (highlighted in blue): We were jumping in the wrong direction!
What Mono does here is create little “trampoline” functions that contain only a call to the JIT compiler and some data encoded into the instruction stream after the call (used by the JIT compiler to know which method to compile). Once the JIT compiler has done its work, it will delete those trampolines and erase every trace of having hooked into the method call.
However, the call instruction you see there is what is called a “near call” which incidentally uses a signed 32-bit offset to jump relative to the next instruction.
And since a signed 32-bit number can reach only 2GB up and down and we’re running 64-bit here, we suddenly knew why heap memory layout played such a crucial role in reproducing the bug: once Mono’s trampolines were further than 2GB away from the JIT compiler, offsets wouldn’t fit anymore into 32-bit and would get truncated when emitting the call instruction.
At that point, Jonathan quickly pinpointed the right fix and by the time his Sunday was over, we had a stable working build ready in time for GDC.
You all know the history from there. We successfully demoed Unity 5 at GDC 2014 to rave reviews and after launch, it quickly became the most beloved piece of software ever. Oh wait, that bit is yet to come…
Before that launch, there’s a whole lot more black and blue crashes to fix :).