On Hunting the Uncommon Elephant

April 22, 2014 in Rants & Raves

Hunting bugs is fun.  And every now and then you get away alive with a story to bore your grandkids with (“In my days, we still hunted bugs with sticks and stones” and all).

First bug ever found. Taped into log for evidence.

GDC 2014 had another such trophy-worthy hunting safari in store for us.  We were five days away from presenting Unity 5 to the world when we “spotted” (well, it was kinda hard to miss) an ugly little elephant of a bug: our shiny new 64-bit editor was randomly crashing on OSX to the point of being completely unusable.  There’s just nothing like being up on stage to showcase how awesome your bug reporter is every couple minutes.

So, Levi, Jonathan and I dropped all the awesome stuff we’re working on (more stories we want to bore our grandkids with) and went stalking.  All we knew at that point was that it crashed somewhere in the native code that Mono generates at run-time.

As every programmer knows, when you’re faced with a bug that isn’t obvious, you simply start by gathering evidence.  Once you’ve learned enough about the bug’s behavioral patterns, you’ll eventually get a shot at it.  And with the clock ticking, we were ready to shoot at pretty much anything.

But we were stumped.  For an elephant, the bug turned out to be surprisingly agile and sneaky.

It seemed to happen only on OSX 10.9 although Kim saw something that looked markedly similar on Windows with his heavy duty memory debugger branch.  And if you enabled Guard Malloc on earlier versions of OSX, you got what looked fairly similar as well.  However, as it was crashing in random script code at arbitrary depths in the call hierarchy, it was difficult to say with certainty what was the same crash and what wasn’t.  And the crash could be consistent for ten consecutive runs only to be totally different for the next five.

So while Kim and I waded knee-high through memory and thigh-high through assembly code, Levi ran an extensive trace on all of Mono’s secret and not so secret activities to generate a gigabyte log and an editor that ran at the speed of my grandma.  This yielded the first interesting insight: apparently we were always compiling the method we crashed in right before things got ugly.

But what made it crash?  The immediate cause was that we were trying to execute code from an invalid address.  How did we get there?  A bug in Mono’s signal handling where we don’t resume properly?  A bug in Mono’s JIT compiler that won’t jump back properly to the compiled code?  A different thread corrupting stack memory on the main thread?  Fairies and grumkins? (for a bit, the latter seemed the most likely).

After two days of hunting, the elephant was still well alive and out and about.

So, Saturday night I equipped myself with a notebook, four different colored pens and an ample supply of beer from our trademark Unity fridge (carefully making sure I don’t touch the awful canned Christmas beer we still have stuck in its crevices ).  Then I spun up Unity instances until I had four different crashes frozen in the debugger, labeled them “Red Crash”, “Blue Crash”, “Green Crash”, and “Black Crash” and went to work with my respectively colored pens to take notes and draw some not-so-pretty diagrams of everything I found.

Here’s my notes for Blue Crash:

BlueCrashNotes

And that’s when I made my first discovery: in every case, the stack was 16 bytes larger than it should be!

That then led to the next discovery: for all crashes, looking at those extra 16 bytes turned up a return address back into the function we crashed in.  From a trace it was clear that in all cases we already had executed some calls from the same method, and at first I thought the address was from the last call we had traced.  However, closer inspection revealed that it was actually the return address for a call whose method had not been compiled yet!

This puzzled me for a moment as in some cases there were several calls in-between the last traced method and this call that hadn’t been compiled yet either.  Looking closer, however, revealed that we always had jumped around them.

So, then I looked at that function we apparently were supposed to return from…

DebuggerCrashshot

And there we have it (highlighted in blue):  We were jumping in the wrong direction!

What Mono does here is create little “trampoline” functions that contain only a call to the JIT compiler and some data encoded into the instruction stream after the call (used by the JIT compiler to know which method to compile).  Once the JIT compiler has done its work, it will delete those trampolines and erase every trace of having hooked into the method call.

However, the call instruction you see there is what is called a “near call” which incidentally uses a signed 32-bit offset to jump relative to the next instruction.

And since a signed 32-bit number can reach only 2GB up and down and we’re running 64-bit here, we suddenly knew why heap memory layout played such a crucial role in reproducing the bug: once Mono’s trampolines were further than 2GB away from the JIT compiler, offsets wouldn’t fit anymore into 32-bit and would get truncated when emitting the call instruction.

At that point, Jonathan quickly pinpointed the right fix and by the time his Sunday was over, we had a stable working build ready in time for GDC.

You all know the history from there.  We successfully demoed Unity 5 at GDC 2014 to rave reviews and after launch, it quickly became the most beloved piece of software ever.  Oh wait, that bit is yet to come…

Before that launch, there’s a whole lot more black and blue crashes to fix :).

Comments (20)

Subscribe to comments
  1. Andy Martin

    May 6, 2014 at 2:15 am / 

    I’m really hoping to get this bug fix in the 4.x.x cycle of Unity.

  2. koblavi

    April 23, 2014 at 9:23 pm / 

    “One nice thing though about being that deep down is that there’s pretty much nothing underneath you anymore. There’s no libraries, no system calls, no layers of abstraction. Just the machine and raw streams of code. That does have its own appeal.

    But yeah, comfortable is something else :)”

    That’s Deep!

    Unity does the hard work, so we don’t have to :-D .
    Thanks Guys and continue the good work!

  3. René Damm

    April 23, 2014 at 5:44 pm / 

    @Thought
    It’s been fixed upstream.

    @Troy
    Address is Vendersgade 28 in Copenhagen. You will have at least my eternal gratitude if you take if off our hands.

    @Peter
    Heh, well it certainly did cross our minds. Gun laws here in Denmark limit us to all but the smallest forms of artillery, though.

    @Alan
    One nice thing though about being that deep down is that there’s pretty much nothing underneath you anymore. There’s no libraries, no system calls, no layers of abstraction. Just the machine and raw streams of code. That does have its own appeal.

    But yeah, comfortable is something else :)

  4. Alan

    April 23, 2014 at 11:57 am / 

    “Once you’ve learned enough about the bug’s behavioral patterns, you’ll eventually get a shot at it. And with the clock ticking, we were ready to shoot at pretty much anything.”

    That really made me laugh out loud :D Nice job on that bug, it really was a nasty one. I personally wouldn’t want to go “down” that deep, my “comfort zone” ends right where the Mono compiler picks up its work ;)

  5. Peter Dwyer

    April 23, 2014 at 10:56 am / 

    Very nice hunting there! No calls of “Er guys! We’re gonna need a bigger elephant gun!” either.

  6. Troy

    April 23, 2014 at 7:13 am / 

    I’ll take the Christmas beer!

  7. thought

    April 23, 2014 at 4:50 am / 

    Is this bug present in upstream mono?

  8. René Damm

    April 23, 2014 at 3:07 am / 

    @Richard
    That part is a bit of a stretch. I don’t even have kids :)

  9. Andy

    April 23, 2014 at 2:32 am / 

    Thanks Rene.

  10. Richard Fine

    April 23, 2014 at 2:15 am / 

    Cool story, granddad ;)

  11. René Damm

    April 22, 2014 at 11:31 pm / 

    @Andy
    It’s the second case you describe, i.e. the Mono JIT code was crashing.

    From the bytecode in the DLLs, Mono was emitting the same code here at runtime regardless of whether it was running in a 32-bit or 64-bit application. For 32-bit it was okay (well, with a modern 64-bit operating system running a 32-bit process with a full 4GB address space, it could still trigger the bug). But with a 64-bit application, you’d see random crashes.

    So yup, you’d see it in all situations with the 64-bit editor. As long as the memory it used was still reasonably compact, the 32-bit offsets would be fine. But once you started using a good chunk of memory, you’d crash.

    ATM the plan is to ship three installers. 32-bit and 64-bit for Windows and only 64-bit for OSX (as there’s no 32-bit hardware around anymore on the OSX side).

  12. Muhammad

    April 22, 2014 at 9:36 pm / 

    i follow site every day every 5 min :)
    any news from Unity make me happy for my black life :D

  13. Muhammad

    April 22, 2014 at 9:32 pm / 

    i know unity team will make it !!
    can`t wait to see unity 5 …. am currently deleted UE4 :D

  14. Andy

    April 22, 2014 at 9:01 pm / 

    So I’m a bit confused, or rather ignorant, on the Mono compiler and target app on osX – is Mono compiler 64 bit and the app you wanted to compile was 64 bit or did you want to compile an app to 32 bit? Or was the Mono JIT code that was crashing happening in the editor for editor function and so on a 64 bit platform it will be 64 bit JIT function being compiled? Or is it dependent on which target publishing platform you currently have selected and whether it is 32 bit or 64 bit?

    So you said, this function of the Mono compiler that was returning the address was always returning 32 bit address, when it should have returned 64 bit address but in what situation? Or in all situations when using the 64 bit editor?

    Will Unity now have 4 downloads for the editor? 32 bit & 64 bit for Windows & osX?

  15. René Damm

    April 22, 2014 at 5:34 pm / 

    @Haim
    I knew once but there’s a bug in my system that deleted that block of memory prematurely.

    @Amelia & Luis
    Indeed! Tracking down defects is one of the most rewarding tasks in programming.

    @Petrucio
    Heh, yeah, C++ is a super powerful gun but it ends up pointing at your foot more oft than not :)

  16. Petrucio

    April 22, 2014 at 5:30 pm / 

    Ahh, the joys of debugging assembly code. No, wait, scratch that.
    Ahh, the joys of debugging C++ code. No, wait, scratch that.
    Ahh, the joys of developing in Unity with C#. Yes, much better. We’ve had a good run, you and I, but I don’t miss you at all, C++. :)

  17. Luis

    April 22, 2014 at 5:23 pm / 

    This is the reason that I love programming.

  18. Nikos Patsiouras

    April 22, 2014 at 5:23 pm / 

    I believe that’s 0×00000000042 :P

  19. Amelia

    April 22, 2014 at 5:22 pm / 

    Sweet! Bug crushing = sooooo satisfying… especially when they are the tricky ones!

  20. haim

    April 22, 2014 at 4:05 pm / 

    any chance you know the meaning of the universe life and everything? :)

Comments are closed.