Hi, my name is Tautvydas and I’m a software developer at Unity working in the Windows team. I’d like to share a story of debugging an elusive memory corruption bug.
Several weeks ago we received a bug report from a customer that said their game was crashing when using IL2CPP scripting backend. QA verified the bug and assigned it to me for fixing. The project was quite big (although far from the largest ones); it took 40 minutes to build on my machine. The instructions on the bug report said: “Play the game for 5-10 minutes until it crashes”. Sure enough, after following instructions, I observed a crash. I fired up WinDbg ready to nail it down. Unfortunately, the stack trace was bogus:
# Child-SP RetAddr Call Site
00 00000022
e25feb10 00000000
00000010 0x00007ffa00000102
0:050> u 0x00007ffa
00000102 L1000007ffa
00000102 ?? ???
^ Memory access error in 'u 0x00007ffa
00000102 l10′Clearly, it tried executing an invalid memory address. Although the stacktrace had been corrupted, I was hoping that only a part of the whole stack got corrupted and that I should be able to reconstruct it if I look at memory contents past the stack pointer register. Surely enough, that gave me an idea where to look next:
……………
00000022
e25febd8 00007ffa
b1fdc65c ucrtbased!heap_alloc_dbg+0x1c [d:\th\minkernel\crts\ucrt\src\appcrt\heap\debug_heap.cpp @ 447]00000022
e25febe0 00000000
0000000400000022
e25febe8 00000022
0000000100000022
e25febf0 00000022
0000000000000022
e25febf8 00000000
0000000000000022
e25fec00 00000022
e25fec3000000022
e25fec08 00007ffa
99b3d3ab UnityPlayer!std::_Vector_alloc<std::_Vec_base_types<il2cpp::os::PollRequest,std::allocator<il2cpp::os::PollRequest> > >::_Get_data+0x2b [ c:\program files (x86)\microsoft visual studio 14.0\vc\include\vector @ 642]00000022
e25fec10 00000022
e25ff45800000022
e25fec18 cccccccc
cccccccc00000022
e25fec20 cccccccc
cccccccc00000022
e25fec28 00007ffa
b1fdf54c ucrtbased!_calloc_dbg+0x6c [d:\th\minkernel\crts\ucrt\src\appcrt\heap\debug_heap.cpp @ 511]00000022
e25fec30 00000000
0000001000000022
e25fec38 00007ffa
00000001……………
00000022
e25fec58 00000000
0000001000000022
e25fec60 00000022
e25feca000000022
e25fec68 00007ffa
b1fdb69e ucrtbased!calloc+0x2e [d:\th\minkernel\crts\ucrt\src\appcrt\heap\calloc.cpp @ 25]00000022
e25fec70 00000000
0000000100000022
e25fec78 00000000
0000001000000022
e25fec80 cccccccc
0000000100000022
e25fec88 00000000
0000000000000022
e25fec90 00000022
0000000000000022
e25fec98 cccccccc
cccccccc00000022
e25feca0 00000022
e25ff3f000000022
e25feca8 00007ffa
99b3b646 UnityPlayer!il2cpp::os::SocketImpl::Poll+0x66 [ c:\users\tautvydas\builds\bin2\il2cppoutputproject\il2cpp\libil2cpp\os\win32\socketimpl.cpp @ 1429]00000022
e25fecb0 00000000
0000000100000022
e25fecb8 00000000
00000010……………
00000022
e25ff3f0 00000022
e25ff42000000022
e25ff3f8 00007ffa
99c1caf4 UnityPlayer!il2cpp::os::Socket::Poll+0x44 [ c:\users\tautvydas\builds\bin2\il2cppoutputproject\il2cpp\libil2cpp\os\socket.cpp @ 324]00000022
e25ff400 00000022
e25ff45800000022
e25ff408 cccccccc
ffffffff00000022
e25ff410 00000022
e25ff5b400000022
e25ff418 00000022
e25ff59400000022
e25ff420 00000022
e25ff7e000000022
e25ff428 00007ffa
99b585f8 UnityPlayer!il2cpp::vm::SocketPollingThread::RunLoop+0x268 [ c:\users\tautvydas\builds\bin2\il2cppoutputproject\il2cpp\libil2cpp\vm\threadpool.cpp @ 452]00000022
e25ff430 00000022
e25ff45800000022
e25ff438 00000000
ffffffff……………
00000022
e25ff7d8 00000022
e25ff6b800000022
e25ff7e0 00000022
e25ff87000000022
e25ff7e8 00007ffa
99b58d2c UnityPlayer!il2cpp::vm::SocketPollingThreadEntryPoint+0xec [ c:\users\tautvydas\builds\bin2\il2cppoutputproject\il2cpp\libil2cpp\vm\threadpool.cpp @ 524]00000022
e25ff7f0 00007ffa
9da83610 UnityPlayer!il2cpp::vm::g_SocketPollingThread00000022
e25ff7f8 00007ffa
99b57700 UnityPlayer!il2cpp::vm::FreeThreadHandle [ c:\users\tautvydas\builds\bin2\il2cppoutputproject\il2cpp\libil2cpp\vm\threadpool.cpp @ 488]00000022
e25ff800 00000000
0000106c00000022
e25ff808 cccccccc
cccccccc00000022
e25ff810 00007ffa
9da83610 UnityPlayer!il2cpp::vm::g_SocketPollingThread00000022
e25ff818 000001c4
1705f5c000000022
e25ff820 cccccccc
0000106c……………
00000022
e25ff860 00005eaa
e9a6af8600000022
e25ff868 cccccccc
cccccccc00000022
e25ff870 00000022
e25ff8d000000022
e25ff878 00007ffa
99c63b52 UnityPlayer!il2cpp::os::Thread::RunWrapper+0xd2 [ c:\users\tautvydas\builds\bin2\il2cppoutputproject\il2cpp\libil2cpp\os\thread.cpp @ 106]00000022
e25ff880 00007ffa
9da83610 UnityPlayer!il2cpp::vm::g_SocketPollingThread00000022
e25ff888 00000000
0000001800000022
e25ff890 cccccccc
cccccccc……………
00000022
e25ff8a8 000001c4
15508c9000000022
e25ff8b0 cccccccc
0000000200000022
e25ff8b8 00007ffa
99b58c40 UnityPlayer!il2cpp::vm::SocketPollingThreadEntryPoint [ c:\users\tautvydas\builds\bin2\il2cppoutputproject\il2cpp\libil2cpp\vm\threadpool.cpp @ 494]00000022
e25ff8c0 00007ffa
9da83610 UnityPlayer!il2cpp::vm::g_SocketPollingThread00000022
e25ff8c8 000001c4
155a589000000022
e25ff8d0 00000022
e25ff92000000022
e25ff8d8 00007ffa
99c19a14 UnityPlayer!il2cpp::os::ThreadStartWrapper+0x54 [ c:\users\tautvydas\builds\bin2\il2cppoutputproject\il2cpp\libil2cpp\os\win32\threadimpl.cpp @ 31]00000022
e25ff8e0 000001c4
155a5890……………
00000022
e25ff900 cccccccc
cccccccc00000022
e25ff908 00007ffa
99c63a80 UnityPlayer!il2cpp::os::Thread::RunWrapper [ c:\users\tautvydas\builds\bin2\il2cppoutputproject\il2cpp\libil2cpp\os\thread.cpp @ 80]00000022
e25ff910 000001c4
155a5890……………
00000022
e25ff940 000001c4
1e0801b000000022
e25ff948 00007ffa
e6858102 KERNEL32!BaseThreadInitThunk+0x2200000022
e25ff950 000001c4
1e0801b000000022
e25ff958 00000000
0000000000000022
e25ff960 00000000
0000000000000022
e25ff968 00000000
0000000000000022
e25ff970 00007ffa
99c199c0 UnityPlayer!il2cpp::os::ThreadStartWrapper [ c:\users\tautvydas\builds\bin2\il2cppoutputproject\il2cpp\libil2cpp\os\win32\threadimpl.cpp @ 26]00000022
e25ff978 00007ffa
e926c5b4 ntdll!RtlUserThreadStart+0x3400000022
e25ff980 00007ffa
e68580e0 KERNEL32!BaseThreadInitThunkHere’s a rough reconstructed stacktrace:
e25febd8 00007ffa
b1fdc65c ucrtbased!heap_alloc_dbg+0x1c […\appcrt\heap\debug_heap.cpp @ 447]00000022
e25fec28 00007ffa
b1fdf54c ucrtbased!_calloc_dbg+0x6c […\appcrt\heap\debug_heap.cpp @ 511]00000022
e25fec68 00007ffa
b1fdb69e ucrtbased!calloc+0x2e […\appcrt\heap\calloc.cpp @ 25]00000022
e25feca8 00007ffa
99b3b646 UnityPlayer!il2cpp::os::SocketImpl::Poll+0x66 […\libil2cpp\os\win32\socketimpl.cpp @ 1429]00000022
e25ff3f8 00007ffa
99c1caf4 UnityPlayer!il2cpp::os::Socket::Poll+0x44 […\libil2cpp\os\socket.cpp @ 324]00000022
e25ff428 00007ffa
99b585f8 UnityPlayer!il2cpp::vm::SocketPollingThread::RunLoop+0x268 […\libil2cpp\vm\threadpool.cpp @ 452]00000022
e25ff7e8 00007ffa
99b58d2c UnityPlayer!il2cpp::vm::SocketPollingThreadEntryPoint+0xec […\libil2cpp\vm\threadpool.cpp @ 524]00000022
e25ff878 00007ffa
99c63b52 UnityPlayer!il2cpp::os::Thread::RunWrapper+0xd2 […\libil2cpp\os\thread.cpp @ 106]00000022
e25ff8d8 00007ffa
99c19a14 UnityPlayer!il2cpp::os::ThreadStartWrapper+0x54 […\libil2cpp\os\win32\threadimpl.cpp @ 31]00000022
e25ff948 00007ffa
e6858102 KERNEL32!BaseThreadInitThunk+0x2200000022
e25ff978 00007ffa
e926c5b4 ntdll!RtlUserThreadStart+0x34
Alright, so now I knew which thread was crashing: it was the IL2CPP runtime socket polling thread. Its responsibility is tell other threads when their sockets are ready to send or receive data. It goes like this: there’s a FIFO queue that socket poll requests get put in by other threads, the socket polling thread then dequeues these requests one by one, calls select() function and when select() returns a result, it queues a callback that was in the original request to the thread pool.
So somebody is corrupting the stack badly. In order to narrow the search, I decided to put “stack sentinels” on most stack frames in that thread. Here’s how my stack sentinel was defined:
When it’s constructed, it would fill the buffer with “0xDD”. When it’s destructed, it would check if those values did not change. This worked incredibly well: the game was no longer crashing! It was asserting instead:
Somebody had been touching my sentinel’s privates – and it definitely wasn’t a friend. I ran this a couple more times, and the result was the same: every time a value of “2” was written to the buffer first. Looking at the memory view, I noticed that what I saw was familiar:
These are the exact same values that we’ve seen in the very first corrupted stack trace. I realized that whatever caused the crash earlier was also responsible for corrupting the stack sentinel. At first, I thought that this was some kind of a buffer overflow, and somebody was writing outside of their local variable bounds. So I started placing these stack sentinels much more aggressively: before almost every function call that the thread made. However, the corruptions seemed to happen at random times, and I wasn’t able to find what was causing them using this method.
I knew that memory was always getting corrupted while one of my sentinels is in scope. I somehow needed to catch the thing that corrupts it red handed. I figured to make the stack sentinel memory read only for the duration of the stack sentinel life: I would call VirtualProtect() in the constructor to mark pages read only, and call it again in the destructor to make them writable:
To my surprise, it was still being corrupted! And the message in the debug log was:
CrashingGame.exe has triggered a breakpoint.
This was a red flag to me. Somebody had been corrupting memory either while the memory was read only, or just before I set it to read only. Since I got no access violations, I assumed that it was the latter so I changed the code to check whether memory contents changed right after setting my magic values:
My theory checked out:
CrashingGame.exe has triggered a breakpoint.
At this point I was thinking: “Well, it must be another thread corrupting my stack. It MUST be. Right? RIGHT?”. The only way I knew how to proceed in investigating this was to use data (memory) breakpoints to catch the offender. Unfortunately, on x86 you can watch only four memory locations at a time, that means I can monitor 32 bytes at most, while the area that had been getting corrupted was 16 KB. I somehow needed to figure out where to set the breakpoints. I started observing corruption patterns. At first, it seemed that they are random, but that was merely an illusion due to the nature of ASLR: every time I restarted the game, it would place the stack at random memory address, so the place of corruption naturally differed. However, when I realized this, I stopped restarting the game every time memory became corrupted and just continued execution. This led me to discover that the corrupted memory address was always constant for a given debugging session. In other words, once it had been corrupted once, it would always get corrupted at the exact same memory address as long as I don’t terminate the program:
CrashingGame.exe has triggered a breakpoint.
Memory was corrupted at 0x90445febd8.
CrashingGame.exe has triggered a breakpoint.
I set a data breakpoint on that memory address and watched as it kept breaking whenever I set it to a magic value of 0xDD. I figured, this was going to take a while, but Visual Studio actually allows me to set a condition on that breakpoint: to only break if the value of that memory address is 2:
A minute later, this breakpoint finally hit. I arrived at this point in time after 3 days into debugging this thing. This was going to be my triumph. “I finally pinned you down!”, I proclaimed. Or so I so optimistically thought:
I watched at the debugger with disbelief as my mind got filled with more questions than answers: “What? How is this even possible? Am I going crazy?”. I decided to look at the disassembly:
Sure enough, it was modifying that memory location. But it was writing 0xDD to it, not 0x02! After looking at the memory window, the whole region was already corrupted:
As I was ready to bang my head against the wall, I called my coworker and asked him to look whether I had missed something obvious. We reviewed the debugging code together and we couldn’t find anything that could even remotely cause such weirdness. I then took a step back and tried imagining what could possibly be causing the debugger to break thinking that code set the value to “2”. I came up with the following hypothetical chain of events:
1. mov byte ptr [rax], 0DDh modifies the memory location, CPU breaks execution to let debugger inspect the program state
2. Memory gets corrupted by something
3. Debugger inspects the memory address, finds “2” inside and thinks that’s what changed.
So… what can change memory contents while the program is frozen by a debugger? As far as I know, that’s possible in 2 scenarios: it’s either another process doing it, or it’s the OS kernel. To investigate either of these, a conventional debugger will not work. Enter kernel debugging land.
Surprisingly, setting up kernel debugging is extremely easy on Windows. You’ll need 2 machines: the one debugger will run on, and the one you’ll debug. Open up elevated command prompt on the machine which you’re going to be debugging, and type this:
Host IP is the IP address of the machine that has the debugger running. It will use the specified port for the debugger connection. It can be anywhere between 49152 and 65535. After hitting enter on the second command, it will tell you a secret key (truncated in the picture) which acts as a password when you connect the debugger. After completing these steps, reboot.
On the other computer, open up WinDbg, click on File -> Kernel Debug and enter port and key.
If everything goes well, you’ll be able to break execution by pressing Debug -> Break. If that works, the “debugee” computer will freeze. Enter “g” to continue execution.
I started up the game and waited for it to break once so I could find out the address at which memory gets corrupted:
CrashingGame.exe has triggered a breakpoint.
Alright, now that I knew the address where to set a data breakpoint, I had to configure my kernel debugger to actually set it:
PROCESS ffffe00167228080
SessionId: 1 Cid: 26b8 Peb: 49cceca000 ParentCid: 03d8
DirBase: 1ae5e3000 ObjectTable: ffffc00186220d80 HandleCount:
Image: CrashingGame.exe
kd> .process /i ffffe00167228080
You need to continue execution (press ‘g’
to be switched. When the debugger breaks in again, you will be in
the new process context.
kd> g
Break instruction exception – code 80000003 (first chance)
nt!DbgBreakPointWithStatus:
fffff801
7534beb0 cc int 3
kd> .process
Implicit process is now ffffe001
66e9e080kd> .reload /f
kd> ba w 1 0x00000049D05FEDD8 “.if (@@c++(*(char*)0x00000049D05FEDD8 == 2)) { k } .else { gc }”
After some time, the breakpoint actually hit…
00 ffffd000
23c1e980 fffff801
7527dc64 nt!IopCompleteRequest+0xef01 ffffd000
23c1ea70 fffff801
75349953 nt!KiDeliverApc+0x13402 ffffd000
23c1eb00 00007ffd
7e08b4bd nt!KiApcInterrupt+0xc303 00000049
d05fad50 cccccccc
cccccccc UnityPlayer!StackSentinel::StackSentinel+0x4d […\libil2cpp\utils\memory.cpp @ 21]
Alright, so what’s going on here?! The sentinel is happily setting its magic values, then there’s a hardware interrupt, which then calls some completion routine, and that writes “2” into my stack. Wow. Okay, for some reason Windows kernel is corrupting my memory. But why?
At first, I thought that this has to be us calling some Windows API and passing it invalid arguments. So I went through all the socket polling thread code again, and found that the only system call that we’ve been calling there was the select() function. I went to MSDN, and spent an hour rereading the docs on select() and rechecking whether we were doing everything correctly. As far as I could tell, there wasn’t really much you could do wrong with it, and there definitely wasn’t a place in docs where it said “if you pass it this parameter, we’ll write 2 into your stack”. It seemed like we were doing everything right.
After running out of things to try, I decided to step into the select() function with a debugger, step through its disassembly and figure out how it works. It took me a few hours, but I managed to do it. It seems that the select() function is a wrapper for the WSPSelect(), which roughly looks like this:
/* setting up some state
…
*/
IO_STATUS_BLOCK statusBlock;
auto result = NtDeviceIoControlFile(networkDeviceHandle, completionEvent, nullptr, nullptr, &statusBlock, 0x12024,
buffer, bufferLength, buffer, bufferLength);
if (result == STATUS_PENDING)
WaitForSingleObjectEx(completionEvent, INFINITE, TRUE);
/* convert result and return it
…
*/
The important part here is the call to NtDeviceIoControlFile(), the fact that it passes its local variable statusBlock as an out parameter, and finally the fact that it waits for the event to be signalled using an alertable wait. So far so good: it calls a kernel function, which returns STATUS_PENDING if it cannot complete the request immediately. In that case, WSPSelect() waits until the event is set. Once NtDeviceIoControlFile() is done, it writes the result to statusBlock variable and then sets the event. The wait completes and then WSPSelect() returns.
IO_STATUS_BLOCK struct looks like this:
{
union
{
NTSTATUS Status;
PVOID Pointer;
};
ULONG_PTR Information;
} IO_STATUS_BLOCK, *PIO_STATUS_BLOCK;
On 64-bit, that struct is 16 bytes long. It caught my attention that this struct seems to match my memory corruption pattern: the first 4 bytes get corrupted (NTSTATUS is 4 bytes long), then 4 bytes get skipped (padding/space for PVOID) and finally 8 more get corrupted. If that was indeed what was being written to my memory, then the first four bytes would contain the result status. The first 4 corruption bytes were always 0x00000102. And that happens to be the error code for… STATUS_TIMEOUT! That would be a sound theory, if only WSPSelect() didn’t wait for NtDeviceIOControlFile() to complete. But it did.
After figuring out how the select() function worked, I decided to look at the big picture on how socket polling thread worked. And then it hit me like a ton of bricks.
When another thread pushes a socket to be processed by the socket polling thread, the socket polling thread calls select() on that function. Since select() is a blocking call, when another socket is pushed to the socket polling thread queue it has to somehow interrupt select() so the new socket gets processed ASAP. How does one interrupt select() function? Apparently, we used QueueUserAPC() to execute asynchronous procedure while select() was blocked… and threw an exception out of it! That unrolled the stack, had us execute some more code, and then at some point in the future the kernel would complete the work and write the result to statusBlock local variable (which no longer existed at that point in time). If it happened to hit a return address on the stack, we’d crash.
The fix was pretty straightforward: instead of using QueueUserAPC(), we now create a loopback socket to which we send a byte any time we need to interrupt select(). This path has been used on POSIX platforms for quite a while, and is now used on Windows too. The fix for this bug shipped in Unity 5.3.4p1.
This is one of those bugs that keep you up at night. It took me 5 days to solve, and it’s probably one of the hardest bugs I ever had to look into and fix. Lesson learnt, folks: do not throw exceptions out of asynchronous procedures if you’re inside a system call!
Stephane Laroche
junho 15, 2016 às 1:48 amGood work, I like those kind of write-ups.
Just a technicality for correctness/completeness of the article: While the nt!KiApcInterrupt chain indeed starts at the IDT, it doesn’t comme from the HW per se (the APIC wasn’t involved in that), it’s just that (like you probably know) on x86, all processor exceptions go through the same routing table (the IDT), and it happens that in the NT kernel, the kernel initiates kernel mode APCs through an interrupt gate (x86 sw int instruction) in the APC dispatch code that triggers it. You can actually see some MS MASM code for that here:
Marco
maio 21, 2016 às 7:26 amI don’t think I would have ever been able to track down such a thing… I mean, how can you stay days hunting down a thing like this? My compliments.
ukrytysum
maio 4, 2016 às 5:18 pmunity
Springwater
maio 2, 2016 às 6:35 amI think it was worthy of the article, and exhilarating.. reminds me of spelunking, but I was ever only in the caves antechamber compared to this macabre airless crushing descent.
Keith T
abril 29, 2016 às 4:26 pmDamn, I was hoping this was the coroutine related crash we are seeing that was introduced in 5.4.0b14 that’s still in b16. Back to the searching.
Tautvydas Zilys
abril 29, 2016 às 5:44 pmCase number?
ernesto
abril 29, 2016 às 1:04 pmThis was very well written, but I’m lost at how you could determine whether the page was read only or not by comparing the address of the buffer and pageboundary adress, care to elaborate more on this?
sad
abril 28, 2016 às 2:10 pmSo was that your bug or Windows bug?
trengri
abril 28, 2016 às 12:05 pmThe final conclusion “do not throw exceptions out of asynchronous procedures if you’re inside a system call!” is a bit misleading. This can make somebody not very familliar with asynchronous Win32 (like me) to think that it’s an exception thrown from APC that forces select() to terminate. But it’s not – actually, it’s execution of APC procedure itself that forces select() to terminate. I mean even if you don’t throw an exception from APC, but just merely place APC on the thread, this will cause WaitForSingleObjectEx to exit from alertable wait and as a result this bug will be hit. So the final conclusion should be something like “Avoid using APC – they can hit in unpredicable way”.
Stephane Laroche
junho 15, 2016 às 1:33 amI don’t think you’re right. While the APC is asynchronously executed (when viewed from the POV of the execution of the thread) in the context of the thread to which it is attached – and will trigger on thread dispatch if the thread is in an alertable state (which is true with WaitForSingleObject), the APC installs a normal exception frame and will return normally from it – the APC doesn’t mock with the stack beyond that. It is the exception that destroys the stack, and the lower level system code that will process the completion of the IO is oblivious to what happened to the thread stack while completing the request (IRPs can be discard from thread termination, but that’s another story).
trengri
abril 27, 2016 às 11:08 pmIs it a wide-spread practice to throw exception from APC proc? Looks like kind of a hack to me. I cannot even imagine how the stack will be unwinding in this situation, will go and read something on SEH.
CRISTI
abril 27, 2016 às 4:11 pmCongratulations on cracking this nut, Zilys! I know that it can get frustrating while debugging it, but the satisfaction in the end makes it worth it.
I have a small question, if you don’t mind answering; when you call VirtualProtect, you give ‘startingPageBoundary’ as address.
The way you compute this value by adding 4095 to the address of m_Buffer, and then zeroing the lower 3 bytes, makes me believe the result will be higher than the address of m_Buffer, which makes me think that VirtualProtect will not actually make all of m_Buffer non-executable.
Why didn’t you simply call VirtualProtect(reinterpret_cast(m_Buffer), …)?
Tautvydas Zilys
abril 27, 2016 às 5:03 pmVirtualProtect operates on pages, not individual memory addresses, so the arguments passed to it need to be page boundary aligned. It is correct that the whole buffer will not become read only and that’s why I output whether the corrupted address was actually protected. In my case it was, and in case it wasn’t, I can always place some local variables at the base of the stack on that thread to force alignment.
Hal Hoadley
maio 5, 2016 às 4:54 pmYour analysis missed the real bug. NEVER create an object on the stack and pass it to another thread. You will always generate a crash sometime. The correct procedure is to create the object from the HEAP (the new operator) and pass that pointer to the other thread. When that thread is finished with it, it deletes it with the delete operator. If the other thread needs to send the memory back that is modified, then it does not delete the memory pointer but sends the pointer back. That solves all the problems and exceptions still work correctly. This is a common mistake, everybody makes it at least once. Just remember that the final user of the memory pointer must delete it in c++.
Also, calling a method directly in another thread is also a path to disaster, the better method is to post a message to it. The receiving thread than has a method that handles our message with the return value. This allows the to OS to better handle thread priorities, user interface higher priority, processing a bit lower.
Final point, you should not be polling for data. Have the communications thread send your thread a message with the data received. Your thread then processes it. Of course if you are not using a messaging OS, then you have a real problem.
Nachochip
maio 9, 2016 às 11:11 pmTotally agree what you said, the bug is cause by a bad coding practices in C++, while in other languages, garbage collection will solve this problem.
Stephane Laroche
junho 15, 2016 às 2:04 amWhile you make some points, I think you miss the context :
1) WSPSelect( ) is OS code
2) It is written in C
3) It needs to be high performance
So I think the author is correct, the APC + exception way for throwing (no pun intended) the thread out of a (C based no less, and clearly not exception safe) system call was a bad hack.
WEN
abril 27, 2016 às 10:49 amIts nothing . I can Understand easily these codes. ;)
Ashkan
abril 27, 2016 às 10:44 amVery interesting story. Sometimes debugging feels a black art to people but one thinks experience and practice with a good mind helps, This is one of the cases which seems it’s black art indeed :)
windbg and good memories :))) Thanks for sharing and please do more.
P.S To those who only use managed languages or write trivial programs, This is what bleeding edge means. What is not shared the lots of hypothesis he thought about at night and maybe day in the 5 days from issues in IL2CPP to …
Garry White
abril 27, 2016 às 5:33 amThe lesson which should be learnt is DO NOT USE EXCEPTIONS.
krick
abril 27, 2016 às 5:29 amThis takes me back to the days of Windows 95 game development when we had to remote debug with a second PC while developing a full screen DirectX game. People take multiple monitors for granted but it really wasn’t a thing until Windows 98 and even then support was kind of sketchy in the beginning.
We had a guy on the team that wrote his own crude device driver so that we could dump text (physics engine variables) to a separate 9-inch monochrome monitor in real-time while we tested the game full-screen on the primary monitor.
I also remember one time stepping through the code looking at CPU registers in Visual Studio while debugging a particularly nasty bug in order to figure out what the heck was going on. I think the bug was happening in a linked library where we didn’t have the source or something like that.
nick
abril 27, 2016 às 4:43 amI rather enjoy how this was written. Nearly wrapping up my CS degree, I was able to piece together the bits of the picture I didn’t understand and more clearly see what was going on. If there’s somewhere that I can read more writings like this, pointers to where they are would be welcome
Joel
abril 27, 2016 às 1:35 amVery interesting read. Thank you for sharing. As some said, this is like a murder mystery for software developers. Keep up the good work!
René Damm
abril 27, 2016 às 12:37 amThis was a very interesting read. Especially since I was the one who introduced the bug :/ Sorry buddy :)
Mike Vargas
abril 27, 2016 às 12:31 amHow does one learn enough to be able to thoroughly understand this blog post? This is coming from someone with a C.S. degree.
brothir
abril 27, 2016 às 7:15 amYou make a lot of programs while using the Visual Studio debugger.
I recently had a memory corruption problem as well, and the solution was to use VirtualAlloc() instead of declaring a variable as an array.
You don’t experience these things if you don’t try to push the envelope.
Jeff
abril 28, 2016 às 9:21 amStart publishing a really bad post on Facebook so all Friends block you, something extremely inappropriate, then destroy all communication devices (phones, your old pager in the drawer), lock yourself for days in your basement, enable “programming webpage only” in your browser (actually you wrote that plugin) so that you can only access msdn, StackOverflow and some obscure Kernel documentations, forget about wife and kids. Become Unity dev.
Sean C
abril 26, 2016 às 6:55 pmI take it the customer has source access if they were able to use IL2CPP on Windows? Is there any way to get beta/early access to using IL2CPP on Windows as well?
Also, great article :)
Tautvydas Zilys
abril 26, 2016 às 7:48 pmWe released IL2CPP support for our Windows Store target with 5.3. That’s where the bug was – it wasn’t the standalone player.
Sean C
abril 26, 2016 às 6:54 pmI take it the customer has source access if they were able to use IL2CPP on Windows? Is there any way to get beta/early access to using IL2CPP on Windows as well?
Also, great write article :)
Marc
abril 26, 2016 às 5:05 pmThat is very interesting, could it be that this bug is related to this one here too?
http://forum.unity3d.com/threads/results-dynamicbounds-empty-error.325138/
Also crashes at “random” times, and I had also memory access errors in my stack trace. Didn’t test it with the latest Unity version. I will test it asap.
Tautvydas Zilys
abril 26, 2016 às 5:12 pmI am positive that it’s not related. The bug I blogged about happened only on IL2CPP scripting backend on Windows when using asynchronous sockets.
Anondev
abril 26, 2016 às 2:49 pmThe solution is to avoid windows because it’a cancerous buggy mess made for accounting chimps.
dat_wackie
abril 26, 2016 às 4:56 pmStart adding to the technical conversion instead of trolling, little boy
Moschops
abril 26, 2016 às 2:27 pmHi, really interesting read. Enjoyed it a lot, despite the various parts at which I would have liked more detail. The first of these was when you reconstructed the corrupted stacktrace (pretty early on). I don’t suppose you could talk a little about how you did that?
Tautvydas Zilys
abril 26, 2016 às 2:56 pmWinDbg has a command “dps”, which prints pointer-sized words taken from the specified memory location and it also resolves the symbol if the printed word happens to be a function address. I typed “dps @rsp L200”, which started printing stack contents starting with the stack pointer for 200 addresses. Then, I looked at the output and manually cherry picked the stack frames.
Sergey
abril 26, 2016 às 11:14 am“I went to MSDN, and spent an hour rereading the docs on select() and rechecking whether we were doing everything correctly.”
but msdn docs says:
Note When issuing a blocking Winsock call such as select with the timeout parameter set to NULL, Winsock may need to wait for a network event before the call can complete. Winsock performs an alertable wait in this situation, which can be interrupted by an asynchronous procedure call (APC) scheduled on the same thread. Issuing another blocking Winsock call inside an APC that interrupted an ongoing blocking Winsock call on the same thread will lead to undefined behavior, and must never be attempted by Winsock clients.
Tautvydas Zilys
abril 26, 2016 às 11:18 amtimeout wasn’t actually set to 0. It was set to 1 second. We weren’t recursively calling WinSock functions either.
Sergey
abril 26, 2016 às 11:58 am“timeout wasn’t actually set to 0. It was set to 1 second.”
NULL timeout there only for example. any blocking operation.
“We weren’t recursively calling WinSock functions either.”
there nothing about recursion.
never mind
Tautvydas Zilys
abril 26, 2016 às 12:18 pmRight about the blocking operation. By recursion I meant this part: “Issuing another blocking Winsock call inside an APC that interrupted an ongoing blocking Winsock call”. We were not calling into Winsock in the APC. We were throwing an exception out of it.
André
abril 26, 2016 às 8:39 amAmazing detective work! Thanks for sharing!
Joy
abril 26, 2016 às 8:22 amThis is extremely fascinating detective work … ! Not to mention the level details put into it provided just the right amount of context. Thank you for taking your time to share such an interesting story.
Ilia
abril 26, 2016 às 7:08 amso, why sentinels and not Application Verifier?
Alex
abril 26, 2016 às 6:48 amAnd that’s the reason why I’m using C#.
Craig Ringer
abril 26, 2016 às 5:02 amI’m surprised that there’s no mention of Valgrind here, or for Windows users, Dr Memory. Both are excellent *free* memory error tracing tools.
Both have been huge helps for me on multiple occasions and I can’t recommend them enough.
Vladimir
abril 26, 2016 às 3:58 amBut why should the “out” parameter be a stack var (and not, say some static one, used solely by the thread, calling the NTxxx func, was such an option considered?)? :)
Tautvydas Zilys
abril 26, 2016 às 11:02 amIt’s not our code. It’s part of Windows, which means we don’t have an option to change it.
Vladimir
abril 26, 2016 às 11:53 amBut doesn’t NtDeviceIoControlFile ‘s documentation at https://msdn.microsoft.com/en-us/library/ms648411(v=vs.85).aspx specify the “out” parameter’s in question type as “PIO_STATUS_BLOCK” (documented at https://msdn.microsoft.com/en-us/library/windows/hardware/ff550671(v=vs.85).aspx as a pointer to a union _STATUS_BLOCK)?
Meaning, that union instance shouldn’t necessarily be a stack variable. Or do I miss somethig?
Tautvydas Zilys
abril 26, 2016 às 12:19 pmIt doesn’t necessarily have to be a stack variable, but it was in this case.
AndrewS
abril 26, 2016 às 3:28 amAs an ex system-programmer, I thoroughly enjoyed this blog. Excellently written, and a very interesting lesson.
Do please continue with other stories from the trenches!
-A.
Siavash
abril 26, 2016 às 2:39 amWell, the fact that I cannot understand 96% of your article, actually means you’ve done a fantastic job right there! Thank you! :P
Edward Bowen
abril 26, 2016 às 2:11 amEnjoyed reading this! Luckily not all of it flew over my head, but the thought process and the path you took to ultimately arrive at a solution is a bit hard to wrap my mind around. Seemed like trying to solve a murder though with few clues and suspects lined up to be interrogated.
Ivo Levicharov
abril 26, 2016 às 1:58 amWhat cool read! This was like Agatha Christie for SW developers. Great job! :)
Andy Martin
abril 26, 2016 às 1:45 amThanks for sharing.
Martin
abril 26, 2016 às 1:20 amFantastic read, and thank you for fixing it!
Georges Paz
abril 26, 2016 às 12:50 amWe definitely have a winner hero here and his name is “Tautvydas Zilys”! Kudos and good hunting! :)
Adam Tuliper
abril 26, 2016 às 12:19 amLove the low level post. Nice debugging work!!!
Senshi
abril 25, 2016 às 9:27 pmThat was an amazing read, thank you! Always fun to see how other people approach debug headaches.
I feel like I’m missing something though, and I am now too invested in this story to just leave it be. You say “[…] the kernel would complete the work and write the result to statusBlock local variable (which no longer existed at that point in time).”
What I’m curious about is how the corrupted memory was always at the same address. It’s probably a simple explanation, but with my very limited low-level knowledge I would assume that either:
1) ‘statusBlock’ lives at a different location each time, thus corrupting a different address; or
2) ‘statusBlock’ has a permanent location, which makes me wonder how game memory or the sentinal guard could be living on that address as well.
In writing this out though, I’m guessing maybe the STATUS_TIMEOUT itself caused the original crash?
Tautvydas Zilys
abril 25, 2016 às 9:34 pmstatusBlock is a local variable in WSPSelect() function. This function is called by the select() function, and that is called from exactly one place in our code. That means that the stack frames will be identical every time it is called, and the statusBlock variable will have a fixed offset from the beginning of the stack.
STATUS_TIMEOUT is perfectly acceptable result for select() call – however, the issue lied in the fact that the frame was unrolled before the kernel could write the result value.
awss
abril 25, 2016 às 8:59 pmVery nice blog.
i have tried to understand whats going on here but i find the C++ code a bit strange. I know C++ but the stuff you are using i never really used before. Any books would you recommend for Intermidiate programmer to learn more about the topics you covered. I am interested to know what Unity team use on their shelfs as a reference for C++ and other graphics programming maybe. Would be a good topic for another blog in my opinion.
Best Regards.
isscandar
junho 23, 2016 às 10:41 amAWSS: I think it’s not a matter of C++ code but a matter of knowledge about Windows as an operating system. I would suggest the old (1997) book by Jeffrey Richter “Advanced Windows 3rd Edition”, Microsoft Press.
chris willoughby
abril 25, 2016 às 8:48 pmAwesome. Bright people working at Unity. Don’t bang yr head against that wall. We need it so our favorite game engine works. :)
Zirkonium
abril 25, 2016 às 8:25 pmThanks for the detailed and interesting write-up :)
Dan Super
abril 25, 2016 às 8:22 pmI’m impressed.
They say wisdom is knowing that you do not know right?
I’m wise enough to know I do not have the knowledge necessary to solve that problem.
Jeff
abril 25, 2016 às 8:01 pm“Somebody had been touching my sentinel’s privates – and it definitely wasn’t a friend.”
Jonney Shih
maio 9, 2016 às 7:00 pmLOL, glad he finally managed to debug his privates.
Mr. Management
abril 25, 2016 às 7:55 pmHi, I’m in management.
Why did that you 5 days to fix?! You’re paid to fix bugs, not read code or write blog posts. Don’t you know you’re job?
I have another project for you. I’d like my keyfob to launch the space shuttle. Have that done for me by next week. It’s easy, just make the buttons launch the space shuttle!
Landon
abril 25, 2016 às 7:21 pmAh, this is painful for me just to read!
Kulesz
abril 25, 2016 às 7:13 pmI read it with a great interest and my eyes wide open… Looking for more, as it is better than some detective stories :)
Great job nailing the bug!
ConstantineM
abril 25, 2016 às 7:12 pmReally interesting read! Debugging, at this low level is a piece of art! I would love to read more posts like this one!
Onsterion
abril 25, 2016 às 6:58 pmVery interesting, thanks for sharing.
Alkis Tsapanidis
abril 25, 2016 às 6:21 pm“Somebody had been touching my sentinel’s privates – and it definitely wasn’t a friend.”
You were dying to pull that pun, weren’t you? ;-)
Tautvydas Zilys
abril 25, 2016 às 7:41 pm( ͡° ͜ʖ ͡°)
Marcin Szymanski
abril 25, 2016 às 6:07 pmOutstanding work. I like how you casually transitioned to “okay so now we debug the Windows kernel.”
Paul lord
abril 25, 2016 às 6:01 pmGreat Job!
The Multi- platform socket development that I have worked on in C/C++ has always been fun!
It’s good to hear Unity is making the tools rock solid! Thank You!!
Andy Coats
abril 25, 2016 às 5:54 pmGreat post! Thanks for sharing.
I bet the beer at the end of that session tasted very nice :)
DougFinn
abril 25, 2016 às 5:13 pmI wish I understood any of that!
Garth Smith
abril 25, 2016 às 4:59 pmDebugging is such an important skill but I rarely see write-ups on how it’s done. Thank you for this.
Sampsa Lehtonen
abril 25, 2016 às 4:53 pmQuite a story, thanks for sharing :)
Have had my share of heap corruptions. It’s never pretty. Feels like you’re investigating a murder scene and only thing you have is some blood on the floor :P
Stephen Allen
abril 25, 2016 às 4:52 pmInteresting read, 95% of it flew over my head, but nevertheless, still interesting :)
Niall Burton
abril 25, 2016 às 8:47 pmMy case exactly.