Search Unity

This is the eighth and final post in the IL2CPP Internals series. In this post I’ll deviate a bit from the content of previous posts, and not discuss some aspect of how IL2CPP works at compile time or run time. Instead, we’ll take a look at a brief overview of how we develop and test IL2CPP.

Test-first development

The IL2CPP team has a strong test-first development mentality. Much of the code for IL2CPP is written using the practice of Test Driven Development (TDD), and very few pull requests are merged to the IL2CPP code without significant test coverage.

Since IL2CPP has a finite (although rather large) set of inputs – the ECMA 335 spec- the process of developing it fits nicely with TDD concepts. Most of tests are written before production code, and these tests always need to fail in an expected way before the code to make them pass is written.

This process helps to drive the design of IL2CPP, but it also provides the development team with a large bank of tests which run rather quickly and exercise nearly all of the existing behavior in IL2CPP. As a development team, this test suite provides two important benefits.

  1. Confidence: Most changes to refactor code in IL2CPP can be made with high confidence. If the tests pass, it is very unlikely that a regression has been introduced.
  2. Troubleshooting: Since the code in IL2CPP behaves as we expect it to, bugs are almost always unimplemented sections of the code or cases we have not yet considered. By scoping down the space of possible causes of a given bug this way, we can correct bugs much more quickly.

Testing statistics

The various types of tests that we run against the IL2CPP code base break down into a few different levels. Here are the number of tests we current have a each level (I’ll discuss what each type of test actually is below).

  • Unit tests
    • C#: 472
    • C++: 44
  • Integration tests
    • C#: 1735
    • IL: 173

If all of these tests are green, then we feel confident that we can ship IL2CPP at that moment. We maintain one main development branch for IL2CPP, which always tracks the leading edge branch for development in Unity as a whole. The tests are always green on this main development branch. When they break (which does happen once in a while), someone usually fixes them within a few minutes.

Since developers on our team are forking this main branch for personal development often, it needs to be green at all times. The build and test status for both the main development branch and personal branches are maintained on Katana, Unity’s internal build management system.

We use NUnit to run all of these tests and the drive NUnit in one of three different ways

  • Windows: ReSharper
  • OSX: Xamarin Studio
  • Command line on Windows and OSX on our build machines: a custom Perl script

Types of tests

I mentioned four different types of tests above without much explanation. Each of these types of tests serves a different purpose, and they all work together to help keep IL2CPP development moving forward.

The unit tests verify the behavior of a small bit of code, typically a method. They set up a situation, execute the code under test, and finally assert some expected behavior.

The integration tests for IL2CPP actually run the il2cpp.exe utility on an assembly, compile the generated C++ code to an executable, then run the executable. Since we have a nice reference for IL2CPP behavior (the existing version of Mono used in Unity), these integration tests also run the same assembly with Mono (and .Net, on Windows). Our test runner then compares the results of the two (or three) runs dumped to standard output and reports any differences. So the IL2CPP integration tests don’t have explicit expected values or assertions listed in the test code like the unit tests do.

C# unit tests

These tests are the fastest, and lowest level tests that we write. They are used to verify the behavior of many parts of il2cpp.exe, the AOT compiler utility for IL2CPP. Since il2cpp.exe is written entirely in C#, we can use fast C# unit tests to get good turn-around time for changes. All of the C# unit tests complete in a few seconds on a nice development machine.

C++ unit tests

The vast majority of the runtime code for IL2CPP (called libil2cpp) is written in C++. For parts of that code which are not easily accessible from a public API, we use C++ unit tests. We have relatively few of these tests, as most of the behavior of code in libil2cpp can be exercised via our larger integration test suite. These tests to require more time than you might expect for unit tests to run, as they need to run il2cpp.exe itself to set up their fixture data.

C# integration tests

The largest and most comprehensive test suite for IL2CPP is the C# integration test suite. These tests a divided into smaller segments, focusing on tests that verify behavior of icalls, code generation, p/invoke, and general behavior. Most of the tests in this suite are rather short, only about 5 – 10 lines long. The entire suite runs in less than one minute on most machines, but we can run it with various IL2CPP options related to things like stripping and code generation.

IL integration tests

These tests are similar in toolchain to the C# integration tests. However, instead of writing the test code in C#, we use the ILGenerator class to directly create an assembly. Although these tests can take a bit more time to write than C# tests, they offer increased flexibility. Often we run into problems with IL code that is invalid or not generated by our current Mono C# compiler. In these cases, we can often write a good test case with IL code. The tests are also beneficial for comprehensive testing of opcodes like conv.i (and similar opcodes in its family) which have clear behavior with many slight variations. All of the IL tests complete end to end in less than one minute.

We run all of these tests through many variations and options on Katana. From a clean pull of the source code to completed test runs, we see about 20-30 minutes of runtime depending on the load on the build farm.

Why so many integration tests?

Based on these descriptions, it might seem like our test pyramid for IL2CPP is upside down. And indeed, the end-to-end integration tests (near the top of the pyramid) make up most of our test coverage.

Following TDD practice with test times more than a few seconds can be difficult as well. We work to mitigate this by allowing individual segments of the integration test suites to run, and by doing incremental building of the C++ code generated in the test suites (this is how we are proving out some incremental building possibilities for Unity projects with IL2CPP, so stay tuned). Then the turn-around time for an individual test is reasonable (although still not as fast as we would like).

This heavy use of integration tests was a conscious decision though. Much of the code in IL2CPP looks different than it used to, even at our initial public releases in January of 2015. We have learned plenty and changed many of the implementation details in the IL2CPP code base since its inception, but we still have many of the original tests written years ago. After trying out tests at a number of different levels (including even validating the content of the generated C++ source code), we decided that these integration tests give us the best runtime to test stability ratio. Seldom, if ever, do we need to modify one of the existing integration tests when something changes in the IL2CPP code. This fact gives us tremendous confidence that a code change which causes a test to fail is really a problem. It also let’s us refactor and improve the IL2CPP code as much as we need to without fear.

Even larger tests

Outside of IL2CPP itself, the IL2CPP code fits into the much larger Unity testing ecosystem. For each platform we ship supporting IL2CPP, we execute the Unity player runtime tests. These tests build up a single Unity project with more than 1000 scenes, then execute each scene and validate expected behavior via assertions. We usually don’t add new tests to this suite for IL2CPP changes (those tests usually end up being at a lower level). This suite serves as a check against regressions that we might introduce with IL2CPP on a given platform. This suite also allows us to test the code used in integration IL2CPP into the Unity build toolchain, which again varies for each platform. A typical runtime test suite completes on about 60-90 minutes, although we often execute individual tests locally much faster.

The largest and slowest tests we use for IL2CPP are Unity editor integration tests. Each of these tests actually runs a different instance of the Unity editor. Most of the IL2CPP editor integration tests focus on building a running a project, usually with various editor build settings. We use these tests to verify things like complex editor integration, error message reporting, and project build size (among many others). Depending on the platform, integration test suites run in a few hours, and usually are executed at least nightly, if not more often.

What is the impact of these tests?

At Unity, one of our guiding principles is “solve hard problems”. I like to think about the difficulty of problems in terms of failure. The more difficult a problem is to solve, the more failures I need accomplish before I can find the solution.

Creating a new highly-performant, highly-portable AOT compiler and virtual machine to use as a scripting backend in Unity is a difficult problem. Needless to say, we’ve accomplished thousands of failures along the way. There are more problems to solve, and so more failures to come. But by capturing the useful information from almost all of those failures in a comprehensive and fast test suite, we can iterate very quickly.

For the IL2CPP developers, our test suite is not so much a means to verify bug-free code (although it does catch bugs), or to help port IL2CPP to multiple platforms (it does that too), but rather, it is a tool we can use to fail fast and solve hard problems so our users can focus on creating beautiful things.

Conclusion

We hope that you have enjoyed the IL2CPP Internals series of posts. We’re happy to share implementation details and provide debugging and performance hints when we can. Let us know if you want to hear more about other topics related to the design and implementation of IL2CPP.

Comments are closed.

  1. Will Unity Window/Phone 10 Universal Games be using .Net Native during runtime ?

    https://channel9.msdn.com/Blogs/DevRadio/NET-Native-Performance-Optimizing-Your-Windows-Apps-with-NET-Native

  2. Is it safe to assume that you also test the more esoteric features of C#, e.g. ‘stackalloc’? (We had a couple of mysterious crashes in 5.1 that decidedly was ‘definitely maybe’ due to stackalloc usage in our game code. Will try again with 5.2 soon.)

  3. For bugs that occur despite passing the test suite – do you add a test that reproduces the bug before fixing it? Also, how often do you find the unit tests failing after making code changes? Is there a particular area that seems to be more sensitive (changes result in test failures) more than the others? (just curious)

    1. Josh Peterson

      July 22, 2015 at 1:36 pm

      > For bugs that occur despite passing the test suite – do you add a test that reproduces the bug before fixing it?

      Yes. In 99% of cases the test is added first. We have had corner cases where the bug is too difficult to reproduce with our testing infrastructure (e.g. an older asset store package has an assembly with invalid IL code that we cannot generate in any way). Bugs like this are few and far between, and the fix usually generates a good bit of discussion about why we cannot add a test first. We’re pragmatic about getting fixes like this in, but we try to avoid them without a test.

      > Also, how often do you find the unit tests failing after making code changes?

      We don’t see this too often, but it is difficult to make a general statement here. See my next answer.

      > Is there a particular area that seems to be more sensitive (changes result in test failures) more than the others?

      Yes. On the code generation side, changes to things like includes and externs can cause problems in unexpected places. For runtime code, changes in type initialization and field layout can also lead to widespread test failures. I guess the best way to describe this is to say that changes in general code can impact many tests, while changes to specific code seldom cause tests to break.

      This seems like a good thing though, as this is the behavior I would expect if the tests actually verify what they should.

      1. Wow, thank you for the detailed answers!

  4. Marc-André Jutras

    July 20, 2015 at 3:42 pm

    “The more difficult a problem is to solve, the more failures I need accomplish…”

    <..>

    … like Unity’s serialization. :p