A look inside: The Quest for Green Automation
One of the hardest things to achieve in any software development process is having an almost always green set of automated tests. Just like it is almost impossible to find and fix all bugs in a software, it is impossible to have tests always green. That doesn’t mean we shouldn’t try! In this blog post, we go over some challenges we have on a daily basis in regard to our automated test suites.
If you are developing a simple library and you design it from scratch using a methodology like Test Driven Development, for example, you are likely to end up with something that works well and has a nice set of fast and stable unit tests to accompany it. If you are developing anything more complex than that, unit tests will not be enough and you will need to add at least an additional integration test suite. Unity is very complex due to the number of features that need to integrate with each other and the fact that we support building games and applications on more than 20 different platforms. Testing software of this complexity meant we ended up creating a lot of high level internal testing frameworks.
Let’s take a look at some numbers! Our code base is 12 years old which means there is lots of legacy code intermixed with lots of new code. Here are some detailed stats collected using cloc:
Engine (Runtimes, Modules, Shaders):
- 663 kloc code + 55 kloc comments in 3.8k files
- 87% C++, 4% C#, 5% bindings, 2% shaders
Editor (Editor, Extensions, some Tools):
- 749 kloc code + 57 kloc comments in 4.6k files
- 51% C++, 44% C#, 4% bindings
- 464 kloc code + 52 kloc comments in 2.8k files
- 59% C++, 37% C#, 2% bindings
Tests (does not include all tests; some are inside editor/runtime, especially c++ unit tests):
- 301 kloc code + 21 kloc comments in 4.6k files
- 1% C++ :), 87% C#, 8% shaders
- 2.2 million lines of code + 185 thousand lines of comments
These numbers don’t include any of the external libraries we integrate into Unity. The tests are split into low-level C++ unit tests and high-level C# tests. The C# tests can be of many different kinds based on what they test: runtime, integration, asset import, graphics, performance, etc. In total we have about 60000 automated tests that are being executed tens of millions of times every month, both locally through manual runs and on our build farm.
The fact that we rely so much on high level automated tests means that we have to deal with test failures and instabilities on a constant basis. In order to keep these to a minimum we started doing a number of things:
- Make sure the head revision of trunk (main development branch) is always green
- Implement monitoring and reporting of tests
- Optimize the time spent executing tests
We use a development method that relies heavily on having multiple code branches that are kept in sync with and eventually merged back to trunk when ready. We want to always be able to have a build ready for release from the head revision of trunk, which means we want to make absolutely sure that it is always green. Everyone branching from trunk also wants to start their work on code that passes all automation.
The current way we are keeping trunk always green is by using a staging branch. Every day, multiple people submit code that should be merged to trunk. Requests like these get bundled together, merged onto the staging branch and all test automation is executed. If anything fails, we have a tool that reruns the failed tests on the same revision again to verify if it is just an instability or an actual failure. If it is an instability, a notification is posted to an internal chat, where we always have one or more developers investigating any issue that gets posted there. If it is an actual failure, we run a bisection process to quickly figure out which one of the code merge requests introduced it. The person responsible for that gets notified and the code is removed from the staging branch. If everything passes as expected, the staging branch gets merged to trunk. We call this the Trunk Queue Verification process.
This process does help keeping the main development branch always green, but it is far from ideal. It is costly to maintain because running all our test suites takes hours and finding the source of some failures require human intervention in a lot of cases. The ideal scenario would be for us to run tests on all branches after every new set of changes is pushed achieving something close to continuous integration. Right now, we are running tests in the most naive way possible, which usually means that for most new batch of changes, we run all the tests. We are taking the first steps towards changing this and improving everyone’s iteration time on test automation runs here at Unity by introducing a smart test selection service.
We have previously blogged about our Unified Test Runner which also stores lots of information about every test run in a database. We now have tens of millions of test data points where we can see when a test was executed, by whom, on which machine, if it failed or passed, how long it took to execute, etc. We are starting to leverage all this data and build a rule based system for selecting which tests should run based on which code was changed on a specific branch. Here are a few examples:
- Executing all integration tests takes about 90 minutes. We can see from historical data that more than half of these tests have always passed in the last 100 runs. We introduce a rule that will skip always green tests for 9 consecutive runs and only run them every 10th run. That saves us 60 minutes for each of those 9 runs.
- The code for the AI feature is nicely isolated from the rest of the code base into it’s own module. Someone makes changes only to the AI code. There is a rule that will determine that only AI tests should be executed.
- A branch only introduces a few new tests. There is a rule that determines that running only those new tests (eventually the full test suites of which they are a part of) should be enough.
- If we are running tests on the trunk staging branch, we always run all of them to make sure trunk is always green
Using this rule-based system will save everyone a lot of time, but it will not remove instabilities. Instabilities make working with tests unreliable and slow, which is why we need to fix the source of the instability as fast as possible. Instabilities can be caused by tests (in which case the test is disabled and a bug with the highest priority is opened for someone to fix it) or by infrastructure (mobile devices in the build farm get disconnected or crash/freeze and need to be restarted, etc). For infrastructure issues all we can do is have good management and monitoring tools.
We are not the only ones struggling to keep test automation green and stable. Google has written about this quite extensively in their Flaky Tests blogpost and they also offer great advice on what one can do to avoid this in their Hackable Projects blogpost. Facebook also uses a system of bots to make sure automation runs fast and stable. You can see more in one of their presentations from GTAC and another from the F8 2015 conference.