Katana: Leveraging Open-Source Tools for Continuous Integration
For a few years, Unity has used TeamCity from JetBrains for automated building and testing. As the R&D team grew here at Unity, the demands on the build infrastructure grew on multiple axes (namely the number of users, the number of changesets, and the number of simultaneous branches). We reached a point where we needed to accommodate several thousand builds per day, and we started seeing performance problems on multiple fronts: servers became slow to respond, we encountered unexplained errors that we could not fix, new changes were processed very slowly causing delays, webpages took several minutes to load, etc. After a year of back-and-forth with the makers of TeamCity, and a progressively worsening state of our build infrastructure, I came to the conclusion that the best path forward for us was to switch to a solution that better suits our particular needs (obviously when you have a way of working that is as particular as ours, combined with our scale, extensibility and flexibility are a necessity in any tool . . . and both can be hard to get with off-the-shelf proprietary solutions). Being a long-time open-source enthusiast, I felt this was a particularly good scenario to leverage the power of open-source to fix our problems. After some research, I decided that we would build a custom solution on top of Buildbot — an open-source continuous integration framework used by Chromium, Mozilla, Python, and various other projects. Buildbot is written in Python on top of the Twisted event-driven networking engine.
A Look Back
Phase 1: Prototype and Proof-of-Concept
It was now September of 2012, and, luckily for me, we had just expanded the Build Engineering team at this time with a new hire — Maria — who already had previous experience working with Buildbot. We knew this would be a large project, so we started with a 2-month long prototyping/proof-of-concept phase where Maria explored various aspects of Buildbot to test its potential to scale in the future while maintaining the flexibility we needed for our complex build chains. We knew we wanted the ability to decouple as many parts of the build infrastructure as possible to allow for easier maintenance and debugging.
An early design diagram for Katana.
Phase 2: Requirements Gathering and Beginning of Implementation
After around two months of prototyping and proof-of-concept work, we were confident the toolset we had chosen would work — with some serious investment. The next phase of the project involved doing a feature comparison between Buildbot and TeamCity and an initial attempt to gather requirements for a system that could be used in production as a TeamCity replacement. This part was hard and required some iteration, because we 1) were still learning about all of the capabilities and limitations of Buildbot, and 2) it was hard to figure out which features TeamCity had that were really useful to us and which ones we could live without. We started with an initial project plan and schedule, which we revised along the way at regular intervals. At this point, we brought our IT department in to provide estimates on the amount of hardware we would need to acquire to build a fully-functioning system without taking resources away from our production instances.
Phase 3: The Front-end
The version of Buildbot we forked from (0.8.7) does come with a user interface, but coming from TeamCity, it was practically impossible to use, especially with the number of build configurations and number of builds we have. Performance was of course also a concern; after our previous experiences, we knew the most important thing was that the UI was fast to load — everything else was secondary. Therefore, we needed someone with UI expertise and a keen eye for design to produce a new UI for us. We hired a front-end developer — Simon — who was experienced with websites where performance is the main concern. He was tasked with creating a new Buildbot frontend.
Phase 4: More Implementation
This is where the bulk of feature implementation was done. At one point during these months we did decide to reassess and extend the project schedule after discovering some significant work was needed on the Buildbot side to handle one of our use-cases, but overall, the project went well. We ended up needing to do some work in our buildsystem (e.g., work around the fact that Python stores internally, and lists, environment variables all in upper-case) and our test frameworks (e.g., make all tests output a standardized XML file containing test results that we could parse) here and there.
Towards the end of this phase, we transitioned some internal projects (for example, our internal builds of the Mono runtime and classlibs) from TeamCity to Katana. This allowed us to gain valuable user testing and feedback in a real-world scenario. We started an internal focus group of users who were using the “Guinea Pig” projects. From this, we progressed gradually to a more well-rounded feature set.
Phase 5: Production Readiness and Roll-Out
This is where we started counting down the list of to-do items before we could transition the main Unity project. We use Trello for project management with Katana, and it works very well — in particular towards the end of this project where the team of people working on Katana had grown (by this point we had also added another member to our team — Daniel — who had started working on Katana, and I also had started working on Katana development and overseeing the configuration management).
Katana’s Trello Board
We migrated the main project to Katana (which, because it was a manual migration and is a very large project, actually took quite some time) and invited users to use this alongside TeamCity for verifying branches to be merged to trunk. During this time, we fixed more issues and gained more feedback. In late January of this year, we switched our mainline to building officially on Katana instead of TeamCity. We’ve been using it since then, and overall, we are very pleased with the improvements it has brought us.
The Current State
Among other things, we have a good overview of our build status on each branch:
And also an overview of what our buildslaves are doing:
We can see a detailed breakdown of a build or test process:
And we have a nice test report to help us when tests fail:
Katana’s architecture has grown in complexity, but we have been mindful of what elements are important to us. Katana architecture now looks more like this:
In general we have seen vast improvements in:
Overall, I consider Katana a roaring success — both in terms of the improvements it has brought to R&D at Unity and also as a shining example of how to leverage the power of open-source tools. We’re proud to be so instrumental in keeping the wheels turning here in R&D at Unity and I hope you all take advantage of build automation in your own studios.