Katana: Leveraging Open-Source Tools for Continuous Integration

June 2, 2014 in Technology

Hi!  I am Na’Tosha, and, although most people know me as one of the people on the Linux Platform team here at Unity, my main job is working as the Lead Software Developer on the Build & Infrastructure team in R&D.  Back in October of 2011, I wrote a blog post entitled Build Engineering and Infrastructure: How Unity Does It.  A lot has changed at Unity over the years, and how we do infrastructure has changed over the years as well.  The two biggest areas where we have made changes are in the tool we use for our Mercurial hosting and code reviews and in our automated build and continuous integration solution.

For a few years, Unity has used TeamCity from JetBrains for automated building and testing.  As the R&D team grew here at Unity, the demands on the build infrastructure grew on multiple axes (namely the number of users, the number of changesets, and the number of simultaneous branches).  We reached a point where we needed to accommodate several thousand builds per day, and we started seeing performance problems on multiple fronts: servers became slow to respond, we encountered unexplained errors that we could not fix, new changes were processed very slowly causing delays, webpages took several minutes to load, etc.  After a year of back-and-forth with the makers of TeamCity, and a progressively worsening state of our build infrastructure, I came to the conclusion that the best path forward for us was to switch to a solution that better suits our particular needs (obviously when you have a way of working that is as particular as ours, combined with our scale, extensibility and flexibility are a necessity in any tool . . . and both can be hard to get with off-the-shelf proprietary solutions).  Being a long-time open-source enthusiast, I felt this was a particularly good scenario to leverage the power of open-source to fix our problems.  After some research, I decided that we would build a custom solution on top of Buildbot — an open-source continuous integration framework used by Chromium, Mozilla, Python, and various other projects.  Buildbot is written in Python on top of the Twisted event-driven networking engine.

A Look Back

Phase 1: Prototype and Proof-of-Concept

It was now September of 2012, and, luckily for me, we had just expanded the Build Engineering team at this time with a new hire – Maria – who already had previous experience working with Buildbot.  We knew this would be a large project, so we started with a 2-month long prototyping/proof-of-concept phase where Maria explored various aspects of Buildbot to test its potential to scale in the future while maintaining the flexibility we needed for our complex build chains.  We knew we wanted the ability to decouple as many parts of the build infrastructure as possible to allow for easier maintenance and debugging.

 

katana-prototype.png

An early design diagram for Katana.

Phase 2: Requirements Gathering and Beginning of Implementation

After around two months of prototyping and proof-of-concept work, we were confident the toolset we had chosen would work — with some serious investment.  The next phase of the project involved doing a feature comparison between Buildbot and TeamCity and an initial attempt to gather requirements for a system that could be used in production as a TeamCity replacement.  This part was hard and required some iteration, because we 1) were still learning about all of the capabilities and limitations of Buildbot, and 2) it was hard to figure out which features TeamCity had that were really useful to us and which ones we could live without.   We started with an initial project plan and schedule, which we revised along the way at regular intervals.  At this point, we brought our IT department in to provide estimates on the amount of hardware we would need to acquire to build a fully-functioning system without taking resources away from our production instances.

Phase 3: The Front-end

The version of Buildbot we forked from (0.8.7) does come with a user interface, but coming from TeamCity, it was practically impossible to use, especially with the number of build configurations and number of builds we have.  Performance was of course also a concern; after our previous experiences, we knew the most important thing was that the UI was fast to load — everything else was secondary.  Therefore, we needed someone with UI expertise and a keen eye for design to produce a new UI for us.  We hired a front-end developer — Simon — who was experienced with websites where performance is the main concern.  He was tasked with creating a new Buildbot frontend.

Phase 4: More Implementation

This is where the bulk of feature implementation was done.  At one point during these months we did decide to reassess and extend the project schedule after discovering some significant work was needed on the Buildbot side to handle one of our use-cases, but overall, the project went well.  We ended up needing to do some work in our buildsystem (e.g., work around the fact that Python stores internally, and lists, environment variables all in upper-case) and our test frameworks (e.g., make all tests output a standardized XML file containing test results that we could parse) here and there.

Towards the end of this phase, we transitioned some internal projects (for example, our internal builds of the Mono runtime and classlibs) from TeamCity to Katana.  This allowed us to gain valuable user testing and feedback in a real-world scenario.  We started an internal focus group of users who were using the “Guinea Pig” projects.  From this, we progressed gradually to a more well-rounded feature set.

Phase 5: Production Readiness and Roll-Out

This is where we started counting down the list of to-do items before we could transition the main Unity project.  We use Trello for project management with Katana, and it works very well — in particular towards the end of this project where the team of people working on Katana had grown (by this point we had also added another member to our team — Daniel — who had started working on Katana, and I also had started working on Katana development and overseeing the configuration management).

katana-trello.png

Katana’s Trello Board

We migrated the main project to Katana (which, because it was a manual migration and is a very large project, actually took quite some time) and invited users to use this alongside TeamCity for verifying branches to be merged to trunk.  During this time, we fixed more issues and gained more feedback.  In late January of this year, we switched our mainline to building officially on Katana instead of TeamCity.  We’ve been using it since then, and overall, we are very pleased with the improvements it has brought us.

The Current State

Katana lives in our buildbot fork on GitHub under a GPLv2 license.  We are still actively developing it; just a few weeks ago we deployed a real-time updating solution that uses Autobahn.

Among other things, we have a good overview of our build status on each branch:

katana-1.png

And also an overview of what our buildslaves are doing:

katana-4.png

We can see a detailed breakdown of a build or test process:

katana-2.png

And we have a nice test report to help us when tests fail:

katana-3.png

Katana’s architecture has grown in complexity, but we have been mindful of what elements are important to us.  Katana architecture now looks more like this:

Katana Production.png

 

In general we have seen vast improvements in:

  • Maintainability
  • Flexibility
  • Reliability
  • Performance

Conclusion

Overall, I consider Katana a roaring success — both in terms of the improvements it has brought to R&D at Unity and also as a shining example of how to leverage the power of open-source tools.  We’re proud to be so instrumental in keeping the wheels turning here in R&D at Unity and I hope you all take advantage of build automation in your own studios.

Comments (8)

Subscribe to comments
  1. Emil "AngryAnt" Johansen

    June 4, 2014 at 12:44 am / 

    Looks great :D I’ll have to come by and try it out some time.

  2. Corentin

    June 2, 2014 at 10:44 pm / 

    Thank you for clearing this point.

  3. Na'Tosha Bard

    June 2, 2014 at 10:21 pm / 

    @CORENTIN We haven’t announced any plans for the Unity Editor on Linux at this point in time.

  4. Corentin

    June 2, 2014 at 9:42 pm / 

    Thanks for this success story !
    Sorry if I’m a bit of a troll, here, but I have to ask… Do you think we’ll see a linux Unity 3d Editor anytime soon ? Unity is the only reason I have Windows installed… Thank you !

  5. Na'Tosha Bard

    June 2, 2014 at 6:43 pm / 

    @Lior, it’s true that most companies just pick something off-the-shelf, but it is becoming quite common for companies with large-scale and/or specialized needs to dedicate teams to building/customizing infrastructure that suits their particular needs.

    The front-end is hosted on just one server, the same server that handles scheduling. However, realtime updating of elements on various pages are offloaded to the autobahn server (to reduce the number of requests on the main server), and the VCS polling / processing is also offloaded to a separate server (this can be intensive when there are a lot of developers making a lot of changes to process and the repositories are big). The main server and VCS poller share a database. The main server and autobahn server communicate with some simple requests that return the build status via JSON, which is then passed on to the client. We also extracted the build artifact storage out to a completely separate system so we could adjust storage space, speed of storage, etc, independently of the rest of the system.

    Our usage right now varies a lot depending on the day, but it analytics indicates we have up to about 140 unique sessions on the busy days of the week . . . which seems about right given the number of developers we have. We are currently split with Unity 4.x being developed/maintained on TeamCity and 5.x development happening on Katana, so I expect usage will increase some more still. The repository used by Katana is the same one TeamCity used — the size and number of changes continues to grow as normal (our main development repository is a Mercurial repository with over 150.000 revisions).

    It is true that the building itself is done on the slaves, but the server itself is actually responsible for a lot of data processing. I can’t explain all of the reasons behind why TeamCity failed to perform for us at scale, because I don’t know the causes. After working with the system (and its creators) for some time, it seemed to be related to a few different things — one of which is the amount of data from the version control history that TeamCity wanted to process and persist in order to show a lot of the shiny bits in the UI. When we moved our mainline development off of TeamCity, it did get better in terms of responsiveness (but not great), so the number of users using it and the number of builds requested had a real effect as well (it went from a few thousand builds per day to under one thousand). I’m sure TeamCity works very well for most users, and if it is working for you, I would encourage you to keep using it.

    @10FINGERARMY We did evaluate Jenkins, but after having a lot of problems debugging and performance tuning TeamCity, we were, among other things, reluctant to pick another Java-based solution. We also thought the ability to abstract various bits and pieces of a Buildbot-based system to separate machines and services was very appealing when it comes to debugging and performance tuning.

  6. 10FingerArmy

    June 2, 2014 at 6:18 pm / 

    Very interesting read! Out of curiosity: did you evaluate using Jenkins servers as well? They come with tons of plugins, it’s easy to write own ones etc.

  7. Lior Tal

    June 2, 2014 at 6:02 pm / 

    Interesting, especially since Unity is not a CI server company. Most companies just pick up a well established product (commercial or open source).

    Regarding the architecture – the Front end is hosted on just 1 server? what’s the load on this server? (e.g: users/requests)

    I wonder (as a TeamCity user) why does it get so slow in rendering those pages, as most of the hard work should be done on the build agents anyway.

  8. Erlend Sogge Heggen

    June 2, 2014 at 4:13 pm / 

    Didn’t you hear? You can’t be going around calling things “master” and “slave” all willy nilly any longer!
    https://github.com/django/django/pull/2692

    (just to be clear, I’m kidding. It’s a thread well worth skimming through though, for funzies).

    Katana sounds (and looks!) quite marvellous. I hope I get an excuse to use it some time.

Comments are closed.