In Unity 4.6 / 5.0, the generation of batches for rendering of the UI system is very slow. This is due to a few factors, but ultimately our deadlines kept us from dedicating the time to polishing this part of the UI, instead focusing on the usability and API side of things.
In the final sprints of finishing the UI, we were lucky enough to have some help with optimisation. After we shipped, we decided to take a step back and analyse exactly why things were slow and how we could fix them.
If you want the quick and dirty: We managed to move everything (apart from job scheduling) away from the main thread as well as drastically fix up some of the algorithms we were using in the batch sorting.
We developed a few UI performance test scenes to get a good baseline to work with when testing the performance changes. They stress the UI in a variety of ways. The test that was most applicable to the sorting / batch generation test had the canvas is completely filled with ‘buttons’. There are overlaps between the text on the button and the button background, so there will always will be some overhead in calculating what can batch with what. The test constantly modifies the UI elements so that rebatching is required every frame.
The test can be configured to place UI elements in an ordered way (taking advantage of spacial closeness), or a random way (potentially stressing the sorting algorithms more). It was clear to us that batch sorting needed to be fast in both scenarios. In 4.6 / 5.0 it is fast in neither.
It should be noted that the performance test tends to have ~10k UI elements. This is not something we would expect to see in a ‘real’ UI, most UI’s we’ve experienced have ~300 items per canvas.
All performance and profiling done from my MacBook Air (13-inch, Mid 2013).
Original (pre 4.6, no stats)
During the 4.6 betas we were getting feedback that batch sorting was very slow when there were many elements on the canvas. This was due to there basically being NO smartness when we were trying to figure out batch draw order. We would simply iterate the elements on the canvas and see what we collided with and then assign a depth based on some rules. This meant that as we added more elements to the scene, things would get slower (O(N^2)), much slower. This is ‘bad vibes’ in terms of performance.
4.6 / 5.0 release (baseline)
We did some work on the sorting that took advantage of the idea that ordered drawn elements would normally be in a similar location on the screen. From this, a bounding box was built (per group of n elements) and then new elements were collided with this group before being collided with individual elements. This lead to a decent performance increase in scenes that had locality between UI elements, but in randomly ordered scenes, or scenes were elements were spaced far apart, the improvements were only marginal.
If we take a look at this version you can see that when placing random elements the batch performance massively breaks down, taking roughly 100ms to sort and populate a scene…. that’s for reals slow.
Looking at this in the timeline profiler also reveals another worrying situation… we are completely blocking anything else from happening. The batch generation is run just before UI is rendered, this is after a late update and often after scene cameras are rendered. It looks like it would make sense to bring the batch generation to be right after late update so that it can happen while a scene would normally be rendered.
Improved sorting (Take 1)
We did a first pass on improving sorting. It was still based on the idea of element locality, but with a few more smarts. It would try and keep groups ‘batchable’, so we could include / exclude batchability on a whole group level. It was faster, but still fell down when given very spatially separate scenes and did not scale well with the number of renderable elements.
Non spatially grouped input
This is pretty poor. It was clear that we needed a new approach.
Improved sorting (Take 2)
As mentioned earlier, sorting tends to break down and be slow in larger UI scenes with spread elements. We took a step back and thought about what might be a better approach. In the end we decided to implement a canvas grid structure. Each grid square becomes a ‘bucket’ and any UI element that touches a square gets added to that bucket. This means that when adding a new UI element we only need to look into the buckets that the element touches to find what it can / can’t batch with. This led to significant performance improvements when the scene was ordered randomly.
We reached the first step on the path to pulling the UI off the main thread by using the new Geometry Job system which was introduced in Unity 5. This is an internal feature that can be used to populate a vertex / index buffers in a threaded way. The changes that were made here allowed us to move a whole bunch of code off the main thread as the timeline below shows. There is some small overhead with regards to managing the geometry job, we have to create the job and job instructions, for example, which requires some memory, but this is negligible compared to the previous main thread cost.
Simplifying the batch sort
During the optimisation process, we did a bunch of smaller, profiler guided optimisations. The biggest gain was probably when we vectorised a bunch of our rectangular overlap checks in the sorting. Basically, getting our data into a super nice, DOD, layout ready for overlap checking, then checking with one call… it removed the overlap checks from a hot spot in the c++ profiler when before they accounted for ~60% of the sort time. As you can see doing this really helped our sort performance a bunch. But there was still a ways to go, and that was taking the 7ms off the main thread.
Taking it all off (the main thread)
The next logical step for us was to remove UI generation from the main thread. For this, we used the internal Job system to schedule a number of tasks. Some of them are serial, others are able to go wide and execute in parallel. Here is the breakdown:
1) Split incoming UI instructions into renderable instructions (1 UI instruction can contain many draw calls due to submeshs and multiple materials). This task goes wide. It allocates memory to accommodate the maximum possible number of renderable instructions. The incoming instructions are then processed in parallel and placed into the output array. This array is then ‘compressed’ down in a combine job into a contiguous section of memory just containing the valid instructions.
2) Sort the renderable instructions. Compare depths, overlaps ect. Basically sort for a command buffer the requires the LEAST amount of state change when rendering.
3) Batch Generation
- Generate the render command buffer. Create draw calls (batches / sub batches).
- Generate the transform instructions that the geometry job can use.
In the example below, you can see the geometry job ‘stall’ as it waits for the batch generation to be completed, we need to do more testing around this but as these scenes do not have any renderable elements aside from UI this issue would decrease as the complexity of the scene increases.
Other performance things we did
- 2D Rect clipping (most UI’s don’t really need stencil buffer it turns out, and this reduces draw calls and state change).
- 2D Rect culling (if your element is out or render bounds… cull it).
- Smarter canvas command buffer
- Allow text / normal elements to share the same shaders / materials
- Massively reduce set pass calls
- Push a lot of UI specific data into material property blocks
- Normally 1 set pass call for a UI, then multiple draw calls
- * Combine UI into 1 mesh / index buffer
- Use DrawIndexRange for rendering
- One VBO / index buffer that resizes as needed
- Splits to a new draw call when > 2^16 indicies
Right now, the sorting / batch generation is behaving acceptably; there are, or course, things we can do to make if faster, but the biggest issue is the time it takes to process the geometry job. As it’s now off the main thread and an isolated job, it’s a good candidate for tidying and speeding up. I’m fairly certain we are doing some dumb things still (is that branching in a tight inner loop?), and it’s also using a bunch of slow maths that could handle being vectorised very nicely.
At a higher level it is also worth looking at the situations that lead to a rebatch happening and attempting to minimise those. As always there is more work to do, but what is described here is in Unity 5.2 and already a significant improvement.
Many of the new features in Unity 5.2 are pretty great. They allowed us to completely minimise the cost of the UI system on the main thread, as well as optimise the batching in general. When we were working, we used a strongly profiler guided approach to find out where the issues were; in one or two places, we decided to completely step back and try again when we realised the old solution was inadequate. Internally at Unity we are doing a lot more of this kind of work, really trying to address pain points and issues that you are reporting to us in a way that makes Unity better for everyone. Thank you for reporting bugs and real projects that have issues for us to investigate.