Tales from the Optimization Trenches
As a Developer Relations Engineer in our EMEA Consulting & Development team, I spend most of my time visiting Unity’s largest customers and helping them resolve performance issues on their projects. If you’re interested in learning how we do that so that you can apply this knowledge and techniques to your own projects, please read on.
At Unite Copenhagen 2019 I presented a session titled ‘Tales From the Optimization Trenches’. My intention was to help intermediate Unity users who might have already seen advice from our Best practice guides, but lack the practical knowledge required in order to diagnose and resolve other performance issues on their own by using profiling and analysis tools.
My talk covers:
- An overview of the Developer Relations Engineer role and our core activity: delivering Project Reviews.
- An introduction to optimization and profiling.
- Three separate sections on CPU, GPU, and memory footprint optimization, each one of them featuring two practical examples based on real issues we’ve seen in the field, along with a breakdown of the tools and techniques we used to overcome them.
- A series of general optimization rules.
You can find the video below and the accompanying slides here.
There wasn’t enough time to cover everything, and I’ve had some great follow-up discussions with Unite attendees that aren’t captured in the video, so I wrote this blog post to share all of this extra material with you. Still, I highly recommend that you watch the video first.
About the Project Reviews
Project Reviews comprise the core of our work. We travel to our customers’ offices and typically spend two full days with them familiarizing ourselves with their projects, asking them several questions in order to understand their requirements and the design decisions they’ve made, and using various profiling tools in order to detect performance bottlenecks. For well-architected projects that have low build times (modular scenes, heavy usage of AssetBundles, etc), we actually perform changes while onsite and reprofile in order to uncover new issues. This is why it’s so important to optimize build times: doing so will enable more frequent iterations. This is even more true for projects whose target hardware differs significantly from the one used during development, such as mobile devices and game consoles.
Luckily for us, Project Reviews are never the same, as our pool of customers is extremely diverse and the type of projects they work on encompasses a wide range of platforms and requirements. When there are complex problems that we don’t manage to resolve during the visit, we capture as much information as we can and we conduct further investigation back at the Unity offices, asking questions to specialized developers across our R&D departments if need be. The deliverables depend on the needs of the customers, but typically they are a written report that summarizes our findings and provides recommendations. When deciding what to focus on, our goal is to always deliver something that provides the greatest value to our customers.
Even though we have access to Unity’s source code, when conducting Project Reviews we try to put ourselves in the same position as our customers. That is, we optimize their projects using publicly available profiling tools and best practices. If we actually need to peek under the hood in order to get to the bottom of a performance issue, we do our best to update our documentation afterward in order to make this new knowledge available to all our users and have a greater impact.
CPU bound vs GPU bound
As discussed during the presentation, before we start optimizing our project we need to find out the actual bottlenecks. One way we can do that is by inspecting the breakdown of our CPU usage using the Unity Profiler. If most of our frame time is spent on ‘Rendering’, as illustrated in the image below, we then need to determine whether we’re CPU bound or GPU bound.
Rendering is a process that is performed in conjunction with both the CPU and the GPU. A comprehensive description of this process is outside the scope of this article but, in a nutshell, the rendering of a scene comprises the following steps:
- For each group of objects that share a material:
- The CPU sends a series of commands to the GPU in order to set it its internal state (e.g., shader, bound textures, vertex formats, etc). This step is also known as ‘set pass’ call.
- The CPU sends a batch of geometry to the GPU so that it can be rendered using the state set in 1.a. This step is also referred to as a ‘draw call’ and it’s quite expensive.
- If more geometry under the same type of material needs to be rendered, go to step 1.B.
Again, there are several details and caveats to the algorithm above, but the key takeaway is that rendering is an activity conducted between the CPU and the GPU. As illustrated in the screenshot below certain tools, such as Xcode, can give us detailed information on how much time is actually spent by both resources.
This type of information can also be found in the Unity Profiler, though note that the GPU metrics are not always available, as they depend on the support provided by the graphics card and its drivers:
If we cannot get CPU and GPU timings using our profiling tools, we can always inspect a random frame in the Unity Profiler. If there’s a call to Gfx.WaitForPresent in there and says ‘call is taking a considerable amount of time’, it means that the CPU is waiting for the GPU to finish processing all the rendering commands and, thus, we’re GPU bound (please refer to this manual page in order to understand the meaning behind other markers, such as WaitForTargetFPS and Gfx.PresentFrame):
There are many factors that could have an impact on the GPU workload, such as:
- Fill rate: our application is coloring an excessive number of pixels multiple times on a given frame, a process known as ‘overdraw’.
- Memory bandwidth: our application is sending a large amount of texture data to the GPU. This can be alleviated by reducing the number of textures (via atlasing, for example), reducing their size, and setting them to a compressed format when applicable.
- Vertex processing: our application is sending too much geometry to the GPU. We covered this scenario as part of one of our examples during the presentation at Unite.
Alternatively, if we’re CPU bound, there could be many things contributing to CPU time (e.g. physics, gameplay code, etc.), and we should check the profiler. If the profiler says we’re spending a lot of time in Rendering, it probably means that the CPU is busy sending too many commands to the GPU. This can be optimized by reducing both the number of state changes (or ‘SetPass’ calls) and the number of batches. Please refer to our ‘Fixing Performance Problems’ tutorial for a deeper discussion on this subject.
Case study: CPU spikes when loading data
A performance problem we typically see in customer projects are performance hiccups during the startup phase of their application or when they are transitioning to a new level. These hiccups manifest themselves as spikes in the Unity Profiler:
And they are typically caused due to both expensive computational processing and large memory allocations. In this example, the CPU spike causes a stall of nearly 10 seconds and a managed allocation of 3.8 GB, as seen in the screenshot below:
These spikes are undesirable mainly for two reasons. The first reason is that their excessive length interrupts the flow of the application. One way to ‘mask’ the stall caused by the CPU spike is to use a loading screen, though note that this solution won’t work if we need to show animated elements on screen, as the animations will stall during the loading process. The second reason that makes these spikes undesirable is that their large allocations permanently increase the size of the managed heap. Unity’s automatic memory management system works in such a way that unreferenced memory is reused in subsequent allocations, but the overall size of the managed heap never decreases, it can only go up. This is known as ‘non-compacting garbage collection’. Please refer to this entry in our documentation and this article from Unity’s Learn website.
The spikes are normally caused by a combination of factors. Based on what we see in the field, it’s because the application is storing data in a non-optimized format (e.g., JSON or XML) and the parsers need to allocate a significant amount of memory in order to process their content. Those allocations, coupled with the intensive computations required to operate on said data (and their associated memory allocations) are often the main culprits.
In order to alleviate these problems, we usually recommend customers to implement a ‘budgeted time manager’ system which instantiates and initializes objects within a per-frame limit and adding support for a binary format. The ‘budgeted time manager’ spreads the cost across multiple frames, whereas support for a binary format helps minimize the size of the allocations.
This idea of having a ‘budgeted time manager’ instead of loading all the data in a single method is analogous to the difference between the regular garbage collector and the incremental garbage collector: while the first one stalls the frame until the whole list of managed objects has been processed, the second one spreads the work across multiple frames.
Due to their nature, binary formats are usually harder to work with during development. So our recommendation to customers is not to remove support for text formats entirely. Instead, we advise them to support both and use the text or binary formats depending on whether they are executing the development or release versions of their applications, respectively.
Some comments regarding Garbage Collection
In the ‘GC spikes in a fast-paced game’ example, we advised the customer to enable the Incremental Garbage Collector and reduce the frame time as much as possible in order to give the algorithm enough room to operate at the end of every frame. One point that wasn’t stressed enough during the presentation is that the incremental garbage collector is not an excuse to become lax when it comes to minimizing the amount and size of managed memory allocations: the main benefit of the tool compared to the regular garbage collector is that it spreads its workload across multiple frames instead of stalling the frame until the entire pool of managed objects is processed, which is especially important in order to ensure a steady framerate.
GarbageCollector.GCMode = GarbageCollector.Mode.Disabled;
This technique can be useful in scenarios where we don’t want to pay any processing costs associated with the garbage collection algorithm. Though please note that, in order to do that, we need to ensure that no allocations are taking place when the garbage collector is disabled because, as discussed during the presentation, the operating system will happily pull the plug on our application if our memory usage goes above a certain threshold. This is especially true on mobile platforms such as Android and iOS.
Case study: FPS with authoritative server
A few months ago we conducted a Project Review of a multiplayer first-person shooter game featuring an authoritative server architecture, with the server running in headless mode. We performed a memory capture using the Unity Memory Profiler and discovered that there were hundreds of MBs allocated to meshes, light probes, audio clips, mesh renderers, and various other types of objects that were not actually required in a headless server.
While this extra memory footprint didn’t prevent the server from running a single multiplayer session, it was clearly impacting its ability to scale. More specifically, being able to increase the number of active instances in a given server required a significant increase in memory.
In this scenario, we advised the customer to break out every game level scene in two parts and store them in separate AssetBundles. The first entity is the ‘logical scene’, and contains all the information required by the headless server, whereas the second entity is the ‘visual scene’ and contains all the information that is exclusively used by the clients.
Note that this division can cause some workflow problems. More specifically, artists and level designers can no longer work in a single scene. Instead of introducing disruptions in the content creators’ workflow, we recommended our customers to leave them as they are and add support for breaking down scenes into the ‘logical’ and ‘visual’ as part of the build process.
Deep profiling and profiler markers
As we’ve discussed before, we should aim for nearly zero per-frame allocations in our applications’ core loops. Doing so will significantly reduce the overhead caused by the garbage collection algorithm. The Unity Profiler is the best tool for the job, but the default level of depth in the reported call stack will only go as deep as the first call stack depth of invocations from the engine’s native code into the application’s scripting code (e.g., MonoBehaviour.Start(), MonoBehaviour.Update() and similar methods). In practice, this means that if our scripts are invoking methods from other scripts (which they usually do), we won’t be able to easily identify the exact place where the managed allocations are taking place.
One way to work around this problem is by explicitly adding Profiler Markers to our scripts. Doing so will record extra information during the profiling process and will help us narrow down the source of our allocations.
A second way is to enable Deep Profiling. You can find specific instructions on how to do that in this article from Unity’s Learn website. Bear in mind that deep profiling adds a lot of overhead which significantly slows down the application, and so the reported timings will no longer be accurate. Our recommendation is to conduct a profiling session with deep profiling disabled, take notes on which scenarios cause unwanted managed allocations and if the reported call stacks are not detailed enough to track down the source of the allocations, conduct a second session with deep profiling enabled in order to find the source of the allocations.
Please note that before Unity 2019.3, deep profiling was only available when using the Mono scripting backend. This limitation has been lifted in the beta cycle of Unity 2019.3, which provides support for both Mono and IL2CPP backends. From the release notes:
Profiler: Added Deep Profiler support to Mono and IL2CPP players.
Profiler: Added Deep Profiling support build option to players. When you build a Player with Deep Profiling, C# code instrumentation can be dynamically enabled and disabled.
Profiler: Added managed allocation callstacks support in players. When you enable callstacks collection, GC.Alloc samples contain C# code callstack.
The fact that deep profiling is now available for the IL2CPP backend means that developers will now be able to perform deep profile captures on platforms that only support IL2CPP, such as iOS. On top of that, the added support for managed allocation call stacks in players should help developers find the source of their allocations without having to resort to deep profiling.
Performance optimization is a large topic that requires a wide range of skills. Skills such as understanding how the underlying hardware operates, along with its limitations. Understanding the various classes and components provided by Unity, algorithms and data structures, how to use profiling tools, and you also need to have creativity in order to find efficient solutions that also satisfy the design requirements.
We want to help you make your Unity’s applications be as performant as they can be, so if there’s any optimization topic that you’d like more information on, please let us know via the comments section.