New performance improvements in Unity 2020.2
The Unity 2020.2 release features several optimizations that are now available for testing in beta. Read on to see where you can expect to see major speed-ups and get behind-the-scenes insights into what we’ve done to make these improvements.
Writing high-performance code is an integral part of efficient software development and has always been part of the development process at Unity. Two years ago, we took the bold step of forming a dedicated Optimization Team to focus on performance as a feature in its own right, which I now have the privilege of leading. See below for an overview of what we’ve got for the Unity 2020.2 release, and check out the Unity 2020.2 beta release notes for a list of all the other improvements.
Nested Prefab optimizations
The Optimization Team worked closely with the original developers of this feature, the Scene Management Team, on various optimizations to Nested Prefabs, including:
- Reduced modifications of dynamic array of Properties
- Changed the sorting strategy for Modification array
- Changed to using a hash set for faster lookups
When loading instances of Prefabs, we apply modifications to the various properties that are different in the instance, compared to the original Prefab asset. These are PropertyModifications. When merging PropertyModifications, there are updates and insertions to a dynamic array, and as this struct is very large, this quickly becomes costly. By not erasing the PropertyModifications from the new Property array, but keeping track of the already updated properties, the method was sped up by 60x (from 3,300 ms to 54 ms on a test project).
When updating PropertyModifications, the modification list might not be sorted correctly, thus a new sort is needed. Previously this was done by sorting the old modifications into a new array. On a test project, this took 11 seconds. By instead doing the sorting in containers and pointing to the modifications, the sorting was brought down to 44 ms (250x faster), and in the case where no modifications were needed down to 11 ms (800x faster).
Additionally, searching for the propertyPath in the list of modifications was sped up by 50x (from 300 ms to 6 ms) by changing the container to a hash set. This gives an overall optimization in generating property diffs.
- Optimized a nested loop of linear searches in RegisterScriptedImporters
Database scalability tests showed that the performance of the Editor scripted importers registration function scaled badly as the number of importers being registered increased. The function was optimized by storing importers in a dictionary by file extension, to speed up searching for conflicts. The overall optimization was found to be between 12 to over 800 times faster when processing 100 to 5,000 importers (for overall improvement, see the graph on the right):
Editor workflow optimizations
- Reduced string copies and allocs in key Editor tasks
- Optimized find references in scenes by using temp memory
The team replaced lots of slow string memory allocations with temp memory labels, for strings that only exist within a single frame. In practice, most strings in Unity exist only as local variables within a single function call, so they can use a fast memory allocator. Most string utility functions now use temp memory.
Some of this work has already landed in 2020.1. Below are some graphs showing how many slow string allocations were removed by this work. The graphs show the number of slow string allocations between 2020.1.0a12 and 2020.2.0a20 in different projects, over several iterations of improvements (x-axis is iteration, y-axis is the number of string allocations):
Another Editor workflow optimization came from FindReferencesInScene. Previously, right-clicking an asset in the Project View and selecting Find References in Scene could be slow in large scenes.
By avoiding excessive smart pointer dereferences, and making use of temp memory, we improved the speed for general use cases by approximately 10%.
In cases where the Scene was missing references, attempting to dereference their smart pointers meant trying to load an invalid filename from the filesystem every time. By detecting the invalid filename, and avoiding asking the filesystem to open a file that we know will fail, we reduced the search time by up to 3x.
- JobQueue optimization giving a ~2x speed-up for scheduling of large parallel jobs
In collaboration with other internal teams, we have been working on optimizations to the JobQueue. This started with profiling the DOTS Sample project early in the year, which highlighted an unexpectedly high cost in AtomicStack::Pop(). Further investigation showed that the problem was in the memory management system in the JobQueue, especially for the JobInfo, which was using an AtomicStack as a memory management pool of items.
In the Data-Oriented Technology Stack (DOTS), there are ForEach jobs that require a Pop() per element in the ForEach for memory allocation and a Push() per element for memory deallocation. This leads to contention on the head item in the AtomicStack.
Another team within Unity implemented a new atomic container specifically for the memory management use case with support for allocating chunks of elements as a single operation to avoid the Pop() per element in ForEach job.
Early local performance testing results were encouraging, showing improved performance scaling of up to 2x as the number of job worker threads is increased:
A member of the DOTS Team pointed out the use case where the new container should show a performance benefit, i.e., the JobQueue ForEach job.
This example is on Android. Green is the new code, Red is the old code:
- Eliminated unnecessary searching by storing a dedicated list of main camera nodes
Using Camera.main has always been ill advised, because of the searching it performs. Previously, all GameObjects with tags were previously searched, and any GameObjects with a matching tag were pulled out into a temporary array. Then that second list would be searched, and if any object had an enabled camera component, it was returned.
The new approach stores a dedicated list of objects with the MainCamera tag, and does not use a secondary array of potential matches. Instead, the list is queried directly, and as soon as a match is found, it is returned. All objects that are considered are objects with the MainCamera tag, so the chance of success is much higher.
In contrived test cases containing 50,000 objects, we saw speed increase by 21,000x to 51,000x.
In a Spotlight Team customer project (shown below) many hundreds of milliseconds vanished to nothing after this improvement.
Optimized RenderManager camera usage
- Reduced the impact of sorting the cameras in RenderManager
Previously, every time a camera was added to, or removed from, the RenderManager class, a linked list was updated to keep the active cameras sorted by depth. Every change required memory allocations and pointer dereferences to check the depth of each camera, which could be slow with many cameras.
Now, the list is sorted only when ordering is needed – because only rendering cares about a sorted list. So during loading, cameras can be added/removed to a flat array (fewer allocations!), and the sorting happens only on the first time a sorted list is requested (during rendering). This test shows the performance improvement in the final timings (the orange bar on the far right is the new code):
Texture loading optimizations
- 2D texture and Cubemap creation occur on a thread on most graphics backends
- 2D texture and Single Mip Cubemap loading optimized on consoles
To reduce hitches during texture loading, we moved 2D texture creation from the graphics thread to a worker thread. Unity 2019 releases included this optimization for most graphics backends. In Unity 2020.2, we fixed an additional case for DirectX 12, removing an 80 ms stall with an 8K texture.
We optimized Texture2D loading for consoles in Unity 2020.1 by moving the texture swizzling offline and loading directly to GPU memory. Performance gains are up to 30% for a 2D texture load, depending on texture size and platform.
In Unity 2020.2 we also optimized cubemap loading for consoles. For a 2K cubemap, some consoles saw a 30 ms savings on the job thread, cutting up to 15 ms of overall loading time for an individual texture.
Profile Analyzer 1.0.0
- Released Profiler Analyzer 1.0.x as a verified package in 2020.2
Profiling and analysis always guide our performance optimization efforts, using a combination of platform-specific profiling tools and Unity’s own custom Profiler. To assist our profiling efforts, we wrote the Profile Analyzer tool, which will be available as a verified package in 2020.2.
Profile Analyzer 1.0.0 and 1.0.x updates include numerous quality-of-life bug fixes, a few performance optimizations, and the addition of some small features, such as:
- Optional column to show the threads a marker appears in
- Multi-selection support to frame time graph UI
- Sorting in the thread selection UI
The Profiler Team will be leading the future development of the tool.
Like what you see?
These updates are just part of our ongoing performance-enhancing contributions at Unity. We’re already hard at work to bring you more performance improvements in 2021. Please continue to send us feedback and let us know if there are areas you would like us to focus on in the comments.
Take Unity 2020.2 beta for a test run
With Unity 2020.2, we’re continuing our 2020 focus on performance, stability and workflow improvements. Join the beta program and let us know what you think about all the upcoming updates on the 2020.2 beta forum.