Unity Labs: AutoLOD — Experimenting with automatic performance improvements
Would you be willing to trade storage cost for up to a 10x improvement in rendering performance? If so, then read on…
Check out the code on GitHub!
As creators of digital 3D content, we want our creations to look and perform their best. In some cases we may be attempting to push rendering to the limit because we are in control of the hardware on which the experience will run. In other cases, we may be building an experience that has to run on a variety of hardware. Either way, we can pick and choose from a variety of performance optimizations to improve both CPU and GPU running time: occlusion culling, texture atlasing, static and dynamic batching, GPU instancing, shader fallbacks, multi-threading, lightmapping, optimizing garbage collection / scripts, and many more techniques. One technique that has been tried-and-true in 3D graphics is using varying levels of detail (LOD) for meshes.
|LOD0 is traditionally the original mesh. Each additional LOD is a decimation or reduction of the previous LOD, which reduces the polygon count.|
If LOD is tried-and-true, then why talk about it? I’d venture to say it has something to do with how readily available LODs are.
Digital productions may not be taking full advantage of LOD for any of the following reasons:
- Requires an artist to process LODs (e.g. individually, process them in a batch via UI, etc.)
- Requires a custom pipeline to be set up
- Involves some software engineering effort
- Involves evaluating a variety of products for use at different stages in the pipeline and product approaches may each be different
- Ultimately, may require estimating the benefits and trade-off for making use of LOD
This past summer I was looking to challenge some of these barriers to LOD usage in an experimental project I’ve called AutoLOD. I was joined in Labs by Yangguang Liao, a Ph.D. student from the University of California, Davis to assist with the project. Parts of this project were originally started at Hack Week 11 (2016) with Elliot Cuzzillo and continued during Hack Week XII (2017) with Jake Turner.
The above video shows rendering with traditional LOD [left] at an average of 30 fps compared to rendering with SceneLOD (part of the AutoLOD package) [right] at an average of 42 fps. At full zoom, traditional LOD uses ~9 ms / 7 ms (CPU/GPU) compared to the ~1 ms / 0.5 ms (CPU/GPU) for SceneLOD. Note: The color disparities are due to the current shader that is being used for SceneLOD, which can be customized.
Underneath each playback window is a recording of the profiler window. On the left, you can see rendering cost balloons as more of the scene is shown. On the right, the rendering cost stays relatively constant once Hierarchical LOD kicks in (more on this later). You may take notice that there is minimal CPU usage on the right, which is due to the reduced draw call count.
The vision of AutoLOD was to explore what an automatic, extensible, and pluggable level of detail (LOD) system might look like in Unity, which could support rendering-intensive projects and serve as a testbed for continuing LOD research. Let’s define these terms:
- Automatic in that sensible defaults are used in order to auto-generate LODs, which will generally make projects run with better performance
- Extensible in that default LOD generator(s) and runtime(s) can be extended and/or overridden (e.g. discrete vs. continuous)
- Pluggable in that third parties can create their own LOD generators that can be used in place of a default LOD generator
Our initial goals for this experimental project were:
- LOD generation on model import with sensible defaults
- Project-wide and per-model LOD import settings
- GPU-accelerated default LOD generator*1
- Asynchronous, pluggable LOD generation framework
- Hierarchical LOD support via SceneLOD*
- Extensible runtime that can be paired with LOD generators for alternative techniques (e.g. continuous, view-dependent, etc.)1
- “Workbench” scene that allows for LOD generator comparison1
Not all goals were reached due to time constraints. However, we felt that the experiment was a success in that parts of the vision proved out. Let’s dig into some of the details.
Tying into the vision of LOD generation being automatic, our goal was to have sensible defaults that would work for most projects. Any professional LOD package comes with plenty of sliders and toggles and ideally those would only be necessary when an automatically generated LOD looked terrible enough to warrant tuning it by hand. That being said, there are project-wide settings that can be specified in Edit -> Preferences…
If any of the generated LODs are not correct, it’s possible to override them per model file:
It’s possible to change the simplifier/batcher combo for a single file or simply turn off automatic generation on import and supply the LODs manually. You can even add additional LODs in the LOD chain if you prefer. The LOD chain will get included in the imported version of the model file in the project, so no separate prefab is needed in order to set up a LODGroup.
SceneLOD is inspired2 by the work of Erikson, C., D. Manocha, and W. Baxter in a 2001 I3D Paper. We decided to create an implementation that would work with the existing LODGroup component in Unity, so that a custom build of Unity would not be required. A bounding volume hierarchy (currently an Octree) of LODGroup components controls which LOD is being used to render the scene.
As a performance optimization, Hierarchical Level of Detail (HLOD) partitions individual meshes in a scene in order to replace those meshes with a grouped representation. Traditional Level of Detail (LOD) would select an appropriate mesh representation according to screen size, distance, viewpoint or some other metric. Each mesh rendered, regardless of which LOD is selected, adds an additional draw call typically. A limitation of traditional LOD is that there is no optimization in the aggregate for draw calls as each object’s LOD chain is evaluated individually. Static batching only solves part of this problem, since it aggregates by shared material. Draw calls typically burden the CPU, so reducing them will generally improve CPU performance.
HLOD can aid in reducing draw calls by combining all objects within a specific volume into a single mesh and potentially a single material by utilizing a texture atlas. For games that wish to display large sweeping views of a whole scene, HLOD can benefit performance greatly. In other cases, HLOD may also outperform the quality of individual LODs when decimated as a group of combined meshes. The drawback of HLOD is that extra memory cost for each HLOD mesh is required at every node in the BVH.
A slightly modified version of the demo scene provided by the POLYGON — City Pack was used. A camera was animated using Timeline to zoom from a close view to the entire view of the city in 5 seconds. Tests were performed on a Razer Blade laptop3.
Let’s take a closer look at the profiler views for traditional LOD and HLOD:
Traditional LOD (above) shows growing CPU and rendering cost as the camera zooms out and reveals more of the scene.
For the HLOD version, traditional LOD is active when playback initially starts, which explains the rendering cost at the beginning. Eventually the performance moves into near constant CPU and GPU costs once HLOD is fully utilized. BVH evaluation (i.e. determining which HLODs should render) has some CPU cost, too.
Additionally, the following experiments were run with the entire scene in view (camera stationary) in the GameView and the Stats window on:
|CPU (ms)||% vs static||Render (ms)||% vs static||Triangles (M)||% vs static||Batches||% vs static|
|Static + Instancing||12||2||4||-10||5.5||0||694||0|
|LOD + Instancing||8.3||47||4.3||-16||0.8||588||691||1|
|HLOD + Instancing||0.8||1,425||0.6||500||1.4||293||4||17,275|
|Static batching involved marking all of the objects in the scene as static and then hitting play. Instancing involved enabling GPU Instancing in all materials used.|
As you can see with a large city scene HLOD improves performance over traditional LOD by 1425% and reduces the draw call count from 1487 to only 6 draw calls!
However, where HLOD really takes off is when you build scenes that traditional LOD would normally not be able to handle:
This is an example scene with four copies of the original scene for a total of 6.2M triangles and 11655 batches. Rendering at 83.3 ms / 43.9 ms (CPU/GPU) this falls below interactive responsive rates.
Now, comparing this to an HLOD version of the same scene:
We’re still rendering at 1 ms / 0.4 ms (CPU/GPU) and only 6 batches even though we’ve increased the triangle count to 7M. Keep in mind that although the copies are of the original scene — you could expect the same performance even if each part of the city were individually unique.
In a build, SceneLOD would add to the static mesh and texture size, but this can be reduced if the BVH depth is also reduced.
Textures 36.3 mb 5.3%
Meshes 599.9 mb 87.6%
Textures 20.3 mb 20.9%
Meshes 30.0 mb 30.8%
The uncompressed size on disk of the HLOD meshes is 1.1GB.
There are some one-time costs for our HLOD implementation both in the generation of the BVH and for generating each HLOD, separate from LOD generation. These one-time computation costs will occur any time an object is added, moved, or removed in the scene. However, SceneLOD keeps track of these changes and updates the BVH and HLODs automatically in the background.
Each time a camera renders, it is necessary to walk the BVH and determine which LODGroup components should be enabled before rendering.4
We’ve found that Automatic LOD can remove some of the pain points to getting LOD into a digital production. Sensible defaults can get projects most of the way there and if any problem meshes exist, then they can be overridden on a case-by-case basis. SceneLOD provides an example implementation of HLOD that can be used with the current version of Unity on large scenes. If you are willing to trade storage cost for performance you might be able to improve rendering performance by an order of magnitude for extremely large scenes that have many static elements.
We hope this experimental project provides some insight to your own project’s performance challenges and/or gives you the ability to build more elaborate scenes. Certainly, there are many avenues for future work, such as support for dynamic objects, better compression for HLODs on disk, a default LOD generator, different shader profiles for HLOD rendering, and of course, optimization!
Please check out the code on GitHub and post any comments / issues you have directly to the project!
* Partially complete
2 Differences between our implementation and the paper are detailed on GitHub
3 Hardware / software configuration:
- Intel i7-6700HQ @ 2.6 GHz
- 16.0 GB RAM
- NVIDIA GeForce GTX 1060
- Samsung SSD HD
- Windows 10 64-bit
- Unity 2017.3.0f2
- Simplygon 8.2.307 for LOD generation
4 A more thorough explanation of run-time performance is detailed on GitHub