On DOTS: C++ & C#
This is a brief introduction to our new Data-Oriented Tech Stack (DOTS), sharing some insights in how and why we got to where we are today, and where we’re going next. We’re planning on posting more about DOTS on this blog in the near future.
Let’s talk about C++. The language Unity is written in today.
One of many advanced game programmers’ problems at the end of the day is that they need to provide an executable with instructions the target processor can understand, that when executed will run the game.
For the performance critical part of our code, we know what we want the final instructions to be. We just want an easy way to describe our logic in a reasonable way, and then trust and verify that the generated instructions are the ones we want.
In our opinion, C++ is not great at this task. I want my loop to be vectorized, but a million things can happen that might make the compiler not vectorize it. It might be vectorized today, but not tomorrow if a new seemingly innocent change happens. Just convincing all my C/C++ compilers to vectorize my code at all is hard.
We decided to make our own “reasonably comfortable way to generate machine code”, that checks all the boxes that we care about. We could spend a lot of energy trying to bend the C++ design train a little bit more in a direction it would work a little bit better for us, but we’d much rather spend that energy on a toolchain where we can do all of the design, and that we design exactly for the problem that game developers have.
What checkboxes do we care about?
- Performance is correctness. I should be able to say “if this loop for some reason doesn’t vectorize, that should be a compiler error, not a ‘oh code is now just 8x slower but it still produces correct values, no biggy!’”
- Cross-architecture. The input code I write should not have to be different for when I target iOS than when I target Xbox.
- We should have a nice iteration loop where I can easily see the machine code that is generated for all architectures as I change my code. The machine code “viewer” should do a good job at teaching/explaining what all these machine instructions do.
- Safety. Most game developers don’t have safety very high on their priority list, but we think that the fact that it’s really hard to have memory corruption in Unity has been one of its killer features. There should be a mode in which we can run this code that will give us a clear error with a great error message if I read/write out of bounds or dereference null.
Ok, so now that we know what things we care about, the next step is to decide on what the input language for this machine code generator is. Let’s say we have the following options:
- Custom language
- Some adaption/subset of C or C++
- Subset of C#
Say What C#? For our most performance critical inner loops? Yes. C# is a very natural choice that comes with a lot of nice benefits for Unity:
- It’s the language our users already use today.
- Has great IDE tooling, both editing/refactoring as well as debugging.
- A C#->intermediate IL compiler already exists (the Roslyn C# compiler from Microsoft), and we can just use it instead of having to write our own.
- We have a lot of experience modifying intermediate-IL, so it’s easy to do codegen and postprocessing on the actual program.
- Avoids many of C++’s problems (header inclusion hell, PIMPL patterns, long compile times)
I quite enjoy writing code in C# myself. However, traditional C# is not an amazing language from a performance perspective. The C# language team, standard library team, and runtime team have been making great progress in the last two years. Still, when using C# language, you have no control over where/how your data is laid out in memory. And that is exactly what we need to improve performance.
On top of that, the standard library is oriented around “objects on the heap”, and “objects having pointer references to other objects”.
That said, when working on a piece of performance critical code, we can give up on most of the standard library, (bye Linq, StringFormatter, List, Dictionary), disallow allocations (=no classes, only structs), reflection, the garbage collector and virtual calls, and add a few new containers that you are allowed to use (NativeArray and friends). Then, the remaining pieces of the C# language are looking really good. Check out Aras’s blog for some examples from his path tracer toy project for some examples.
This subset lets us comfortably do everything we need in our hot loops. Because it’s a valid subset of C#, we can also run it as a regular C#. We can get errors on out of bounds access, with great error messages, debugger support and compilation speeds you forgot were possible when working in C++. We often refer to this subset as High-Performance C# or HPC#.
Burst compiler: Where are we today?
We’ve built a code generator/compiler called Burst. It’s been available since Unity 2018.1 as a preview package. We have a lot of work ahead, but we’re already happy with it today.
We’re sometimes faster than C++, also still sometimes slower than C++. The latter case we consider performance bugs we’re confident we can resolve.
Only comparing performance is not enough though. What matters equally is what you had to do to get that performance. Example: we took the C++ culling code of our current C++ renderer and ported it to Burst. The performance was the same, but the C++ version had to do incredible gymnastics to convince our C++ compilers to actually vectorize. The Burst version was about 4x smaller.
To be honest, the whole “you should move your most performance critical code to C#” story also didn’t result in everybody internally at Unity immediately buying it. For most of us, it feels like “you’re closer to the metal” when you use C++. But that won’t be true for much longer. When we use C# we have complete control over the entire process from source compilation down to machine code generation, and if there’s something we don’t like, we just go in and fix it.
We will slowly but surely port every piece of performance critical code that we have in C++ to HPC#. It’s easier to get the performance we want, harder to write bugs, and easier to work with.
Here’s a screenshot of Burst Inspector, allowing you to easily see what assembly instructions were generated for your different burst hot loops:
Unity has a lot of different users. Some can enumerate the entire arm64 instruction set from memory, others are happy to create things without getting a PhD in computer science.
All users benefit as the parts of their frame time that are spent running engine code (usually 90%+) get faster. The parts that are running Asset Store package runtime code gets faster as Asset Store package authors adopt HPC#.
Advanced users will benefit on top of that by also being able to also write their own high-performance code in HPC#.
In C++, it’s very hard to ask the compiler to make different optimization tradeoffs for different parts of your project. The best you have is per file granularity on specifying optimization level.
Burst is designed to take a single method in that program as input: the entry point to a hot loop. It will compile that function and everything that it invokes (which is guaranteed to be known: we don’t allow virtual functions or function pointers).
Because Burst only operates on a relatively small part of the program, we set optimization level to 11. Burst inlines pretty much every call site. Remove if checks that otherwise would not be removed, because in inlined form we have more information about the arguments of the function.
How that helps with common multithreading problems
C++ (nor C#) doesn’t do much to help developers to write thread-safe code.
Even today, more than a decade since game consumer hardware has >1 core, it is very hard to ship programs that use multiple cores effectively.
Data races, nondeterminism and deadlocks are all challenges that make shipping multithreaded code difficult. What we want is features like “make sure that this function and everything that it calls never read or write global state”. We want violations of that rule to be compiler errors, not “guidelines we hope all programmers adhere to”. Burst gives a compiler error.
We encourage both Unity users and ourselves to write “jobified” code: splitting up all data transformations that need to happen into jobs. Each job is “functional”, as in side-effect free. It explicitly specifies the read-only buffers and read/write buffers it operates on. Any attempt to access other data results in a compiler error.
The job scheduler will guarantee that nobody is writing to your read-only buffer while your job is running. And we’ll guarantee that nobody is reading from your read/write buffer while your job is running.
If you schedule a job that violates these rules, you get a runtime error every time. Not just in your unlucky race condition case. The error message will explain that you’re trying to schedule a job that wants to read from buffer A, but that you already scheduled a job before that will write to A, so if you want to do this, you need to specify that previous job as a dependency.
We find this safety mechanism catches a lot of bugs before they get committed and results in efficient use of all cores. It becomes impossible to code a deadlock or a race condition. Results are guaranteed to be deterministic regardless of how many threads are running, or how many time a thread gets interrupted by some other process.
Hacking the whole stack
By being able to hack on all these components, we can make them be aware of each other. For example, a common case for a vectorization not happening is that the compiler cannot guarantee that two pointers do not point to the same memory (aliasing). We know two NativeArray’s will never alias because we wrote the collection library, and we can use that knowledge in Burst, so it won’t have to give up on optimization because it’s afraid two array pointers might point to the same memory.
Similarly, we wrote the Unity.Mathemetics math library. Burst has intimate knowledge of it. It will (in the future) be able to do accuracy sacrificing optimizations for things like math.sin(). Because to Burst math.sin() is not just any C# method to compile, it will understand the trigonometric properties of sin(), understand that sin(x) == x for small values of x (which Burst might be able to prove), understand it can be replaced by a Taylor series expansion for a certain accuracy sacrifice. Cross platform & architecture floating point determinism is also a future goal of burst that we believe is possible to achieve.
The distinction between engine code and game code disappears
By writing Unity’s runtime code in HPC#, the engine and the game are written in the same language. We will distribute runtime systems that we have converted to HPC# as source code. Everyone will be able to learn from them, improve them, tailor them. We’ll have a level playing field, where nothing is stopping users from writing a better particle system, physics system or renderer than we write. I expect many people will. By having our internal development process be much more like our users’ development process, we’ll also feel our users pain more directly, and we can focus all our efforts into improving a single workflow, instead of two different ones.
In my next post, I’ll cover a different part of DOTS: the entity component system.