Search Unity

I was looking at a curious bug report: “Texture importing got much slower in current beta”. At first look, I dismissed it under “eh, someone’s being confused” (quickly tried on several textures, did not notice any regression). But then I got a proper bug report with several textures. One of them was importing about 10 times slower than it used to be.

Why would anyone make texture importing that much slower? No one would, of course. Turns out, this was an unintended consequence of generally improving things.

But, the bug report made me use xperf (a.k.a. Windows Performance Toolkit) for the first time. Believe it or not, I’ve never used it before!

So here’s the story

We’ve got a TGA texture (2048×2048, uncompressed – a 12MB file) that takes about 10 seconds to import in current beta build, but it took ~1 second on Unity 4.6.

First wild guess: did someone accidentally disable multithreaded texture compression? Nope, doesn’t look like it (making final texture be uncompressed still shows massive regression).

Second guess: we are using FreeImage library to import textures. Maybe someone, I dunno, updated it and comitted a debug build? Nope, last change to our build was done many moons ago.

Time to profile. My quick ”I need to get some answer in 5 seconds” profiler on Windows is Very Sleepy, so let’s look at that:

texdebug03-profile

Wait what? All the time is spent in WinAPI ReadFile function?!

Is there something special about the TGA file I’m testing on? Let’s make the same sized, uncompressed PNG image (so file size comes out the same).

texdebug00-files

The PNG imports in 108ms, while TGA in 9800ms (I’ve turned off DXT compression, to focus on raw import time). In Unity 4.6 the same work is done 116ms (PNG) and 310ms (TGA). File sizes roughly the same. WAT!

Enter xperf

Asked a coworker who knows something about Windows: “why would reading one file spend all time in ReadFile, but another file of same size read much faster?”, and he said “look with xperf”.

I’ve read about xperf at the excellent Bruce Dawson’s blog, but never tried it myself. Before today, that is.

So, launch Windows Performance Recorder (I don’t even know if it comes with some VS or Windows SDK version or needs to be installed separately… it was on my machine somehow), tick CPU and disk/file I/O and click Start:

texdebug04-WPR

Do texture importing in Unity, click save, and on this fairly confusing screen click “Open in WPA”:

texdebug05-WPA

The overview in the sidebar gives usage graphs of our stuff. A curious thing: neither CPU (Computation) nor Storage graphs show intense activity? The plot thickens!

texdebug06-WPAsidebar

CPU usage investigation

Double clicking the Computation graph shows timeline of CPU usage, with graphs for each process. We can see Unity.exe taking up some CPU during a time period, which the UI nicely highlights for us.

texdebug07-CPU

Next thing is, we want to know what is using the CPU. Now, the UI groups things by the columns on the left side of the yellow divider, and displays details for them on the right side of it. We’re interested in a callstack now, so context-click on the left side of the divider, and pick “Stack”:

texdebug08-AddStack

Oh right, to get any useful stacks we’ll need to tell xperf to load the symbols. So you go Trace -> Configure Symbol Paths, add Unity folder there, and then Trace -> Load Symbols.

And then you wait. And wait some more…

And then you get the callstacks! Not quite sure what the “n/a” entry is; my best guess that just represents unused CPU cores or sleeping threads or something like that.

texdebug11-stacks

Digging into the other call stack, we see that indeed, all the time is spent in ReadFile.

texdebug12-trace

Ok, so that was not terribly useful; we already knew that from the Very Sleepy profiling session.

Let’s look at I/O usage

Remember the “Storage” graph on sidebar that wasn’t showing much activity? Turns out, you can expand it into more graphs.

texdebug13-IO

Now we’re getting somewhere! The “File I/O” overview graph shows massive amounts of activity, when we were importing our TGA file. Just need to figure out what’s going on there. Double clicking on that graph in the sidebar gives I/O details:

texdebug14-reads

You can probably see where this is going now. We have a lot of file reads, in fact almost 400 thousand of them. That sounds a bit excessive.

Just like in the CPU part, the UI sorts on columns to the left of the yellow divider. Let’s drag the “Process” column to the left; this shows that all these reads are coming from Unity indeed.

texdebug15-process

Expanding the actual events reveals the culprit:

texdebug16-events

We are reading the file alright. 3 bytes at a time.

But why and how?

But why are we reading a 12 megabyte TGA file in three-byte chunks? No one updated our image reading library in a long time, so how come things have regressed?

Found the place in code where we’re calling into FreeImage. Looks like we’re setting up our own I/O routines and telling FreeImage to use them:

texdebug17-iocode

Version control history check: indeed, a few weeks ago a change in that code was made, that switched from basically “hey, load an image from this file path” to “hey, load an image using these I/O callbacks”

This generally makes sense. If we have our own file system functions, it makes sense to use them. That way we can support reading from some non-plain-files (e.g. archives, or compressed files), etc. In this particular case, the change was done to support LZ4 compression in lightmap cache (FreeImage would need to import texture files without knowing that they have LZ4 compression done on top of them).

So all that is good. Except when that changes things to have wildly different performance characteristics, that is.

When you don’t pass file I/O routines to FreeImage, then it uses a “default set”, which is just C stdio ones:

texdebug19-defaultstdio

Now, C stdio routines do I/O buffering by default… our I/O routines do not. And FreeImage’s TGA loader does a very large number of one-pixel reads.

To be fair, the “read TGA one pixel at a time” seems to be fixed in upstream FreeImage these days; we’re just using a quite old version. So looking at this bug made me realize how old our version of FreeImage is, and make a note to upgrade it at some point. But not today.

The Fix

So, a proper fix here would be to setup buffered I/O routines for FreeImage to use. Turns out we don’t have any of them at the moment. They aren’t terribly hard to do; I poked the relevant folks to do them.

In the meantime, to check if that was really the culprit, and to not have “well TGAs import much slower”, I just made a hotfix that reads whole image into memory, and then loads from that.

texdebug20-hotfix

Is it okay to read whole image into memory buffer? Depends. I’d guess in 95% cases it is okay, especially now that Unity editor is 64 bit. Uncompressed data for majority of images will end up being much larger than the file size anyway. Probably the only exception could be .PSD files, where they could have a lot of layers, but we’re only interested in reading the “composited image” file section. So yeah, that’s why I said “hotfix”; and a proper solution would be having buffered I/O routines, and/or upgrading FreeImage.

This actually made TGA and PNG importing faster than before: 75ms for TGA, 87ms for PNG (Unity 4.6: 310ms TGA, 116ms PNG; current beta before the fix: 9800ms TGA, 108ms PNG).

Yay.

Conclusion

Be careful when replacing built-in functionality of something with your own implementation (e.g. standard I/O or memory allocation or logging or … routines of some library). They might have different performance characteristics!

xperf on Windows is very useful! Go and read Bruce Dawson’s blog for way more details.

On Mac, Apple’s Instruments is a similar tool. I think I’ll use that in some future blog entry.

I probably should have guessed that “too many, too small file read calls” is the actual cause after two minutes of looking into the issue. I don’t have a good excuse on why I did not. Oh well, next time I’ll know :)

27 Comments

Subscribe to comments

Comments are closed.

  1. It takes forever to import an old project to the Unity 5 64bit release version still and the CPU goes crazy, not fixed I guess.

  2. Is this fix in Unity 5 rc2? Texture import still seems mighty slow…

  3. @Aras – good article! I’d think that reading the whole file in memory could be a bit restrictive for mobile platforms – maybe a sliding approach is better for low spec devices (64k buffer for mobiles, etc)?

    1. Aras Pranckevičius

      January 26, 2015 at 6:18 am

      Everything in this post is about texture importing, which only ever happens in the editor. I don’t think we’re going to run the editor on iPhones anytime soon :)

  4. This article is plain awesome. I have to admit that from the first few lines it felt like I was reading a thriller story instead of a profiling essay. In fact, the writing intrigued me this much that I am about to read all of your existing articles too.

    Very well done Mr. Pranckevicius!

  5. Could something like exponential buffering help? Here’s a blog entry (Mozilla) which explains it: https://blog.mozilla.org/nnethercote/2014/11/04/please-grow-your-buffers-exponentially/

  6. “I probably should have guessed that “too many, too small file read calls” is the actual cause after two minutes of looking into the issue.”

    Nah.. Unless it takes less than 1 minute to verify, I always try to stay away from “gut feeling exploration” when doing performance measurements. They are 99% of the time wrong even if you got years of experience in performance measurements.

    IMHO its way more efficient to learn the tools (like xperf or even the builtin windows resource monitor) to a point where it only takes you 1-2 minutes to actually step down and verify from bird view to the actual specific culprit instead of iterating all your gut feelings one by one and spend 10 minutes each to exclude that specific idea. (I’ve seen way too many man months wasted on this guessing approach)

  7. Small side note for anyone that uses very sleepy… (Which Aras says… is very good) it was updated for the first time in a long time, a few days ago http://www.codersnotes.com/notes/very-sleepy-0-9

  8. very interesting and informative blog post, thanks for the effort of writing it Aras

  9. I’ve never heard of xpref too. I don’t know what is your IDE of choice there, but if it happens to be visual studio, then you can use its profiler tool to observe some good information. But, I’m sure you’ll know about this.

    Nice post.

    1. haha, good point; there would be probably bit of filtering required but would lead straight to the function
      xpref has probably more sense for binaries without sources
      ( another thing is that it was maybe easier to profile on the same image as the bug was reported for than to pull all relevant sources :)

  10. I’m actually a bit surprised that you guys don’t use buffering for I/O. Isn’t that quite standard nowadays?

    1. Aras Pranckevičius

      January 12, 2015 at 11:02 am

      Most if not all I/O code we end up doing already reads files in big chunks. E.g. serialization reads 64kb at a time (IIRC); many other parts just read whole file into memory and so on. This is the first time I’ve seen code doing 3 byte reads.

  11. Thank you for taking the time to write out this excellent blog post. As you can see, a lot of folks find this very useful. You might make a few more people more interested in digging into the profiler or other debugging tools more regularly.

  12. I’m glad my blog was helpful and that you were able to figure things out. Nice work.

    The n/a entries in the stack are most definitely not normal — at least not that many. If 1% of my stacks say n/a then that’s unusual and you have something like 85% of them listed as n/a. That probably either means you are profiling generated code in a way that xperf doesn’t like, or you need to reboot to let a registry change take effect (DisablePagingExecutive). Those n/a stacks are hugely distorting your results, which can be risky.

    I’m guessing that you have a CPU with eight hardware threads so that CPU Usage Sampled (~12.5%) represent one core completely busy. I’m not sure why the computation summary shows the system mostly idle — maybe expand that and see if you can figure it out, since it includes many sub-graphs.

    The other area that can be very valuable to explore (not needed this time) is CPU Usage Precise, which shows every context switch. That’s something that Very Sleepy cannot do.

    Xperf does take a bit of practice, but the truth is in there.

    1. BTW, here’s the latest page for doing analysis of CPU usage scenarios:
      https://randomascii.wordpress.com/2013/04/23/xperf-for-excess-cpu-consumption-wpa-edition/

      And, here’s the latest page (not updated for WPA I’m afraid) for doing analysis of CPU idle scenarios:
      https://randomascii.wordpress.com/2012/05/05/xperf-wait-analysisfinding-idle-time/

      The best training information, for those who have the time, is here:
      http://randomascii.wordpress.com/2014/08/19/etw-training-videos-available-now/

    2. Aras Pranckevičius

      January 11, 2015 at 8:14 am

      Thanks for the tips, much appreciated!

  13. I’d like to chime in as well – this was a fantastic article on troubleshooting and problem solving. I didn’t know about xperf, though in the past I have many, many times had to wade through dump files to find the issue for something at work.

    Very well done. I personally love articles like this. Thank you!

  14. very informative and interesting! amazing detail actually. thanks for taking time to write this. =3 please know more of these kinds of posts are really welcome. I personally would love to hear more about how you do things internally. maybe its the old engine monkey in me but it makes me miss the days when i was working on my own engine. then i remember all the hair pulling and i crawl right back into Unity again.

  15. Very interesting!

  16. Cool. I know that there is a lot of work involved in debugging Unity. Seeing how you approach this makes things more transparent, and also gives insight into how we can perform our own debugging by watching how you did yours. Thanks for sharing this:)

  17. Very interesting, I’d also like to have more insights like this.
    It’s nice to see how Unity works under the hood and what problems the team are facing.

  18. This is an amazingly detailed analysis with a happy ending. Loved it. 10/10 would read these all the time.

  19. Nice one Aras!
    Fit’s in with my current “back to low level” initiative after being spoiled through C# for such a long time (anyone didn’t see handmade hero yet?)

    I also like to see more of this!

  20. Awsome article!

  21. Interesting read. Its nice to see the thought process in so much detail.

  22. Really interesting, please make more posts like this!