Automatic Setup of a Humanoid
The humanoid animation option in Unity 4 makes it possible to retarget the same animations to different characters. The characters can have different proportions and skeleton rigs where the bones have different names etc. But before you can take advantage of that, the bones in your character have to be matched up with the authoritative “human bones” that Mecanim uses. Fortunately, for the majority of models this otherwise lengthy setup process is completely automatic. Let’s have a look at how that works…
What Avatar Mapping Involves
Setting up a humanoid Avatar in Unity involves matching every “human bone” to one of the transforms in the model. It’s possible to do this manually in Unity by clicking Configure… under the Rig tab of the Model Importer. Without fingers there’s up to 24 bones to map – and 30 more if you have finger bones too. For every one of those you have to drag the transform into the corresponding slot in the GUI when you use the manual method. When I first joined the effort to improve the interface for Mecanim, that was the only way to setup characters. For anyone who wanted to experience the Mecanim awesomeness with their own characters, there was first these 24-54 bones to drag in with the mouse, one by one. Things like that don’t leave an exciting first impression. So I began looking into how to do that automatically.
Hints to Go By
How do you make a computer figure out what transforms correspond to which human bones? There is more than one way to go about it…
Names of Bones
One seemingly straightforward option is to use the names of the transforms. But unfortunately, some of the most commonly used terms are actually used to describe different bones in different models.
- “Hip” can mean the center bone that’s parent to both legs, or it can mean the left or right hip.
- “Leg” can mean upper or lower leg.
- “Shoulder” can mean either the bone for the upper arm or, if present, the collar bone that’s the parent to the upper arm bone.
- “Arm” can mean upper or lower arm.
Naming for bones in the spine is often a total anarchy since the number of bones is completely variable, and sometimes a bone called “neck” even has the left and right shoulder as children (due to a flaw in a popular software used for rigging), meaning the mapping have to assign it as the chest bone despite its name. Furthermore, finger bones are often simply named “finger_0” to “finger_14” or similar, with no convention of which numbers mean what. And bones for eyeballs and jaw have a ridiculous amount of variation for what they’re called, including helpful things like “nurbsToPoly2”. So while names of bones often give us certain useful hints, they don’t contain sufficient information on their own to cover a wide variety of models.
Just determining if a bone is for the left or right side is a mini challenge by itself. Obviously if the word “left” or “right” is included in the name it’s a given, but a lot of the time only the letter “l” or “r” is used. However, those letters can also be part of a word, so I only regard them as side keywords if they stand by themselves, separated from the rest of the string by spaces, punctuation, or similar.
Some information can be inferred by looking at the direction of a bone. While there’s no guarantee about the pose of the character at the time when the auto-mapping happens, it’s generally more likely that bones in the right side point towards +x and that bones in the spine point towards +y while bones in the legs point towards -y etc. For each human bone we have specified a “correct” direction. Direction matches of different child transforms are computed using dot products of the normalized transform direction compared to a target human bone direction, and a score is awarded accordingly. Where the bone direction information really shines is in figuring out the mapping for the fingers, where there is often nothing else to go by.
Bone Length Ratios
While proportions of different models differ, there are usually some aspects of the proportions that are consistent for most models. For example, the upper leg and lower leg usually have roughly the same length, while the length of the foot is typically much shorter. The length of the upper arm bone is usually about twice as long as the length of the collar bone, when present. The length ratios contribute a lot to ensuring sensible mappings. Imagine for example a model that has no collar bone, and the upper arm bone is called “shoulder”. The upper arm also has a bone in between the shoulder joint and elbow joint used for twisting of the upper arm (but nothing in the name necessarily indicates this). Looking at the names of the bones alone, the computer would be inclined to map the bone called “shoulder” to the human collar bone and map the upper arm twist bone to the upper arm human bone. This would be completely wrong. But thanks to the length ratios, this scenario is almost always avoided and the bones are mapped correctly. Another place where the length ratios shine is for the bones in the spine. The spine in a model can consist of any number of bones, but they have to be mapped in a sensible way to just the following human bones: Hips, spine, chest (optional), neck (optional), and head. If you have, say, 8 transforms in the spine, how should they be mapped? The names are no use, and the bone directions are all the same. Just attempting an even distribution gives terrible results. Instead we strive towards these length ratios: spine-chest should be 1.4 times as long as hip-spine. Chest-neck should be 1.8 times as long as spine-chest. Neck-head should be 0.3 times as long as chest-neck. (Humanoids generally bend mostly in the area below the chest since the chest itself is more rigid.) The result is a mapping that produces bending in the spine that looks natural within the confines of the bones in the model. The algorithm chooses bone assignments that matches the desired length ratio as closely as possible. Calculating “closeness” in this case requires converting both the actual ratio and the desired ratio to a logarithmic scale and measuring the difference of those. The logarithmic scale ensures than a length that’s twice as longs as it’s supposed to be is penalized by the same amount as a length that’s half as long as it’s supposed to.
Last but not least, the “topology” of the bones play a major role. For example, the neck and the left and right arm must all be children of the chest, so a mapping where one bone is mapped to the neck, but children of that bone are mapped to the left and right shoulder is not an option. Contrary to the other factors described above, the topology requirements work as a hard constraint. The requirements are specified in the form like this
- The RightShoulder human bone must be a child to the Chest human bone, placed 1-3 levels further down the hierarchy. (1 level would be a child; 2 levels would be a child of a child etc.)
- The RightUpperArm human bone must be a child to the RightShoulder human bone, placed 0-2 levels further down. (A level of 0 means they are the same bone. This is permitted when one of the bones are optional, like the Shoulder ones are.)
- The RightLowerArm human bone must be a child to the RightUpperArm human bone, placed 1-2 levels further down.
Most bones allows for a range of levels in between the mapping of itself and its parent. This is because different models have different amounts of transforms in between the principal human bones. For example there might or might not be a twist bone in between an upper arm bone and an elbow bone. The range also function as an optimization, since transforms that are further down the bone hierarchy than the number of levels allowed do not need to be considered as potential matches.
Optimal Mapping as a Search Problem
We’ve looked at various hints that can be used as part of the mapping algorithm, but how is it all combined? The solution I arrived at is to treat the mapping as a search problem. Consider the function EvaluateBoneMatch which evaluates a match between a human bone (for example Chest) and a transform (for example “MyModel_Chest”). This function evaluates the match itself according to the keyword, direction, and length ratio hints described above. The result is a score that indicates how good this match is. But the EvaluateBoneMatch function goes further than that. It iterates through all the child human bones. (For the Chest human bone, the children would be LeftShoulder, RightShoulder, and Neck.) For each of those it gathers all the transforms that are potential matches according to the topology requirements, i.e. all transforms n levels down the hierarchy, where n depends on which human bone is being mapped. And for each of those pairs of a child human bone and a transform it calls EvaluateBoneMatch. Since the function calls itself, this is called a recursive function. Each of those calls return a score value that determines how good the match is. Now the function determines the best match for each of the child human bones (LeftShoulder, RightShoulder, and Neck). This is based mostly on the scores, but it can’t simply choose the top ranking choice for each of them, since multiple child bones sometimes pick the same transform as their top choice! More on that later. Anyway, the score for each of the picked child human bone matches is added to the score of the current bone match itself, and the result is what the EvaluateBoneMatch function returns. Confused? It is common for it to take a little while and some effort to wrap one’s head around a recursive function and it sure did for me as I implemented the algorithm. Having many different inter-related hierarchical structures in parallel didn’t make it easier (the human bone hierarchy, the transform hierarchy, and the search graph hierarchy). But the gist of it is that the EvaluateBoneMatch function returns a score that represents not just how good that specific match is in itself, but including the entire child hierarchy of best matches. This in effect means that every bone mapping gets chosen not just based on how good that match is by itself, seen in isolation, but also based on how well it fits as a piece in the entire mapping. Remember how I mentioned that a lot of models incorrectly have the arms as children of a bone called “neck”? Imagine this conversation:
Ah, so this is the Neck bone! It’s called “neck” and points upwards and has a good length ratio, so it seems to match. The neck should only have the Head as a child, but there’s some other transform children as well that don’t seem to match any bones in the head. Oh well. … later … Hmm, alternatively we can try to map the bone called “neck” as the Chest bone. This doesn’t score well in terms of keyword matches but let’s try it anyway. This Chest has child transforms that seem to match the LeftShoulder and RightShoulder as well as the Neck. And those left and right shoulders have children that matches the various bones in the arms, and so on. All those matches are worth a lot of points so it ends up being better to match the transform “neck” to the Chest than to the Neck, based on the points obtained from the hierarchy further down.
With a search based algorithm you get seemingly “smart” considerations like this for free all over the place. It’s very similar to the problem of trying to solve a maze: In order to get to the exit that you know is to the North, you may have to go in the wrong direction some of the way first. Knowing the right direction for even the first step requires knowing the entire solution.
Child Conflict Resolution
Multiple child human bones sometimes pick the same transform as their best matching choice choice. For example the human bone LeftHand have the child human bones ThumbProximal, IndexProximal, MiddleProximal, RingProximal, and LittleProximal. And maybe both RingProximal and LittleProximal picked the transform called “Finger_2” as the first choice. Now what to do? The same finger can’t be both the ring finger and the little finger! For each of the human bones we keep not just the first choice, but a ranked list of choices. To make sure that each transform is only mapped to one human bone, and that the transforms are assigned to the human bones in the best possible way, a function called GetBestChildMatchChoices is called. First we make a list that contains the current choice for each human bone. Initially they’re all set to 0 (the first choice) even if that means there are potential conflicts. We pass that list of current choices to the function, and the function then goes through these steps:
- Check if any of the current choices have conflicting transform. If not, we are done and can return the current choices!
- Make a list of all the human bones that are part of the conflict.
- For each of those human bones, try out an alternative list of current choices where this human bone retains its current choice and all the other human bones in the conflict have to use their next choice on their priority lists. For each of those alternative lists of current choices, call the GetBestChildMatchChoices function. (Yes it’s a recursive function again.)
- Each call of GetBestChildMatchChoices returns a new current choices list and a score value that is simply the summed scores of all the matches corresponding to the current choices.
- Choose the current choices list with the best score and return that list along with its score.
This procedure basically tests all permutations and chooses the best scoring one. Before I came up with this approach I had initially just implemented an approach where the human bone with the best scoring match got its first choice, the human bone with the next best scoring match would get its first choice excluding the already picked, and so on. This was faster computationally and easier to understand, but unfortunately it often resulted in incorrect results. The illustration below demonstrates the difference between picking the best individual match versus the best overall match for all siblings.
Since the search evaluates many hundreds of possible mappings for a typical avatar, the auto-mapping started to get a bit slow at some point. A lot of the time was spent with string handling related to trying to match keywords, but other parts of the evaluations also took up significant time. I realized that the same pair of human bone and transform was being evaluated many times as part of different possible mappings. The biggest optimization was achieved by caching the results of such an evaluation and simply use the cached result the next time the same pair needed to be evaluated. Different parts of the cached data needed to be cached differently because they had different amounts of needed context. A keyword evaluation only needs the human bone and transform themselves, but the evaluation of bone length ratios need the parent match and grandparent match as well, since they are used for calculating the lengths. In the end I used a design where I cache the result of an entire EvaluateBoneMatch call based on a pair and the parent and grandparent pairs. If a potential pair needs to be evaluated and the cache already has an existing evaluation for the same pair with the same parent and grandparent, that result is used and the call to EvaluateBoneMatch is skipped altogether for that pair. If it doesn’t exist, EvaluateBoneMatch is called to do the evaluation, but some of the sub-routines that evaluate keywords etc. use their own caching that requires less context and hence is more likely to have already been evaluated before. Using these cached result sped up the evaluation times by a factor of 8. After that, it was fast enough that the entire Auto-Setup process was no longer a bottleneck. (It usually takes less than a second which is a fraction of what the model import process takes.) Doing high-level optimization like this is often very satisfying since it can change the fundamental time complexity of an algorithm which can often result in very big improvements.
The Auto-Mapping algorithm was designed for to be primarily data-driven. I wanted to avoid a solution that had all kinds of hard-coded rules for different parts of the body, since code like that is often brittle and error-prone, and doesn’t lend itself to the kind of search-based solution I could see would be necessary for good results. Nevertheless, some compromises ended up being necessary. One compromise is that the search for the body mapping stops at the hands rather than including all the fingers. The fingers include more bones than the entire rest of the body, and excluding all that from the body mapping search reduced the search time significantly. Instead, the fingers are mapped by themselves starting from the hand bones found in the body mapping search. Another thing that’s handled in a special way is the assumption about body orientation. As mentioned earlier, the search algorithm takes hints from the directions of the various bones. However, sometimes a model is imported in Unity with a completely arbitrary rotation and all the assumptions about bone orientations are wrong. The remaining hints are usually enough to map some parts of the body well enough, but the fingers, if present, are often mapped completely wrong in those cases. To counter this we check the positions of left and right hip and shoulder after a mapping is completed, and derive the body orientation from those positions. If the body orientation is significantly different than the assumed one where +z is forward and +y is up, then we redo the entire mapping based on the now known actual orientation. In this second pass the bones are usually mapped with much higher rate of correctness.
While implementing the automatic setup I also created a function for automatically testing and validating the automatic setup for a model. Every time we have come across a model where the auto-setup fails, we have added it to the testing framework, provided a sensible setup is possible to do manually in the first place. (Some models don’t have a proper skeletal hierarchy at all. Those are not compatible with Mecanim humanoid animation, so obviously auto-mapping won’t work for those.) At this time we have around 30 models in the framework, which include a lots of humans with different proportions as well as some fantasy and toon characters with completely different proportions and a few robots. Among them the models have wildly varying rigs. While the auto-mapping is based on a solid algorithm, there is a lot of tweaking involved in determining which hints should contribute how much in the scoring. Having the testing framework has been essential to being able to tweak those parameters and be able to immediately test that it fixes the intended edge cases without causing regressions for other characters. If you have any characters that you CAN do a manual setup for, but which the automatic setup did not handle satisfactory, you are welcome to send the models in question in a bug report. Include the words “avatar auto-mapping” in the first line of the description. We can’t guarantee that we’ll be able to make the auto-setup handle every single model correctly, but being aware of problematic cases provides us with a better basis for attempting it.
I’ve covered the primary functionality of the Auto-Mapping for humanoid characters in Unity. The Auto-Setup includes other functions as well, such as getting the character into T-pose, but those are outside of the scope for this post, and not quite as interesting and challenging anyway. The Auto-Mapping is an interesting feature in that it contains quite advanced functionality yet is practically invisible to the user. It’s very purpose is to be something you don’t have to think about at all. Of course it doesn’t always work out that way in all cases – for some models here and there the automatic mapping fails and then you’re suddenly painfully aware of it. But for the majority of cases, the Humanoid Animation Type is just a setting you can enable, and then not have to think about. Instead you can get on with the real fun of animating your characters.