We’ve come up with something that I think is quite cool for conveying presence though our avatars. Interpretive Motion IK is a technique in which all kinds of inputs drive a small set of animated motions, and then the animation results and additional inputs both drive an Inverse Kinematic system. This gives us a rich set of built-in avatar behaviors, while also making use of an ever-evolving set of input devices to produce a dynamic and life-like result.
Why Aren’t Avatars More Like Us?
From static “head shots” in a text chat, to illustrations in a book and famous faces in a movie, avatars give us an intuitive sense of who is present and who is doing what. The story of an animated movie (or high end game “cut scene”) is largely shown through the fluid motion of anthropoids. A movie studio can hire an army of artists, or record body actors in motion capture suits. But a virtual world does not follow a set script in which all activity can be identified and animated before use. Avatar animation must instead be generated in real time, in response to a world of possible activities.
This challenge leads some systems to just show users as simplified images, or as just a head, a disembodied “mask and gloves”, or a mostly un-animated “tin can” robot. This may be appropriate for specialized situations, but in the general case of unlimited high fidelity virtual worlds, the lack of whole-body humanoid animation fails to provide a fulfilling sense of personal immersion.
When it works well, the interpretation of motion is so strong that when another avatar turns to face your avatar, we describe it as “YOU can see ME”. In fact, the pixels on the screen have not turned, and cannot see. Think of the personality conveyed by the Pixar desk lamp hopping across the scene and looking at the camera, or the dancing skeletons of Disney’s early Silly Symphony. Unlike Disney and Pixar, High Fidelity aims to capture this rich whole-body movement as a realtime result of dynamic user input. Alas, today’s input devices give us only incomplete data. Interpretive Motion IK allows us to integrate these clumsy signals into a realistic sense of human personality and action.
Inputs Drive Animated Motion
One part of the High Fidelity solution is to use the same kind of sophisticated state machine as is used in high-end video games. It starts with the idea that high level input such as the keyboard arrow keys or joysticks can be interpreted as move forward, turn left, etc. The absence of such input can be interpreted as “idle”. The software keeps track of what state the user is in, interprets the inputs to move to another recognizable state, and plays the animation associated with the current state.
The animation is defined in terms of a set of “bones” for a stick-figure skeleton. The animation time span is divided evenly into a fixed set of moments called key frames, and each key frame defines the position of all the bones at that particular moment. “Playing an animation” means identifying the bone positions for the current playback frame, and arranging for the graphics system to modify the avatar model to match the resulting bone positions. Alas, the playback speed doesn’t always match the key frame frequency. Playback may need to be faster for smoothness, or may vary with machine activity. High Fidelity interpolates between key frames to define the motion for each actual playback frame.
It would be very difficult and limiting to try to create an animation for every conceivable state and transition. To avoid discontinuities, the beginning of each animation would have to exactly match the end of the previous one. As with other high-end systems, we handle transitions not with a separate explicit state, but by allowing the last and next state to be active at once. Each playback frame is computed by interpolating from one animation to the next, weighted by how far we are into the next state. Any state can be fading in or out, any number of times.
There are also layerings of one animation over part of another. For example, a single animation of a hand might be useful even when the avatar is in various whole-body states (such as idle or walking or turning). “Bone masks” are used to specify that, for example, the hand and finger bones in the final result should come from the hand animation, while the whole position of the hand at the end of the arm should come from the whole-body animation.
Ideally, each avatar model would come with its own rich set of the specific animations. However, the set of animations invoked by our state machine is constantly in development, and in any case is the result of specialized animator expertise. We want users of varying skills to be able to get avatars from multiple sources or create their own. We have, therefore, developed a standard set of conventions for avatar modeling, such that we can supply a standard set of such animations that works quite well with any conforming avatar model. Users can still supply their own animation for all or any of the states, but they do not have to.
All of the above techniques work together to allow a very large set of states to be represented by a fairly small amount of animation data. To avoid having each user machine compute or download the state and animations of all other visible avatars, each user’s machine computes the state and animation only of their own avatar, and a clever system sends only the changed bones to an “avatar mixer”, that in turn distributes only the “interesting” data to each participating machine.
This all gives us a flexible baseline of canned animation that can be invoked as needed in appropriate simultaneous combinations. But as described so far it is still limited to the states that we can build in and the animations we can create for it.
Moving Users Drive a Skeleton
The execution of canned animations is called “forward kinematics”: starting from the root of the skeleton at the hips, each bone is moved based on the rotation specified by the (blended and interpolated) animation key frame. Then the next bone is attached to that and rotated per the animation, and so forth, until we compute a position for each bone out to the tips of the skeleton tree (head, fingers, and toes). The other way to compute a skeleton is called “inverse kinematics” (IK). It starts with a desired position and/or rotation for a bone (often near the tips), and then works backward to compute the rotations necessary for each parent bone to produce the target result. Of course, there are many (or infinite) such solutions. The engineer’s art is to produce a result that meets anatomical constraints and makes sensible and natural choices. We created an IK system that rotates (“swings”) bones to meet positional targets, twists them to meet rotational targets, uses a memory of the previous frame, and otherwise tracks toward or “prefers” a given underlying dynamic pose that we supply. The algorithm is such that the constraints and extra information do not take extra time to check. Instead they are used to converge to a solution very quickly, allowing the entire whole-body skeleton to be constantly recomputed in real time.
IK is used in robotics, and some researchers are starting to apply it to automatically produce animations. Games and modeling systems use it in specific situations, such as making an avatar’s hand follow a wall they are walking next to, or to retarget animations from one avatar model to another. At High Fidelity, we use IK to achieve a twin technical goal of mixing high-frequency/high-resolution sensors with an open heterogenous platform. Some of our users have only a mouse and keyboard and some have 2D or 3D cameras to track head or hand movement. Some have magnetic- or laser-registered in-palm sensors for hand position and rotation. Some have head mounted displays (such as the Oculus Rift) with rotation and even position tracking. But we have no foot sensors in use, and indeed many users stay at a desk rather than walking around a dedicated VR “cave” space.
The Magic
So here’s what we do: The full-body IK system runs all the time, on every frame (at 60-90 full updates per second). We ALSO run the full state machine and its animations on every frame, with the state machine being driven by the keyboard or controller buttons (e.g., arrow keys). If we don’t have any sensor information, the IK gets target information for head, hands, and feet from the animation result. However, if the user activates a hand controller, the IK hand target comes from that instead. Similarly for head rotation. (We are experimenting with various mechanisms for driving head position. We currently do not drive head position directly from sensors. Instead, we use head position information to heuristically compute a spine lean that is applied to the animation result, before the target positions are read by the IK. This allows a natural “head tracking” effect without making the head bob up and down.)
Recall that our full-body IK solution uses any moment’s underlying animation bone state as a guide. The result is quite close to the animation (as modified by the lean) in any part where there is not direct sensor info. For example, triggers on the hand controllers also drive state, leading to animations for pointing, closing hands into a fist, or releasing an object. These finger animations are layered on top of those for walking, turning, idling, etc. The IK solution has no targets in the fingers, and so it just reproduces the underlying animation. Thus the arm, hand, and fingers seamlessly meld artist-created animations (in body and fingers) with IK-driven motion (in arms).
As another example, our state machine recognizes when the user is speaking into their microphone. When there is no hand controller or the hand controllers are on their resting cradle, the state machine plays a “talking with your hands” animation with both hands. However, if you then pick up one hand controller, that hand is driven directly by the controller, while the other continues to animate. Pick up both controllers, and you’re completely in charge of your hands’ positions and rotations, even as trigger-driven finger animations play at that hand position. (Microphone input also drives other animations that are handled separately. Wherever the head ends up, we use “morphs” or “mesh animation” for facial animation, which is driven by camera input and/or the microphone input. The avatar lip-sync’s, smiles, blinks, etc. Meanwhile, we make the avatar’s eyes look exactly where the user is looking when eye tracking info is available, and otherwise simulate the data based on the facial features of nearby avatars.)
The sensors also drive the state machine. For example, suppose you are standing idle, but the combination of head and hand controllers produces a final IK solution in which you are leaning to one side. The avatar will have that lean, even though it is not in the underlying animation. If the lean exceeds a programmed limit (based on the physics of the posture), we change the whole avatar’s position. This then creates a velocity that changes the animation state, and the avatar is shown to take a step in that direction. It’s sort of like riding a Segway, with the avatar repositioning to keep its feet under your hips. This works for both “sitting” and “standing” (or “room scale”) uses of Head Mounted Displays (which other system treat as separate operational modes that users must choose between).
Thus some inputs are interpreted as input to the state machine that produces animated motions. As in high-end games, this produces good automated combinations from a reasonably small set of artist-created motions. The new and evolving high-resolution/high-frequency sensors additionally produce a set of direct motions for head and hands. Animated and sensor motions then both feed into the same IK integration. As a result of all this, when a user grabs things, stacks blocks, throws balls, and such, their avatar reaches, bends, steps, and articulates in a realistic way. Interpretive Motion IK populates scenes with lots of natural movement, combining the best of artist-driven and sensor-driven animation.