Gesture recognition is all about the software: Lessons from Microsoft Kinect and Leap

When David Holz, CTO of Leap Motion, first told me not to focus too much on the company’s hardware because the real magic was in the software, I only half believed him. But after the Build conference, and listening to Microsoft Kinect dev lead Alisson Sol do a masterful job of describing exactly what went into evolving Kinect gesture recognition from the first version forward, I was convinced.

Gesture recognition is all about the software. While competent hardware is a requirement, it doesn’t need to be earth-shattering. A teardown of the Leap – revealing three off-the-shelf LEDs and two inexpensive cameras – proved that. What is crucial is plenty of time and effort invested in the hard work of getting software algorithms to accurately figure out what they are seeing. Microsoft’s process for developing its grip and release recogniser for Kinect for Windows is an excellent case study of exactly how that process works.

Machine learning: Making your computer do the programming

For decades, computer science has focused primarily on how to conserve computer resources through clever pre-constructed algorithms. In some ways, machine learning turns this process on its head. With the incredibly low cost of CPU and GPU cycles, it is now practical to throw large amounts of data at the computer and let it sort things out. The process is a lot more complicated than that, and doing it well is something of an art, but the results are perfectly suited for various kinds of recognisers – including those for gestures.

Before you can set your computer off to learn, you need to collect a lot of quality data. In the case of Kinect, the data is many gigabytes of depth map videos from dozens of subjects performing all types of gestures. In addition, those video segments have to be hand tagged to indicate at which points the subjects performed each gesture. The process is basically one large piece of old fashioned scientific data collection. The human-tagged data is referred to as the “ground truth” – essentially the gold standard the recogniser will be measured against.

You also need to make sure you are solving the right problem. Previous teams working on the grip recogniser at Microsoft had focused on trying to decide if hands were open or closed, and they’d got stuck. Sol’s group took the approach of looking directly for the grip and release gestures – which was really what they needed to know to implement a grasp-based user interface.

Turning data into features

Once you’ve got a large set of tagged data, the next step is deciding which attributes – called features – of the data are important to making the recognition decision. This is as much of an art as a science, and you’re likely to get it wrong at least once. The features you decide to use also need to be relatively easy to compute. The Kinect for Windows team gave themselves 2ms to recognise grips and releases, for example.

In the case of the grip gesture, the team initially used the number of pixels far from the centre of the hand (that gets reported by the skeletal tracking subsystem) as the major feature to feed its machine learning algorithm. Unfortunately and unexpectedly, the hand position reported wasn’t stable enough for the feature to be highly accurate, so the team had to develop a hand stabilisation algorithm that took into account various positions and motions of the hand during gesture recognition.

This approach yielded a pretty mediocre result, so the team looked around for other ways to sift through the data. They decided on using the frame-to-frame difference in the depth map at each pixel to signal grip or release. Essentially every pixel would “vote” on whether it thought a gesture was happening, based on whether it was farther from or closer to the Kinect than the frame before.

Letting your output write the code

Rather than the typical programming model of developing code that produces a desired output, machine learning systems like those used by the Kinect team rely on sets of desired output (the tagged data initially connected) to allow the number crunching machine learning algorithm to produce a recogniser (essentially machine-generated code) that can be run on real data to identify the target gestures.

Number crunching their newly selected feature quickly became a big data problem. At 30 fps the training data contains over 100,000 frames per hour, each with about 300,000 pixels (more for the new Kinect). Even looking at just the 128 x 128 pixel region of interest surrounding each hand there are over 16,000 pixels for each hand – 64,000 per frame (4 hands) – to be analysed. Next, the extracted features get fed into a machine learning system, typically a variant of one of several openly available systems implementing an algorithm like SVM (Support Vector Machines).

Sol downplayed the differences between various machine learning software algorithms, basically telling us that with enough data they would converge to very similar results. For this project his team used the ID3 (Iterative Dichotomiser 3) method to create decision trees for the recogniser to use. ID3 works by determining which feature provides the most information, then separating it out and repeating until all the features are accounted for. If the features selected originally are sufficient to the task, the result is code that can be tested against more ground truth data. If the feature selection or the data is off, then it needs to be reworked and new features identified and measured.

Don’t just test, analyse

Many research papers on machine-learning projects end with a quick “thumbs-up, thumbs-down” assessment of their results based on the percentage of success running against their test data set. For consumer products like Kinect, Sol explained why that approach isn’t nearly good enough. To reach the high bar of acceptance in the market, Microsoft used thousands of test subjects and built sophisticated tools to help it analyse each type of failure to further improve the algorithm. One example he gave was hand velocity. For obvious reasons the hand position data from the skeletal tracking system is less accurate during fast movements, so the grip recogniser had to be modified to take that into account.

Similarly, given the huge number of frames being analysed, even an apparently rock solid accuracy of 99.9 per cent would result in dozens of mistakes every hour. Instead of relying on simple accuracy metrics, the testers had to take into account real-world performance data. All the changes needed to address the various glitches found by the testers took several iterations of the recogniser code. One iteration was devoted to adapting to the fact that the left and right hand images couldn’t be treated as pure mirror images because the lighting and shadows are not symmetric. As you can imagine, all this number crunching took a lot of computer time. Even running on an 80-core cluster cranking out a grip recogniser to test took nearly a week each time.

As the final step, the team enlisted the help of Microsoft Research to help make the recogniser fast enough to run within its 2ms window. Tricks like skipping some of the pixels were helpful, and the final result provides grip-enabled controls in the Kinect for Windows SDK version 1.7. The result is a highly-accurate recogniser that feeds the very usable gesture-driven controls developers can use and learn from in the SDK.

Similarly, while LEAP hasn’t yet been as forthcoming about its development process, it is clear that their software magic has turned a simple set of off-the-shelf parts into one of the most powerful gesture recognition systems on the market.

To learn more, check out Alisson Sol’s talk from Microsoft Build 2013. You can also read up on the ID3 Algorithm.