How we built a Dynamic Gesture Recognition system using Machine Learning (ML)
In this article, I will show you how we created a Gesture Recognition system based on Machine Learning (ML) techniques. I will focus on several attempts we made to combine different models and compare their effectiveness in solving the problem of recognizing dynamic hand gestures registered with an RGB camera. Such detection can be used for many applications. It can be e.g. implemented as a control interface in Augmented Reality (AR) glasses.
We started work on the problem of recognizing dynamic hand gestures because we needed to build an interface to control applications for one of our latest projects – under the working name of Mixed Reality (MR) Glasses. One of the requirements was to easily add custom gestures. Our goal was to recognize the following gestures:
In machine learning, a choice of an appropriate model for solving a new problem not always can be easily made a priori. Often it requires many attempts and meticulous tweaking the model. That’s how it was in our case in xBerry. Below I am presenting subsequent attempts that we have taken to solve the problem.
In the beginning, we implemented the YOLO model: Real-Time Object Detection:
You only look once (YOLO) is a state-of-the-art, real-time object detection system. On a Pascal Titan X it processes images at 30 FPS and has a mAP of 57.9% on COCO test-dev. YOLOv3 is extremely fast and accurate. In mAP measured at .5 IOU YOLOv3 is on par with Focal Loss but about 4x faster. Moreover, you can easily tradeoff between speed and accuracy simply by changing the size of the model, no retraining required! (source: https://pjreddie.com/darknet/yolo/)
A set of 2,500 static hand images enabled us to obtain satisfactory results of a palm detection in the images using the YOLO model. Next, based on the obtained bounding boxes, we have extracted a wide range of handcrafted features, e.g. location of a center of the bounding box, size of the bounding box, velocity (speed and direction) of the center of the bounding box (within 2 seconds), Histogram of Oriented Gradients (HOG), Histograms of colors distrubution.
This allowed registering changes of these features in time (two seconds) for individual dynamic hand gestures. We collected a total of several hundred samples, which we recorded in different environments, on different backgrounds, and with different lightings. To further increase the size and value of our dataset we used Data Augmentation techniques on collected samples, such as adding noise, variance, manipulation of the hand movement track, changing colors etc. This action has allowed enlarging our training dataset in a synthetic way.
In the first attempt to solve the problem of recognizing dynamic hand gestures, we decided to check the performance of the HMM (Hidden Markov model):
Hidden Markov Model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (i.e. hidden) states. The hidden Markov model can be represented as the simplest dynamic Bayesian network. (source: https://en.wikipedia.org/wiki/Hidden_Markov_model)
The purpose of choosing a probabilistic model was to calculate the probability of assigning the observation data to the appropriate dynamic gesture class. This solution, unfortunately, did not bring satisfactory results. Although explicit dynamic gestures like swipe left, swipe right, swipe up, swipe down, and push were detected quite well, it was not easy to determine a lack of gesture.
In the second attempt, we decided to use the LSTM model (Long Short Term Memory networks):
“LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. (…) LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn! (source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
The developed dataset of dynamic hand gestures, described in the first attempt, contains several hundred samples. Although it is not much for this kind of neural network, we decided to use it as a training dataset for our stateful LSTM model. When using stateful LSTM networks, we have fine-grained control over when the internal state of the LSTM network is reset. Therefore, it is important to understand different ways of managing this internal state when fitting and making predictions with LSTM networks affect the skill of the network . This attempt brought much better results than the previous one. Nevertheless, it still left some serious problems unresolved. The main issue was a poor detection of a lack of the gesture. In such situation, the system returned a random classification result. A detection of less dynamically distinctive gestures, like “click” was also a significant problem. Apart from that, gestures like swipe left, swipe right, swipe up, swipe down, and push were detected quite effectively.
In the third attempt, we decided to give up handcrafted features in favor of traits from the convolutional network. One of the reasons for rejecting them (as well as the LSTM) was the lack of certainty whether they would be enough for adding custom gestures by users in the future. First, we moved YOLO to an auxiliary role only – it is only used to locate hand on the whole image: when CPM starts tracking, the YOLO is no longer used. Then, we decided to take advantage of CPMs (Convolutional Pose Machines):
Pose Machines provide a sequential prediction framework for learning rich implicit spatial models. (…) convolutional networks can be incorporated into the pose machine framework for learning image features and image-dependent spatial models for the task of pose estimation. (source: https://arxiv.org/abs/1602.00134)
This model allowed us to track the position of individual phalanges and record them in time. Based on that, we created a new dataset of dynamic hand gestures for about 0.5s each. Such a training set was used to create a classifier using the Random Forest method.
Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of overfitting to their training set. (source: https://en.wikipedia.org/wiki/Random_forest)
Such combination of models (CPMs + Random Forest + YOLO) resulted in an effective solution for detecting all of the desired dynamic gesture classes.
- time spent solving the task: 3 months
- number of collected training samples (static images of a palm) for training YOLO: 2500
- number of collected video samples (dynamic hand gestures) for training LSTM: 3500
- number of collected video samples (dynamic hand gestures) for training Random Forest: 3500
- number of classes: 8 (including “None”)
- RGB camera
- Nvidia GTX970
The following conclusions emerge from the analysis.
- Application of the LSTM requires a much broader training set.
- Random Forest requires much fewer input data for reasonable training.
- Random Forest is much less computational demanding than LSTM.