Gestures are a form of non-verbal communication, which is highly intuitive and very effective. Gestures are used in a wide diversity of domains from verbal (e.g., to facilitate oral communication) and non-verbal communication, to precision surgery, security, entertainment and sports, among many others. Because of its relevance, automated gesture recognition is a research topic with a growing popularity in computer science, see e.g., (Aggarwal and Ryoo, 2011; Mitra, 2007). The availability of novel sensors, like Kinect, has made even more attractive this task to researchers, as one can reliably obtain the location of body-parts in real time.
Traditional approaches for the automated recognition of gestures learn a model (e.g., a hidden Markov model(Rabiner, 1989)) from a set of sample videos including the gestures of interests; where, commonly, the variation of spatial positions from body-parts (e.g., hands) across time are used as inputs for the models. In general, the more examples we have for building a model, the better its performance is in new data (Aviles-Arriaga et al., 2011; Lee and Kim, 1999; Inoue and Ueda, 2003; Kim et al., 2007)
. However, in many domains gathering examples of gestures is a time consuming and expensive process. Hence, gesture recognition methods that can learn from few examples are needed. On the other hand, it is also desirable that gesture recognition methods do not rely on specialized sensors to estimate body-part positions; or on the output of techniques for associated problems like hand-detection/tracking or pose estimation(Eichner et al., 2012), as these techniques may introduce noise into the data acquisition process. Undoubtedly, methods that can be trained from very few examples and using unspecialized equipment would make the applicability of gesture recognition more widespread: e.g., anyone with access to a webcam would be able to build gesture recognizers.
In this paper we approach the problem of gesture recognition by using a single example of each gesture to be recognized. This task, called one-shot gesture recognition, was proposed in the context of the Chalearn gesture challenge (Guyon et al., 2012, 2013a). The target for this type of methods are user adaptive applications that require the recognition of gestures from arbitrary and user-defined vocabularies; domains where gestures can change with time and models need to be modified periodically; and scenarios where gathering data is too expensive or users are not willing to spend time collecting large amounts of data.
For each gesture to be recognized the only information we have for building a model is a single video recorded with a Kinect camera, where both RGB and depth videos are available. Despite the fact that Kinect can record additional data (e.g., skeleton information) it was disregarded in the ChaLearn gesture challenge. This favored the development of new methods not relying on a first step of skeleton extraction, which is often not robust to occlusions, and requires spatial and temporal resolutions not available in many application settings. The problem is restricted to single user gesture recognition, there is little variation in the background and the user is placed right in front the sensor. On the other hand, the problem is very challenging as a single example is available for each gesture, thus, traditional recognition methods (e.g., those based on HMMs) cannot be applied directly. Also, the gestures in the database used in this work were performed by different users with different skill-levels to perform the gesture; there is a wide diversity of domains of gestures, ranging from highly dynamic (e.g., “aircraft-landing” signals) to static (e.g., “Chinese letters”) and some body-parts may be occluded (Guyon et al., 2013b). Additionally, the sampling rate is low (of the order of 12fps). Clearly, standard gesture recognition methods are not directly applicable, and even though the problem has been simplified, it remains a difficult task.
We propose a simple and efficient method, yet very effective, for one-shot gesture recognition called principal motion components (PMC). The main goal of PMC was to act as a strong baseline for the Chalearn gesture challenge (Guyon et al., 2012, 2013a, 2013b) and it has inspired several of the top ranking entries, see e.g., Wu et al. (2012b)
. The proposed method is based on a motion map representation that is obtained by processing the sequence of frames in a video. Motion maps are used in combination with principal component analysis (PCA) under a reconstruction-error classification approach. The proposed method was evaluated in a large database withgestures used in Chalearn gesture challenge (Guyon et al., 2012, 2013a, 2013b). We compare the performance of PMC to a wide variety of techniques. Experimental results show that the proposed method is competitive with alternative methods. In particular, we found that the proposed method resulted very effective for recognizing highly-dynamic gestures, although it is less effective when static gestures are analyzed. The proposed method can be improved in several ways and it can be used in combination with other approaches, see e.g., (Wu et al., 2012b; Cheemaa et al., 2013). The main contributions of this work are threefold.
The introduction of a new representation for motion in video, where we capture motion in successive frames through 2D motion maps. The proposed representation can be seen as a bag-of-frames formulation, where each video is characterized by the (orderless) set of motion maps it contains. The representation can be used with other methods for gesture recognition and it can be used for other tasks, e.g., for segmentation purposes using motion detection.
The proposal of a new one-shot gesture recognition approach based on PCA. Motion maps for a video (in a bag-of-frames representation) are used to generate a PCA model. The reconstruction error of the PCA models is used as criterion for gesture recognition. Our proposal is capable of building a predictive PCA model from a single video without using any temporal information.
The evaluation of the proposed method in a large-scale heterogeneous database
and a comparison of it with a variety of alternative techniques. We show that the proposed method is effective for highly dynamic gestures. Several variants of the bag-of-frames representation (including representations based on HOG, HOF, STIP features) and different recognition techniques (classifiers and template methods) are considered in our study.
The rest of this paper is organized as follows. The next section reviews work closely related to our proposal. Section 3 introduces the principal motion components method. Section 4 describes the experimental settings adopted in this work and Section 5 reports experimental results. Finally, Section 6 presents the conclusions derived from this paper and outlines future work directions.
2 Related work
This section reviews related work on two key components of the proposed approach: motion-based representations and PCA-based recognition.
2.1 Motion-based representations
When it is not possible to track body-parts across a sequence of images, motion-based representations have been used for gesture recognition. Different approaches have been proposed, mainly based on template matching (Aggarwal and Ryoo (2011); Aggarwal and Cai (1999)). The seminal work of Bobick and Davis (2001)
used motion history images (MHIs) to represent videos, where MHIs are obtained by accumulatively adding (thresholded) binary-difference images, this type of templates reveal information about the history of motion in a video (i.e., how movement happened). Statistical moments obtained from the MHIs were used for recognition.Davis (2001) extended the MHI representation to generate histograms of motion orientation. The MHI is obtained for each video and the resulting template is divided into spatial regions. Gradients from motion values are obtained on each region separately, a histogram is generated per each region using as bins a set of predefined orientations over the gradients. Per-region histograms are concatenated to obtain a 1D representation for each video. A similarity-based approach was used for recognition in that work.
Polana and Nelson (1994) represented a sequence of images by a spatiotemporal template. As preprocessing, the object of interest is isolated from the rest of the scene. Then, the sequence of cropped frames is processed to obtain optical flow fields. Flow frames are divided into a spatial grid and motion magnitudes are added in each cell. Agrawal and Chaudhuri (2003)
obtained motion vectors (correspondences between blocks of pixels in adjacent frames) for successive frames and generated a 2D motion histogram, in which the occurrence of motion vectors is quantized. They used this representation for gesture recognition in a very small data set ofeasy gestures. Yi et al. (2005) proposed a representation, called Pixel Change Ration Map (PCRM), based on motion histograms that account for the occurrence of specific values of motion in the video sequence. That is, the bins correspond to different (normalized) motion values. This approach is very similar to Davis (2001). However, under PCRM, the average of motion energy in cells of the grid are used instead of the orientation of gradients. The representation proved to be very effective for video retrieval, clustering and classification. Shao and Ji (2009) proposed a method for key frame extraction from video shots. The core of the method is a representation based on motion histograms. Optical flow fields are obtained for each frame, a subset of different combinations of magnitude and direction of motion values are used as the bins of the motion histogram. Motion histograms, one per-frame, are then processed to extract representative frames of the sequence.
Other approaches define motion histograms in terms of symbols derived from optical flow analysis (Pers et al. (2010)); build classification models using motion histograms over voxels as features (Luo et al. (2010)); and generate histograms of gradient orientations for static gesture recognition (Freeman and Roth (1995)).
In most of the above described approaches, a single template based on motion histograms is obtained to represent a whole sequence of frames. In our proposed representation, a motion map, accounting for the spatial distribution of motion across successive frames, is obtained per each difference image. This can be thought of as a relatively low resolution 2D map, each location accounting for the amount of motion at a given position, at a given time. However, we discard the time ordering of the various maps and time is only taken into account by the fact that the maps are based on consecutive frame differences. Thus, by analogy to bag-of-words representations in text recognition that ignore word ordering in text, we can talk of a “bag-of-frame” type of representation, which is neither a template nor a time ordered sequence of features. In this way, we have a set of observations (motion maps) associated with a single-gesture, which can be used for the induction of classifiers. To the best of our knowledge none of the above described methods has been evaluated in one-shot gesture recognition Guyon et al. (2012, 2013a).
In the context of one-shot learning gesture recognition, template-based methods have been popular. A simple average template approach was the first baseline proposed by the organizers of the gesture recognition challenge, and it remained a difficult baseline to beat during the first weeks of the competition (Guyon et al., 2012). Mahbub et al. (2012)
proposed a template matching approach for one-shot learning gesture recognition, where three ways of generating templates were proposed (2D standard-deviation, Fourier-transform and MHIs). For recognition the authors used the correlation coefficient to compare templates and testing videos.Wu et al. (2012a) proposed an extended MHI that incorporates gait energy information and inverse recording, although the method obtained very good performance, it is difficult to assess the contribution of the sole recognition approach as several pre-processing steps were performed beforehand (the authors mention that pre-processing improves the performance of their method by about ). Other methods have been proposed in the context of the gesture recognition challenge, including probabilistic graphical models (Malgireddy et al., 2013) and techniques from manifold learning Liu (2012), these and other methods are summarized by Guyon et al. (2012, 2013a). In Section 5 we compare the performance of our proposal to these methods.
2.2 PCA for gesture recognition
The second component of our proposal is a PCA-based method for gesture recognition. PCA has been widely used in many computer vision tasks, including gesture recognition (Aggarwal and Ryoo (2011); Polana and Nelson (1994); Munoz-Salinas et al. (2008)). In most of the times, PCA has been used to reduce the dimensionality of the representation or to eliminate noisy and redundant information, see e.g., (Gweth et al., 2012); in fact, this is a common preprocessing step when facing any machine learning task (Guyon et al., 2006).
Some authors have used PCA for recognition (Turk and Pentland, 1991; Martin and Crowley, 1997; Gomez and Moens, 2012; Malagon-Borja and Fuentes, 2009). The most used approach consists of estimating the reconstruction error obtained after projecting the data into a PCA model as a measure of the likelihood that an instance belongs to a class. This recognition method was first reported in the seminal work of Turk and Pentland (1991)
for face recognition. A similar approach was adopted byMartin and Crowley (1997) to classify hand postures to be used for gesture recognition by a high-level approach. Martin and Crowley (1997) used a large data set of images with diverse hand postures and used the PCA-reconstruction approach to classify hand postures. This approach has proved to be very effective in other domains as well (e.g., spam filtering, (Gomez and Moens, 2012), and pedestrian detection, (Malagon-Borja and Fuentes, 2009)
). The reconstruction approach based on PCA has been also used for one-class classification and outlier detection(Tax, 2001; Hoffmann, 2007).
The motivation behind using a reconstruction-error approach for one-shot recognition stems from the fact that we do not know what are the underlying motion dimensions associated to a particular gesture, and we would like PCA to automatically determine what are those dimensions and to use such information for recognition. One should note that previous work has used the PCA-reconstruction approach considering a data set of labeled instances, where many instances are available per each class. In our proposal, we have multiple observations taken from a single instance associated to a class (the bag-of-frames for a gesture). Another way of thinking of our model is that it acts as a single state hidden Markov model for each gesture, the PCA model representing an i.i.d. generating process. To the best of our knowledge PCA has not been used similarly for recognition, not even for other tasks than gesture recognition.
3 Principal motion components
In general terms, the proposed principal motion components (PMC) approach to one-shot gesture recognition is as follows. The set of motion maps, i.e., the bag-of-frames representation, associated to a video is processed to generate a PCA model per each gesture. When a new gesture needs to be recognized, its motion maps are extracted and projected into the PCA model for each gesture, the projected data are reconstructed back and we measure the average of reconstruction error. The underlying idea is that a PCA model can capture the important motion variation across frames, and therefore, motion maps obtained from the same gesture will be better reconstructed than with models associated with different gestures. Unintentional movements that are not directly related to the user will vanish with the reconstruction performed by PCA.
3.1 Representation: motion maps (bag-of-frames)
Let be a video composed of frames, , where is the frame, and being the width and height of the image, respectively. We represent a video by a set of motion energy maps, , , one per each frame. Each map accounts for the movement taking place in consecutive frames on fixed spatial locations of the frames.
For obtaining motion maps, we first generate motion energy images by subtracting consecutive frames in the video: , (we set to have the same number of difference images as frames in the video). Next, a grid of equally spaced patches is defined over the difference images. The size of the patches is the same for all of the images, see Figure 1. We denote with the number of patches in the grid. We estimate for each difference image , the average motion energy in each of the patches of the grid; this is done by averaging motion values for pixels within each patch. That is, we obtain a 2D motion map for each difference image, where each element of the map accounts for the average motion energy in the image in the corresponding 2D location. The 2D maps are transformed into a 1D vector . Hence, each video is associated to a matrix of dimensions , with one row per frame and one column per patch. We call the bag-of-frames representation for the video, under the motion maps characterization. Figure 2 shows motion maps for a subset of frames in a video. In the figure motion maps are shown in temporal order, although, in the proposed approach, order of motion maps is not taken into account.
For the implementation we adopted a more efficient approach to generate motion maps. Each motion energy image , is downsized (e.g., via cubic interpolation
cubic interpolation) up to a specified scale . Motion maps are obtained by concatenating the rows from the downsized images.
One should note that as the proposed representation captures motion in fixed spatial locations, translation variations may have a negative impact into the motion maps representation. The extreme case is when considering a large number of patches (e.g., when having one bin per pixel), resulting in a fine-grained map for which translation variance is a critical issue. In order to overcome this problem, we expand motion information in each difference imageas follows: . Where , , , and are difference images translated by a gap of pixels to the left, right, up, and down directions, respectively. Basically, were are growing the region of motion to make the representation less dependant on the position of the user with respect to the camera.
Figure 3 shows motion maps extracted from videos depicting different gestures and performed by different persons; row 1 shows a very dynamic gesture, whereas row 2 shows a static one. We can see that motion information is effectively captured by the proposed representation, as expected, the more dynamic the gesture (as depicted in the accompanying MHIs) the higher the values of the motion map. It is interesting that even the representation for the static gesture shows high motion energy values, which can be due to unintentional movement from the user that is not related to the gesture. The PCA model is expected to capture the main dimensions of motion and to limit the contribution of such noisy movements. From Figure 3 we can also see that the motion expansion emphasizes motion energy in neighboring patches (compare the leftmost and center images), which makes the representation more robust against variance in translation.
3.2 Recognition: PCA-based reconstruction
For recognition we consider a reconstruction-error approach based on PCA. Consider a training video representing a single gesture. We first compute a bag-of-frames representation , (alternatively denoted by matrix ), as explained in the previous section. Here does NOT represent a time index and the frames representing motion (converted in feature vectors) can be arbitrarily re-ordered. The modeling approach then consists in treating the feature vectors as training examples of a PCA model, globally representing the frames of that gesture. The principal components can be thought of as “principal motions”. Given now a new video also in a bag-of-frames representation, its similarity to the training video can be assessed by the average reconstruction error of the frames of the video under the PCA model.
Let be the set of videos corresponding the a gesture vocabulary (e.g., “diving signals”), where each video corresponds to a gesture (e.g., “out of air” gesture). We apply PCA to each of the bag-of-frames representations associated to the different training videos in . We center each matrix
and apply singular value decomposition:, we store the top singular values from S
together with the corresponding eigenvectors(i.e., the principal components), where is the matrix formed by the first columns of . Hence for each gesture in the vocabulary we obtain a PCA model represented by the pair .
Figure 4 shows the principal motion components for a particular gesture vocabulary, the figure illustrates the benefits of the proposed approach. We can appreciate that the principal motion components indeed capture the intrinsic dimensions of motion of each gesture. By comparison, informative motion is not as clearly captured by competing motion-based representations, e.g., MHI (column 4) and the sequence of motion maps (column 5). For this particular vocabulary, the principal motion components can be easily associated by visual inspection with the image that visually describe the gesture (column 6).
A test video , depicting a single gesture222We assume each video to be processed depicts a single gesture. Gesture segmentation is an open problem by itself that we do not approach in this paper, although we evaluate the performance of our method using gestures manually and automatically segmented with a basic technique. that needs to be classified is processed similarly as training videos, thus it is represented by a matrix of motion maps . Matrix is projected into each of the spaces induced by the training PCA models , where the projection of under the PCA model is obtained as follows (Jolliffe, 2002):
where is a matrix with each row being the average of , and subscript in and indicates the index of the associated PCA model. Next projections are reconstructed back, the reconstruction of under the PCA model is given by:
where superscript indicates the transpose of a matrix.
We can measure the reconstruction error for each as follows:
where and are the number of rows and columns of , respectively and with . Finally, we assign the gesture corresponding to the PCA model that obtained the lowest reconstruction error, that is: .
Similar reconstruction-error approaches have been adopted for one-class classification (Tax, 2001), where instances of the target class are used to generate the PCA model and a threshold on the reconstruction error is used for classification. Reconstruction error has been also used for spam filtering (Gomez and Moens, 2012), face recognition (Turk and Pentland, 1991) and pedestrian detection (Malagon-Borja and Fuentes, 2009), see Section 2. One should note that in previous work a set of labeled instances have been used to generate the PCA model of each class, whereas under the proposed approach the elements of a single-instance (the amount of motion in the frame differences under the bag-of-frames representation) are used. Besides the granularity, the main difference stems in that, in previous work, one can assume each instance is representative of the category, while in our setting the set of motion maps associated to a gesture are not necessarily representative of the gesture (e.g., similar motion maps may be shared by different gestures).
Figure 5 shows the difference image obtained by subtracting original from reconstructed motion maps for a particular vocabulary (“helicopter”). Specifically, image in the array of images, depicts the difference between: the average of motion maps for image , minus the average of motion maps for image reconstructed with PCA model (e.g., images in the diagonal show the difference image obtained by subtracting original representations from the reconstruction with the correct model). Only differences exceeding the value of are shown in the images. As expected, gestures reconstructed with the correct PCA model, obtain lower differences than the threshold, while the reconstruction of gestures using other models results in large differences across the whole 2D space.
The main motivation for our recognition technique is the fact that principal components minimize the reconstruction error when projecting the data into the components’ space; it can be show that this is equivalent to finding the directions that maximize the variance of the data, which is the most known derivation of PCA, see e.g., Bishop (2006); Jolliffe (2002). Since the PCA model for a gesture is the one that minimizes the average reconstruction error for motion maps belonging to the corresponding video, this model should be the one (among the PCA models for other gestures) that better reconstructs new motion maps belonging to the same gesture. Clearly this is not a discriminant classifier, since the PCA model for a gesture is generated independently of the models for other gestures, hence no inter-gesture information is captured by the PCA approach. Nevertheless, our experimental study from Section 5 reveals that even with this limitation the proposed approach performs better than supervised methods that use the bag-of-frames representation.
4 Experimental settings
We evaluate the performance of the principal motion components approach in the ChaLearn Gesture Dataset (CGD) (Guyon et al., 2013b). CGD comprises different gestures divided into batches of 100 gestures each, gestures were recorded in RGB and depth video using a Kinect camera. The data set was divided into development (480 batches), validation (20 batches) and additional batches for evaluation (40 batches, referred to as final batches). Each batch is associated to a different gesture vocabulary, and it contains exactly one video from each gesture in the vocabulary for training and several videos containing sequences of gestures taken from the same vocabulary for testing. Each batch contains 100 gestures, the number of training videos/gestures ranges from 8 to 12, depending on the vocabulary. There are 47 videos for testing in each batch containing sequences from 1 to 5 gestures each; hence, a gesture segmentation method has to be applied before recognition. The number of test gestures in each batch ranges from 88 to 92. About 20 different users contributed for the generation of gestures and there are about 30 different gesture vocabularies. See (Guyon et al., 2013b) for a comprehensive description of the CGD. It is important to mention that gesture vocabularies are quite diverse and come from many domains, e.g., see those mentioned in Table 1.
|Referee wrestling signals||Motorcycle signals||Diving signals|
|Surgeon signals||Taxi South Africa||Gang hand signals|
|Tractor operation signals||Chinese numbers||Mudra signals|
The CGD was developed in the context of Chalearn gesture challenge333http://gesture.chalearn.org/, an academic competition that focused in the development of gesture recognition systems under the one-shot-learning scenario (Guyon et al., 2012, 2013a). During the challenge, participants had access to the labels of all of the development batches (1-480), although most participants used only twenty batches (1-20) when developing their systems. This can be due to the fact that for those batches additional information was provided by the organizers (e.g., manual segmentation of test videos, hand tracking information, body-part estimates, etc.). Validation data was used by the organizers to provide immediate (on-line) feedback on the performance of participants’ methods. Final batches were used to evaluate the performance of the different methods. See (Guyon et al., 2013a) for more details on the Chalearn gesture challenge.
The evaluation measure used in the challenge was the Levenshtein’s distance (normalized by the length of the truth labeling), which accounts for the number of edits that must be performed for taking a sequence of predictions into the ground truth labeling for a gesture. In the next section we report experimental results on the CGD benchmark to evaluate the effectiveness of the principal motion components approach.
5 Experimental results
In this section we report results from experiments that aim at evaluating different aspects of the proposed approach. First, we evaluate the performance of our method in the whole CGD collection. Next, we evaluate the method under different parameter settings. Then we compare the proposed approach to a number of related techniques we implemented. Finally, we compare the performance of the principal motion components technique to other methods developed in the context of Chalearn’s gesture challenge.
As explained previously, videos must be segmented in order to isolate gestures prior to recognition. We report results of experiments using both: manually segmented (batches 01-20 for development and validation only) and automatically segmented (all of the batches) videos. For automatic segmentation we used a simple method based on dynamic time warping, which is also based on the motion maps representation (a time ordered version at a very coarse resolution). This method was provided by the organizers of the Chalearn gesture challenge; it is publicly available from the challenge website.
5.1 Performance over the whole collection
In a first experiment we applied the principal motion components approach to the whole GRC database of gestures using both RGB and depth video. Results in terms of the Levenshtein score are shown in Table 2. For this experiment all of the videos were automatically segmented. The translation gap was set to pixels, the scale for image downsizing was fixed to , while the number of principal components was set to ; our choices were based on the results obtained in a preliminary study, see section 5.2.
|Data set / Type||RGB||DEPTH|
|Devel01-480||0.4079 (0.2387)||0.4103 (0.2068)|
|Valid01-20||0.3178 (0.2030)||0.3189 (0.1891)|
|Final01-20||0.2747 (0.1842)||0.2641 (0.1971)|
|Final21-40||0.2124 (0.1404)||0.2263 (0.1362)|
The performance of our method in the 480 development batches was worst than that obtained in the final and valid batches. This can be due to the difference in number of batches and the diversity of their vocabularies. In development batches, results using depth video are slightly worse than those obtained with RGB video, nevertheless, the difference in performance is not statistically significant according to a two-sample T-test (). The corresponding differences for the validation () and final () batches were not statistically significant neither. Thus, can conclude that the proposed method performs similarly, regardless of the type of information used: either RGB or depth video. This is advantageous as we do not need a Kinect sensor to achieve acceptable recognition performance with our method; one should note, however, that the standard deviation of performance is lower for depth video (in the 480 batches and for validation batches), hence, when available it would be preferable to use it.
The proposed approach took an average of seconds to entirely process a batch444Experiments were performed in a workstation with Intel Corei7-2600 CPU at 3.4 GHz and 8GB in RAM.
(i.e., training the PCA models from the training videos and labeling all of the test videos, the time includes feature extraction and gesture segmentation). This means that a test video is processed in approximately one second, which makes evident the efficiency of our proposed method and can be used in real time applications.
The performance of our method in validation and final batches followed the same behavior as in development batches, although it is better. In fact the performance of our method on validation and final batches is competitive with methods proposed by participants of the GRC. For example, the results on the final batches 1-20 from Table 2 would be ranked for the first round of the challenge, whereas for the final batches see 21-40 would be ranked , see Section 5.3. One should note, however, that this method was not designed to handle all cases (e.g., static gestures). Competitive methods also used some handshape features to recognize static gestures, and that is beyond the scope of this paper.
We now evaluate the impact of gesture segmentation in the performance of the principal motion components technique. Table 3 compares the performance obtained in batches 1-20 for development and validation data when using manually segmented gestures and the automatic segmentation approach. As expected, using manual segmentation improves the performance of our approach, nevertheless, the achieved improvements are modest. In fact, statistical tests did not reveal that the differences were statistically significant for both modalities (RGB and depth video) and batches (development and validation). Therefore, we can conclude that we can apply the principal motion approach using automated methods for gesture segmentation and still obtain competitive performance.
|Data set / Type||RGB||DEPTH||RGB||DEPTH|
5.2 Performance under different parameter settings
Recall the only parameters of the proposed formulation are (the scale for downsizing the image, see Section 3.1), which is related to the size of the patches to generate motion maps, and , the number of principal components used to generate PCA models, see Section 2.2. In a third experiment we aimed to determine to what extent varying the values of such parameters affect the performance of the proposed approach. We proceeded by fixing the value of a parameter and then we evaluate the performance of our approach when varying the second parameter.
We start by analyzing the results in terms of the scale parameter (). For this experiment we fixed the number of principal components to . Results of this experiment are shown in Figure 6. It can be seen that for both modalities there is not too much variation in the performance of the method for the different values we consider. This is due in part to the region growing preprocessing described in Section 3.1. The best results were obtained when . Lower values of are preferred because the dimensionality of the motion maps is reduced and the proposed approach can be applied faster. Besides, the smaller the value of the larger the size of the patches for the motion maps and the more robust is the approach to variations in the position of the user with respect to the camera. For instance, for the dimensionality of the motion maps is of , the corresponding size of the patches is . Nevertheless, it can be seen from Figure 6 that for smaller values than the performance of principal motion components is worse.
For analyzing the influence of the number of components on the proposed technique we fixed the value of the scale to and varied the number of principal components when building PCA models, experimental results are shown in Figure 7. It can be seen from these plots that, in general, the performance of principal motion components is poor when using few components, , for all the combinations of batches/modalities. The best performance for all of the batches/modalities was obtained when using a number of components ; the performance is somewhat stable for and then it decreases considerably. This result may suggests the best value for is related to the number of gestures in the vocabularies (). Actually, the average vocabulary lengths for development and validation batches are 9.7 and 9.5, respectively. Nevertheless, we did not find significant correlation between the best value for and the size of the vocabulary ().
We also evaluated the correlation between the best value of and the average and standard deviation of the length of training gestures, the minimum and maximum duration, the entropy on the duration of training gestures among other statistics. However, we did not find a statistically significant correlation value either. Thus, other aspects that have to do with the difficulty of vocabularies may have an impact into the optimal value for . In this regard, Table 4 shows information of the performance on each batch when using the optimal number of principal components for each of the development and validation batches (manual segmentation and RGB video were used).
Along with the performance obtained in each batch it is shown the optimal value of and some characteristics about the dynamism of gestures in batches. Interestingly, a few principal components are enough to obtain outstanding performance for some batches (e.g., “Referee-Volleyball1” (3), “Gestuno-disaster” (4), and “Helicopter” (5)), while a large value for is used for some batches and yet the performance is poor (e.g., “Taxi-SouthAfrica” (39), and “Mudra2” (34)). It seems that easier vocabularies (too much motion, movement across the whole image, small inter-class similarity) require of less components than difficult ones (little motion, motion happening in small regions of the image, large inter-class similarity). Although is not easy to define what an easy/difficult vocabulary is.
Other interesting findings can be drawn from the results of this experiment. First, it can be seen that the principal motion components approach is very effective for some gestures. For example, performance similar to that of humans was obtained for “Helicopter”, “Gestuno-disaster”, “Gestuno-topography”, “Tractor-Operation” and “Canada-Aviation” vocabularies. These are highly dynamic gestures where motion happens in different regions of the image, thus proposed approach can effectively capture the differences among gestures in the same vocabulary. In general, acceptable performance was obtained with the proposed approach when either the gesture is dynamic or the body of the user moves significantly when performing the gesture. The worst results were obtained when facing static gestures and users remained static when performed the gesture. This is a somewhat expected result as our approach attempts to exploit motion information.
Table 5 shows the average performance one one would obtain when selecting the optimal value for in each batch. The (hypothetical) relative improvements over the results reported in Table 3 range from to . Hence, it is worth pursuing research on methods for selecting the number of principal components for each particular batch or gesture. Although one should note that the raw differences in performance are small: an improvement of (RGB/Devel/MANUAL) corresponds to a raw difference of in Levenshtein score. Development batches have a larger room for improvement than validation ones, the result is consistent with previous ones.
|Data set / Type||RGB||DEPTH||RGB||DEPTH|
|Devel01-20||0.2351 (20.1%)||0.2351 (14.2 %)||0.2749 (9.1%)||0.2635 (12.6%)|
|Valid01-20||0.2876 (8.7%)||0.2876 (8.23%)||0.2949 (7.2%)||0.2832 (11.1%)|
Summarizing, the principal motion components approach is rather robust to parameter selection. The scale parameter set to achieved the best results for most of the configurations we evaluated. Although, other values obtained competitive performance as well. Selecting the number of principal components remains a difficult challenge, yet acceptable performance can be obtained by fixing . Finally, we showed evidence suggesting the principal motion components method is particularly well suited to vocabularies involving a lot of motion, and when motion happens in different locations of the image.
5.3 Comparison with alternative methods
We now compare the performance of the principal motion approach to that obtained with alternative methods to solve the same one-shot learning problem. First we compare the performance of principal motion components to that of other techniques that are based on similar ideas/features. Next we compare the performance of the proposed technique to that obtained with other methods that were proposed during the Chalearn gesture challenge (Guyon et al., 2012, 2013a).
|HOG-I||HOG features from frames||PCR|
|HOG-M||HOG features from difference of frames||PCR|
|HOF-I||HOF features from frames||PCR|
|HOF-M||HOF features from difference of frames||PCR|
|HOG-SVM||HOG features from difference frames||SVM|
|HOF-SVM||HOG features from difference frames||SVM|
|MHI||Motion history image||TM|
|SMHI||Static-motion history image||TM|
For the first comparison we implemented the methods described in Table 6. The goal of this comparison is assessing whether using different features to represent the video, under the bag-of-frames formulation, could improve the performance of the one based on motion maps. We extracted the following (state-of-the-art) features widely used in computer vision: histograms of oriented gradients (HOG) (Dalal and Triggs, 2005); histograms of oriented optical flow (HOF) (Chaudhry et al., 2009); space-time interest points with 3D HOG and HOF features (Wang et al., 2009); and motion history images (Bobick and Davis, 2001). 2D HOG and HOF features were extracted from the frames themselves (HOG-I, HOF-I) and from difference images (HOG-M, HOF-M). For STIP-based features we tried HOG-only, HOF-only and HOG+HOG 3D representations (Wang et al., 2009). The variants of HOG, HOF and STIP-based features were represented under the bag-of-frames representation. Additionally, two variants of motion history images were implemented: the standard approach (MHI) (Bobick and Davis, 2001), and another version that accounted for non-motion (SMHI). The latter variant aimed to be helpful for highly static gestures.
The different bag-of-frames representations were used for gesture recognition under the proposed PCA-based reconstruction-error technique. Also, we evaluated the recognition performance of supervised approaches using the same representations. For these methods, each vector of features (either motion maps, HOG, HOF, of 3D-HOG/HOF) is treated as an instance of a classification problem, where the class of the instance is the gesture from which the corresponding vector was extracted. In preliminary experimentation we tried several classification methods including (linear discriminant analysis, neural networks, random forest, etc.), we report results for the best methods we found. For motion and static-motion history images we used a template matching approach for recognition (correlation). Experimental results obtained with the considered variants and with the principal motion components approach are shown in Figure8.
From Figure 8 it can be seen that principal motion components obtains the best performance for all but one of the configurations. HOG-M obtained the best results when using automatic segmentation and RGB video, the relative improvement was of . This result indicates the suitability of the reconstruction approach for one-shot gesture recognition under the bag-of-frames representation, which is not tied to a particular type of features. In fact, when using automatic segmentation the three methods: HOG-M, HOG-I and PMC obtained very similar results.
When manual segmentation was used, our approach outperformed the other methods by a considerable margin. The improvement over the nearest technique in performance (HOG-M) was of and for RGB and depth video, respectively. The widely used STIP features were not very useful for gesture recognition under neither the bag-of-frames nor the bag-of-visual-words formulations. This can be due to the fact that a single video is not enough to capture discriminative features. Actually, none of the supervised approaches to one-shot-gesture recognition performed decently. This is not surprising as we are using as labeled samples to features that may have high overlap with several gestures. It is interesting that the static history images outperformed the standard MHI technique (Bobick and Davis, 2001).
Finally, we also compare the performance of principal motion components to that obtained by other authors that have used the ChaLearn Gesture Dataset Guyon et al. (2013b). We considered for this comparison methods that have been already described in a scientific publication. The performance of the considered methods as well as a brief description for each of them can be seen in Table 7.
|MLS-Wu||Multi-layer Template+DTW||0.1950||Wu et al. (2012b)|
|GM-MM||Graphical model||0.2400||Malgireddy et al. (2013)|
|TM-Wu||Template matching||0.2600||Wu et al. (2012a)|
|PMC-M||PMC / Manual segmentation||0.2696||-|
|MF-LIU||Manifold learning||0.2873||Liu (2012)|
|PMC-A||PMC / Automatic segmentation||0.2890||-|
|TM-Mahbub||Template matching||0.3746||Mahbub et al. (2012)|
|TM-2-Mahbub||Template matching||0.3125||Mahbub et al. (2013)|
|GM-MM||Graphical model||0.2332||Malgireddy et al. (2013)|
|TM-Wu||Template matching||0.2968||Wu et al. (2012a)|
|PMC-A||PMC / Automatic segmentation||0.3178||-|
It can be observed from Table 7 that the performance of the proposed approach is competitive with that obtained by the different methods. The best performance reported so far in a scientific publication is that reported by Wu et al. (2012b). It is interesting that such method uses principal motion components as a preliminary step in their multi-layer architecture. Roughly, our method is used to determine if a gesture is dynamic or static. Dynamic gestures are treated with a method based on particle filtering and a tailored dynamic time warping; static ones, are processed with a novel method that incorporates contextual information.
The performance of our automatic approach is close to that obtained by Malgireddy et al. (2013) and Liu (2012). The former authors implemented a graphical model inspired in hidden-Markov models that have been used for keyword spotting, both modalities (RGB and depth video) are used by the model. On the other hand, Liu (2012) represents videos with using a method based on higher-order singular value decomposition, recognition is done via least-squares regression for manifolds. Both approaches obtained outstanding performance in state-of-the-art data sets for human activity recognition and standard gesture recognition, besides they achieved acceptable results in data from ChaLearn Gesture challenge. The principal motion components approach obtained comparable performance to that techniques, hence, it is worth exploring the performance of our method on other closely-related tasks.
Regarding the ChaLearn Gesture Challenge, the latest version of principal motion components would be ranked and in stages one555http://www.kaggle.com/c/GestureChallenge/leaderboard and two666http://www.kaggle.com/c/GestureChallenge2/leaderboard, respectively. Principal motion components was proposed as a baseline method, whose simplicity and easy of implementation motivated participants to develop better methods. In this aspect we accomplished our goal and exceed it by motivating other researchers to build better methods on top of our proposal.
We introduced a novel gesture recognition approach for the one-shot learning setting called Principal Motion Components. The proposed approach represents the frames of a video by means of maps that account the amount of motion happening in spatial regions of the video. The bag of motion maps is used with a PCA-based recognition approach in which recognition error is used as a measure of gesture affinity.
We report experimental results in a large data set with gestures, and two video modalities. Experimental results show that the proposed approach is very competitive, despite being simple and very efficient. The proposed method can work with RGB or depth video and obtain comparable performance. Likewise, the performance of the method does not degrade significantly when using manual or automatic gesture segmentation. We compare the performance of our approach to alternative methods we implemented ourselves and those reported by other researchers. Our approach compared favorably with some techniques and obtained close performance to others. We analyze the performance of our approach under different parameter settings and show characteristics of gestures that can be effectively recognized with it. This study revealed that the proposed approach is well suited for highly dynamic gestures.
There are several future work directions we would like to explore. First, we would like to study the suitability of the principal motion components approach for related tasks, including gesture segmentation, keyframe extraction and motion-based retrieval. Also, we are interested in developing alternative recognition methods that use the bag-of-frames representation. Other interesting areas for research include developing a hierarchical principal motion components formulation, and extending the proposed representation to spatiotemporal features.
- Aggarwal and Cai (1999) Aggarwal, J. K., Cai, Q., 1999. Human motion analysis: a review. Computer Vision and Image Understanding 73 (3), 428–440.
- Aggarwal and Ryoo (2011) Aggarwal, J. K., Ryoo, M. S., 2011. Human activity analysis: a review. ACM Computing Surveys 43 (3), Atr. 16.
- Agrawal and Chaudhuri (2003) Agrawal, T., Chaudhuri, S., 2003. Gesture recognition using motion histogram. In: Proceedings of the Indian National Conference of Communications. pp. 438–442.
- Aviles-Arriaga et al. (2011) Aviles-Arriaga, H. H., Sucar, L. E., Mendoza, C., Pineda, L., 2011. A comparison of dynamic naive Bayesian classifiers and hidden Markov models for gesture recognition. Journal of Applied Research and Technology 9 (1), 81–102.
Bishop, C. M., 2006. Pattern Recognition and Machine Learning. Springer, New York.
- Bobick and Davis (2001) Bobick, A. F., Davis, J. W., 2001. The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell 23 (3), 257–267.
- Chaudhry et al. (2009) Chaudhry, R., Ravichandran, A., Hager, G., Vidal, R., 2009. Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: Proceedings of the IEEE Conference on Pattern Recognition and Computer Vision. IEEE, pp. 1651–1657.
- Cheemaa et al. (2013) Cheemaa, M. S., Eweiwia, A., Bauckhage, C., 2013. Human activity recognition by separating style and content. Pattern Recognition Letters In press.
- Dalal and Triggs (2005) Dalal, N., Triggs, B., 2005. Histograms of oriented gradients for human detection. In: Proceedings of the IEEE Conference on Pattern Recognition and Computer Vision. pp. 886–893.
- Davis (2001) Davis, J. W., 2001. Hierarchical motion history images for recognizing human motion. In: Proceedings of the IEEE Workshop on Detection and Recognition of Events in Video. pp. 39–46.
- Eichner et al. (2012) Eichner, M., Marín-Jiménez, M. J., A. Zisserman, V. F., 2012. 2d articulated human pose estimation and retrieval in (almost) unconstrained still images. International Journal of Computer Vision 99 (2), 190–214.
- Freeman and Roth (1995) Freeman, W. T., Roth, M., 1995. Orientation histograms for hand gesture recognition. In: Proceedings of the IEEE International Workshop on Automatic Face and Gesture recognition.
- Gomez and Moens (2012) Gomez, J. C., Moens, M. F., 2012. PCA document reconstruction for email classification. Computational Statistics and Data Analysis 56, 741–751.
- Guyon et al. (2013a) Guyon, I., Athitsos, V., Jangyodsuk, P., Escalante, H., Hamner, B., 2013a. WDIA: Advances in Depth Image Analysis and Applications. Vol. 7854 of LNCS. Springer, Ch. Results and Analysis of the ChaLearn Gesture Challenge 2012, pp. 186–204.
- Guyon et al. (2013b) Guyon, I., Athitsos, V., Jangyodsuk, P., Escalante, H. J., 2013b. The chalearn gesture dataset (cgd 2011). Submitted.
- Guyon et al. (2012) Guyon, I., Athitsos, V., Jangyodsuk, P., Hammer, B., Escalante, H. J., 2012. ChaLearn gesture challenge: design and first results. In: Proceedings of the Conference on Computer Vision and Pattern Recognition Worshops. IEEE, Rhode Island, USA, pp. 1–6.
- Guyon et al. (2006) Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L. (Eds.), 2006. Feature extraction, foundations and applications. Vol. 207 of Studies in fuzzines and soft computing. Springer.
- Gweth et al. (2012) Gweth, Y. L., Plahl, C., Ney, H., 2012. Enhanced continuous sign language recognition using pca and neural network features. In: Proceedings of the Conference on Computer Vision and Pattern Recognition Worshops. IEEE, Rhode Island, USA, pp. 555–60.
Hoffmann, H., 2007. Kernel pca for novelty detection. Pattern Recognition 40 (3), 863–874.
- Inoue and Ueda (2003) Inoue, M., Ueda, N., 2003. Exploitation of unlabeled sequences in hidden markov models. IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (12), 1570–1581.
- Jolliffe (2002) Jolliffe, I. T., 2002. Principal Component Analysis, 2nd Edition. Springer, New York.
- Kim et al. (2007) Kim, D., Song, J., Kim, D., 2007. Simultaneous gesture segmentation and recognition based on forward spotting accumulative hmms. Pattern recognition 40 (11), 3012–3026.
- Lee and Kim (1999) Lee, H. K., Kim, J. H., 1999. An hmm-based threshold model approach for gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 21 (10), 961–973.
- Liu (2012) Liu, Y. M., 2012. Human gesture recognition on product manifolds. Journal of Machine Learning Research 13 (2012), 3297–3321.
- Luo et al. (2010) Luo, A., Kong, X., Zeng, G., Fan, J., 2010. Human action detection via boosted local motion histograms. Machine Vision and Applications 21 (3), 377–389.
- Mahbub et al. (2012) Mahbub, U., Imtiaz, H., Roy, T., Rahman, S., Rahman-Ahad, M. A., 2012. A template matching approach to one-shot-learning gesture recognition. Pattern Recognition Letters In press.
- Mahbub et al. (2013) Mahbub, U., Roy, T., Rahman, S., Imtiaz, H., Serikawa, S., Rahman-Ahad, M. A., 2013. A template matching approach to one-shot-learning gesture recognition. In: Proceedings of the 1st International Conference on Industrial Application Engineering. pp. 186–193.
- Malagon-Borja and Fuentes (2009) Malagon-Borja, L., Fuentes, O., 2009. Object detection using image reconstruction with pca. Image and Vision Computing 27 (1-2), 2–9.
- Malgireddy et al. (2013) Malgireddy, M. R., Nwogu, I., Govindaraju, V., 2013. Language-motivated approaches to action recognition. Journal of Machine Learning Research Accepted.
- Martin and Crowley (1997) Martin, J., Crowley, J. L., 1997. An appearance-based approach to gesture recognition. In: Proceedings of the 9th International Conference on Image Analysis and Processing. Vol. 1311 of LNCS. Springer, pp. 340–347.
- Mitra (2007) Mitra, S., 2007. Gesture recognition: a survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 37 (3), 311–324.
- Munoz-Salinas et al. (2008) Munoz-Salinas, R., Medina-Carnicer, R., Madrid-Cuevas, F., Potayo, A. C., 2008. Histograms of optical flow for efficient representation of body motion. Pattern Recognition Letters 29, 319–329.
- Pers et al. (2010) Pers, J., Sulic, V., Kristan, M., Perse, M., Polanec, K., Kovaviv, S., 2010. Histograms of optical flow for efficient representation of body motion. Pattern Recognition Letters 31, 1369–1379.
- Polana and Nelson (1994) Polana, R., Nelson, R., 1994. Low level recognition of human motion. In: Proceedings of the IEEE workshop on Motion of Non-Rigid and Articulated Objects. pp. 77–82.
- Rabiner (1989) Rabiner, L., 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77 (2), 257–286.
- Shao and Ji (2009) Shao, L., Ji, L., 2009. Motion histogram analysis based key frame extraction for human/activity representation. In: Proceedings of the Canadian Conference on Computer and Robot Vision. IEEE, 92, p. 88.
- Tax (2001) Tax, D., 2001. One-class classification. Ph.D. thesis, Delft University of Technology.
- Turk and Pentland (1991) Turk, M., Pentland, A., 1991. Eigenfaces for recognition. Journal of Cognitive Neuroscience 3 (1), 71–86.
- Wang et al. (2009) Wang, H., Ullah, M. M., Klaser, A., Laptev, I., Schmid, C., 2009. Evaluation of local spatio-temporal features for action recognition. In: Proceedings of the British Machine Vision Conference. pp. 1–11.
- Wu et al. (2012a) Wu, D., Zhu, F., Shao, L., 2012a. One-shot learning gesture recognition from rgbd images. In: Proceedings of the Conference on Computer Vision and Pattern Recognition Worshops. IEEE, Rhode Island, USA, pp. 7–12.
- Wu et al. (2012b) Wu, S., Pan, W., Jiang, F., Gao, Y., Zhao, D., 2012b. A mutiple-layered gesture recognition system for one-shot learning. In: ICPR 2012 Gesture recognition workshop.
- Yi et al. (2005) Yi, H., Rajan, D., Chia, L. T., 2005. A new motion histogram to index motion content in video segments. Pattern recognition letters 26, 1221–1231.