Improved Optical Flow for Gesture-based Human-robot Interaction

by   Jen-Yen Chang, et al.
The University of Tokyo

Gesture interaction is a natural way of communicating with a robot as an alternative to speech. Gesture recognition methods leverage optical flow in order to understand human motion. However, while accurate optical flow estimation (i.e., traditional) methods are costly in terms of runtime, fast estimation (i.e., deep learning) methods' accuracy can be improved. In this paper, we present a pipeline for gesture-based human-robot interaction that uses a novel optical flow estimation method in order to achieve an improved speed-accuracy trade-off. Our optical flow estimation method introduces four improvements to previous deep learning-based methods: strong feature extractors, attention to contours, midway features, and a combination of these three. This results in a better understanding of motion, and a finer representation of silhouettes. In order to evaluate our pipeline, we generated our own dataset, MIBURI, which contains gestures to command a house service robot. In our experiments, we show how our method improves not only optical flow estimation, but also gesture recognition, offering a speed-accuracy trade-off more realistic for practical robot applications.



page 1

page 3

page 5


Surgical Gesture Recognition with Optical Flow only

In this paper, we address the open research problem of surgical gesture ...

Gesture Recognition with a Focus on Important Actions by Using a Path Searching Method in Weighted Graph

This paper proposes a method of gesture recognition with a focus on impo...

Head Gesture Recognition using Optical Flow based Classification with Reinforcement of GMM based Background Subtraction

This paper describes a technique of real time head gesture recognition s...

One-Shot-Learning Gesture Recognition using HOG-HOF Features

The purpose of this paper is to describe one-shot-learning gesture recog...

Using Visual Anomaly Detection for Task Execution Monitoring

Execution monitoring is essential for robots to detect and respond to fa...

Two-stream Fusion Model for Dynamic Hand Gesture Recognition using 3D-CNN and 2D-CNN Optical Flow guided Motion Template

The use of hand gestures can be a useful tool for many applications in t...

SDOF-Tracker: Fast and Accurate Multiple Human Tracking by Skipped-Detection and Optical-Flow

Multiple human tracking is a fundamental problem for scene understanding...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In our society, people have a constant need to communicate with each other (e.g., via voice or gestures). Although voice is normally preferred for its directness, sometimes verbal communication is not possible due to environmental factors such as noise, or the inability of one of the interlocutors to communicate, due to, e.g., a speech/hearing disability or when the interlocutor speaks in another language or has a very strong accent. In such cases, humans use gestures as an alternative modality for interaction [1]. On the other hand, robots are increasingly being adopted in our daily life. There has been a rising demand for domestic service robots for entertainment purposes [2], taking care of elders or hospital patients [3], and carrying out domestic tasks. An example of these robots are the Human Support Robot (HSR) developed by Toyota and Pepper developed by Softbank Robotics. The speech recognition modules installed in house service robots (e.g., DeepSpeech [4] or Google Speech API) have achieved near human level performance. However, similarly to humans, human-robot verbal communication is also affected by factors such as noisy environments, a malfunctioning microphone, or situations in which the robot and the human do not “speak” the same language. In order to achieve a more reliable human-robot interaction (HRI), an alternative way to communicate with the robot is necessary. As with humans, gestures provide an alternative modality for communication, and thus, gesture recognition plays an important role in HRI.

Fig. 1: Overview of our pipeline. The input video from the robot’s camera is used to estimate the optical flow with our novel method. Then, through optical flow-based gesture recognition, the robot’s operating system (e.g., ROS) selects which module to run.

In computer vision, gestures can be recognized using an action recognition method

[5] [6]. These methods include optical flow estimation in their pipeline to facilitate understanding the dynamic nature and the spatio-temporal features of gestures. Current optical flow estimation methods have limitations: (1) they result in a blurry representation of the human silhouette that misses a lot of detail, and (2) they are time-costly. Algorithms with an excessive runtime are not suitable for real time gesture-based HRI. If recognizing a gesture takes too long (e.g., 30s), communication is heavily hindered. On the other hand, if the recognition time is fast but the accuracy is poor, recognition mistakes will make the communication impossible.

In this work, we present a gesture recognition pipeline (Fig. 1) designed for human-robot interaction that achieves a good trade-off balance between speed and accuracy. In order to obtain a finer spatio-temporal representation of human motion, we proposed a novel optical flow that pays attention to the human body silhouette, and thus, achieves a better recognition accuracy. Our contributions are:

  • Three novel optical flow estimation methods that incorporate attention to allow for better gesture recognition.

  • An optical-flow based gesture recognition pipeline with improved speed-accuracy trade-off, designed for human-robot interaction.

  • A self-generated dataset for human-robot interaction, MIBURI111MIBURI project URL:, designed to command the robot to complete household tasks.

Ii Related work

Ii-a Human-Robot Interaction Via Gestures

Nowadays, most human-robot interaction methods are speech recognition-based [7] [8]. Unfortunately, the use of speech recognition is somewhat limited, due to environmental, hardware, or linguistic problems (Sec. I). Gesture communication serves as a natural alternative to speech [1]. Previous works [9, 10] study the effect of robots performing gestures to communicate with humans. However, our interest is in humans commanding the robot through gestures. Most existing works on human-robot interaction (HRI) via gestures focus on hand gesture recognition [11]. However, hand gestures limit the motion to only the hand part of the human body, which may not be intuitive enough for general human users to perform. Unlike these works, we also refer to motions involving the whole body as gestures.

Ii-B Gesture Recognition and Optical-Flow based Gesture Recognition

In robot vision, gestures can be recognized using action recognition methods that combine appearance and motion. Whereas color-only methods [12] can recognize certain activities such as sports, their representation of motion is not fine enough to differentiate between similar gestures. In contrast, optical flow is a better representation of motion. Deep learning-based action recognition using optical flow achieved outstanding results with the two-stream network approach [5]

developed by Simonyan et al.. They trained two separate convolutional neural networks to explicitly capture spatial features from color images and temporal features from optical flow. Finding Action Tubes

[13] extended the work of Simonyan et al. by proposing candidate regions descriptors instead of full images in order to localize the action to improve the recognition accuracy. However, the trade-off between the runtime of recognizing the gesture and accuracy of such recognition can be improved. Recent approaches such as Temporal Segment Networks (TSN) [6], and [14] further extended the two-stream approach. More concretely, TSN proposed to concatenate video frames into segments for a better speed-accuracy trade-off.

Ii-C Optical Flow Estimation

In optical flow-based gesture recognition, such as the previously described [14] and [6], in order to estimate optical flow from the input images, researchers deployed traditional optical flow estimation methods, namely the TVL1 [15], and the work of Brox et al. [16]. These optical flow estimation methods reached state-of-the-art accuracy, but they are slow. To tackle this, researchers leveraged deep learning techniques. The first attempt of a deep optical flow estimation network, FlowNet [17], proposed two networks, namely FlowNetS and FlowNetC, and greatly improved the runtime of previous work. Zhu et al. [18] developed an optical flow estimation method called DenseNet by extending a deep network for image classification, and demonstrated that it outperforms other unsupervised methods. FlowNet 2.0 [19] is a follow-up paper of FlowNet [17]. They stacked two optical flow estimation networks (FlowNetS and FlowNetC) and achieved state-of-the-art results comparable to traditional optical flow methods while running on 8 to 140 fps depending on the desired accuracy (the higher the accuracy, the slower the runtime). However, one of the main problems of FlowNet 2.0 [19] is its model size (parameter counts). Since the architecture is stacked, the parameter count is naturally high. Recently, some methods focused on obtaining the same accuracy while reducing its model size [20].

Iii Optical Flow-based Gesture Recognition

Figure 1 depicts an overview of the basic pipeline of our gesture recognition method for human-robot interaction (HRI). The RGB video of the user performing a gesture is fed to our optical-flow estimation method, and then, the resulting optical flow is used to recognize the gesture.

We implemented our modules by taking into account the trade-off between the recognition accuracy and speed. While most gesture/action recognition methods employ optical flow (Sec. II-C), they mainly use traditional optical flow estimation. We argue that using deep learning for estimating optical flow results in a better trade-off between gesture recognition runtime and accuracy, and thus, is adequate for HRI. We also argue that adding an attention mechanism to deep learning optical flow estimation results in a higher accuracy, not only for optical flow estimation (reduces blurriness in contours), but also for gesture recognition (refines the user’s silhouette).

Following the same criteria, we feed our optical flow to the TSN action recognition network [6] (Sec. II-B), as it features state-of-the-art accuracy with short recognition time. That is, approximately one second on a GTX 1080 GPU, for action recognition on 120 images (4 seconds of video at 30fps).

The next section explains in detail our contribution in optical flow estimation.

Fig. 2: Proposed optical flow estimation networks. Top left: original FlowNetS [17]. Top right: AttFlowNet{Res, Inc, NeXt32, NeXt64}. Bottom left: MidFlowNet{Res, Inc, NeXt32, NeXt64} [17]. Bottom right: AttMidFlowNet{Res, Inc, NeXt32, NeXt64}. For all figures below, orange blocks are convolutional layers (feature extractors), blue blocks are transposed convolutional layers, green blocks are optical flow estimators, yellow blocks are attention calculators. Numbers indicate layer size (same number, same size).

Iv Optical flow estimation

We define optical flow estimation as the task of predicting motion pixel-wise in two consecutive frames. It can be also seen as a way of understanding the spatio-temporal features of movements, actions and gestures. We aim to improve estimation accuracy in three ways: by using stronger image feature extractors, by adding attention to the estimation, and by emphasizing mid-level feature representations. Based on this, we propose three novel optical flow estimation networks.

Iv-a Stronger feature extractors

Optical flow estimation aims to predict how pixels move, and thus, we have to understand images at pixel-level. That is, first we extract features from each pixel of an image, then we match the features between two consecutive images, and finally we estimate pixel-level movement. Therefore, we can expect that a better understanding of the images results in a more accurate matching. As shown in [18]

, in optical flow estimation, feature extractors trained on image classification tasks are recommendable for unsupervised learning.

Our novel optical flow estimation uses stronger feature extraction blocks (i.e. blocks capable of extracting more discriminative features, tested in image classification tasks) onto an existing baseline estimation network. We chose FlowNetS


as our base network due to its well-balanced speed-accuracy trade-off. For this purpose, we resorted to four feature extractors widely used for classifying the ImageNet dataset

[21]: ResNet [22], Inception [23], ResNext (4x32d block) and ResNext (4x64d block) [24]. We replaced the simple convolutional layers present in FlowNetS [17] with an adaptation of ResNet, Inception and ResNext. We named these optical flow estimation networks as FlowNetRes, FlowNetInc, FlowNeXt32 and FlowNeXt64 respectively (Fig. 2).

Iv-B Attention module

Once the feature extraction is improved, we target the optical-flow estimation. Ideally, pixels belonging to the same entity (e.g., a hand) should have similar optical flow values. This makes the contours between different entities sharper. To achieve this, our novel network adopts an attention mechanism that was originally used for putting emphasis on keywords in the field of natural language processing

[25]. This attention mechanism enlarges the difference between dissimilar optical flow values, so estimations for pixels within the same moving entity will be similar. The overall formula for applying attention to a layer is

where is the input features to the layer, is the output of the layer, and is a set of convolutional layers (details in Table I). When

is 0, the output features will be the same as the original features. Finally, a sigmoid activation function is applied. Thus, large estimations are enlarged whereas smaller estimations remain similar. Attention allows preserving the silhouettes of the moving parts of the image. Figure

2 (top-right) shows a schematic diagram, derived from the original FlowNetS [17]. We named these optical flow estimation networks as AttFlowNetRes, AttFlowNetInc, AttFlowNext32, and AttFlowNext64 respectively. It is also noted that starting from here, we added additional upscaling back to the original resolution, as the final estimation resolution of FlowNetS [17] is only to the 1/4 of the original resolution.

Name Kernel Strides Input Ch I — O
attconv1 3x3 2 x x — x/4
attconv2 3x3 2 attconv1 x/4 — x/8
attdeconv1 4x4 2 attconv2 x/8 — x/4
attdeconv2 4x4 2 attdeconv2 x/4 — x
TABLE I: Architecture of our attention module

Iv-C Midway module

One of the major challenges with optical flow estimation is predicting a correct general direction of the motion. This is normally solved by using a pyramid structure, like the one presented in SpyNet [26] or [27]. These pyramid methods learn features starting from the largest feature patches down to the smallest ones. However, we lose estimation accuracy when we take the same strategy on estimations when we upscale back to the original resolution. We argue that, in order to estimate the overall optical flow more accurately, mid-level estimations are crucial. Mid-level means that the scale of the estimation is scaled-down from the original input frame size, but not too small so that each pixel can represent its nearby estimation but will not over represent. Therefore, the estimation will have more uniform patches overall, and no sharp changes inside the same object, but will have sharp contrasts especially on the contours. Therefore, we designed a novel module that puts emphasis on the mid-level estimations. A schematic diagram is shown in Fig. 2 (bottom-left). We named these optical flow estimation networks MidFlowNetRes, MidFlowNetInc, MidFlowNext32, and MidFlowNext64 respectively. Table II shows the architecture and layer sizes of our MidFlowNet.

Layer Name Kernel Strides Input Ch I/O
block1 7x7 2 Im1+Im2 6/64
block2 5x5 2 block1 64/128
block3 5x5 2 block2 128/256
block3_1 1x1 1 block3 256/256
block4 1x1 2 block3_1 256/512
block4_1 1x1 1 block4 512/512
block5 1x1 2 block4_1 512/512
block5_1 1x1 1 block5 512/512
block6 1x1 2 block5_1 512/1024
block6_1 1x1 1 block6 1024/1024
biup_1 block6_1 1024/1024
biup_2 block5_1 512/512
avgdown_1 block3_1 256/256
avgdown_2 block2 128/128
concat biup_1 + biup_2 + block4_1 -/2432
+ avgdown_1 + avgdown_2
pr4+loss4 3x3 1 concat 2432/2
down4 4x4 2 pr4+loss4 2/2
deconv4 4x4 2 concat 2432/512
concat5 block5_1 + deconv4 + down4 -/1026
pr5+loss5 3x3 1 concat5 1026/2
down5 4x4 2 pr5+loss5 2/2
deconv5 4x4 2 concat5 1026/1024
concat6 block6_1 + deconv5 + down5 -/2050
pr6+loss6 3x3 1 block6_1 2050/2
up4 4x4 2 pr4+loss4 2/2
deconv3 4x4 2 concat 2432/256
concat3 block3_1 + deconv3 + up4 -/514
pr3+loss3 3x3 1 concat3 514/2
up3 4x4 2 pr3+loss3 2/2
deconv2 4x4 2 concat3 512/128
concat2 block3_1 + deconv2 + up3 -/258
pr2+loss2 3x3 1 concat2 258/2
up2 4x4 2 pr2+loss2 2/2
deconv1 4x4 2 concat2 258/64
concat1 deconv1 + up2 -/66
pr1+loss1 3x3 1 concat1 66/2
up1 4x4 2 pr1+loss1 2/2
deconv0 4x4 2 concat1 66/32
concat0 deconv0 + up1 -/34
pr0+loss0 3x3 1 concat0 34/2
TABLE II: General architecture of MidFlowNet {Res,Inc,NeXt}. Bi_up means bilinear upsampling. Avg_down means average-pooling. the output of biup and avgdown has the same size as block4_1. Pr means optical flow estimation. Loss is EPE loss. Deconv is deconvolution followed by LeakyRelu activation. Up is only deconvolution. Concat concatenates at the channel axis.

Iv-D Attention-midway module

Our proposed novel optical-flow estimation method is based on the combination of attention and midway. This way, we manage to simultaneously put emphasis on both the mid-level estimations and the contours. Figure 2 shows a schematic diagram. We named these optical flow estimation networks as AttMidFlowNetRes, AttMidFlowNetInc, AttMidFlowNext32, and AttMidFlowNext64 respectively.

V Experiments

To the best of our knowledge, previous gesture recognition works do not discuss whether having a more accurate optical flow estimator results on more accurate gesture recognition. In order to shed light on the matter, we have conducted two experiments. First, we compared our optical flow estimation methods against the state of the art. Then, we evaluate the improvement in gesture recognition accuracy when using our estimated optical flow. In each experiment, our proposed method for optical flow estimation and gesture recognition is trained end-to-end. For this, in order to evaluate the recognition of gestures to command a home service robot, we generated our own dataset, MIBURI. The runtime provided in this section corresponds to the execution times of our algorithms in an Nvidia TITAN X GPU, with data stored on SSD (to minimize data reading time).

V-a MIBURI: Video gesture Dataset for commanding Home Service Robots

In order to evaluate the performance of gesture recognition for commanding a home service robot, a video dataset of gestures is required. Since, to the best of our knowledge, no existing public dataset fits our needs, we generated our own.

Our dataset, MIBURI, was collected using HSR robot’s head camera, an ASUS Xtion (640480 resolution). It contains 10 gestures: pick-bag, drop-bag, follow-me, stop-follow-me, bring-drink, bring-food, on-the-left, on-the-right, yes/confirm, no/decline (see examples in Fig. 3).

When selecting the motion for each gesture, we applied the following criteria: (1) they are easily understandable for general people, even the first time they see them; (2) they are natural for humans to perform. These two points are essential from the perspective of HRI, since gestures should be used effortlessly to command the robot [28]. We recorded gestures for 15 users (among them 3 females and 12 males ranged from age 22 to age 30) and four repetitions per gesture, which sums up to a total of 600 videos consisting of 69835 frames. The average duration of the video is 3.88s.

Fig. 3: Examples of our MIBURI dataset (top): pick-bag, drop-bag, on-the-left. Examples of the ChaLearn IGR dataset (bottom): surgeon needle holder, italian d’accordo, volleyball referee substitution
Model Flying- Sintel Sintel Flow
Chairs clean final Params. runtime
(EPE) (EPE) (EPE) (s)
Related works
FlowNetS [17] 2.71 4.50 5.45 38.5M 0.080
FlowNetC [17] 2.19 4.31 5.87 39M 0.150
SpyNet [26] 2.63 4.12 5.57 1.2M 0.069
Zu et al. [18] 4.73 - 10.07 2M 0.130
MSCSL [20] 2.08 3.39 4.70 - 0.060
Ours: base (stronger feature extractors)
FlowNeXt32 2.58 3.82 4.38 33.8M 0.066
FlowNeXt64 2.70 4.14 4.65 49.3M 0.071
FlowNetRes 2.26 3.56 4.06 52.2M 0.055
FlowNetInc 2.27 3.94 4.40 25.9M 0.044
Ours: with attention
AttFlowNeXt32 1.94 5.08 6.91 26.8M 0.060
AttFlowNeXt64 1.62 4.71 6.65 42.3M 0.074
Ours: with midway
MidFlowNeXt32 1.74 5.85 7.86 51.2M 0.055
MidFlowNeXt64 1.48 5.30 7.45 66.8M 0.076
Ours: with attention and midway
AttMidFlowNeXt32 1.51 4.89 6.62 48.4M 0.074
AttMidFlowNeXt64 1.60 5.25 6.74 64.0M 0.092
TABLE III: Results of optical flow estimation in EPE (End Point End, the lower the better). The runtime reported is on FlyingChairs.

V-B Optical flow estimation evaluation

Table III shows the optical flow estimation results. We evaluated our networks with the FlyingChairs dataset [17], and the MPI_Sintel dataset [29]. FlyingChairs [17] is a synthetic dataset designed specifically for training CNNs to estimate optical flow. It is created by applying affine transformations to real images and synthetically rendered chairs. The dataset contains 22,872 image pairs: 22,232 training and 640 test samples according to the standard evaluation split, with resolution 512384. MPI_Sintel [29] is a synthetic dataset derived from a short open source animated 3D movie. There are 1,628 frames, 1,064 for training and 564 for testing with a resolution of 1024436. The difference between them is that Clean pass contains realistic illuminations and reflections, and Final pass added additionally rendering effects like atmospheric effects and defocusing blurs.

We measured the performance of our estimation networks via End Point Error (EPE). EPE is calculated using Euclidean distance between the ground truth optical flow and the estimated optical flow. For training, we used exclusively FlyingChairs train set, since MPI_Sintel is too small for large-scale deep learning training. Then, FlyingChairs is tested in its test set, and both Sintel results are on their respective train sets.

V-B1 Discussion

Stronger feature extractors generalize better: Our base networks with stronger feature extractors perform similarly to FlowNetS and FlowNetC on our training dataset, FlyingChairs, but outperform both of these networks on both of the Sintel passes. This is because, while FlowNetS and FlowNetC achieved a good EPE on FlyingChairs, they do not generalize well to other datasets.

Stronger feature extractors are more robust to noise: Although some related works, e.g. MSCSL, achieved better Sintel clean results, we observed that our base networks achieved better Sintel final results. This proves that by adapting stronger feature extractors onto learning optical flow, the networks can understand motion better and ignore the additional atmospheric effects and motion blurs included in the final pass better, thus obtaining closer results between the Sintel clean and Sintel final benchmarks.

Our method outperforms others in FlyingChairs: By adding the attention mechanism, midway estimations and attention-midway combined, we significantly outperformed other methods on the training set, FlyingChairs, with an improvement of between 20% to 45%. However, we performed poorer on the Sintel dataset. We think it is because of the dataset difference between Sintel and FlyingChairs, also discussed in [17]. Sintel has larger motions than FlyingChairs. When performing deconvolutions, the larger motions will more likely be upsampled less accurately, therefore resulting in an larger EPE. Hence, when we reached very accurate FlyingChairs estimations, which feature more smaller motions, we will suffer on Sintel. Nevertheless, our attention and midway networks, individually and combined, achieve the highest accuracy to date to our knowledge on the FlyingChairs dataset.

Our method improves the speed-accuracy trade-off: Adding an attention module improves runtime (faster execution) and reduces the model size (less memory consumed). Then, the midway module reduces the runtime but increases the model size. The best speed-accuracy trade-off is achieved on the FlyingChairs dataset by combining attention and midway since, although the runtime increases by 10%, the accuracy increases by 45%.

V-C Gesture recognition evaluation

Table IV shows the gesture recognition accuracy when using our optical flow estimation methods. For that, we used our MIBURI dataset and the Isolated Gesture Recognition Challenge (IGR) dataset. In MIBURI, our training split contains the first three gesture repetitions of each user for training, and the remaining for testing. The IGR dataset (in ICPR 2016) was derived from one of the popular gesture recognition datasets, the ChaLearn Gesture Dataset 2011 [30]. It includes in total 47933 RGB-D gesture videos, where each RGB-D video represents an isolated gesture (we only used the RGB channels). The dataset is split in 35878 videos for training, 5784 videos for validation and 6271 videos for testing. The resolution is 240320. There are 249 gestures labels performed by 21 different individuals. The gestures are very diverse (e.g., music notes, diving signals, referee signals), which makes it more challenging than MIBURI.

We chose the optical flow branch of the Temporal Segment Network (TSN) [6] as our gesture recognition method, since its performance is state-of-the-art and the speed-accuracy trade-off is well balanced. We compared it to three baselines: First, improved dense trajectories (iDT) [31]

with Fisher vector (FV) encoding


and support vector machines (SVM) classifier, which combined were the state of the art before deep learning

[31]. Second, TSN (default) uses traditional (i.e., non-deep learning-based) optical flow estimation [15]. Third, TSN/FlowNetS is TSN’s optical flow branch replaced with optical flow estimation from FlowNetS [17].

TABLE IV: Gesture recognition accuracy with our proposed optical flow. The runtime reported is on MIBURI, and includes gesture recognition optical flow estimation.
Model MIBURI ChaLearn IGR Recognition
accuracy (%) accuracy (%) runtime (s)
iDT+FV+SVM 98.3 - 27.1
TSN (default) 96.9 94.55 11.1
TSN/FlowNetS 73.9 84.98 1.7
Ours: base (stronger feature extractors)
TSN/FlowNeXt32 92.1 76.83 4.7
TSN/FlowNeXt64 93.8 81.79 7.9
Ours: with attention
TSN/AttFlowNeXt32 96.9 84.31 4.1
TSN/AttFlowNeXt64 96.9 87.51 6.4
Ours: with midway
TSN/MidFlowNeXt32 96.9 85.94 6.0
TSN/MidFlowNeXt64 96.9 85.02 8.5
Ours: with attention and midway
TSN/AttMidFlowNeXt32 96.9 86.00 7.2
TSN/AttMidFlowNeXt64 98.4 85.43 9.6

V-C1 Discussion

Attention improves gesture recognition: Adding attention to our deep learning-based networks translated to a better gesture recognition performance in both datasets. This effect can be observed for TSN/FlowNeXt{32/64} vs TSN/AttFlowNeXt{32/64} and TSN/MidFlowNeXt{32/64} vs TSN/AttMidFlowNext{32/64}. As we hypothesized, since attention provides a more refined human silhouette, human motion is better represented, and thus, the performance of optical flow-based gesture recognition methods is improved.

Better EPE does not necessarily imply better gesture recognition: In our experiments with MIBURI, improvement in optical flow and gesture recognition is correlated. For ChaLearn IGR on the other hand, in spite of being outperformed in optical flow estimation by our networks, TSN/FlowNetS achieves better recognition accuracy than our base. The reason for this mismatch is that, whereas only correctly-recognized motions account for gesture recognition, EPE metric also takes into account correctly-recognized background pixels. The way of obtaining a good score in both metrics is by providing an accurate representation of the human silhouette, like in our attention mechanism.

Our method improves the speed-accuracy trade-off: Compared to traditional optical flow estimation methods, deep learning-based methods allow for a faster gesture recognition, which has a lot of impact in HRI. With our MIBURI dataset, comparing TSN/FlowNeXt{32/64} vs TSN/AttFlowNeXt{32/64}, adding attention improved recognition accuracy by 7% to 10% while runtime decreases, achieving the best speed-accuracy trade-off. While other configurations achieve higher accuracies, their recognition runtime makes them prohibitive for HRI.

Vi Conclusion

In this paper, we presented a novel optical flow estimation method and evaluated it in a gesture recognition pipeline for human-robot interaction. We presented four ways of improving optical flow estimation: stronger feature extractors, attention mechanism, midway mechanism, and a combination of the latter. Due to the lack of gesture datasets for commanding house service robots, we generated a gesture dataset called MIBURI. We proved that our method allows for a better representation of motion, improving the optical flow estimation of existing deep learning-based gesture recognition work. This improvements also lead to more accurate gesture recognition, and achieve a better speed-accuracy trade-off. In the future, we plan to extend this work by integrating it with a robot, and comparing the efficiency of gesture-based human-robot interaction against current voice-controlled approaches.


This work was partially supported by JST CREST Grant Number JPMJCR1403, Japan. The authors have participated in the HSR developer community222 and made use of HSR hardware and software platforms.


  • [1] Goldin-Meadow, “The role of gesture in communication and thinking.” Trends in cognitive sciences, vol. 3 11, pp. 419–429, 1999.
  • [2] M. Lohse, F. Hegel, and B. Wrede, “Domestic applications for social robots: an online survey on the influence of appearance and capabilities,” Journal of Physical Agents, vol. 2, pp. 21–32, June 2008.
  • [3] I. Leite, C. Martinho, and A. Paiva, “Social robots for long-term interaction: A survey,” International Journal of Social Robotics, vol. 5, April 2013.
  • [4] D. Amodei et al., “Deep speech 2: End-to-end speech recognition in english and mandarin,” in

    IEEE International Conference on Machine Learning (ICML)

    , 2016.
  • [5] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Neural Information Processing Systems (NIPS), 2014.
  • [6] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in European Conference on Computer Vision (ECCV), 2016.
  • [7] H. Medicherla and A. Sekmen, “Human–robot interaction via voice-controllable intelligent user interface,” Robotica, vol. 25, no. 5, pp. 521––527, 2007.
  • [8] O. Mubin, J. Henderson, and C. Bartneck, “You just do not understand me! speech recognition in human robot interaction,” in IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), Aug 2014, pp. 637–642.
  • [9] M. Salem, S. Kopp, I. Wachsmuth, K. Rohlfing, and F. Joublin, “Generation and evaluation of communicative robot gesture,” International Journal of Social Robotics, vol. 4, no. 2, pp. 201–217, April 2012.
  • [10] B. Gleeson, K. MacLean, A. Haddadi, E. Croft, and J. Alcazar, “Gestures for industry intuitive human-robot communication from human observation,” in International Conference on Human-Robot Interaction (HRI), March 2013, pp. 349–356.
  • [11] L. Brethes, P. Menezes, F. Lerasle, and J. Hayet, “Face tracking and hand gesture recognition for human-robot interaction,” in IEEE International Conference on Robotics and Automation (ICRA), vol. 2, April 2004, pp. 1901–1906.
  • [12] K. Soomro, A. Roshan Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” in IEEE International Conference on Computer Vision (ICCV), Dec 2012.
  • [13] G. Gkioxari and J. Malik, “Finding action tubes,” in

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2015.
  • [14] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [15] C. Zach, T. Pock, and H. Bischof, “A duality based approach for realtime tv-l1 optical flow,” in Pattern Recognition, F. A. Hamprecht, C. Schnörr, and B. Jähne, Eds., 2007, pp. 214–223.
  • [16] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “High accuracy optical flow estimation based on a theory for warping,” in European Conference on Computer Vision (ECCV), vol. 3024, May 2004, pp. 25–36.
  • [17] A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazırbaş, V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” in IEEE International Conference on Computer Vision (ICCV), 2015.
  • [18] Y. Zhu and S. D. Newsam, “Densenet for dense flow,” in International Conference on Image Processing (ICIP), 2017.
  • [19] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet 2.0: Evolution of optical flow estimation with deep networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [20] S. Zhao, X. Li, and O. E. F. Bourahla, “Deep optical flow estimation via multi-scale correspondence structure learning,” CoRR, vol. abs/1707.07301, 2017. [Online]. Available:
  • [21] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2009, pp. 248–255.
  • [22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [23] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [24] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [25]

    D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in

    International Conference on Learning Representations (ICLR), 2015.
  • [26] A. Ranjan and M. J. Black, “Optical flow estimation using a spatial pyramid network,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [27] B. D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision,” in

    International Joint Conference on Artificial Intelligence (IJCAI)

    , vol. 2, 1981, pp. 674–679.
  • [28] J. L. Raheja, R. Shyam, U. Kumar, and P. B. Prasad, “Real-time robotic hand control using hand gestures,” in International Conference on Machine Learning and Computing (ICMLC), Feb 2010, pp. 12–16.
  • [29] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” in European Conference on Computer Vision (ECCV), ser. Part IV, LNCS 7577, A. Fitzgibbon et al. (Eds.), Ed.   Springer-Verlag, Oct 2012, pp. 611–625.
  • [30] I. Guyon, V. Athitsos, P. Jangyodsuk, and H. J. Escalante, “The chalearn gesture dataset (cgd 2011),” Machine Vision and Applications, vol. 25, no. 8, pp. 1929–1951, Nov. 2014.
  • [31] H. Wang and C. Schmid, “Action recognition with improved trajectories,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Dec. 2013, pp. 3551–3558.
  • [32] J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek, “Image classification with the fisher vector: Theory and practice,” International Journal of Computer Vision (IJCV), vol. 105, no. 3, pp. 222–245, Dec 2013.