Recent robot learning methods such as learning from demonstration (LfD) Argall et al. (2009) and imitation learning allow a transfer of preferences and policy from the expert performing the task to the learner. These methods have allowed the learning robots to successfully perform difficult acrobatic aerial maneuvers Abbeel et al. (2007), carry out nontrivial manipulation tasks Pollard and Hodgins (2004), penetrate patrols Bogert and Doshi (2014), and merge autonomously into a congested freeway Nishi et al. (2017). An important way by which this transfer occurs is the learner simply observing the expert perform the task.
Observations of the expert engaged in the task is expected to yield trajectories of state-action pairs, which is then given as input to the algorithms that drive these methods. Consequently, recognizing the expert’s state and action accurately from observations is crucial for the learner. If the learner is a robot, its observations are sensor streams. Very likely, these will be streams from range and camera sensors yielding RGB and depth (RGB-D) data. Therefore, the learning robot must recognize sequences of state-action pairs quickly and accurately from RGB-D streams. This is a critical component of the LfD and imitation learning pipelines.
In this paper, we present SA-Net
, a deep neural network that recognizes state-action pairs from RGB-D data streams with a high accuracy. This supervised learning method offers a general deep learning alternative to the current adhoc techniques, which often rely on problem-specific implementations using OpenCV. Figure1 gives an overview of how SA-Net is deployed. SA-Net aims to recognize from a sensor stream, the expert’s state and action. The state is the 2D or 3D coordinates in a global reference frame and the orientation. For example, the state of a ground mobile robot is its 2D coordinates and the angle it is facing as measured counterclockwise from the positive x-axis. The action is derived from the motion performed by the robot.
As the learner’s position may not be fixed, SA-Net seeks to recognize the coordinates and orientation of the observed object(s) relative to the learner’s location. While the RGB frame offers context, the depth data is relative to the observer. Coordinates are recognized by interleaving convolutional neural nets (CNN) and pooling layers followed by fully-connected networks input to a softmax. This allows the use of all four channels, RGB-D, in recognizing the coordinates. Identifying the expert’s orientation and action is more challenging. Both of these rely on temporal data, and SA-Net utilizes frames from time steps , , and current time step . Each frame is cropped previously by a network such as Faster R-CNN Girshick et al. (2014) or YOLO Redmon and Farhadi (2018) to focus attention on the expert. The network backtracks the movement inside the bounding box for time step and
using a layer of time-distributed CNNs followed by two convolutional long-short-term memory nets (LSTM)Hochreiter and Schmidhuber (1997). SA-Net continues to utilize the depth channel here as well by running an intercept to the previously described fully-connected nets that provide relative distance.
We evaluate SA-Net in two diverse domains. It is used to identify the state-action sequences of two TurtleBots that are simultaneously but independently patrolling a hallway Bogert and Doshi (2014). In another application, SA-Net is used to identify the state-action sequences of a PhantomX robotic arm that is performing pick-and-place operations. In both domains, SA-Net exhibits high accuracy while being able to run on computing machines on board a robot with limited processing power and memory. Ablation and robustness studies demonstrate that the architecture is necessary and sufficient and that SA-Net can handle typical adverse conditions as well. Consequently, SA-Net offers high-accuracy trajectory recognition to facilitate robots engaged in LfD or imitation learning in various domains.
1.1 Related Work
Traditionally, the state and action of an observed agent is recognized by tracking a marker associated with the agent. For example, Bogert and Doshi [2014
] makes use of a colored box placed on the TurtleBot, which simplifies the detection of the robot and estimation of its state and action. A limitation of such methods is a lack of robustness to occlusion of the object and to noise in the context.
Recently, deep neural networks have demonstrated significantly improved performance on tasks involving image and video analysis Yue-Hei Ng et al. (2015) He et al. (2016). Related to our method are the neural network architectures utilized for recognizing human gestures and activities. For example, Ji et al. [2013
] recognizes human actions in surveillance videos using a 3D-CNN. A recurrent neural network (RNN) combined with 3D CNN is utilized by Montes et al. [2016
] to classify and temporally localize activities in untrimmed videos. To leverage depth information in gesture recognition, two separate CNN streams are usedEitel et al. (2015) with a late fusion network. Recently, the RGB and depth modalities were considered as one entity to extract features for action recognition with CNNs Wang et al. (2017). In general, these action recognition methods treated input videos for learning as either 3D volumes with multiple adjacent frames Ji et al. (2013), one or multiple compact images Wang et al. (2017), or as a sequence of image frames Montes et al. (2016). Our proposed method belongs to the last category and handles the image sequence with LSTMs, which are capable of learning temporal dependencies. Furthermore, in contrast to these methods for recognizing actions, SA-Net is tasked with recognizing the state and action pairs simultaneously for use in online learning.
2 SA-Net Architecture
As SA-Net is tasked with recognizing state-action pairs, this motivates a network design that efficiently mixes convolutional and recurrent NNs, which we describe below.
2.1 Problem Definition
We aim to automatically estimate the state and action pairs of an expert from RGB and depth streams using deep neural networks. Given the expert’s three video frames captured by a learner at time points , , and , our network jointly predicts the state () and action () of the expert at the current time point . Here, the tuple of () in the state representation describes the location coordinate of the expert in a 3D environment; the dimension is ignored for 2D cases. The describes the orientation of the expert. In this paper, we consider discrete states and action, which allows us to formulate our task as a multi-label classification problem. Formally, our problem can be formulated as:
where indicates the mapping function learned by our classification network; , , and are the three frame inputs; represents the parameter set of the network for classifying the state and action jointly; , , and are the discretized dimensions in each coordinate; is the number of the expert’s orientations – for instance, we have four orientations including north, south, east, and west in our TurtleBot application; is the number of actions, e.g., four different actions including move forward, stop, move right, and left. In general, the network includes two coupled components for the state and action recognition, which are learned simultaneously. The architecture of SA-Net is shown in Fig. 2.
2.2 State Recognition
The state recognition aims to determine the expert’s coordinate () and its orientation . Typically, the expert’s coordinate can be identified on the basis of its surrounding environment. Therefore, we use one image frame without considering the temporal information in our coordinate recognition module. Different from state recognition, orientation recognition requires more than one image frame to recognize hard-to-distinguish orientations. As shown in Fig. 3, the TurtleBot is facing in different directions in the two images, but the image difference is too subtle to correctly separate these two orientations of the TurtleBot. In such situations, image sequence plays an important role in recognizing the orientation. Therefore, in the state recognition of SA-Net, we separate the prediction of the coordinate () from that of the orientation , as one network stream takes the static image input while the other takes the image sequence as input.
Coordinate recognition As shown in the top stream of the network in Fig. 2, only the image frame at time point is used to predict the expert’s location coordinate. We assign a pre-defined coordinate system for each environment; that is, each image frame will be classified into a unique coordinate, which is represented by an absolute location () with respect to the origin in the coordinate system. The expert’s coordinates are learned from images captured by the learner; however, the learner’s location may change in different situations. To improve the generalization of the network, we leverage the relative distance between the expert and the learner to help in the recognition of the expert’s coordinate.
In the coordinate recognition branch, we have two sets of coordinate-related prediction, that is, the relative distance () and the absolute coordinate (
). These two prediction tasks share the same process of image feature extraction, which includes five convolutional layers and three max pooling layers. The convolutional layers use 32 filters with the same kernel sizeand the same stride. The three pooling layers are located after the first, third, fifth convolutional layers, respectively, with filters of size , , and and strides of size , , and . Following the convolutional and pooling layers, two fully convolutional (FC) layers are used in the classification. Because the prediction of relative distance contributes to the coordinate prediction, we have an additional FC layer in the stream of coordinate classification after concatenating the pre-activation of the softmax function from the relative distance classification.
Orientation recognition Different from the coordinate parameter, the orientation of the expert guides its movement regardless of the environment – similar to the action parameter discussed in Section 2.3. Therefore, in both orientation and action recognition we would like the network to have its attention on the expert itself, especially when the expert is far away from the learner and relatively small in the whole image frame. To achieve this goal, we adopt object detection to make the expert stand out for perceiving its behavior. More details about the object detection are given in Section 2.4. After object detection, we have three new sequential frames, which are cropped from the original RGB-D image inputs and re-sized to images of size to facilitate orientation and action recognition of the expert. The sequential frames are essential in orientation recognition to differentiate hard examples as shown in Fig. 3.
To handle the sequential image inputs, we use time-distributed convolutional (TD-Conv) layers in the network stream for the orientation recognition. These layers collect image features required for orientation recognition from all three time steps. In particular, we have two TD-Conv layers, followed by one time-distributed max pooling layer and another TD-Conv layer. Each TD-Conv layer has 32 filters with size of and stride of , and the pooling layer uses a filter with size of and stride of . In addition, we observe that the orientation and action recognition are connected to coordinate recognition, albeit loosely. For instance, the TurtleBot is less likely to turn left or right if it is in the middle of a corridor. Thus, we concatenate the whole image features extracted from the coordinate recognition with the spatio-temporal features extracted from the cropped image sequence to predict the expert’s orientation. A similar operation is performed in action recognition, as discussed next.
2.3 Action Recognition
Similar to the orientation recognition, actions are recognized using the same three sequential cropped images after object detection (Section 2.4). The goal is to determine the expert’s action – for example, in which cardinal direction is the expert moving. Because the orientation and action recognition are working on the same input, they share the first three layers for extracting lower-level features from cropped images at all time steps. The action recognition then uses two convolutional LSTM layers to further compose higher-level features and capture temporal changes in the image sequence. These two new layers also use 32 kernels with size of and stride of . In this branch we leverage all features extracted from the state (both coordinate and orientation) recognition to support the action recognition.
2.4 Expert Detection
This object detection module provides inputs for the orientation and action recognition of the expert. We use the RGB data stream at the current time point t to perform the object detection using an existing model, YOLO Redmon and Farhadi (2018). Using the predicted bounding box for the expert , we crop the images from the frames , , and . In case of cropping, we keep a small amount of surrounding environment background; this buffered cropping is calculated by the linear equation, . Here, is the cropping factor that determines how aggressively the users want to crop the image; is the width or the height of the bounding box before the buffered cropping, while is the corresponding value after the cropping; and is the minimum amount of cropping, e.g., 10 pixels. In all of our experiments, we set .
2.5 Masking for Multiple Experts
In practice, we would have more than one expert in the scene for learning. However, the network is not explicitly designed to recognize the state and action pairs for multiple experts. To address this issue, we propose to use the masking strategy that ensures only one expert existing in the images for recognition. In particular, we leverage the object detection described in Section 2.4 to separate the experts and generate a new image for each of them. When generating the image for one expert, we remove all unwanted experts using the detected bounding boxes and replace the the removed regions with the background image stored in the memory. In this way, we have new images for each expert to pass through the network for its state and action recognition.
SA-Net exhibits a general architecture useful in multiple domains. We evaluate it on two domains offline and online on a physical robot and report on our extensive experiments below.
We evaluated SA-Net on two diverse domains. First, it was deployed on a TurtleBot tasked with penetrating cyclic patrols by two other TurtleBots in a hallway as shown in Fig. 4
; this domain has been used previously to evaluate inverse reinforcement learning methodsBogert and Doshi (2014). Each patroller can assume one of 4 orientations and 4 actions. The other domain involves observing a PhantomX arm mounted on a TurtleBot (Fig. 4), which is performing a pick-and-place task. The arm is observed from a Kinect 360 RGB-D sensor overlooking the arm. This domain adds a third dimension, the height of the end effector, to the state, and the arm has 6 possible orientations and 6 actions.
3.2 Formative Evaluation
For both domains, we evaluated SA-Net using stratified 5-fold cross validation. 500 annotated RGB and depth image pairs were utilized to train a Faster R-CNN Ren et al. (2015), whose output then trained a YOLO network to obtain the bounding boxes for the cropped images. The whole data sets consist of 60K annotated sets of RGB and depth image frame pairs for the patrolling domain and 10K such sets for the manipulation domain. Each set consists of an uncropped pair and three cropped pairs of time points , , and .
show the prediction accuracy on the 2D or 3D coordinates and orientation that make up the state, and on the action for each domain. We show the results for each of the 5 runs, mean, and standard deviation across the runs. Notice that in both domains,SA-Net generates predictions of state and action with very high accuracy, with those for the manipulation domain being slightly less accurate than those for the patrolling domain. This is generally consistent across all folds due to which the standard deviations are low.
|SA-Net w/o Relative X & Y||81.3781.4965||91.4361.4559||91.2371.2587||83.6331.8564|
|SA-Net w/o data from t-1, t-2||96.560.0063||98.320.0048||79.431.3344||78.863.5681|
|SA-Net w/o depth channel||87.231.4968||95.121.4861||83.561.5189||81.121.00001|
|SA-Net w/o object detect||68.742.2565||69.951.2556||21.652.3909||33.890.9604|
3.3 Ablation Study
We performed an ablation study to understand the sensitivity of SA-Net’s performance on key components of the network. The ablation study removes a part of the network and conducts experiments on the revised model.
Relative X and Y In this experiment on the patrolling domain, we eliminate that part of SA-Net which contributes to establishing the 2D grid coordinate of the observed robot relative to the observer’s location. This part relies more greatly on the depth data. Consequently, we may expect the network to memorize the location by relying more on RGB data but unable to detect changes in its own deployed position. Row 1 of Table 2 shows a significant drop in the prediction accuracy of state and action with a more pronounced drop in the accuracy of predicting the X-coordination and action. These two rely significantly more on the relative distances.
Temporal sequence data In this experiment, we eliminate the part of SA-Net responsible for processing temporal data from previous time steps and . This also eliminates those two input channels and keeps input from time step only. We hypothesize this removal to significantly impact the recognition of orientation and action, both of which are thought to rely on sequence data. On the other hand, a single frame could be sufficient to identify the orientation in many cases. Table 2, row 2 presents prediction accuracies that are significantly lower for and action, while recognizing the 2D coordinates is generally not affected. As such, the temporal data is indeed important for and for SA-Net in general.
Multimodal data Next, we study if depth data is needed for the predictions and how the network will behave when its removed. Can we make the network learn the state and action from RGB data only? Row 3 of Table 2 shows that the predictions of X-coordinate, , and action are significantly degraded in the absence of the depth channel. The Y-coordinate is least impacted as we may expect. As a patroller approaches the observer, there are multiple states for which the RGB frames are similar. In the absence of depth, the network memorizes certain features and overfits on those characteristics.
Object detection Finally, we precluded the object recognition performed by YOLO, resulting in no cropped images as input. The drastic drop in prediction quality of all coordinates, orientation, and action (row 4) gives evidence that object detection is required. Coordinate recognition suffers because object detection is needed for masking each expert in the context of multiple experts. In recognizing the orientation and action, object detection plays a more integral role focusing SA-Net’s attention, which is demonstrated by a larger degradation in their prediction accuracy.
|Baseline: Centroid method||94.150.00||96.130.00||N/A||93.160.00||78.260.68|
|SA-Net w/ Noise||92.650.87||96.650.72||95.230.40||95.120.76|
|Centroid method w/ Noise||34.206.62||44.431.42||23.231.31||42.451.88|
|SA-Net w/ Occlusion||45.150.87||54.600.76||64.120.99||46.361.00|
|Centroid method w/ Occlusion||18.232.13||17.341.57||14.420.15||43.120.80|
3.4 Summative Evaluation on Physical Robots
We deployed the trained SA-Net on a physical TurtleBot that observed two other TurtleBots patrolling the hallway and on a TurtleBot that is connected to a Kinect 360 overlooking a PhantomX arm. SA-Net can be used in ROS as a service and the corresponding component architecture is shown in Fig. 5.
Although, in general, it is challenging to report the prediction accuracy in online physical experiments, we logged the RGB-D stream and SA-Net’s predictions for each frame in the stream. These predictions were later verified manually. Table 3 reports the prediction accuracy of observed state-action pairs for both domains. We compared SA-Net’s performance on the patrolling domain with a traditional OpenCV based implementation that detects the centroid of the colored box on each robot. The extant method is particularly poor in recognizing the patroller’s action, and SA-Net improves on it drastically. SA-Net’s reduced accuracy on the manipulation domain is due to the increased complexity of a third dimension and more actions of the manipulator. Next, SA-Net’s prediction robustness was evaluated in various scenarios.
Noise test In this experiment, we test if background noise impacts the prediction accuracy of the network. The noise is defined as objects that look like or have similar characteristics as the target, and dimmed ambient light. Such background objects, shown in Fig. 6, include a human wearing a similar-colored shirt and boxes of same colors on the floor.
Occlusion test In this experiment, the target is covered partially to approximate 50% occlusion; we cover the TurtleBot by a cardboard box or a white cloth as shown in Fig. 6. These robots then patrol the hallways as before.
In Table 4, we show SA-Net’s prediction accuracy in each context. For the noise test, the predictions are average of 15 runs split into 5 with a human, 5 with boxes, and 5 with dimmed ambient light. For the occlusion test, again an average of 15 runs is shown with the object partially covered to approximate 50% occlusion. Notice that SA-Net’s predictions degrade and rather dramatically under occlusion of the target object. The latter drop is because of SA-Net’s reliance on RGB data, which get curtailed under occlusion. Nevertheless, it’s predictions remain significantly better in both tests than the traditional centroid-based blob detection method. In particular, the centroid-based method fails to detect the observed robots under occlusion.
|Memory usage||742MB 3MB|
|Faster R-CNN SA-Net||6s 0.4s|
|YOLO v2 SA-Net||1.1s 0.3s|
How much memory is consumed by the ROS deployment of SA-Net? Table 5 reports the total amount of RAM held by the ROS service for good performance on state-action recognition. We also show the maximum time in seconds taken by SA-Net for prediction when paired with Faster R-CNN and paired with YOLO2 for the patrolling domain, which has two targets. Notice that pairing with YOLO2 speeds up the prediction by a factor of more than five.
4 Concluding Remarks
SA-Net brings the recent advances in deep supervised learning to bear on a crucial step in LfD and imitation learning. It represents a general architecture for recognizing state-action pairs from RGB-D streams, which are then input to underlying methods for LfD such as inverse reinforcement learning. SA-Net demonstrates recognition accuracies on diverse robotics applications that are significantly better than previous conventional techniques. While minor changes in component layers may be beneficial, an ablation study revealed that the major architectural parts of the neural network are indeed needed. A low resource utilization signature allows SA-Net to be deployed using the relatively sparse computing resources on board robotic platforms.
SA-Net also brings another benefit to LfD. Recent techniques, such as maximum entropy deep inverse reinforcement learning Wulfmeier et al. (2015), utilize a neural network to perform inverse reinforcement learning. Consequently, this offers an opportunity to integrate SA-Net into the neural network for inverse reinforcement learning, optimizing synergies. This offers the potential for an end-to-end deep learning approach for LfD in the future.
- Abbeel et al.  Pieter Abbeel, Adam Coates, Morgan Quigley, and Andrew Y Ng. An application of reinforcement learning to aerobatic helicopter flight. Advances in Neural Information Processing Systems (NIPS), pages 1–8, 2007.
- Argall et al.  Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5):469–483, 2009.
- Bogert and Doshi  Kenneth Bogert and Prashant Doshi. Multi-robot inverse reinforcement learning under occlusion with interactions. In International Conference on Autonomous Agents and Multi-Agent Systems, pages 173–180, 2014.
- Eitel et al.  Andreas Eitel, Jost Tobias Springenberg, Luciano Spinello, Martin Riedmiller, and Wolfram Burgard. Multimodal deep learning for robust RGB-D object recognition. In Intelligent Robots and Systems (IROS), pages 681–687. IEEE, 2015.
- Girshick et al.  Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In , pages 580–587, 2014.
- He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
- Hochreiter and Schmidhuber  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
Ji et al. 
S. Ji, W. Xu, M. Yang, and K. Yu.
3d convolutional neural networks for human action recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):221–231, Jan 2013.
- Montes et al.  Alberto Montes, Amaia Salvador, Santiago Pascual, and Xavier Giro-i Nieto. Temporal activity detection in untrimmed videos with recurrent neural networks. arXiv preprint arXiv:1608.08128, 2016.
- Nishi et al.  Tomoki Nishi, Prashant Doshi, and Danil V. Prokhorov. Freeway merging in congested traffic based on multipolicy decision making with passive actor critic. CoRR, abs/1707.04489, 2017.
- Pollard and Hodgins  Nancy S Pollard and Jessica K Hodgins. Generalizing demonstrated manipulation tasks. Algorithmic Foundations of Robotics V, pages 523–539, 2004.
- Redmon and Farhadi  Joseph Redmon and Ali Farhadi. YOLO v3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
- Ren et al.  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), pages 91–99, 2015.
- Wang et al.  Pichao Wang, Wanqing Li, Zhimin Gao, Yuyao Zhang, Chang Tang, and Philip Ogunbona. Scene flow to action map: A new representation for RGB-D based action recognition with convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 416–425, 2017.
- Wulfmeier et al.  Markus Wulfmeier, Peter Ondruska, and Ingmar Posner. Maximum entropy deep inverse reinforcement learning. arXiv preprint arXiv:1507.04888, 2015.
- Yue-Hei Ng et al.  Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4694–4702, 2015.