Coupled Recurrent Network (CRN)

12/25/2018 ∙ by Lin Sun, et al. ∙ 18

Many semantic video analysis tasks can benefit from multiple, heterogenous signals. For example, in addition to the original RGB input sequences, sequences of optical flow are usually used to boost the performance of human action recognition in videos. To learn from these heterogenous input sources, existing methods reply on two-stream architectural designs that contain independent, parallel streams of Recurrent Neural Networks (RNNs). However, two-stream RNNs do not fully exploit the reciprocal information contained in the multiple signals, let alone exploit it in a recurrent manner. To this end, we propose in this paper a novel recurrent architecture, termed Coupled Recurrent Network (CRN), to deal with multiple input sources. In CRN, the parallel streams of RNNs are coupled together. Key design of CRN is a Recurrent Interpretation Block (RIB) that supports learning of reciprocal feature representations from multiple signals in a recurrent manner. Different from RNNs which stack the training loss at each time step or the last time step, we propose an effective and efficient training strategy for CRN. Experiments show the efficacy of the proposed CRN. In particular, we achieve the new state of the art on the benchmark datasets of human action recognition and multi-person pose estimation.



There are no comments yet.


page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Illustration of CRN with LSTM units. The blocks outlined in red and purple represent the input source and . Within each block, one time-step LSTM is illustrated, and and are the input gate, forget gate and output gate for the time step , respectively. The filled red and purple rectangles represent the recurrent adaptation block (RAB). The filled red and purple rectangles with a white boundary represent recurrent interpretation block (RIB).

Many computer vision tasks rely on semantic analysis of data in sequential forms. Typical examples include video-based human action recognition

[1], image/video captioning [2], speech recognition [3] etc. In other cases, tasks of interest might be recast as sequential learning problems, so that their learning objectives can be easily and iteratively achieved. For example, in human pose estimation, the joint locations can be predicted using multi-stage CNNs, the hidden features produced by one stage are used as input for the next stage. This multi-stage scheme for pose estimation can be cast using the recurrent scheme that the hidden features from one time step will be fed into in the next time step for refinement. [4]

models the distribution of natural images using RNN. By factorizing the joint modeling problem into a sequential problem, RNN offers a compact and shared parametrization of a series of modeling where teh model learns to predict the next pixel given all the previously generated pixels. RNNs are a typical choice of model for the above time- or spatial-domain sequential learning problems, in which each neuron or unit can use its internal memory to maintain information of the previous input. Among all of these sequence-prediction recurrent networks, Long Short-Term Memory network, a.k.a LSTM

[5], has been observed to be the most effective.

RNNs have become the de facto learning models for many computer vision tasks whose observed input data are or can be recast in a sequential form. In these tasks, we might have access to multiple and heterogenous input sources. To benefit from these heterogenous data, existing RNN models can be deployed separately to learn from individual input sources, and the results of the individual models will be either fused or post-processed to achieve the final objective of sequential learning. For example, in human action recognition task, a sequence of RGB frames and optical flows will be fed into two independent models. The output probabilities from these two models will be averaged or weightedly averaged. The predicted category will be the largest entry of the fused probability.

The independent use of CNNs/RNNs on different input sources and fusing the probabilities at the end does not fully exploit the reciprocal information between each inpiut source. Moreover, there are few researches on how to exploit reciprocal information between them. Therefore, in this paper, we propose Coupled Recurrent Network (CRN) to achieve more effective sequential learning from multiple input sources. CRN has two recurrent branches, i.e. branch A and branch B, each of which takes one input source. Different from two-stream architecture, during the learning of CRN, branch A and branch B can benefit from each other. In order to well interpretate the reciprocal information between them, within CRN, we propose two modules, recurrent interpretation block (RIB) and recurrent adaptation block (RAB). The hidden output from branch A will pass through RIB to extract reciprocal information. The extracted reciprocal information will be concatenated with input sources of branch B and then they will pass through the RAB to obtain the re-mapped input for branch B. The extraction module RIB and remapping model RAB is recurrent as well. An illustration of a CRN is shown in Fig. 1.

The proposed CRN can be generlized to mant computer vision tasks which can be represented by two input sources. In this paper, we apply the proposed CRN to two human centric problems, i.e. human action recognition and multi-person pose estimation. In human action recognition, a sequence of RGB frames and corresponding optical flows are the two heterogenous input sources for two branches in CRN. Two cross entropy losses for the same recognition task with identical form are applied at the end of each branch. While in multi-person pose estimation, besides the commmonly used individual body joints, a field of part/joint affinities that characterizes pair-wise relations between body joints [6]

is used as the additional supervision information. Two different regression losses, one for joint estimation and the other one for vector prediction, are applied at the end of each network.

The standard supervision for RNN training is to apply appropriate loss for each input source at each time step or only apply at the last time step. However, in the experiments, we find neither of two training strategies work well for CRN. Having supervision at each time step seems to make supervision signals assertive and arbitrary and the whole training becomes numerically unstable, leading to a poor performance; while having supervision only at the last time step makes supervision signal too weak to reach the end and leads to the performance drop on the considered tasks. Therefore, we propose a new training scheme which have a good balance of the supervision strength along time steps. In addition to having the loss at the last time step, we randomly select some previous time steps for supervision. Only the losses at selected time step will contribute to the back-propagation.

Comparative experiments show that our proposed CRN outperforms the baselines by a large margin. CRN sets a new state-of-the-art on benchmark datasets of human action recognition (e.g., HMDB-51 [7], UCF-101 [8]

and larger dataset, Moments in Time

[9]) and multi-person pose estimation (e.g., MPII [10]). Moreover, since better reciprocal information can be exploited within CRN, even using RGB and RGB differences as input sources, CRN can achieve more than accuracy on the UCF-101 with more than 200 FPS which will well serve the requirements of real-time application.

  • We propose a novel architecture, Coupled Recurrent Network (CRN), to deal with multiple input sources in a reciprocal and recurrent manner. In order to interpret the representations of each input source well, recurrent interpretation block (RIB) and recurrent adaptation block (RAB) are proposed as two important modules. RIB is used to distill the useful information from the output of one branch for the other branch. RAB provides the remapping of two concatenated representations.

  • In the paper, two tasks, i.e. human action recognition and human pose estimation, are investigated and analyzed using proposed CRN. However, our proposed method can be generalized to many computer vision tasks which initially is or can be recast as sequential learning.

  • A effective and efficient training strategy for CRN is proposed.

  • Extensive quantitative and qualitative evaluations are presented to analyze and verify the effectiveness of our proposed method. We conduct several ablation studies to validate our core contributions.

2 Related Works

In this section, we first present a brief review of existing works that use RNNs for different computer vision tasks, focusing particularly on those algorithms that deal with multiple sources of sequential inputs. We then review representative methods of action recognition and human pose estimation using multiple input sources or CNNs/RNNs.

2.1 RNNs for Multiple Sources of Sequential Data

Countless learning tasks require dealing with sequential data. Image captioning [11], speech synthesis, and music generation all require that a model produce outputs that are sequences. In other domains, such as time series prediction, video analysis [12], and musical information retrieval, a model must learn from inputs that are sequences. RNNs is the model which can handle dynamics of sequences via cycles in the network of nodes. RNNs also extends its success to sequential data with multiple modalities, e.g., they can deal with text and images as the input sources simultaneously for better image recognition.

2.2 Human Action Recognition

Many works have tried to design effective CNNs/RNNs for action recognition in videos. Only processing RGB frames for the video-based action recognition does not work well. Therefore, [13] proposes to use a two-stream network architecture. In this design, one network (spatial network) is processing RGB frames and the other one (temporal network) is processing optical flows, the probability will be averaged or weightdely averaged at the end of two networks for predicting the action category. Here, two input sources are compensating each other where RGB images provide apperance representations, while optical flows explicitly capture motion information. Many works follow the trend of ‘two-stream’ networks. Tran et al. explored 3D ConvNets [14] on realistic and large-scale video datasets, where they tried to learn both appearance and motion features with 3D convolution operations. Sun [15] proposed a factorized spatiotemporal ConvNet with the difference between RGB images as additional motion information. Wang [16] proposed a temporal segment network (TSN), which is based on the idea of long-range temporal structure modeling, for RGB images and flows fusion. [17] investigates a gating CNN scheme to combine the information from Starting from RGB images and optical flows. [18] presents new approaches for combining different sources of knowledge in spatial and temporal networks. They propose feature amplification, where they use an auxiliary, hand-crafted, feature (e.g. optical flow) to perform spatially varying soft-gating on intermediate CNN feature maps. They present a spatially varying multiplicative fusion method for combining multiple CNNs trained on different sources. Even the algorithm is sophisticated, the performance is far from satisfying. [19] also proposes that a spatial and temporal network can be fused at a convolution layer.

Some LSTM-based two-stream networks have been proposed as well. [1, 20] proposed to train video recognition models using LSTMs that capture temporal state dependencies and explicitly model short snippets of ConvNet activations. Ng et al. [20] demonstrated that two-stream LSTMs outperform improved dense trajectories (iDT) [21] and two-stream CNNs [13], although they needed to pre-train their architecture on one million sports videos. VideoLSTM [22]

applies convolutional operations within LSTM on sequences of images or feature maps. Additionally, an attention model is stacked on top of the ConvLSTM to further refine the temporal features. Sun

[23] also propose a lattice LSTM for the long and complex temporal modeling. These two-stream LSTMs were all trained independently and combined on the probability level. Even lattice LSTM has joint training on the gates between the two streams, their representations are not completely coupled.

2.3 Human Pose Estimation

Not just action recognition, other computer vision tasks also consider using multiple branches to improve the accuracy. Based on the multi-stage work of [24] , Cao et al. [6] presents a real-time pose estimation method. They add a bottom-up representation of association scores via part affinity fields (PAFs). By adding the joint associate network parallel to the joint detection network, the multi-person pose estimation can be well improved.

3 LSTM-based Coupled Recurrent Network

Long Short-Term Memory (LSTM) [5] is the dominating unit for RNN due to its superior performance in many tasks. Usually, an LSTM contains a cell memory to remember, an input, a forget and an output gate to control. In this paper, we adopt LSTM, particularly convolutional LSTM (ConvLSTM) [25] as the basic unit in CRN. To simplify presentation, in the rest of the paper, we will use the abbreviation LSTM instead of ConvLSTM.

Formally, let’s denote as two input sources at the time step , where and are the indices of input sources. are usually in the form of 2D images or feature representations. A CRN contains two branches, each of which handles one input source. Since two branches in CRN are symmetric, in the following paragraph, we will illustrate branch A step by step to present whole computation flow for proposed CRN. Starting from the cell memory which maintains the information over time within recurrence, its calculation at time is


where and are, respectively, the weights for the input and hidden states. The symbol denotes the convolution operation and is the hidden output at time step in branch A. The concatenation of

and the interpreted reciprocal hidden representations

from branch B is an enhanced representation. is the remapped input which can be obtained by passing the enhanced representations through Recurrent Adapted Block (RAB):


where denotes the functions of RAB. can be one or several convolutional layers and parameters are shared at different time steps. presents the concatenation operation.

As for the controlling gates, the input gate and forget gate at time step are computed as


where are distinct weights for the input gate and are distinct weights for the forget gate. At the end of time step , we can obtain the updated memory cell from the previous memory cell and :


where ‘’ denotes the pixel-wise multiplication, a.k.a, the Hadamard product. The input gate and forget gate together determine the amount of dynamic information entering/leaving the memory cell. The final hidden output for time step is controlled by output gate ,


where are distinct weights for the output gate. The branch B is processed in a similar way, getting the hidden output . Even the features within CRN are coupled, the weights are initialized independently for each branch.

3.1 Adapting to Different Tasks

After obtaining , we stack additional transformation layer(s) (e.g., convolutional or linear) to extract final representations , adapting to different supervision tasks.


where denotes the transformation function(s).

In the training process, identical or different losses can be applied to each branch. In our case, and

are two loss functions,


where is the supervision target for branch A, and is the supervision target for branch B. and can be the euclidean distance or cross-entropy loss, depending on the tasks. and are the losses for branch A and B at time step , respectively. The overall loss for CRN is


where is a set of selected time steps for supervision.

3.2 Extracting Reciprocal Representations

Above, we miss how to extract the reciprocal representations. The interpreted reciprocal representation can be obtained by passing the hidden output from branch B through the Recurrent Interpretation Block (RIB):


where denotes the RIB. Like RAB, it consists of one or several convolutional layers which hold shared parameters at different time steps. Although the input sources or the supervising targets of two branches are related, directly concatenating the input sources (i.e. or ) with the hidden output (i.e. or ) leads to the limited performance improvement. The key question is how to ‘borrow’ the really useful information from each other. Therefore, RIB is designed to distill the reciprocal information from each other at every time step. As illustrated in Fig. 2 (a), in order to effectively extract the reciprocal information, RIB

has a similar design like an inception module which has three convolutional paths with different dilation ratios. To investigate the importance of capacity of

RIB, a simplized version, sRIB is provided in Fig. 2 (b). Experiments present that neither direct concatenation nor sRIB perform as effectively as RIB on specific task. It indicates that when learning transferable knowledge from different input sources, it is important to distill reciprocal or useful information as much as possible.

Figure 2: Illustration of two architectures of Recurrent Interpretation Block (RIB). (a) RIB (b) sRIB

Figure 3: Illustration of CRNs for specific tasks. Each block represents one time step of recurrent networks. Blocks of different colors process different input sources. Left (light green) presents Non-Coupled Recurrent Network (N-CRN). Right (dark green) shows the architecture of CRN. represent input source and is for the source . are the output for each source, respectively. The inputs for the human action recognition and human pose estimation are different. Rectangles with filled colors denote RAB and thick arrow lines illustrate RIB. Rectangles with filled light red and light purple represents transformation layers. Better viewed in color.

4 Use of CRN for Specific Tasks

We present, in this section, how can our proposed CRNs be applied to two human-centric tasks, namely video based human action recognition and image based multi-person pose estimation. Fig. 3 gives an illustration, where recurrent networks are unrolled to better present the information flow. To provide the baseline, the architecture of non-coupled recurrent network (N-CRN) is provided on the left side of the figure while the exemplar design of CRN is on the right side. N-CRN is just a modified two-stream architecture in which input source only concatenate the previous hidden output of itself. represents the sequential data for input source and is for input source . The inputs can be the raw images or feature maps extracted from intermediate layers of some pre-trained models. and are sequential hidden output from each branch.

In human action recognition, the input is a sequence of RGB frames and a corresponding sequence of optical flows or RGB differences, and the objective is to classify the input video as one of the action categories. The output of

branch A at time step is , is the number of categories. has the same dimension and supervision target as . Two cross entropy losses with identical form, are added at the end of each input source. Two probabilities from two branches will be averaged for final prediction.

In multi-person pose estimation, the input is a sequence of the same images, and the objective is to estimate 2D locations of body joints for each person in this image. CRN simultaneously outputs a set of hidden features for heat maps prediction and of for 2D PAFs prediction, which encode the degrees of association between body joints. has feature maps with resolution and each of them corresponds to one body joint at time step . has vectors whose width is and height is . Each of vector corresponds to a limb of the human body at the time step . loss is applied at the end of each branch to learn the heat maps and PAFs by supervision. We follow the greedy relaxation as in [6] to find the optimal parsing. The joint location is estimated using both predicted heat maps and PAFs.

5 Experiments

We apply our proposed CRNs to popular computer vision tasks of multi-person pose estimation and human action recognition. For the human action recognition, we use three large scale benchmark datasets:

UCF-101 [8] is composed of realistic web videos. It has 101 categories of human actions with more than 13K videos, with an average length of 180 frames per video. It has three split settings to separate the dataset into training and testing videos. The mean classification accuracy over these three splits needs to be reported for evaluation.

HMDB-51 [7] has a total of 6766 videos organized as 51 distinct action categories. Similar to UCF-101, HMDB-51 also has three split settings to separate the dataset into training and testing videos, and the mean classification accuracy over these three splits should be reported.

Moments in Time [9] consists of over 1,000,000 3-second videos corresponding to 339 different verbs depicting an action or activity. Each verb is associated with over 1,000 videos, resulting in a large balanced dataset for learning a basis of dynamical events from videos.

For the human pose estimation, we use multi-person pose estimation dataset:

MPII dataset [10] consists of 3844 training and 1758 testing groups with crowded, occlusion, scale variation and overlapped people from the real world. We use 3544 images for multi-person training set, leaving 300 images for validation in experiments.

5.1 Implementation Details

For the action recognition, bninception [26] and inceptionv3 [27] are used as the pre-processing networks and features from the last convolutional layers will be fed into the CRN. RGB frames and corresponding optical flows, and RGB frames and RGB differences will be paired and pass through these backbone simultaneously. For each branch in CRN, it is a two-layer LSTM. The 2D convolution kernel/filter size are all and the number of hidden feature maps is 512. Ten sequential frames will be fed into the system for training. The detailed architecture of RIB is shown in Fig. 2 (a) and the RAB is just a convolutional layer with kernel size . The transformation layer is a globally pooling followed by a fully connection layer. The whole system can be trained end-to-end using SGD with initial learning rate as 1e-2. The learning rate of pre-trained backbone is 1e-3. The momentum is set to 0.9 and weight decay is 5e-4. Besides the last time step, we randomly select additional one time step from the previous time steps for back-propagation. When testing, we regularly sample ten clips with ten frames and average their probabilities.

Like [6], we pre-process pose estimation images using VGG-19 [28]

pre-trained on ImageNet

[29]. The whole system is trained using SGD with initial learning rate 2e-4, while the pre-processing network is trained using 5e-5. For each branch in CRN, a two-layer LSTM with all convolutions is applied. Ten repeated images will be fed into CRN. We use similar RIB and RAB architecture as they are in the action recognition task.

All the implementations, including [6]

, are based on pytorch

[30] and running on GTX 1080. All the experiments are run and evaluated under the same settings as instructed.

5.2 Evaluation on Action Recognition

The effect of RIB: In order to verify that distilling reciprocal information from each input source is useful for the task, we test CRN using different backbone networks with/without different RIB architectures. The detailed performance is shown in Table 1, where ’S-Nets‘ denotes the spatial networks and ’T-Nets‘ denotes the temporal networks. The paired input sources can be RGB frames and optical flows or RGB frames and corresponding RGB differences. From the table, we can see that without RIB (1 vs. 2, 3 and 4 vs. 5,6 in UCF-101 and HMDB-51, respectively), directly concatenating the hidden output without distilling does not provide good performance. And under all the backbones, CRN with RIB can achieve better performance, with about performance gain, over CRN with sRIB. Stronger RIB module makes our CRN transfer more useful reciprocal representations to the other branch. The results indicate that how and how much information is distilled from the other input source affects the final performance. What is more, the whole procedure is recurrent, iteratively refining the interpreted representations makes our coupled learning generate better features for each input source. A CRN with the inceptionv3 backbone and RIB, can achieve the best performance. Without specific notation, in the following paragraph, our CRN indicates this architecture. Please note that the performance, each branch of CRN achieved on the two benchmark datasets is already better than fused performance of the other sophisticated state-of-the-art methods with two-stream architecture.

The effect of training strategy: CRN can not be well trained by adding the loss at the end of time step or adding them at each time step. We evaluate different training strategies on UCF-101 in Table. 2, indicates supervising at the end, indicates supervising at each time step, indicates supervising at the end and one previous selected time step and indicates supervising at the end and previous two selected time steps. As indicated in the introduction, balances the supervision strength within CRN and therefore better performance can be achieved. We use for training our CRN.

The effect of coupled recurrent: From Table 3, we can see that leveraging the reciprocal information of each other, both spatial and temporal networks can achieve better performance. Since our CRN needs a paired input and generates a paired output, spatial networks listed here are the average of the two CRNs which are trained by RGB images, flows and RGB images, RGB differences. The accuracy on split 1 of UCF-101 with a bninception backbone is 90.4% for spatial networks, 91.8% for temporal networks trained using flows and, 89.5% for temporal networks trained using RGB differences. With an inceptionv3 backbone we can achieve 92.1% for spatial networks, 93.5% for temporal networks trained using flows and, 91.6% for temporal networks trained using RGB differences on split 1 of UCF-101. They surpass all the independently trained two-stream networks in both spatial and temporal networks. Although our N-CRN is not a strong model, we can see the performance boost after combining the models (CRN+N-CRN) together. We expect our CRN models to be a good compensation for any independently trained two-stream models.

No. Training settings RGB+Flow RGB+Diff
S-Nets T-Nets S-Nets T-Nets
1 CRN (bninception) 88.3% 90.6% 86.7% 87.9%
2 CRN (bninception + sRIB) 90.4% 92.3% 88.7% 89.3%
3 CRN (bninception + RIB) 91.4% 93.0% 89.5% 90.7%
4 CRN (inceptionv3) 91.0% 92.3% 88.9% 90.2%
5 CRN (inceptionv3 + sRIB) 91.8% 92.9% 90.4% 90.8%
6 CRN (inceptionv3 + RIB) 93.0% 93.5% 91.2% 91.6%
1 CRN (bninception) 54.7% 61.8% 52.7% 54.9%
2 CRN (bninception + sRIB) 58.6% 63.5% 55.4% 57.6%
3 CRN (bninception + RIB) 60.3% 67.5% 55.9% 59.0%
4 CRN (inceptionv3) 61.5% 64.1% 57.1% 58.2%
5 CRN (inceptionv3 + sRIB) 63.1% 65.8% 59.0% 59.9%
6 CRN (inceptionv3 + RIB) 64.4% 67.7% 60.9% 60.9%
Table 1: Performance evaluation on UCF-101 and HMDB-51 using different settings of CRN
Strategy setting S-Nets T-Nets (Flow) T-Nets (Diff)
a CRN (bninception) 55.5% 62.2% 56.4%
b CRN (bninception) 47.3% 57.2% 49.7%
c CRN (bninception) 58.1% 67.5% 59.0%
d CRN (bninception) 56.9% 66.9% 58.6%
Table 2: Evaluation with different training strategies on UCF-101
Training setting S-Nets T-Nets (Flow) T-Nets (Diff)
Clarifai [13] 72.7% 81.0% -
VGGNet-16 [16] 79.8% 85.7% -
BN-Inception [16] 84.5% 87.2% 83.8%
BN-Inception+TSN [16] 85.7% 87.9% 86.5%
N-CRN (bninception backbone) 84.7% 85.6% 86.2%
CRN (bninception backbone) 90.4% 91.8% 89.5%
CRN + NCRN 91.0% ( 0.6) 92.2% ( 0.4) 89.7% ( 0.2)
N-CRN (inceptionv3 backbone) 85.7% 87.2% 86.9%
CRN (inceptionv3 backbone) 92.1% 93.5% 91.6%
CRN + NCRN 92.8% ( 0.7) 94.0% ( 0.5) 93.9% ( 2.3)
Clarifai [13] 40.5% 54.6% -
BN-Inception+TSN [16] 54.4% 62.4% -
N-CRN (bninception backbone) 51.4% 56.9% 53.2%
CRN (bninception backbone) 58.1% 67.5% 59.0%
CRN + NCRN 59.0% ( 0.9) 68.3% ( 0.7) 60.7% ( 1.7)
N-CRN (inceptionv3 backbone) 52.4% 57.9% 54.9%
CRN (inceptionv3 backbone) 62.7% 67.7% 60.9%
CRN + NCRN 63.2% ( 0.5) 69.2% ( 1.5) 61.9% ( 1.0)
Table 3: Evaluation on split 1 of UCF-101 and HMDB-51

Comparison with alternative designs and other state-of-the-art methods: The mean performance on the three splits of UCF-101 and HMDB-51 compared with the state-of-the-art and alternative designs can be seen from Table 4. CRN can achieve comparable if not better performance. Together with a N-CRN model, ours outperforms the state-of-the-art on both datasets by a large margin. We also present the alternative designs for the fusion of multiple input sources as shown in Fig. 4. Most of previous state-of-the-art methods adopt (a), a two-stream architecture, we also experiment (b) and (c) for a fair comparison. (b) is a form of late fusion, two hidden outputs from two branches will be fed into a fusion module. (c) is a form of early fusion, the concatenated inputs will be fed into a fusion module and then pass through a recurrent network. Even effective compared to some other methods, neither early fusion nor late fusion can provides better representations than our proposed CRN for action recognition task.

Figure 4: Illustration of alternative designs for fusing different input sources.
UCF-101 HMDB-51
EMV-CNN [31] 86.4 EMV-CNN [31] -
Two Stream [13] 88.0 Two Stream [13] 59.4
(SCI Fusion) [15] 88.1 (SCI Fusion) [15] 59.1
C3D (3 nets) [14] 85.2 C3D (3 nets) [14] -
Feature amplification [18] 89.1 Feature amplification [18] 54.9
VideoLSTM[22] 89.2 VideoLSTM[22] 56.4
TDD+FV [32] 90.3 TDD+FV [32] 63.2
Fusion [19] 92.5 Fusion [19] 65.4
[23] 93.6 [23] 66.2
ST-ResNet [33] 93.4 ST-ResNet [33] 66.4
TSN [16] 94 TSN [16] 68.5
Gated CNNs [17] 94.1 Gated CNNs [17] 70
Early fusion (bninception) 92.7 Early fusion (bninception) 66.3
Late fusion (bninception) 92.4 Late fusion (bninception) 66.5
N-CRN(bninception backbone) 92.2 N-CRN(bninception backbone) 65.7
CRN(bninception backbone) 93.5 CRN(bninception backbone) 67.8
CRN + N-CRN 94.6 CRN + N-CRN 69.4
Early fusion (inceptionv3) 92.7 Early fusion (inceptionv3) 67.1
Late fusion (inceptionv3) 92.5 Late fusion (inceptionv3) 67.5
N-CRN(inceptionv3 backbone) 93.1 N-CRN(inceptionv3 backbone) 66.3
CRN(inceptionv3 backbone) 94.1 CRN(inceptionv3 backbone) 68.2
CRN + N-CRN 94.9 CRN + N-CRN 70.6
Table 4: Mean accuracy on the UCF-101 and HMDB-51 datasets

Besides these relatively large datasets, we also evaluate CRN on the much larger dataset, Moments in Time [9]. The performance can be seen in Table 5. CRN sets a new benchmark on Moments in Time by a lager margin.

Model Modality Top-1 (%) Top-5 (%)
Chance - 0.29 1.47
ResNet50-scratch [9] Spatial 23.65 46.73
ResNet50-Places [9] Spatial 26.44 50.56
ResNet50-ImageNet [9] Spatial 27.16 51.68
TSN-Spatial [9] Spatial 24.11 49.10
CRN-Spatial Spatial 27.32 50.01
BNInception-Flow [9] Temporal 11.60 27.40
ResNet50-DyImg [9] Temporal 15.76 35.69
TSN-Flow [9] Temporal 15.71 34.65
CRN-Flow Temporal 26.13 47.36
CRN-RGBDiff Temporal 27.11 49.35
TSN-2stream [9] Spatial+Temporal 25.32 50.10
TRN-Multiscale [9] Spatial+Temporal 28.27 53.87
Ensemble All [9] Spatial+Temporal + Audio 30.40 55.94
CRN + NCRN Spatial+Temporal 35.87 64.05
Table 5: Performance evaluation on Moments in Time

5.2.1 Real Time Action Recognition

Real-time action recognition is important for practical applications. Inspired by [15], the RGB difference between the neighboring frames can be a good substitute of the optical flow. Compared with optical flows which require certain amount of calculations, RGB difference can be directly inferred from RGB frames with little computation. At the same time, better performance can be obtained when combining the prediciton from branches with RGB difference and RGB frames as input. Balance of speed and accuracy, RGB frames and RGB difference are the good input sources for real-time action recognition. However, as indicated in [34], compared with optical flows, RGB difference only provides weak motion information which limits the performance. Previously, the best performance of two-stream networks using RGB and RGB difference as input is only . Our proposed CRN can interpret the spatial and temporal information from each input source, so much better performance can be achieved even using RGB frames and RGB difference as input. As shown in Table 6, applying CRN boosts the real-time action recognition performance by .

Method Speed (GPU) UCF101 Split 1 UCF101 Average
Enhanced MV [31] 390 FPS 86.6% 86.4%
Two-stream 3Dnet [35] 246 FPS - 90.2%
RGB Diff w/o TSN [34] 660FPS 83.0% N/A
RGB Diff + TSN [34] 660FPS 86.5% 87.7%
RGB Diff + RGB (both TSN) [34] 340 FPS 90.7% 91.0%
Ours (RGB Diff + RGB) 200 FPS* 92.2% 93.0%
  • * May vary when different GPU is used.

Table 6: Performance evaluation of real time action recognition

5.3 Evaluation on Multi-person Pose Estimation

Figure 5: Different joint location performance of MPII with the change of time steps.

Figure 6: The visualization of the heat maps and corresponding PAF at different steps. Better viewed in color and zoomed in.

Human pose estimation is another dimension to analyze human activity. CRNs with different numbers of hidden maps is evaluated as shown in Table 7.

Arch Hea Sho Elb Wri Hip Knee Ank mAP
Cao [6] 91.3 90.2 80.6 66.9 79.9 76.0 72.4 79.6
91.4 90.6 79.6 64.0 81.6 74.3 67.8 78.5
92.9 91.4 81.9 69.4 82.8 77.8 73.4 81.4
92.8 91.2 81.9 69.9 84.4 77.7 74.3 81.7
92.1 90.1 81.0 69.9 84.8 80.3 74.2 81.8
Table 7: Performance evaluation with different hidden features

Here, CRN_F* indicates a CRN with the corresponding ‘*’ hidden feature maps in the LSTM unit. Even with the 64 feature maps, our proposed method can exceed the state-of-the-art method [6]. When the number of hidden feature maps increases, the performance becomes better. Note that even with 128 feature maps, the size of our proposed model is still smaller the model proposed in [6]. From the table we can see our proposed method can achieve more than gain.

The performance varies when different time steps are applied for the inference. As shown in Fig. 5, mAP of easy joint locations, such as, head or shoulders, when the time step is equal to three, are almost stable. However, for more difficult ones, such as, wrist or ankle, more time steps are required for better performance. This experiments present the effectiveness of the recursive refinement for image based computer vision task using CRN.

5.4 Quantitative Performance Evaluation

The joint location and PAFs prediction at different time steps are generated as visualized in Fig. 6. As the time step increases, the location prediction as well as PAFs prediction becomes more and more confident (the brightness reveals the confidence level). Pay attention to the joint prediction shown in the dashed yellow rectangles along the time axis, initially, the confidence of prediction is pretty weak, however, with the time step increasing, the confidence is highly enhanced.

The visualization of the pose estimation on sample images from MPII [10] are shown in Fig. 7.

Figure 7: The visualization of the pose estimation results on samples of MPII dataset. (a), (c), (e) and (g) are the results from CRN, (b), (d), (f) and (h) are the results from [6]. Our proposed method presents better pose estimation in variation of viewpoint and appearance ((a) vs. (b), (e) vs. (f) and (g) vs. (h)) and occlusion ((c) vs. (d)). Better viewed in color and zoomed in.

In this figure, (a), (c), (e) and (g) are the results generated using CRN and (b), (d), (f) and (h) are the results generated using the method proposed in [6]. From this figure, we can see that our proposed method, Coupled Recurrent Network (CRN), can deal well with rare poses or appearances with less/no false parts detection. Even for images with substantial overlap of the body parts of two people, our proposed method still works well, correctly associating parts for each person. (a) vs. (b), (e) vs. (f) and (g) vs. (h) presents that CRN can work well in different situations with variation of viewpoint and appearance. (c) vs. (d) shows that CRN can work better for occluded poses than other state-of-the-art method proposed in [6].

6 Summary

In this paper, we propose a novel architecture, called a Coupled Recurrent Network (CRN), to learn better representations from the multiple input sources. With the RIB

module, reciprocal information can be well distilled from the related input source. Iterative refinement using re-currency improves the performance step by step. Extensive experiments are conducted on two tasks, human action recognition and multi-person pose estimation. Due to the effective integration of features from different sources, our model can achieve the state-of-the-art performance on these human-centric computer vision tasks. Hope our work shed the light on other computer vision or machine learning tasks with multiple inputs.

7 Acknowledgment

The author would like to thank Dr. Xingyu Zhang for constructive comments that greatly improved the manuscript.


  • [1] Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
  • [2] Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. Video paragraph captioning using hierarchical recurrent neural networks. CoRR, 2015.
  • [3] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013.
  • [4] Aäron Van Den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In ICML, 2016.
  • [5] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, Nov 1997.
  • [6] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. arXiv preprint arXiv:1611.08050, 2016.
  • [7] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In ICCV, 2011.
  • [8] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012.
  • [9] Mathew Monfort, Bolei Zhou, Sarah Adel Bargal, Tom Yan, Alex Andonian, Kandan Ramakrishnan, Lisa Brown, Quanfu Fan, Dan Gutfruend, Carl Vondrick, et al. Moments in time dataset: one million videos for event understanding. CoRR, 2017.
  • [10] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In CVPR, 2014.
  • [11] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). ICLR, 2015.
  • [12] Subhashini Venugopalan. Natural-Language Video Description with Deep Recurrent Neural Networks. PhD thesis, Department of Computer Science, The University of Texas at Austin, August 2017.
  • [13] A. Zisserman K. Simonyan. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
  • [14] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
  • [15] Lin Sun, Kui Jia, Dit-Yan Yeung, and Bertram E. Shi. Human action recognition using factorized spatio-temporal convolutional networks. In ICCV, 2015.
  • [16] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
  • [17] Novanto Yudistira and Takio Kurita.

    Gated spatio and temporal convolutional neural network for activity recognition: towards gated multimodal deep learning.

    EURASIP Journal on Image and Video Processing, 2017(1), Dec 2017.
  • [18] Eunbyung Park, Xufeng Han, Tamara L. Berg, and Alexander C. Berg. Combining multiple sources of knowledge in deep cnns for action recognition. In WACV, 2016.
  • [19] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, 2016.
  • [20] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.
  • [21] Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In ICCV, 2013.
  • [22] Zhenyang Li, Kirill Gavrilyuk, Efstratios Gavves, Mihir Jain, and Cees G. M. Snoek. Videolstm convolves, attends and flows for action recognition. Computer Vision and Image Understanding, 166:41–50, 2018.
  • [23] Lin Sun, Kui Jia, Kevin Chen, Dit-Yan Yeung, Bertram E. Shi, and Silvio Savarese. Lattice long short-term memory for human action recognition. In ICCV, 2017.
  • [24] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. In CVPR, 2016.
  • [25] SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In NIPS, 2015.
  • [26] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  • [27] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
  • [28] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [29] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • [30] Pytorch.
  • [31] Bowen Zhang, Limin Wang, Zhe Wang, Yu Qiao, and Hanli Wang. Real-time action recognition with enhanced motion vector CNNs. In CVPR, 2016.
  • [32] Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR, 2015.
  • [33] Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. Spatiotemporal residual networks for video action recognition. CoRR, abs/1611.02155, 2016.
  • [34] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks for action recognition in videos. In Arxiv, 2017.
  • [35] Luc Van Gool Ali Diba, Ali Mohammad Pazandeh. Efficient two-stream motion and appearance 3d cnns for video classification. In Arxiv, 2017.