Leveraging Structural Context Models and Ranking Score Fusion for Human Interaction Prediction

08/18/2016 ∙ by Qiuhong Ke, et al. ∙ Murdoch University The University of Western Australia 0

Predicting an interaction before it is fully executed is very important in applications such as human-robot interaction and video surveillance. In a two-human interaction scenario, there often contextual dependency structure between the global interaction context of the two humans and the local context of the different body parts of each human. In this paper, we propose to learn the structure of the interaction contexts, and combine it with the spatial and temporal information of a video sequence for a better prediction of the interaction class. The structural models, including the spatial and the temporal models, are learned with Long Short Term Memory (LSTM) networks to capture the dependency of the global and local contexts of each RGB frame and each optical flow image, respectively. LSTM networks are also capable of detecting the key information from the global and local interaction contexts. Moreover, to effectively combine the structural models with the spatial and temporal models for interaction prediction, a ranking score fusion method is also introduced to automatically compute the optimal weight of each model for score fusion. Experimental results on the BIT Interaction and the UT-Interaction datasets clearly demonstrate the benefits of the proposed method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Human interaction prediction, or early recognition, has a wide range of applications. It can help preventing harmful events (e.g., fighting) in a surveillance scenario. It is also very essential for human-robot interaction (e.g

., when a human lifts his/her hand to handshake or opens his/her arms to hug, the robot will then respond accordingly). Unlike action recognition whose objective is to classify a full video, interaction prediction aims to infer an interaction class with a sequence containing only a partial observation of the full interaction activity. Interaction prediction is challenging due to the large variations in human postures during a complete interaction sequence.

Fig. 1: The structural model aims to capture the contextual dependency and salient information from the global and local interaction contexts.

Human interaction involves a sequence of postures of two humans in a specific scenario. The global interaction context of each video frame, containing the two humans, provides the overall information of their positions and relationship. The local context (including different body parts of each human) provides the fine-grained details of the body gestures.

Previous works focus on using the spatial and temporal information from videos for human interaction prediction [1, 2, 3, 4, 5]. Humans usually observe the content of a scene from the global context, to obtain an overall information, followed by an observation of the local context, to acquire more details [6]. We hypothesize that the contextual dependency among the global and local contexts thus needs to be learned for a better prediction of the interaction class. Moreover, the global and local contexts play different roles depending on the types of the interaction. In the case where the interaction involves the movements of both humans (e.g., walking towards each others or departing), the global context (containing the relationship between the two humans) is more useful than the local details. On the other hand, some other types of interactions mainly consist of the movements of a particular local body part of one human (e.g., upper body part in “hugging” and lower body part in “kicking”).

To learn the contextual dependency and to capture the salient information for interaction prediction, we propose to learn structural models from the global and local interaction contexts (as shown in Figure 1). More specifically, we organize the global and local interaction contexts in a sequential order and then use Long Short Term Memory (LSTM) networks [7] to process the sequence and learn the structural models. LSTM networks are designed for temporal modelling. They are capable of detecting salient keywords in sequential data, such as speech and sentences [8, 9]. LSTM networks have also been used to learn the spatial dependency and discover salient local features in images [10]. We propose to exploit the structural information of the interaction context by processing the sequence of the global and local contexts using LSTM networks. This will allow us to capture the contextual dependency of the interaction, and to detect the discriminate information, which is relevant to the interaction class by “memorizing”, and to block the irrelevant information by “forgetting”. The learned structural models enhance the discriminative power of the global and local contexts for interaction prediction. Our experiments clearly show the benefits of using structural models for interaction prediction (See Section IV-B).

The temporal information of a video sequence is also very important for interaction prediction. Human interactions may last for a long period and can consist of multiple different sub-actions. It is insufficient to use a single frame that is captured before the interaction happens to infer the class. The temporal information that is captured along several consecutive frames, on the other hand, provides critical cues to predict a future interaction. To extract the temporal information, we adopt the temporal convolution network proposed in [5], which learns the temporal evolution using the features of several consecutive optical flow images. We also consider the spatial information of each video frame.

On this basis, our proposed interaction prediction framework is achieved through the incorporation of the structural, temporal and spatial models. These models have different discriminative abilities in classification. Previous works average or manually assign weights to fuse the different models [11, 12]. To effectively combine the complementary strengths of the proposed structural, spatial and temporal models, we introduce a new ranking score fusion method, which can automatically find the fusion weights of these models for the final interaction prediction decision. The advantage of the ranking score fusion method over a simple average fusion is shown in Section IV-B.

The main contributions of this paper relate to the proposed learning methodology for interaction prediction. First, we design novel structural models which are exploited using LSTM networks to process a sequence of global and local interaction contexts and improve the performance of interaction prediction. The structural models learn the contextual dependency and extract the discriminative information that is relevant to the interaction class. Experimental results clearly demonstrate the benefits of the proposed structural models (Section IV-B). Second, we develop a ranking score fusion method to combine the structural, spatial, and temporal models for the final prediction of an interaction class. The ranking score fusion method automatically finds the optimal weights of these models and is more robust compared to the average fusion approach. Third, we have evaluated our method on two interaction datasets and the experimental results demonstrate that the proposed method outperforms the state-of-the-art methods for human interaction prediction (see Section IV-B).

Ii Related Works

The proposed method mainly focuses on learning sequences of global and local contexts for human interaction prediction. In this section, we therefore briefly describe existing works on action prediction and sequence learning.

Ryoo [1]

presented one of the early works on human interaction prediction. This work formulated an interaction prediction process as a posterior probability and represented the video frames with integral bag-of-words (IBoW) and dynamic bag-of-words (DBoW) to model the temporal evolution of features. Hoai and De la Torre

[13] proposed a structured output SVM to train a detector to recognize partial events. When testing on action data, they used the Euclidean distance transform of binary masks between frames to create a codebook and computed a histogram of temporal words to represent a sequence of frames. Cao et al. [14] divided each activity into multiple ordered temporal segments and constructed a matrix basis for each segment with the spatio-temporal features from the training data. A sparse coding method (SC) is then used to approximate the features of the test video with one matrix basis or a mixture of several matrix bases (MSSC). Lan et al. [2] introduced a “Hierarchical Movemes” (HM) feature (i.e., combining features from coarse to fine temporal levels based on HOG, HOF and MBH features) as representations and used SVM to jointly learn the appearance models at different levels and their intra relationships to predict interaction. Kong et al. [3]

represented partial videos using bag-of-words features, and learned the multiple temporal scale support vector machine (MTSSVM) based on a structured SVM to recognize unfinished videos. Kong et al.

[4] extended their work and proposed a max-margin action prediction machine (MMAPM) for early recognition of unfinished actions. The limitation of the above mentioned methods lies in their reliance on low-level features. Recently, Ke et al. [5] applied CNNs on flow coding images to learn the temporal information for human interaction prediction. This method uses only temporal features and lacks the discriminative spatial features of human postures.

Sequence learning has been used in the temporal domain to capture the temporal information associated to consecutive frames in a video. Traditional sequential models such as Hidden Markov models (HMMs)

[15] and Conditional Random Fields (CRFs)[16] have been successfully used for action recognition [17, 18, 19, 20]. However, they are not suitable for applications with high dimensional features [1]. Besides, they are not designed to learn long-term dependencies. LSTM networks [7], on the other hand, are capable to learn long-term dependencies. LSTM networks [7]

is a variant of Recurrent Neural Networks (RNNs)

[21] with LSTM cells, which can remove or add information over a period of time. LSTM ntworks have been successfully applied for speech recognition [22], video description and recognition [23, 11, 24, 25]. LSTM networks have also been used in salient keywords detection in sentences for document retrieval [9]. Although RNNs and LSTM are designed for temporal modelling, they have also been used to process sequences of local features in images to exploit the spatial dependency and extract discriminative information for powerful image representation [26, 27, 10].

Iii Proposed Approach

Fig. 2: Outline of the proposed method. The goal of the proposed method is to effectively combine the spatial, spatial structural, temporal and temporal structural information to predict the interaction class from a subsequence containing only a partial observation of the interaction. We first compute optical flow images from consecutive video frames of the partial sequence. The video frames are fed to the spatial and the spatial structural models, while the optical flow images are fed to the temporal and the temporal structural models. The output scores of these models are fused with a ranking score fusion method to predict the class of the partial observation of the interaction.

An overall architecture of the proposed method is shown in Figure 2. It contains a spatial model, a temporal model and two structural models (i.e., a spatial structural model and a temporal structural model). The prediction scores of the four models are fused using a ranking score fusion method for the final decision of the interaction class. The goal of the proposed method is to effectively combine the structural, spatial and temporal information of the videos for interaction prediction. Learning the spatial and temporal information is a common practice for video action recognition. For the scenario of two-human interaction, we introduce structural models to extract the salient information related to the interaction class and to learn the contextual dependency among the global context of the two humans and their local body gestures. In this section, we first describe the proposed structural models in details, and then briefly introduce the spatial and temporal models. Finally, we present the ranking score fusion and the testing methods.

Iii-a Structural Models

Fig. 3: The Network architecture of the structural models: (a) Spatial Structural Model and (b) Temporal Structural Model.

In a two-human-interaction context, the global context provides the overall information of the positions and gesture relevance of the two humans. For example, if the two humans walk towards each other to handshake (as shown in Figure 1), their movements of body gestures are similar and the distance of the two actors becomes smaller. While if someone intends to kick another person, the gestures of the two humans are generally different. The local context including each human and their upper and lower body parts provides fine-grained details of the gestures. There exist intrinsic relationships and dependencies among the global and local contexts. In addition, the global and local contexts play different roles for different interactions. In some interactions, both humans perform actions. The global context with regard to the relationship of the two humans provides discriminant information to distinguish between the different types of interactions (e.g., walking towards or departing from each other). For some other interactions, only one actor performs an action, while the other actor stands still. In this case, the local context of the moving human provides more details than the global context and is thus more important. The upper and lower body parts also a have different importance depending on the types of interactions (e.g., the upper body part is more important in “hugging” while the lower part is more important in “kicking” action). Considering that LSTM networks are capable to learn contextual dependencies and detect salient information in sequences [9, 26, 27, 10], we propose, in this paper, to organize the global and local interaction contexts in a sequence and use LSTM networks to learn the sequence to derive the structural models for a good understanding of an interaction class.

Iii-A1 Model Input

As shown in Figure 3, the proposed structural models include the spatial structural and the temporal structural models. These models aim to learn the contextual dependency and detect salient information from the global and local contexts of the frame and optical flow images, respectively. The global context is the image region which contains both humans, while the local context consists of 6 local patches, including the whole body, the upper and the lower body parts of each human. Inspired by the theory that the human visual system analyses the contents of visual scenes sequentially from the global to the local scene contexts [6], we organize the global and local contexts in a sequential order, as shown in Figure 3. More specifically, the order is: the global context with two humans, the whole body of the left human, the whole body of the right human, the upper body part of the left human, the lower body part of the left human, the upper body part of the right human, the lower body part of the right human.

For the temporal structural model, the inputs are the global and local contexts of the optical flow image, which is derived from the optical flow between consecutive frames. The and the

components of an optical flow vector are scaled to values between 0 and 255 using a linear transformation. The two components (

and ) correspond to two channels of the optical flow image, and the third channel of the optical flow image is set to 0.

Iii-A2 CNN Feature Representation

The global and local contexts of the frame and optical flow images are fed to CNN models that are pre-trained on ImageNet

[28] to extract fixed-dimension features. CNN pre-trained models have been shown to be transferable across domains [29], and have achieved a better performance than hand-crafted features in a variety of visual recognition tasks [30, 31, 32, 33, 34]. Particularly, CNN pre-trained models have been successfully used to extract features from video frames and optical flow images for action recognition and detection [35, 36, 37]. In this paper, the CNN-M-2048 model [38]

is used for feature extraction due to its successful application in action recognition

[35]. The convolutional layers of this CNN model contain 96 to 512 kernels with a size varying from to

. The rectification unit (ReLU)

[39]

is used as a nonlinear activation function. The output of the first fully connected layer (layer 19) of the network is used as the feature representation of the input image.

Iii-A3 Model Learning

The CNN features of the global and local contexts are fed to LSTM networks in a sequential order as described in Section III-A1. LSTM netwoks [7] are RNNs [21] with LSTM cells. Standard RNN can be regarded as multiple copies of the same network, allowing for processing time series. The traditional RNNs suffer from the problems of vanishing and exploding gradients [40, 41]. Compared to RNNs, LSTM networks contain memory cells, which are capable to learn long-term dependencies. The LSTM cell is composed of a forget gate , an input gate , a cell state , an output gate and a hidden state . Given an input at time step and a hidden value of the previous time step , the equations of each gate and the states follow the equations below.

(1)

where , , , , , , , and are the weight matrices and , , and are the biases.

is the sigmoid function.

At each time step , the LSTM updates the hidden value and output value with the above equations. By determining when to remember and when to forget, LSTM networks are capable to learn dependencies and to detect the discriminate information in a sequence.

As shown in Figure 3, each sequence feeding to LSTM networks contains seven time steps of CNN features. LSTM networks output a hidden value and an output value at each time step using Equation (1

). The outputs of all time steps are fed to a hidden layer including a fully connected (FC) layer and ReLU, followed by another FC layer and a softmax layer to generate class scores for each steps. More specifically, let the output value of LSTM cell at time step

be . denotes the number of hidden units of the LSTM cell. The probability score of the class at the step is given by:

(2)

where is the vector fed to the softmax layer. and denote the parameters of the first FC layer. is the number of units of the first FC layer. and denote the parameters of the second FC layer. is the number of interaction classes. The class scores of all time steps are averaged to produce the final probability scores of the structural model.

Iii-B Spatial and Temporal Models

Fig. 4: Network architecture of (a) the spatial model and (b) the temporal model. “FC” denotes a “fully connected” layer. During the training of the temporal model, every stack of optical flow images is fed to the model. During testing, the prediction scores of the time step is generated by feeding the sequence formed by the and the previous optical flow images to the model. The current optical flow image is repeated when .

The spatial interaction context provides the static human postures of the two interacting humans. The spatial model is introduced to extract the spatial information of the interaction context from each individual frame for interaction prediction. As shown in Figure 4(a), the input of the spatial model is a single frame image containing both humans. The CNN-M-2048 model [38] is used to extract a feature from the input frame image which is the same as the feature representation of the structural models (see Section III-A2). The feature is then fed to a fully connected (FC) layer, a ReLU, another FC layer and a softmax layer to generate the probability scores of an interaction.

Because human interaction involves the movements of human limbs along a sequence, using only the spatial feature of each individual frame is insufficient to infer the interaction class, while the temporal feature of multiple consecutive frames provides more information. The goal of the temporal model is to exploit the temporal information for interaction prediction. For the interaction prediction task, which aims to recognize an interaction class at an early temporal stage before an interaction happens, the testing sequences consist of subsequences containing only a partial observation of the full activity. The temporal evolution of partial sequences needs to be learned for interaction prediction. The temporal convolution network [5] aims to model the temporal information from subsequences and has been successfully used for interaction prediction. We therefore adopt the temporal convolution network [5] to build our temporal model. As shown in Figure 4(b), the input of the temporal model is a set of consecutive optical flow images, which are fed to a pre-trained CNN model to extract CNN features similar to the spatial and structural models. The sequence of features are then processed with a temporal convolution layer. The output feature is a compact feature vector of the sequence. It is then fed to a hidden layer and an output layer consisting of a FC and a softmax layer to produce the probability scores of an interaction class. During testing, the prediction result of the time step is derived by feeding a sequence consisting of the current and the previous optical flow images to the model ( denotes the number of consecutive optical flow images used for training). The current flow image is repeated for times to generate a sequence of length when .

Iii-C Ranking Score Fusion

For different datasets, the relative importance of the spatial structural, temporal structural, spatial and temporal models may vary due to their different discriminations between classes. The proposed ranking score fusion method is used to find the optimal fusion weights to combine these models. Given a testing sequence, the four models generate a score matrix at each time step ( is the number of interaction classes):

(3)

where denotes the score of the model for the interaction class. Each column of corresponds to one class. Inspired by the ranking theory [42], we learn to rank these columns so that , where is the ground truth class label of the video, is the column of , and denotes the order between two vectors. The task is to learn a linear function which induces an ordering on the columns, i.e.,

(4)

where

(5)

and is a four dimensional weight vector of the linear function.

Let be a pair of columns of a score matrix, we have either or , or equivalently

(6)

where

(7)

Thus one can treat the difference vector as a training example with label . The weight vector can be obtained by training a binary classifier. More precisely, by using the score matrices of all the videos at every time step, we can create a training set

(8)

where is the total number of the training pairs generated from the score matrices.

The weight vector can be obtained by solving the following optimization problem:

(9)
subject to

However, this may result in negative weights (which is counter-intuitive for a model to contribute negatively). To ensure that the contribution of each model is zero at worst, a non-negative constraint on the weight vector is added. The problem is solved approximately using the following iterative method: first, a weight vector without the constraint is obtained, then the negative weights are set to zero and the remaining weights are re-trained until all of the weights become non-negative.

Iii-D Sequential-level Prediction

Given the trained four classification models and a testing sequence containing frames, there will be a score matrix as described in Equation (3) at each time step of the sequence. Let the weights learned by the ranking score fusion method be . The four rows of are then combined using to produce the final scores of each time step. Let the final score of time step (after combing the four models) be . is the final score for class . is the column of , which consists of the scores of the four models for class . corresponds to the number of classes. The prediction label of the sequence at time step is then identified as class where

(10)

Now let be the histogram of the set , where is the number of frames of the sequence and is the number of elements of that is equal to . Finally, the sequence is predicted as class where

(11)

This method is called majority vote, which determines the class label of a sequence by counting the predicted labels at all time steps.

Iv Experiments

In this section, we present our evaluation results on two datasets that have been used for interaction prediction.

Iv-a Datasets

The proposed method has been evaluated on two interaction datasets, i.e., the BIT-Interaction dataset [43] and the UT-Interaction dataset [1].

BIT-Interaction dataset [43]: This dataset consists of 8 types of interactions between two humans (bow, boxing, handshake, high-five, hug, kick, pat and push). Each class of interactions contains 50 videos. It is a very challenging dataset, including variations in illumination conditions, scales, subject appearances and viewpoints. In addition, there are also occlusions by holes, bridges, pedestrians, etc..

UT-Interaction dataset [1]: This dataset includes two sets. The background of Set 1 is simpler and mostly static. In contrast, the background is complex and slightly moving on Set 2. Each set includes 60 sequences of videos belonging to 6 interaction classes, i.e., handshake, hug, pointing, kick, push and punch.

For each testing video, the interaction is predicted in 10 observation ratios, from 0 to 1, with a step size of 0.1. In other words, each testing video is divided into 10 partial sequences. The sub-sequence consists of the frames of , where is the number of frames in the full video. A prediction accuracy under an observation ratio of 0.2 denotes the accuracy that is tested with the the second sub-sequence containing frames . If the observation ratio is 1, the accuracy is tested with the entire video. During training, the two humans in each frame are detected using the detector in [44]. The image region with both human regions is used as the global region to train the models. In both datasets, the numbers of units of the LSTM and the hidden layers in the structural models are set to 512 and 128, respectively. To train the temporal model, every seven consecutive optical flow images are used as input. The number of units of the hidden layer in the spatial and temporal models is set to 512.

Iv-B Experimental Results

The proposed method is compared with the previous methods. For both datasets, the same testing protocol that was used in the previous works was adopted for a fair performance comparison. In addition, the following baseline are also conducted to show the benefits of the structural models and the ranking score fusion method.

Spatial+Temporal+Ranking (Sp_Tp_Rank): the prediction scores of the spatial model and the temporal model are fused using the weights learned by the ranking score fusion method to produce the final prediction. Compared to the proposed method, this baseline does not include the proposed structural models.

Spatial+Temporal+Structural+Average (Sp_Tp_St_Avg): the prediction scores of the spatial, the temporal and the proposed structural models are averaged to generate the prediction results. Compared to the proposed method, this baseline does not include the ranking score fusion method.

Iv-B1 Result on the BIT-Interaction Dataset

There are 400 videos in this dataset. Following the same procedure as in [43], a random sample of 272 videos are used for training and the remaining videos are used for testing.

Fig. 5: Performance comparisons of the proposed method with other methods on the BIT-Interaction dataset. (Best viewed in color)

The proposed method is compared with other methods, including IBoW [1] DBoW [1], SC [14], MSSC [14], MTSSVM [3] and MMAPM [4]. The results are shown in Figure 5. It can be seen that the proposed method performs consistently better than other methods for all observation ratios. When using 20% of the video (i.e., observation ratio is 0.2) to predict the interaction class, the performance of the proposed method is 45.31%, which is 8.59% better than MMAPM [4] (36.72%). When testing with half sequences (i.e., observation ratio is 0.5), the proposed method achieves an accuracy of 79.69%. Compared to MMAPM [4](67.97%), the improvement is 11.72%.

Table I shows the comparisons of the proposed method with baseline Sp_Tp_Rank. It can be seen that the proposed method achieves a better performance in 9 out of 10 cases. When using only 10% of each testing video to predict the interaction class, the performance of the proposed method is 41.41. Compared to Sp_Tp_Rank(39.06), the improvement is 2.35%. When testing with half sequences (i.e., observation ratio is 0.5), the improvement of the proposed method is 3.13% (from 76.56% to 79.69%). Compared to Sp_Tp_Rank, the proposed method incorporates the proposed structural models. It clearly shows the benefits of the proposed structural models. The structural models learn the contextual relationships between the global and local contexts, and detect the discriminate salient features that are related to the interaction class. This provides complementary information to the spatial and temporal models and improves the interaction accuracy.

The comparison of the proposed method with baseline Sp_Tp_St_Avg is also shown in Table I. It can be seen that the proposed method performs better than Sp_Tp_St_Avg in all observation ratios. When testing with an observation ratio of 0.1, the accuracy achieved by the proposed method is 7.82% better than Sp_Tp_St_Avg (i.e., from 33.59% to 41.41%). The improvement is more significant with observation ratio 0.5, with an improvement of 10.16% (from 69.53% to 79.69%). Both the proposed method and Sp_Tp_St_Avg incorporate the same models. Sp_Tp_St_Avg combines the spatial, structural and temporal models with the average fusion method. These models are learned with different features, which have different discriminations between classes. Simply averaging these models thus generate suboptimal results. The proposed method uses the proposed ranking method to find the optimal fusion weights between these models and produces better results. These significant improvements of the proposed method clearly show the advantage of the ranking score fusion method.

Table II compares the performance of the proposed method with the four individual models. It can be seen that with this dataset, the temporal and temporal structural models perform better than the spatial model and the spatial structural model. When the observation ratio is 0.5, the prediction accuracy of the temporal and temporal structural models are more than 40% better than the spatial and spatial structural models. The temporal and temporal structural models are learned with the optical flow images, while the spatial and spatial structural models are learned with the video frames. The backgrounds of the video frames of this dataset are very complex. The optical flow images are generated from the motion of two consecutive frames. The background noise is removed, resulting a better prediction accuracy. The weights learned by the score ranking method is for the spatial, temporal, spatial structural and temporal structural models. It can be seen that the fusion of these four models using the learned weight improves the performance of each model with most observation ratios.


Methods
Observation Ratio
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0


Sp_Tp_Rank
39.06% 42.19% 57.81% 70.31% 76.56% 79.69% 81.25% 85.16% 86.72% 84.38%
Sp_Tp_St_Avg 33.59% 39.84% 51.56% 60.16% 69.53% 72.66% 73.44% 73.44% 75.00% 73.44%
Proposed Method 41.41% 45.31% 58.59% 72.66% 79.69% 81.25% 83.59% 83.59% 86.72% 85.94%



TABLE I: Performance comparison of the proposed method with other baselines on BIT-Interaction dataset.

Methods
Observation Ratio
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Spatial 15.62% 18.75% 21.88% 30.47% 31.25% 35.16% 33.59% 34.38% 33.59% 33.59%
Temporal [5] 39.06% 42.19% 57.81% 70.31% 76.56% 79.69% 81.25% 85.16% 86.72% 84.38%
Spatial Structural 21.88% 23.44% 25.00% 28.12% 30.47% 32.03% 32.81% 33.59% 32.81% 32.03%
Temporal Structural 35.16% 44.53% 56.25% 67.97% 79.69% 79.69% 80.47% 82.03% 82.81% 82.03%

Proposed Method
41.41% 45.31% 58.59% 72.66% 79.69% 81.25% 83.59% 83.59% 86.72% 85.94%



TABLE II: Performance comparison of the proposed method with individual models on BIT-Interaction dataset.
Fig. 6: Performance comparisons of the proposed method with other methods on the UT-Interaction dataset (a) Set 1 and (b) Set 2. (Best viewed in color)

Methods
Observation Ratio
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Sp_Tp_Rank 53.33% 56.67% 63.33% 70.00% 85.00% 85.00% 90.00% 91.67% 91.67% 91.67%
Sp_Tp_St_Avg 41.67% 45.00% 50.00% 65.00% 75.00% 78.33% 85.00% 88.33% 85.00% 85.00%
Proposed Method 55.00% 60.00% 66.67% 78.33% 83.33% 86.67% 93.33% 93.33% 95.00% 93.33%



TABLE III: Performance comparison of the proposed method with other baselines on UT-Interaction dataset (Set 1).

Methods
Observation Ratio
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Sp_Tp_Rank
38.33% 40.00% 50.00% 70.00% 81.67% 86.67% 88.33% 86.67% 88.33% 88.33%
Sp_Tp_St_Avg 38.33% 40.00% 48.33% 60.00% 78.33% 83.33% 85.00% 85.00% 86.67% 83.33%
Proposed Method 46.67% 48.33% 55.00% 71.67% 83.33% 91.67% 91.67% 90.00% 91.67% 91.67%




TABLE IV: Performance comparison of the proposed method with other baselines on UT-Interaction dataset (Set 2).

Methods
Observation Ratio
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Spatial
30.00% 30.00% 31.67% 45.00% 50.00% 50.00% 61.67% 70.00% 68.33% 65.00%
Temporal [5] 45.00% 53.33% 61.67% 71.67% 81.67% 86.67% 86.67% 88.33% 90.00% 90.00%
Spatial Structural 33.33% 35.00% 36.67% 48.33% 56.67% 65.00% 66.67% 70.00% 73.33% 73.33%
Temporal Structural 30.00% 36.67% 55.00% 70.00% 76.67% 85.00% 86.67% 88.33% 86.67% 85.00%
Proposed Method 55.00% 60.00% 66.67% 78.33% 83.33% 86.67% 93.33% 93.33% 95.00% 93.33%





TABLE V: Performance comparison of the proposed method with individual models on UT-Interaction dataset (Set 1).

Methods
Observation Ratio
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Spatial 38.33% 38.33% 41.67% 55.00% 58.33% 65.00% 66.67% 70.00% 73.33% 75.00%
Temporal [5] 33.33% 33.33% 55.00% 68.33% 85.00% 88.33% 88.33% 90.00% 90.00% 90.00%
Spatial Structural 38.33% 38.33% 41.67% 46.67% 46.67% 55.00% 60.00% 56.67% 56.67% 58.33%
Temporal Structural 36.67% 41.67% 61.67% 71.67% 78.33% 80.00% 78.33% 78.33% 80.00% 81.67%
Proposed Method 46.67% 48.33% 55.00% 71.67% 83.33% 91.67% 91.67% 90.00% 91.67% 91.67%


TABLE VI: Performance comparison of the proposed method with individual models on UT-Interaction dataset (Set 2).

Iv-B2 Results on the UT-Interaction Dataset

There is no training/testing split with this dataset. The performance is measured using leave-one-sequence-out cross validation, i.e., for each set, 54 videos performed by 9 groups of actors are used for training and the remaining sequences of 1 group of actors are used for cross validation. The model is validated 10 times and the averaged results are reported as the model performance.

The proposed method is compared with other methods, including IBoW [1] DBoW [1], HM [2], SC [14] MSSC [14], and MMAPM [4], and the results are shown in Figure 6. It can be seen that the proposed method achieves superior results over other methods in 7 out of 10 observation ratios for both Set 1 and Set 2. When the observation ratio is 0.1, the accuracy of the proposed method on Set 1 is about 55.0%. The improvement compared with the best previous result (i.e., MMAPM [4]) is about 8.33%. The improvements of the proposed method on Set 2 are more significant after an observation ratio of 0.4. When testing with half the length of the videos (i.e., observation ratio of 0.5), the proposed method achieves an impressive 83.3% accuracy. Compared to MMAPM (75.0%) [4], the improvement is 8.3%.

Table III and Table IV shows the comparison of the proposed method with the baseline methods Sp_Tp_Rank and Sp_Tp_St_Avg on Set 1 and Set 2. It can be seen that the proposed method outperforms the baseline methods in 9 out of 10 observation ratios in Set 1 and all cases in Set 2 of the UT-Interaction dataset. The comparison of the proposed method with individual models are shown in Table V and Table VI. The weights of the spatial, temporal, spatial structural and temporal structural models learned by the ranking score fusion method are on Set 1, and on Set 2. The fusion results outperform the performances of individual models in all cases in Set 1 and 8 out of 10 cases in Set 2. This clearly demonstrates the superiority of the proposed structural models and ranking method.

V Conclusion

In this paper, we proposed novel structural models to uncover the contextual dependencies and salient information for interaction prediction. The structural models are learned by using LSTM networks to process the sequence of global and local contexts. We also proposed a novel ranking score fusion method to determine the optimal weights of the spatial, temporal and structural models, to effectively combine their complementary strengths. The proposed method was compared with previous works on two interaction datasets and has achieved superior performance. We also performed an ablative analysis and compared the proposed method with a baseline that does not include the structural models and a baseline that does not use the proposed ranking score fusion method. Experimental results show the benefits of the proposed framework, particularly the structural models and the fusion method.

Acknowledgment

This work was partially supported by Australian Research Council grants DP150100294, DP110103336, and DE120102960.

References

  • [1] M. Ryoo, “Human activity prediction: Early recognition of ongoing activities from streaming videos,” in ICCV 2011.   IEEE, 2011, pp. 1036–1043.
  • [2] T. Lan, T.-C. Chen, and S. Savarese, “A hierarchical representation for future action prediction,” in ECCV 2014, 2014, pp. 689–704.
  • [3] Y. Kong, D. Kit, and Y. Fu, “A discriminative model with multiple temporal scales for action prediction,” in ECCV 2014.   Springer, 2014, pp. 596–611.
  • [4] Y. Kong and Y. Fu, “Max-margin action prediction machine,” IEEE Transactions on Pattern Analysis & Machine Intelligence, 2015.
  • [5] Q. Ke, M. Bennamoun, S. An, F. Boussaid, and F. Sohel, “Human interaction prediction using deep temporal features,” in

    European Conference on Computer Vision Workshops

    .   Springer, 2016, pp. 403–414.
  • [6] A. De Cesarei and G. R. Loftus, “Global and local vision in natural scene identification,” Psychonomic bulletin & review, vol. 18, no. 5, pp. 840–847, 2011.
  • [7] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [8] H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
  • [9] H. Palangi, L. Deng, Y. Shen, J. Gao, X. He, J. Chen, X. Song, and R. Ward, “Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 24, no. 4, pp. 694–707, 2016.
  • [10] R. R. Varior, B. Shuai, J. Lu, D. Xu, and G. Wang, “A siamese long short-term memory architecture for human re-identification,” in European Conference on Computer Vision.   Springer, 2016, pp. 135–153.
  • [11] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in CVPR 2015, 2015, pp. 2625–2634.
  • [12] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: towards good practices for deep action recognition,” in European Conference on Computer Vision.   Springer, 2016, pp. 20–36.
  • [13] M. Hoai and F. De la Torre, “Max-margin early event detectors,” International Journal of Computer Vision, vol. 107, no. 2, pp. 191–202, 2014.
  • [14] Y. Cao, D. Barrett, A. Barbu, S. Narayanaswamy, H. Yu, A. Michaux, Y. Lin, S. Dickinson, J. M. Siskind, and S. Wang, “Recognize human activities from partially observed videos,” in CVPR 2013.   IEEE, 2013, pp. 2658–2665.
  • [15] S. R. Eddy, “Hidden markov models,” Current opinion in structural biology, vol. 6, no. 3, pp. 361–365, 1996.
  • [16] J. Lafferty, A. McCallum, and F. C. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” 2001.
  • [17] P.-C. Chung and C.-D. Liu, “A daily behavior enabled hidden markov model for human behavior understanding,” Pattern Recognition, vol. 41, no. 5, pp. 1572–1580, 2008.
  • [18] J. Zhang and S. Gong, “Action categorization with modified hidden conditional random field,” Pattern Recognition, vol. 43, no. 1, pp. 197–203, 2010.
  • [19] K. Tang, L. Fei-Fei, and D. Koller, “Learning latent temporal structure for complex event detection,” in CVPR 2012.   IEEE, 2012, pp. 1250–1257.
  • [20] Y. Song, L.-P. Morency, and R. W. Davis, “Action recognition by hierarchical sequence summarization,” in CVPR 2013.   IEEE, 2013, pp. 3562–3569.
  • [21] H. Jaeger, Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the” echo state network” approach.   GMD-Forschungszentrum Informationstechnik, 2002, vol. 5.
  • [22] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on.   IEEE, 2013, pp. 6645–6649.
  • [23] S. Ma, L. Sigal, and S. Sclaroff, “Learning activity progression in lstms for activity detection and early detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1942–1950.
  • [24] J. Liu, A. Shahroudy, D. Xu, and G. Wang, “Spatio-temporal lstm with trust gates for 3d human action recognition,” in European Conference on Computer Vision.   Springer, 2016, pp. 816–833.
  • [25] J. Liu, G. Wang, P. Hu, L.-Y. Duan, and A. C. Kot, “Global context-aware attention lstm networks for 3d action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [26] B. Shuai, Z. Zuo, B. Wang, and G. Wang, “Dag-recurrent neural networks for scene labeling,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3620–3629.
  • [27] Z. Zuo, B. Shuai, G. Wang, X. Liu, X. Wang, B. Wang, and Y. Chen, “Convolutional recurrent neural networks: Learning spatial dependencies for image representation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2015, pp. 18–26.
  • [28] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on.   IEEE, 2009, pp. 248–255.
  • [29] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features off-the-shelf: an astounding baseline for recognition,” in CVPRW 2014.   IEEE, 2014, pp. 512–519.
  • [30] Y. Li, S. Wang, Q. Tian, and X. Ding, “Feature representation for statistical-learning-based object detection: A review,” Pattern Recognition, vol. 48, no. 11, pp. 3542–3559, 2015.
  • [31] F. Liu, G. Lin, and C. Shen, “Crf learning with cnn features for image segmentation,” Pattern Recognition, vol. 48, no. 10, pp. 2983–2992, 2015.
  • [32] Q. Ke and Y. Li, “Is rotation a nuisance in shape recognition?” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 4146–4153.
  • [33] Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid, “A new representation of skeleton sequences for 3d action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [34] Q. Ke, S. An, M. Bennamoun, F. Sohel, and F. Boussaid, “Skeletonnet: Mining deep part features for 3d action recognition,” IEEE Signal Processing Letters, 2017.
  • [35] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in Neural Information Processing Systems, 2014, pp. 568–576.
  • [36] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4694–4702.
  • [37] G. Gkioxari and J. Malik, “Finding action tubes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 759–768.
  • [38] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” arXiv preprint arXiv:1405.3531, 2014.
  • [39]

    V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in

    ICML 2010, 2010, pp. 807–814.
  • [40] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, “Gradient flow in recurrent nets: the difficulty of learning long-term dependencies,” 2001.
  • [41] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” arXiv preprint arXiv:1211.5063, 2012.
  • [42] T. Joachims, “Optimizing search engines using clickthrough data,” in SIGKDD 2002.   ACM, 2002, pp. 133–142.
  • [43] Y. Kong, Y. Jia, and Y. Fu, “Learning human interaction by interactive phrases,” in ECCV 2012.   Springer, 2012, pp. 300–313.
  • [44] A. Ess, B. Leibe, K. Schindler, and L. V. Gool, “A mobile vision system for robust multi-person tracking,” in CVPR 2008.   IEEE, 2008, pp. 1–8.