Hierarchical Long Short-Term Concurrent Memory for Human Interaction Recognition

11/01/2018 ∙ by Xiangbo Shu, et al. ∙ Nanjing University University of Central Florida Columbia University 10

In this paper, we aim to address the problem of human interaction recognition in videos by exploring the long-term inter-related dynamics among multiple persons. Recently, Long Short-Term Memory (LSTM) has become a popular choice to model individual dynamic for single-person action recognition due to its ability of capturing the temporal motion information in a range. However, existing RNN models focus only on capturing the dynamics of human interaction by simply combining all dynamics of individuals or modeling them as a whole. Such models neglect the inter-related dynamics of how human interactions change over time. To this end, we propose a novel Hierarchical Long Short-Term Concurrent Memory (H-LSTCM) to model the long-term inter-related dynamics among a group of persons for recognizing the human interactions. Specifically, we first feed each person's static features into a Single-Person LSTM to learn the single-person dynamic. Subsequently, the outputs of all Single-Person LSTM units are fed into a novel Concurrent LSTM (Co-LSTM) unit, which mainly consists of multiple sub-memory units, a new cell gate and a new co-memory cell. In a Co-LSTM unit, each sub-memory unit stores individual motion information, while this Co-LSTM unit selectively integrates and stores inter-related motion information between multiple interacting persons from multiple sub-memory units via the cell gate and co-memory cell, respectively. Extensive experiments on four public datasets validate the effectiveness of the proposed H-LSTCM by comparing against baseline and state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Human interactions (e.g., handshaking, and talking) are typical human activities that occur in public places and are attracting substantial attentions from researchers [1, 2, 3, 4]. A human interaction usually involves at least two individual motions from multiple persons, who are concurrently inter-related with each other (e.g., some persons are talking together, some persons are handshaking with each other). In most cases of human interaction, the concurrent inter-related motions between multiple persons are strongly interacting (e.g., person A kicks person B, while person B retreats back). It has been shown that the concurrent inter-related motions among multiple persons rather than single-person motions can contribute discriminative information for recognizing human interactions [5].

Two main types of solutions exist for the problem of human interaction recognition. One solution (e.g., [1, 6, 2, 7]) is to extract individual motion descriptors from each interacting person and then predict the class label of an interaction by inferring the coherence between two individual motions. However, this solution, i.e., regarding human interactions as multiple single-person actions, ignores some inter-related motion information and brings in some irrelevant individual motion information. The other solution is to extract motion descriptors on interacting regions and then train an interaction recognition model [5]. However, interacting regions are difficult to locate before the close interaction occurs.

Fig. 1: The framework of the proposed Hierarchical Long Short-Term Concurrent Memory (H-LSTCM) for modeling human interactions in a human interaction scene. The details of Co-LSTM unit is displayed in Figure 2.

Recently, due to the powerful ability to capture sequential motion information, Long Short-Term Memory (LSTM) [8], has proven to be successful at various human action recognition tasks [9, 10, 11, 12, 13]. Therefore, we aim to explore the long-term inter-related dynamics among a group of interacting persons by leveraging LSTM. However, existing LSTMs model human dynamics independently, and do not consider the concurrent inter-relation of dynamics among multiple persons. A straightforward way to overcome this limitation is to either 1) merge individual actions at the preprocessing stage [14] (e.g., consider interacting persons as a whole); or 2) utilize several LSTMs to model the single-person dynamics of individuals and then fuse the output sequences of these LSTMs [13]. However, both methods neglect the inter-related dynamics of how interactions among these persons change over time.

To this end, we propose a novel Hierarchical Long Short-Term Concurrent Memory (H-LSTCM) for human interaction (activity) recognition to model the long-term inter-related dynamics among a group of interacting persons, as shown in Figure 1. For each person, we first feed her/his static features (e.g., CNN features) into a Single-Person LSTM to learn the single-person dynamic, which describes a person’s long-term motion information in a whole video clip. Then, all outputs of Single-Person LSTM units are fed into a novel Concurrent LSTM (Co-LSTM) unit, which mainly consists of multiple sub-memory units, multiple new cell gates and a new co-memory cell. In a Co-LSTM unit, multiple sub-memory units store single-person motion information from the Single-Person LSTM units. Following these sub-memory units, the cell gates allow the inter-related motion memory in sub-memory units to enter a new co-memory cell, and the co-memory cell selectively integrates and stores the inter-related memory to reveal the concurrent inter-related motion information among all interacting persons. Overall, all interacting persons in each frame are jointly modeled by a Co-LSTM unit on the person bonding boxes. At the last time step, the output of Co-LSTM is a dynamic inter-related representation of the group activity. Extensive experiments on various datasets are conducted to evaluate the performance of H-LSTCM compared with the state-of-the-arts.

The main contributions of this work are summarized as follows:

  • We propose a novel Hierarchical Concurrent Long Short-Term Concurrent Memory (H-LSTCM) to effectively address the problem of human interaction recognition with multiple persons, by learning the dynamic inter-related representations among all persons in the group crowd scenes.

  • We design a novel Concurrent LSTM (Co-LSTM) to aggregate the inter-related memory from individuals in collective activity scenes, by capturing the concurrently long-term inter-related dynamics among multiple persons rather than dynamics of individuals.

Our preliminary Co-LSTSM method in [15] with two sub-memory units can recognize only the interactions between two persons, while the proposed H-LSTCM in this paper can recognize various group activities at a larger scale, including collective activity with multiple persons ( persons), and group activity with multiple sub-group activities. This is because Co-LSTSM learns the dynamic inter-related representation between two persons simply from the static single-person features. Actually there is a large gap between the static single-person features and the dynamic inter-related representation, which limits the performance of the Co-LSTSM. Thus, in H-LSTCM we bring in the single-person dynamic, which is a basic element in the group activity to describe a person’s long-term motion information in a whole video clip, and reflects motion patterns caused by interactions with other persons. H-LSTCM learns dynamic inter-related representation among multiple persons in a hierarchical way, from the static to dynamic features at the single-person level first, and further to an inter-related level of group activities. Specifically, the single-person LSTMs in H-LSTCM first learn single-person dynamics from the static single-person features. And then, an extended Co-LSTM with multiple sub-memory units in H-LSTCM learns concurrently inter-related representation among all persons based on the single-person dynamics. Such a hierarchical strategy ensures that H-LSTCM learns more discriminative representation than Co-LSTSM for group activities.

Ii Related Work

Human interaction recognition (activity recognition) aims to automatically understand the interaction performed by at least two persons [2]. In the task of two persons’ interaction recognition, earlier researchers have noted that several interactive attributes provide discriminative information to represent person-person interactions. For example, Kong et al. [1, 6] regarded multiple interactive phrases as the latent mid-level feature to recognize person-person interactions from human individual actions. Consider that there exists temporal context information in a video clip, Zhang et al. [7] and Liu et al. [16] used a new set of spatio-temporal action attribute phrases to describe the person-person interactions in a video. However, the difference in some person-person interactions (e.g., boxing and patting) is too small to be identified via only interactive phrases. Moreover, some person-person interactions are complex, and cannot be described well by a specified number of interactive phrases.

Benefiting from the success of deep learning, some deep learning methods have been proposed to understand two persons’ interaction for the last five years [17, 14]. For example, Wang et al. [17] adopted deep context features instead of the traditional context features (e.g., [18]) on the event neighborhood to recognize person-person interactions, where the size of the event neighborhood must be manually defined at the preprocessing step. One limitation of the above methods is that locating the interactive region is a challenging task before the close interaction occurs. Therefore, this work aims to design a human interaction recognition without locating the interactive regions accurately.

In a scene of multiple persons’ interaction (i.e., group activity), several persons interact with each others, which makes activity recognition a complex task. Two solutions are commonly used to address the problem of group-person interaction recognition. One solution is to exploit spatial distribution of human activities and to present spatio-temporal descriptors to capture the spatial distribution of persons [19, 20, 21]

. The other solution is to track all the body parts in the video, and then learn holistic representations to estimate the class of the collective activity 

[22, 23]

. However, the former solution requires inference of the complex spatio-relation between persons, and the latter brings in some individual action of outlier persons.

Recently, Long Short Term Memory (LSTM) has been proposed to address the problem of human interaction recognition by learning high-level dynamic representations of persons [24, 25, 14]. This insight motivates us to employ superior LSTM models to learn high-level dynamic representations of human activity. Therefore, we propose a new Hierarchical Long Short-Term Concurrent Memory (H-LSTCM) for Human Interaction Recognition. H-LSTCM adopts a hierarchical way to first model the single-person dynamics of individuals by LSTM, and then model the concurrently inter-related dynamics among all the interacting persons by a new Co-LSTM.

Closely related work includes Hierarchical Deep Temporal Model (HDTM) [13], Deep Structured Model (DSM) [24], and Structure Inference Machines (SIM) [25]. Specifically, HDTM [13]

first models the individual dynamic motions by several LSTMs. Subsequently, the outputs of these LSTMs are pooled into a single vector, which is the input of a following LSTM. HDTM pools single-person dynamics into an overall dynamic representation, and dose not consider the inter-relations among persons in the group activity. DSM 

[24] and SIM [25] utilize CNN to obtain the initialized class labels of single-person actions and group-level activity and refine the group activity class label by exploring the relations among the actions of all individuals in an iterative manner. If one person’s action is closely related to the group activity and the other persons’ actions, this person intensively participates in the group activity; otherwise, this person is an outlier. Since DSM and SIM target “key” persons who play crucial roles in the group activity rather than those of all persons, some sudden motion information of “outlier” persons may be lost.” Compared to HDTM [13], the proposed H-LSTCM considers the inter-relations among persons via the cell gates and co-memory cell. And compared to DSM [24] and SIM [25], the proposed H-LSTCM models the concurrently inter-related dynamics among all persons, which dose lose outlier yet useful persons.

Recently, some works [26, 27]

proposed to learn the concurrently location-related representation among multiple persons for multi-target tracking. They assume that two persons who have close position are inter-related with each other. By contrast, the proposed H-LSTCM learns sematic-related representation among multiple persons by leveraging the inter-relation between the single-person dynamic at the current time step and the dynamic of the whole activity at the previous time step. Here, it is assumed that one person, whose current representation is closely related to the hidden representation of the whole activity, is likely to be more involved in this activity.

Iii Preliminaries: LSTM-Based Action Recognition

Given an input video clip with length , where is the static feature at time step

. A traditional Recurrent Neural Networks (RNN) 

[28] models the dynamics of this video clip through a sequence of hidden states. Due to the exponential decay in retaining the context information of video frames [8], RNN does not model the long-term dynamics of video sequences well. To this end, Long Short-Term Memory (LSTM) [8], a variant of RNN, provides a solution by incorporating memory units that enable the network to learn when to forget previous hidden states and when to update hidden states given new information [9].

A traditional LSTM unit[8] at time step contains an input gate , a forget gate , an output gate ) and a memory cell , which are expressed as follows,

(1)
(2)
(3)
(4)
(5)

where

is a sigmoid function,

denotes element-wise product, is a hyperbolic tangent , and are weight matrices, and

is bias vector. Subsequently, a hidden state

at time step can be expressed as

(6)

which denotes the dynamic representation of the -th frame. All hidden states describe the dynamic of the video clip. Finally, the output at time step is computed as

(7)

which can be transformed to a probability

() corresponding to the -th class of the activity by a softmax function

(8)

where in denotes the encoding of the confidence score on the -th activity class. Generally, we set as the predicted class label vector.

Iv Hierarchical Long Short-Term Concurrent Memory

Iv-a The Architecture

For human interaction recognition, each video frame contains at least two concurrent singe-person actions among multiple persons, which are inter-related in a group activity. Existing LSTM models targeting single-person actions cannot handle multiple-person interactions well. As mentioned previously, we can roughly treat all the interacting persons as a whole before training the LSTM network. However, this solution results in some individual-specific motion information. Additionally, we can model the single-person dynamics of individuals by multiple LSTM networks, and then combine (e.g., concatenate or pool) the single-person dynamics obtained by all these LSTM networks into the final representation. Since this strategy assumes that all persons in a group activity are independent of each other, some of the inter-related motion information among these persons is lost.

Fig. 2: Illustration of a Concurrent LSTM (Co-LSTM) unit in H-LSTCM.

Recently, Deng et al. [25] proposed a new Structure Inference Machines (SIM) for group activity recognition, which indicated that some persons are related to the group activity, while others are outliers. Specifically, SIM first utilizes CNN to initialize the class labels of individuals’ actions and group activity. Then, the group activity class label is refined by considering the relations among the actions of all individuals in an iterative manner. Motivated by this, for a group activity with multiple persons, we also consider designing a model to capture the inter-relation among multiple persons

However, SIM targets “key” persons who play crucial roles in the group activity, some sudden motion information of “outlier” persons may be lost. Empirically, we observe that “outlier” persons are not irrelevant to the group activity at any time step. For example, a person suddenly spike ball at one moment in a volleyball game. Therefore, we propose a Hierarchical Long Short-Term Concurrent Memory (H-LSTCM) to capture the concurrently inter-related dynamics among all the persons rather than those of a selection of persons. Specifically, the proposed H-LSTCM first models the temporal motion information for each person via multiple Single-Person LSTMs corresponding to these persons, and then captures the inter-related dynamics among all the persons by a novel Concurrent LSTM (Co-LSTM). Figure 

1 shows the whole framework of the proposed H-LSTCM. The key point of H-LSTCM is to utilize multiple sub-memory units in a Concurrent LSTM (Co-LSTM) unit to selectively integrate and store the concurrently inter-related temporal information among multiple persons from the individual temporal information.

Figure 2 illustrates the architecture of a Co-LSTM unit of the proposed H-LSTCM at a time step. The Co-LSTM unit mainly consists of multiple specific sub-memory units (the number of units corresponds to the number of interacting persons), multiple cell gates, a common output gate and a new co-memory cell. Specifically, all sub-memory units include their respective input gates, forget gates, and memory cells. Following these sub-memory units, the cell gates allow the inter-related motion memory in the sub-memory units to enter a new co-memory cell, and the co-memory cell selectively integrates and memorizes the inter-related motion information among all the interacting persons. Overall, the stacked Co-LSTM units are recurrent in a time sequence to capture the concurrently inter-related dynamics among all interacting persons over time.

Formally, , , and denote the sets of static features (e.g., CNN features) on the tracklets (obtained by object detector and object tracker) of interacting persons in a video clip (wherein ). For a feature set of of the -th person, we can obtain her/his hidden state (i.e., single-person dynamic) at each time step via a Single-Person LSTM. In the -th sub-memory unit of Co-LSTM on the top of the Single-Person LSTMs, , , , and () denote input gate, forget gate, input modulation gate and sub-memory cell at time step , respectively. These components can be expressed by the following equations

(9)
(10)
(11)
(12)

where and are weight matrices, is bias vector, and the hidden state denotes the dynamic inter-related representation of the whole activity at time step . All hidden states describe the inter-related dynamic of the activity scene in the video clip.

Following the -th sub-memory unit, a new cell gate aims to control the memory that enters and leaves this sub-memory unit at time step . Like traditional gates, the cell gate is activated by a nonlinear function of the input and the past hidden state ,

(13)

where and are the weight matrices, and is the bias vector. Based on the consistent interactions among multiple interacting persons, all cell gates () allow more concurrently inter-related motion information among interacting persons to enter a new co-memory cell , which contributes to a common hidden state at time step . In this work, the co-memory cell can be expressed as

(14)

This co-memory cell corresponds to an output gate that is related to all the inputs and the common hidden state at the previous time step, i.e.,

(15)

Finally, the hidden state at time step can be expressed as

(16)

If we obtain , we can compute the probability vector of one human interaction by Eq (7) and Eq (8).

Iv-B Learning Algorithm

We employ a loss function to learn the model parameters of H-LSTCM by measuring the deviation between the ground-truth class label vector

and the predicted probability vector corresponding to at time step ,

(17)

When the training label of the activity frame at time step corresponds to the target class (), element in is , and the other elements in are zero. Then, Eq. (17) can be simplified as

(18)

where is defined in Eq. (8). Some researchers  [11, 8] indicated that the memory cell of LSTM at the last time step can store useful sequence information of the whole data sequence (e.g., a video clip). That is, for a video clip of length , if its class label is annotated at the video level, the H-LSTCM model can be trained by minimizing the loss at time step , i.e., . Otherwise, if the class label is annotated on each frame , we can minimize the cumulative loss over the sequence, i.e., .

In this work, given a training video clip with label () at the video level, we choose the loss

(19)

where

denotes a parameter set including all the parameters of the H-LSTCM model. The loss function of H-LSTCM can be minimized by Backpropagation Through Time (BPTT). The detailed deductions of the derivatives of all the parameters in the H-LSTCM model can be found in Appendix A of the supplemental material. The detailed training procedure of H-LSTCM is summarized in Algorithm 

1.

0:   video clips, , Configuration set of H-LSTCM.
0:  Parameter set , .
1:  Extract fc6 features of each person on the detected bounding box in each frame of each video.// Forward propagation
2:  Forward propagation of Single-person LSTMs;
3:  Forward propagation of Co-LSTM. // Back propagation
4:  for  do
5:     for  do
6:         Update parameters in Single-Person LSTMs via BPTT;
7:          11footnotemark: 1;
8:         ;
9:         ;
10:         ;
11:         ;
12:         ;
13:         ;
14:         ;
15:         , , and ;
16:         ;
17:         .
18:     end for
19:  end for
19:  Parameter set .
11footnotemark: 1

Here, and are the momentum parameter and learning rate, respectively. The detailed deductions of the derivative of all the parameters can be found in Appendix A.

Algorithm 1 Training for H-LSTCM

V Experiments

In experiments, we evaluate the performance of H-LSTCM compared with the state-of-the-art methods and some baselines on four public datasets.

Method bow boxing handshake high-five hug kick pat push Average
Lan et al. [20] 81.25 75.00 81.25 87.50 87.50 81.25 81.25 81.25 82.03
Liu et al. [16] 100.00 75.00 81.25 87.50 93.75 87.50 75.00 75.00 84.37
Kong et al. [6] 81.25 81.25 81.25 93.75 93.75 81.25 81.25 87.50 85.16
Kong et al. [5] 87.50 81.25 87.50 81.25 87.50 81.25 87.50 87.50 85.38
Kong et al. [1] 93.75 87.50 93.75 93.75 93.75 87.50 87.50 87.50 90.63
Donahue et al. [9] 100.00 75.00 85.00 69.75 85.00 69.75 80.00 76.50 80.13
Ke et al. [14] - - - - - - - - 85.20
B1 100.00 75.00 62.50 56.25 93.75 68.75 56.25 62.50 71.88
B2 100.00 75.00 84.50 84.50 88.00 88.00 70.00 78.00 83.50
B3 100.00 79.00 84.50 84.50 94.75 88.00 80.50 90.00 87.66
B4 100.00 82.00 85.75 84.50 94.75 88.00 83.00 90.00 88.50
Co-LSTSM [15] 100.00 90.50 92.50 92.50 94.75 88.00 90.50 94.25 92.88
H-LSTCM 100.00 92.50 94.75 95.50 94.75 89.50 91.00 94.25 94.03
TABLE I: Recognition accuracy (%) on the BIT dataset.

V-a Datasets

The detailed descriptions of four public datasets are as follows:

  • BIT dataset [6]. It consists of eight classes of human interactions, i.e., bow, boxing, handshake, high-five, hug, kick, pat, and push. Each class includes 50 videos with cluttered backgrounds. Following in [1], 34 videos per class are randomly chosen as the training data and the remaining ones are used for testing.

  • UT dataset [29]. It consists of ten videos, each video containing six classes of human interactions, i.e., handshake, hug, kick, point, punch and push. After extracting the frames, we obtain 60 video clips, namely 10 video clips per class. Leave-one-out cross-validation is adopted for the experiments.

  • Collective Activity Dataset (CAD) [19]. It contains 44 videos of five multiple-person activities, i.e., crossing, waiting, queuing, walking, and talking. Similar to [20, 30], we select one-third of the video clips from each activity category to form the test set, and the rest of the video clips are used for training. The one-versus-all technique is employed for this recognition task.

  • Volleyball Dataset (VD) [13]. It contains 55 volleyball videos with 4830 annotated frames. Each frame, there has a group-level activity class label (e.g., left_pass, right_pass, left_set, right_set, left_spike, right_spike, left_winpoint or right_winpoint). Following in [13], two-thirds of the annotated frames are used for training and the remaining ones are used for testing.

V-B Implementation Details

In the preprocessing step, the bounding box (tracklet) corresponding to each person is detected and tracked over all frames by an object detector [31] and an object tracker [32]. Following in [9], the pretrained AlexNet model [33] is employed to extract the fc6 feature (static feature) on each bounding box around one person, respectively.

For the BIT, UT, CAD and VD datasets, the length of the time steps is set to , , and

, respectively. In the configurations of H-LSTCM on four datasets, the number of memory cell nodes of each Single-Person LSTM, the number of output nodes of each Single-Person LSTM, and the number of sub-memory cell nodes of Co-LSTM are set to 2048, 1024 and 512, respectively. We use the Torch toolbox and Caffe 

[34] as the deep learning platform and an NVIDIA Tesla K20 GPU to run the experiments. The learning rate, momentum and decay rate are set to , 0.9 and 0.95, respectively. In experiments, the training of H-LSTCM begins to converge after approximately , , and epochs on the BIT, UT, CAD and VD datasets, respectively. The learning curves for training the proposed H-LSTCM on the BIT, UT, CAD and VD datasets are plotted in Appendix B of the supplemental material.

In experiments, the following four baselines are chosen:

  • B1: Person-box CNN

    . The pre-trained AlexNet is deployed on each person bounding box at each time step, where all fc6 features corresponding to each person are concatenated into a long vector. Then the concatenated features over all time steps are pooled into a single feature. All features from each video clip are trained and tested by the softmax classifier. This baseline illustrates the importance of deep features.

  • B2: One CNN + LSTM. This baseline treats two individual actions as a whole. First, multiple bounding boxes corresponding to each interacting person respectively at each time step are merged into a larger bounding box. Second, fc6 features are extracted by AlexNet on this “larger” bounding box at each time step. Third, we use the fc6 features as inputs to train an LSTM. This baseline is similar to that of Long-term Recurrent Convolutional Networks [9].

  • B3: Multiple CNN + LSTM

    . This baseline models the individual dynamics of multiple persons by Multiple LSTMs. First, AlexNet is deployed on each person bounding box at each time step to extract the fc6 feature. Second, the fc6 feature extracted from each person is fed into an LSTM network to capture the individual dynamic motions, respectively. Third, we average the softmax scores output of all LSTM networks. Here, the averaged score reflects the probability of the action class. This baseline is the same as Two-Stream Convolutional Networks 

    [35].

  • B4: Single-Person LSTMs + Whole LSTM

    . This baseline learns the single-person dynamics via multiple LSTMs, and the outputs are pooled into the other LSTM. Specifically, we first use AlexNet to extract fc6 features on person bounding boxes at each frame. Second, the fc6 features of each person are fed into each traditional LSTM network to learn the single-person hidden states. Third, the hidden states of all persons at each time step are max pooled into a single vector, which is fed into the other LSTM network, followed by a softmax. This baseline is the same as Hierarchical Deep Temporal Models 

    [13].

V-C Results on the BIT dataset

Comparison with baselines. Table I

shows the recognition accuracy of the proposed H-LSTCM are better than all baseline methods. Adding the temporal information by employing LSTM (i.e., B2, B3, B4, and Co-LSTSM) improves the performance of B1 without temporal information. Specifically, Co-LSTSM achieves higher accuracy than B2, B3 and B4. It is illustrated that inter-related motion information among multiple persons is more important than the single-person motion information of individuals for recognizing human interactions. The confusion matrix of H-LSTCM is shown in Appendix C of the supplementary material.

Comparison with state-of-the-art methods. We also compare H-LSTCM with the state-of-the-art methods for human interaction recognition, i.e., hand-crafted spatio-temporal interest points [36] methods of Lan et al. [20], Liu et al. [16], and Kong et al. [6, 1, 5], as well as the LSTM-based methods of Donahue et al. [9], and Ke et al. [14]. Table I lists the results of recognition accuracy, in which some results are reported in [1, 5]. H-LSTCM performs better than the alternatives, especially the LSTM-based methods, i.e., Donahue et al. [9] and Ke et al. [14]. In particular, the proposed H-LSTCM has gained an approximately 9% improvement compared with the state-of-the-art LSTM-based methods (i.e., Ke et al. [14] with an accuracy of 85.20%). Some examples of the recognition results of H-LSTCM are shown in Figure 3(a).

Fig. 3: Examples of some recognition results of the proposed method on four datasets. In the BIT, UT and CAD datasets, each person is detected and tracked by a bounding box, of which the size is enlarged by a moderate scale to cover more context information. The VD dataset provides person bounding boxes. Better view in color.

V-D Results on the UT dataset

Comparison with baselines. Table II shows the recognition accuracy of the proposed H-LSTCM compared with that of baselines (including Co-LSTSM [15]). It is observed that H-LSTCM performs consistently better than all the baselines. In particular, H-LSTCM and Co-LSTSM, targeting to model the inter-related dynamics rather than the individual dynamics, achieve impressive accuracy. The confusion matrix of H-LSTCM is shown in Appendix C of the supplementary material.

Method handshake hug kick point punch push Average
Ryoo et al. [29] 75.00 87.50 62.50 50.00 75.00 75.00 70.80
Yu et al. [37]  100.00 65.00 100.00 85.00 75.00 75.00 83.33
Ryoo  [38] 80.00 90.00 90.00 80.00 90.00 80.00 85.00
Kong et al. [6] 80.00 80.00 100.00 90.00 90.00 90.00 88.33
Kong et al. [1] 100.00 90.00 100.00 80.00 90.00 90.00 91.67
Kong et al. [5] 90.00 100.00 90.00 100.00 90.00 90.00 93.33
Raptis et al. [39] 100.00 100.00 90.00 100.00 80.00 90.00 93.30
Shariat et al. [40] - - - - - - 91.57
Zhang et al. [7] 100.00 100.00 100.00 90.00 90.00 90.00 95.00
Donahue et al. [9]  90.00 80.00 90.00 80.00 90.00  80.00 85.00
Ke et al. [14] - - - - - - 93.33
Wang et al. [17] - - - - - - 95.00
B1 90.00 80.00 80.00 80.00 80.00 80.00 81.67
B2 90.00 80.00 90.00 80.00 90.00 80.00  85.00
B3 100.00 100.00 90.00 80.00 90.00 80.00  90.00
B4 100.00 100.00 90.00 90.00 90.00 80.00 91.67
Co-LSTSM [15] 100.00 100.00 90.00 100.00 90.00 90.00 95.00
H-LSTCM 100.00 100.00 100.00 100.00 100.00 90.00 98.33
TABLE II: Recognition accuracy (%) of different methods on UT dataset.

Comparison with state-of-the-art methods. The proposed H-LSTCM is also compared with the state-of-the-art methods, including some traditional methods (i.e., Ryoo et al. [29], Yu et al. [37], Kong et al. [6, 1, 5], Raptis et al. [39], Shariat et al. [40], and Zhang et al. [7]), a deep learning method (i.e., Wang et al. [17]), as well as LSTM-based methods (i.e., Ke et al. [14] and Donahue et al. [9]). The recognition accuracy results are shown in Table II. Co-LSTSM achieves satisfactory accuracy, i.e., 95%. By further extending Co-LSTSM in a hierarchical way, the proposed H-LSTCM, which first models single-person dynamics and then captures concurrently inter-related dynamics among persons, improves the recognition accuracy to , which is the state-of-the-art performance. Some of the recognition results of H-LSTCM are shown in Figure 3(b).

V-E Results on the CAD dataset

Comparison with baselines. We compare the recognition accuracy of the proposed H-LSTCM and that of all the baselines. We also regard the preliminary Co-LSTSM [15] as a baseline. Since most of the group activities in the CAD dataset contain multiple interacting persons ( persons), the original Co-LSTSM [15] modeling two interacting persons cannot directly model group activity with multiple persons ( persons). Thus, we extend the Co-LSTSM to a new version, named as Co-LSTSM. Co-LSTSM has multiple sub-memory units corresponding to multiple persons, and its architecture is similar to the Co-LSTM module of H-LSTCM in Figure 3. The recognition accuracy of the proposed H-LSTCM and all baselines is shown in Table III. It is observed that H-LSTCM achieves the best performance. Furthermore, Co-LSTSM+ no longer achieves the significant performance improvements compared with B4. Here, Co-LSTSM (i.e., the extended version of Co-LSTSM [15]) cannot capture the complex inter-related dynamics among multiple persons based on the static single-person CNN features, since the collective activities in the CAD dataset are more complex than the interactions in either the BIT dataset or UT dataset. The confusion matrix of H-LSTCM is shown in Appendix C of the supplementary material.

Method crossing waiting queuing walking talking Average
Choi et al. [19] 55.4 64.6 63.3 57.9 83.6 65.9
Lan et al. [18] 75 74 74 57 61 68.2
Choi et al. [3] 76.4 76.4 78.7 36.8 85.7 70.9
Antic et al. [41] 73.70 74.50 90.10 62.00 70.00 74.1
Liu et al. [16] 72.73 66.67 71.43 83.33 85.71 76.19
Wang et al. [4] 64.8 66.0 66.7 89.2 99.5 77.2
Lan et al. [20] 68 69 76 80 99 79.7
Choi et al. [22] 61.3 82.9 95.4 65.1 94.9 79.9
Kong et al. [1] 77.27 77.78 85.71 83.33 100 82.54
Zhou et al. [42] 76.83 74.36 93.76 87.63 98.16 82.07
Ibrahim et al. [13] 61.54 66.44 96.77 80.41 99.45 81.50
Hajimirsadeghi et al. [43] 72 75 92 70 99 81.6
Deng et al. [24] - - - - - 80.6
Deng et al. [25] - - - - - 81.2

B1
46.21 53.69 70.20 61.19 74.33 61.12
B2 52.38 54.50 73.89 61.45 76.35 63.71
B3 52.46 54.61 82.00 61.21 79.85 66.02
B4 62.60 65.25 90.74 78.33 95.36 78.46
Co-LSTSM 65.50 64.85 94.67 75.33 95.33 79.14
H-LSTCM 65.50 68.29 97.90 87.69 99.35 83.75
TABLE III: Recognition accuracy (%) of different methods on CAD.

Comparison with state-of-the-art methods. We also compare the recognition accuracy of H-LSTCM and the state-of-the-art methods, including some traditional methods (i.e., Choi et al. [19], Lan et al. [18], Choi et al. [3], Antic et al. [41], Liu et al. [16], Wang et al. [4], Lan et al. [20], Choi et al. [22], Kong et al. [1], Zhou et al. [42], and Hajimirsadeghi et al. [43]), a deep learning based method (i.e., Deng et al. [24]), RNN based methods (i.e., Deng et al. [25, 13], and an LSTM based method (i.e., Ibrahim et al. [13]). The results of recognition accuracy are shown in Table II. H-LSTCM achieves better performance than that of the other methods. As a new exploration that leverages the variants of LSTM, the proposed H-LSTCM achieves an approximately improvement compared with the most closely related work [13], which uses only the traditional LSTM model without any change. Finally, we present some recognition results of H-LSTCM in Figure 3(c).

V-F Results on the Volleyball dataset

Comparison with baselines. In a volleyball sport, there are two sub-groups of players from two teams. The players on the same team have more interactions among themselves than with players on the other teams. We consider using two Co-LSTMs to model the inter-related dynamics among players on two teams, respectively. The new framework of H-LSTCM is shown in Figure 4. The main extension is that a concatenation operation and an LSTM layer are added on the top of the Co-LSTM layer. In this framework, each sub-group is modeled by a Co-LSTM, and then the outputs of two Co-LSTMs are concatenated into a sequence of representations, which are input into an LSTM layer. Likewise, Co-LSTSM+ (introduced in Section V-E) is also modified in this way. In the B2, we model dynamics of one team by B2, and add a concatenation operation and an LSTM layer on the top of the LSTM layer. In the baseline B4, the outputs of one team of multiple Single-Person LSTMs are pooled into a sequence of representations. The representations of the two teams are concatenated into a long representation which is then fed into an LSTM layer. The recognition accuracy of H-LSTCM and all the baselines is shown in Table IV, where “lpass”, “rpass”, “lset”, “rset”, “lspike”, “rspike”, “lwin” and “rwin” denote left_pass, right_pass, left_set, right_set, left_spike, right_spike, left_winpoint and right_winpoint, respectively. H-LSTCM achieves the best performance over all baseline methods. It is noted that Co-LSTSM+ and B4 are comparable. These results illustrate that Co-LSTSM+ cannot learn concurrently inter-related representations between multiple persons well, when a complex pattern of group activity exists. The confusion matrix of H-LSTCM is shown in Appendix C of the supplementary material.

Fig. 4: Framework of H-LSTCM on the Volleyball activity with two sub-groups of persons. A concatenation operation and a LSTM layer is added on the top of Co-LSTM.
Method lpass rpass lset rset lspike rspike lwin rwin Averae
Ibrahim et al.[13] 77.9 81.4 84.5 68.8 89.4 85.6 88.2 87.4 82.9
Shu et al. [44] - - - - - - - - 83.6
Li et al. [45] 55.8 69.1 67.3 52.1 82.1 79.2 - - 67.6
Biswas et al.[46] - - - - - - - - 83.0
B1  62.8 62.1 71.4 58.7 65.1 76.5 63.7 61.6 65.2
B2  64.6 66.5 76.5 62.7 77.7 74.0 70.6 68.0 70.1
B3  74.4 77.3 81.8 69.7 88.2 83.7 78.6 78.0 79.0
B4  77.0 80.9 84.1 68.3 88.8 85.3 88.0 87.7 82.5
Co-LSTSM+  81.3 79.5 85.1 70.7 88.8 85.5 88.7 86.9 83.3
H-LSTCM  83.9 88.1 90.3 80.4 93.4 89.8 88.7 92.4 88.4
TABLE IV: Recognition accuracy (%) on Volleyball dataset.

Comparison with state-of-the-art methods. The results of the proposed H-LSTCM and other related methods are also shown in Table IV. H-LSTCM achieves the higher recognition accuracy than the state-of-the-art methods, including Ibrahim et al. [13], Shu et al. [44], Li et al. [45], and Biswas et al. [46]. In particular, H-LSTCM with an accuracy of 88.4% achieves approximately 5% improvement compared with Shu et al. with an accuracy of 83.6%. This demonstrates that H-LSTCM is effective in modeling complex collective activity among a sub-group of persons. Finally, we present some recognition results of H-LSTCM in Figure 3(d).

V-G Evaluation on Human Interaction Prediction

We also evaluate H-LSTCM on human interaction prediction. In contrast to human interaction recognition, human interaction prediction is defined as recognizing an ongoing interaction activity before the interaction is completely executed [14, 38]. Due to the large variations in appearance and the evolution of scenes, human interaction prediction is a challenging task. Following the experimental setting in [14, 47], a testing video clip is divided into 10 incomplete action executions by using 10 observation ratios (i.e., from 0 to 1 with a step size of 0.1), which represent the increasing amount of sequential data with time. For example, given a testing video clip of length , an observation ratio of denotes that the accuracy is tested with the first frames. When the observation ratio is , namely the entire video clip is used, H-LSTCM acts as a human interaction recognition model.

(a) On the BIT dataset.
(b) On the UT dataset.
Fig. 5: Comparisons of human interaction prediction on BIT and UT.

The baselines include Dynamic Bag-of-Words (DBoW) [38], Sparse Coding (SC) [48], Sparse Coding with Mixture of training video Segments (MSSC) [48], Multiple Temporal Scales based on SVM (MTSSVM) [49], Max-Margin Action Prediction Machine (MMAPM) [47], Long-term Recurrent Convolutional Networks (LRCN) [9], Spatial-Structural-Temporal Feature Learning (SSTFL) [14] and our preliminary Co-LSTSM  [15]. The results of all the methods on the BIT and UT datasets with different observation ratios are listed in Figure 5(a) and Figure 5(b), respectively. Overall, H-LSTCM and Co-LSTSM outperforms all the baselines. All the interactions in the BIT dataset are the two persons’ interactions with simple background, and Co-LSTSM is proposed to learn the dynamic inter-related representation between two persons. Thus, the performance of H-LSTCM is comparable to Co-LSTSM for the problem of two persons’ interaction predication on the BIT dataset. Specifically, we can observe that: 1) the improvements of H-LSTCM and Co-LSTSM on BIT are more significant when the observation ratio is ; 2) the accuracy of H-LSTCM becomes stable on both BID and UT when the observation ratio is approximately , which illustrates the end of close interaction is ending; and 3) since H-LSTSM and Co-LSTSM can accumulate the temporal interacting information, their accuracy monotonously increases with increasing video observation ratio.

Vi Conclusions

In this work, on human interaction recognition, we propose a novel Hierarchical Concurrent Long Short-Term Concurrent Memory (H-LSTCM) to learn the dynamic inter-related representation among all persons from the static singe-person features in a hierarchical way. Specifically, for each person, we first feed her/his static single-person features into a Single-Person LSTM to learn the single-person dynamic. Afterwards, the outputs of all Single-Person LSTMs unit are fed into a novel Concurrent LSTM (Co-LSTM) unit, which mainly consists of multiple sub-memory units and a new co-memory cell. In the Co-LSTM unit, each sub-memory unit stores individual motion information, while a concurrent LSTM (Long Short-Term Memory) unit selectively integrates and stores the inter-related motion information among multiple interacting persons from multiple sub-memory units via a new co-memory cell. The proposed method is evaluated on four public datasets and yields promising improvements over the state-of-the-art methods.

References

  • [1] Y. Kong, Y. Jia, and Y. Fu, “Interactive phrases: Semantic descriptions for human interaction recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 9, pp. 1775–1788, 2014.
  • [2] X. Chang, W.-S. Zheng, and J. Zhang, “Learning person–person interaction in collective activity recognition,” IEEE Transactions on Image Processing, vol. 24, no. 6, pp. 1905–1918.
  • [3] W. Choi, K. Shahid, and S. Savarese, “Learning context for collective activity recognition,” in CVPR, 2011.
  • [4] Z. Wang, S. Liu, J. Zhang, S. Chen, and Q. Guan, “A spatio-temporal crf for human interaction understanding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 8, pp. 1647–1660, 2017.
  • [5] Y. Kong and Y. Fu, “Close human interaction recognition using patch-aware models,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 167–178, 2016.
  • [6] Y. Kong, Y. Jia, and Y. Fu, “Learning human interaction by interactive phrases,” in ECCV, 2012.
  • [7] Y. Zhang, X. Liu, M. Chang, W. Ge, and T. Chen, “Spatio-temporal phrases for activity recognition,” in ECCV, 2012.
  • [8] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [9] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in CVPR, 2015.
  • [10] Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” in CVPR, 2015.
  • [11] V. Veeriah, N. Zhuang, and G.-J. Qi, “Differential recurrent neural networks for action recognition,” in ICCV, 2015.
  • [12] J. Liu, A. Shahroudy, D. Xu, and G. Wang, “Spatio-temporal lstm with trust gates for 3d human action recognition,” in ECCV, 2016.
  • [13] M. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, and G. Mori, “A hierarchical deep temporal model for group activity recognition,” arXiv, 2015.
  • [14] Q. Ke, M. Bennamoun, S. An, F. Bossaid, and F. Sohel, “Spatial, structural and temporal feature learning for human interaction prediction,” arXiv, 2016.
  • [15] X. Shu, J. Tang, G.-J. Qi, Y. Song, Z. Li, and L. Zhang, “Concurrence-aware long short-term sub-memories for person-person action recognition,” in CVPRW, 2017.
  • [16] J. Liu, B. Kuipers, and S. Savarese, “Recognizing human actions by attributes,” in CVPR, 2011.
  • [17] X. Wang and Q. Ji, “Hierarchical context modeling for video event recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 9, pp. 1770–1782, 2017.
  • [18] T. Lan, Y. Wang, G. Mori, and S. N. Robinovitch, “Retrieving actions in group contexts,” in ECCV, 2010.
  • [19] W. Choi, K. Shahid, and S. Savarese, “What are they doing?: Collective activity classification using spatio-temporal relationship among people,” in ICCVW, 2009.
  • [20] T. Lan, Y. Wang, W. Yang, S. N. Robinovitch, and G. Mori, “Discriminative latent models for recognizing contextual group activities,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 8, pp. 1549–1562, 2012.
  • [21] M. S. Ryoo and J. K. Aggarwal, “Recognition of composite human activities through context-free grammar based representation,” in CVPR, 2006.
  • [22] W. Choi and S. Savarese, “A unified framework for multi-target tracking and collective activity recognition,” in ECCV, 2012.
  • [23] A. Vahdat, B. Gao, M. Ranjbar, and G. Mori, “A discriminative key pose sequence model for recognizing human interactions,” in ICCVW, 2011.
  • [24] Z. Deng, M. Zhai, L. Chen, Y. Liu, S. Muralidharan, M. J. Roshtkhari, and G. Mori, “Deep structured models for group activity recognition,” in BMVC, 2015.
  • [25] Z. Deng, A. Vahdat, H. Hu, and G. Mori, “Structure inference machines: Recurrent neural networks for analyzing relations in group activity recognition,” in CVPR, 2016.
  • [26] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, “Social lstm: Human trajectory prediction in crowded spaces,” in CVPR, 2016.
  • [27] A. Sadeghian, A. Alahi, and S. Savarese, “Tracking the untrackable: Learning to track multiple cues with long-term dependencies,” in ICCV, 2017.
  • [28] R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural Computation, vol. 1, no. 2, pp. 270–280, 1989.
  • [29] M. S. Ryoo and J. K. Aggarwal, “Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities,” in ICCV, 2009.
  • [30] H. Hajimirsadeghi, W. Yan, A. Vahdat, and G. Mori, “Visual recognition by counting instances: A multi-instance cardinality potential kernel,” in CVPRW, 2015.
  • [31] R. B. Girshick, “Fast r-cnn,” in ICCV, 2015.
  • [32] A. R. Zamir, A. Dehghan, and M. Shah, “Gmcp-tracker: Global multi-object tracking using generalized minimum clique graphs,” in ICCV, 2012.
  • [33]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    NIPS, 2012.
  • [34] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in ACM MM, 2014.
  • [35] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in NIPS, 2014.
  • [36] P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior recognition via sparse spatio-temporal features,” in VS-PETS, 2005.
  • [37] T.-H. Yu, T.-K. Kim, and R. Cipolla, “Real-time action recognition by spatiotemporal semantic and structural forests,” in BMVC, 2010.
  • [38] M. S. Ryoo, “Human activity prediction: Early recognition of ongoing activities from streaming videos,” in ICCV, 2011.
  • [39] M. Raptis and L. Sigal, “Poselet key-framing: A model for human activity recognition,” in CVPR, 2013.
  • [40] S. Shariat and V. Pavlovic, “A new adaptive segmental matching measure for human activity recognition,” in ICCV, 2013.
  • [41] B. Antic and B. Ommer, “Learning latent constituents for recognition of group activities in video,” in ECCV, 2014.
  • [42] Z. Zhou, K. Li, X. He, and M. Li, “A generative model for recognizing mixed group activities in still images,” in IJCAI, 2016.
  • [43] H. Hajimirsadeghi and G. Mori, “Multi-instance classification by max-margin training of cardinality-based markov networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 9, pp. 1839–1852, 2017.
  • [44] T. Shu, S. Todorovic, and S.-C. Zhu, “Cern: Confidence-energy recurrent network for group activity recognition,” in CVPR, 2016.
  • [45] X. Li and M. C. Chuah, “SBGAR: semantics based group activity recognition,” in ICCV, 2017.
  • [46] S. Biswas and J. Gall, “Structural recurrent neural network (srnn) for group activity analysis,” in WACV, 2018.
  • [47] Y. Kong and Y. Fu, “Max-margin action prediction machine,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 9, pp. 1844–1858, 2016.
  • [48] Y. Cao, D. P. Barrett, A. Barbu, S. Narayanaswamy, H. Yu, A. Michaux, Y. Lin, S. J. Dickinson, J. M. Siskind, and S. Wang, “Recognize human activities from partially observed videos,” in CVPR, 2013.
  • [49] Y. Kong, D. Kit, and Y. Fu, “A discriminative model with multiple temporal scales for action prediction,” in ECCV, 2014.