1. Introduction
Recent years have witnessed a revival of the Long ShortTerm Memory (LSTM) (Hochreiter and Schmidhuber, 1997), thanks to the special gating mechanism that controls access to memory cells. The superior capability of LSTM has been shown in a wide range of problems such as machine translation (Sutskever et al., 2014; Bahdanau et al., 2014), speech recognition (Graves et al., 2013), and multimodal translation (Venugopalan et al., 2014). Compared with many existing spatiotemporal features (Klaser et al., 2008; Scovanner et al., 2007) from the timeseries data, LSTM uses either a hidden layer (Schuster and Paliwal, 1997) or a memory cell (Hochreiter and Schmidhuber, 1997) to learn the timeevolving states which model the underlying dynamics of the input sequences. In contrast to the conventional RNN, the major component of LSTM is the memory cell which is modulated by three gates: input, output, and forget gates. These gates determine the amount of dynamic information entering/leaving the memory cell. The memory cell has a set of internal states, which store the information obtained over time. In this context, these internal states constitute a representation of an input sequence learned over time.
LSTMs have shown tremendous potential in activity recognition tasks (Donahue et al., 2015; Baccouche et al., 2010; Grushin et al., 2013). The existing LSTM model represents a video by integrating all the available information from each frame over time. It was pointed out in (Veeriah et al., 2015) that for an activity recognition task, not all frames contain salient spatiotemporal information which is equally discriminative to different classes of activities. Many frames contain nonsalient motions which are irrelevant to the performed actions. Since the gate units in LSTM do not explicitly consider whether a frame contains salient motion information when they modulate the input and output of the memory cells, LSTM is insensitive to the dynamical evolution of the hidden states given the input video sequences and cannot capture the salient dynamic patterns. dRNN addresses this problem and models the dynamics of actions by computing different orders of Derivative of State (DoS). DoS models the change in information gain caused by the salient motions between the successive frames using higher orders of internal state derivatives. Intuitively, 1storder DoS represents the velocity of change of internal state memory while 2ndorder DoS represents the acceleration of memory state change. This reveals that the conventional LSTM, whose internal cell is simply 0thorder DoS, only captures the locality of information change.
dRNN is formulated in the fashion that the gates are modulated by the weighted combinations of several orders of DoS. While an individual order of DoS is able to model a certain degree of dynamical structures, the sum of all the orders of DoS could distort the detected salient motion patterns. Figure 1 illustrates the energy curves of the 0th, 1st, and 2ndorders of DoS over an example of sequence for the activity ”RunInGroup”. The local maxima indicate high energy landmarks corresponding to the salient motion frames at different levels. While each order of DoS enables the LSTM unit to model the dynamics of local saliency at a certain level, the weighted sum of different orders of DoS may risk misaligning salient motion and result in distorted motion patterns. This inspires us to control the LSTM gates using individual orders of the state derivatives.
As is generally accepted, RNNs are inherently deep in time because the current hidden state is a function of all previous hidden states. By questioning whether RNNs could also benefit from depth in space, just as feedforward layers which are stacked in conventional deep networks, Graves et al. (Graves et al., 2013) introduced Deep Recurrent Neural Networks, also known as stacked LSTMs. Stacked LSTMs have shown superiority over the traditional LSTM in modeling complex sequences and have been used in various types of applications. Inspired by Deep Recurrent Neural Network, we are motivated to explore whether the stacked deep layers in space could naturally reveal the saliency dynamics over time, thus avoiding the misaligned DoS in different orders.
To this end, we propose to stack multiple levels of LSTM cells with increasing orders of DoS. The proposed model progressively builds up the ability of LSTM gates to detect salient dynamic patterns with deeper memory layers modeling higher orders of DoS. The proposed model is thus termed deep differential Recurrent Neural Network (RNN). The RNN differs from conventional stacked LSTMs in that stacked LSTMs use homogeneous LSTM layers while RNN uses heterogeneous ones. In this way, RNN is not only capable of modeling more complex dynamical patterns, but also enables a hierarchy of DoS saliency in deep layers to model the spatiotemporal dynamics over time.
We demonstrate that RNN can outperform the stateoftheart methods on two publicly available human activity datasets: NUSHGA (Ni et al., 2009) and ViolentFlows (Hassner et al., 2012). Specifically, RNN outperforms the existing LSTM, dRNN, and stacked LSTM models, consistently achieving better performance in detecting human activities in sequences. In addition, we compared with the other nonLSTM algorithms, where RNN model also reached competitive performance.
The remainder of this paper is organized as follows. In the next section, we briefly review several related works. The background and details of dRNN are reviewed in Section 3. Section 4 presents the proposed deep differential RNN model. The experimental results are presented in Section 5. Finally, we offer our conclusion and discuss the future work in Section 6.
2. Related Work
2.1. Variants of Long ShortTerm Memory
Due to the exponential decay, traditional RNNs are limited in learning longterm sequences. Hochreiter et al. (Hochreiter and Schmidhuber, 1997) designed Long ShortTerm Memory (LSTM) to exploit the longrange dependency. As LSTM shows superiority in modeling timeseries data, it is widely used to various kinds of sequential processing tasks and several LSTM variants were proposed to improve the architecture of standard LSTM. SLSTM (Zhu et al., 2015) is an LSTM network with tree structures. The hierarchical structure of SLSTM aims to mitigate the gradient vanishing problem and model more complicated input than LSTM. Bidirectional LSTM (Schuster and Paliwal, 1997) captures both future and past context of the input sequence. Multidimensional LSTM (MDLSTM) (Graves et al., 2007) uses interconnection from previous state of cell to extend the memory of LSTM along every dimension. The MDLSTM receives inputs in an dimensional arrangement, thus can model multidimensional sequences. MDLSTM model becomes unstable with the growth of the grid size and LSTM depth in space. Grid LSTM (Kalchbrenner et al., 2015)
provides a solution by altering the computation of output memory vectors.
Even though the above variants of LSTMs show superiority in some aspect, they do not consider the salient spatiotemporal dynamics which can be modeled by information gain of internal memory states. This inspires the use of Derivative of States (DoS) in differential Recurrent Neural Networks (dRNN) (Veeriah et al., 2015). Unfortunately, dRNN uses the weighted combination of DoS to modulate the LSTM gates, which could distort the detected salient motion patterns. We are motivated to control the gates using individual orders of DoS.
Stacked LSTMs (Graves et al., 2013) borrow the idea of depth in ANNs and stack hidden layers with LSTM cells in space to increase the network capacity. However, the homogeneous layers of stacked LSTMs limit its ability to model discriminative spatiotemporal structures. We are motived to explore a hierarchy of DoS saliency in deep layers.
2.2. Activity Recognition
Human activity recognition includes subproblems of individual human action recognition and multiperson activity recognition. In this paper, we focus on the more challenging problem of multiperson activity recognition. Multiperson activity recognition is further divided into group activity recognition and crowd analysis.
Most existing approaches of group activity recognition are based on motion trajectories of group participants. Ni et al. (Ni et al., 2009) applied motion trajectory segments as inputs and used digital filters’ frequency responses to represent the motion information. Zhu et al. (Zhu et al., 2011) considered motion trajectory as a dynamic system and used the Markov stationary distribution to acquire local appearance features as a descriptor of group action. Chu et al. (Chu et al., 2012) designed an algorithm to model the trajectories as series of heat sources to create a heat map for representing group actions. Cho et al. (Cho et al., 2015) addressed the problem by using group interaction zones to detect meaningful groups to handle noisy information. Cheng et al. (Cheng et al., 2014) proposed a layered model of human group action and represented activity patterns with both motion and appearance information. Their performance on NUSHGA achieved an accuracy of 96.20%. Zhuang et al. (Zhuang et al., 2017) used a combination of Deep VGG network (Simonyan and Zisserman, 2014) and stacked LSTMs. Their model complexity is high and has a large chance of overfitting. In this case, their model is trained on augmented data thus cannot be fairly compared with other methods.
Recent methods for crowd scene understanding mostly analyze crowd activities based on motion features extracted from trajectories/tracklets of objects
(Shao et al., 2014; Hassner et al., 2012; Su et al., 2016; Mousavi et al., 2015a). Marsden et al. (Marsden et al., 2016)studied scenelevel holistic features using tracklets to solve realtime crowd behavior anomaly detection problem.
(Marsden et al., 2016) holds the stateoftheart performance for the ViolentFlows dataset. Su et al. (Su et al., 2016) used trackletbased features and explored Coherent LSTM to model the nonlinear characteristics and spatiotemporal motion patterns in crowd behaviors. The trajectory/tracklet feature contains more semantic information, but the accuracy of trajectories/tracklets dictates the performance of crowd scene analysis. In extremely crowded areas, tracking algorithms could fail and generate inaccurate trajectories. The generalpurpose RNN does not require such input, holding the potential for more sequencerelated applications.3. Background
In this section, we briefly review Recurrent Neural Network (RNN) as well as differential Recurrent Neural Network (dRNN). Readers who are familiar with them might skip to the next section directly.
3.1. Recurrent Neural Networks
Traditional Recurrent Neural Networks model the dynamics of an input sequence of frames through a sequence of hidden states , thereby learning the spatiotemporal structure of the input sequence. For instance, a classical RNN model uses the following recurrent equation
(1) 
to model the hidden state at time by combining the information from the current input and the past hidden state . The hyperbolic tangent
in the above equation is an activation function with range [1, 1],
and are two mapping matrices to the hidden states, andis the bias vector.
The hidden states will then be mapped to an output sequence as
(2) 
where each represents a 1of encoding of the confidence scores on
classes of human activities. This output can then be transformed to a vector of probabilities
by the softmax function as(3) 
where each entry is the probability of frame belonging to class .
3.2. Differential Recurrent Neural Networks
Due to exponential decay in retaining the context information from video frames, traditional RNNs are limited in learning the longterm representation of sequences. Hochreiter et al. (Hochreiter and Schmidhuber, 1997) designed Long ShortTerm Memory (LSTM) to exploit the longrange dependency.
Although traditional LSTM neural network is capable of processing complex sequential information, it might fail to capture the salient dynamic patterns because the gate units do not explicitly consider the impact of dynamic structures present in input sequences. This makes the conventional LSTM model inadequate to learn the evolution of action states. Veeriah et al. (Veeriah et al., 2015) introduced the Derivate of States (DoS) for dRNN, which can explicitly model spatiotemporal structure and learn salient motion patterns within. Replacing internal state with the DoS in the gate units, dRNN has the following updated equations:
(i) Input gate regulates the degree to which the input information would enter the memory cell to affect the internal state at time . The activation of the gate has the following recurrent form:
where the sigmoid is an activation function in the range [0,1], with 0 meaning the gate is closed and 1 meaning the gate is completely open; are the mapping matrices and is the bias vector.
(ii) Forget gate modulates the previous state to control its contribution to the current state. It is defined as
with the mapping matrices and the bias vector .
With the input and forget gate units, the internal state of each memory cell can be updated as below:
(4) 
where stands for elementwise product.
(iii) Output gate gates the information output from a memory cell which would influence the future states of LSTM cells. It is defined as
Then the hidden state of a memory cell is output as
(5) 
By iteratively applying Eq. 4 and Eq. 5, dRNN updates the internal state and the hidden state over time. In the process, the input gate, forget gate, and output gate play an important role in controlling the information entering and leaving the memory cell.
4. The proposed model
Given an activity recognition task, not all video frames contain salient patterns to discriminate between different classes of activities. dRNN tries to detect and integrate the salient spatiotemporal sequences via the state derivative. As the internal state contains the memory of the previous input sequences, the state derivative explicitly models the change in information gain and considers the impact of dynamic structures. Thus, the state derivative tends to be effective in recognizing actions.
As mentioned in (Greff et al., 2017), LSTM gates serve as the most crucial elements of LSTM. dRNN formulates the input, forget, and output gates using the combination of different orders of DoS. To be more specific, 0thorder DoS, which is the same as conventional LSTM internal cell, models the locality of memory change; 1storder DoS denotes the velocity of change in information gain; and 2ndorder DoS describes the acceleration of memory change, etc. While each individual order of DoS is effective in capturing a certain level of salient spatiotemporal sequences, the sum of all the orders of DoS could distort the detected salient motion patterns and result in less effective modulation of those gates.
In this paper, we propose to modulate the LSTM gates via individual orders of DoS. Inspired by (Graves et al., 2013), we stack multiple levels of LSTM cells with increasing orders of DoS. To be more specific, layer 1 of RNN uses 0thorder DoS, which resembles the conventional LSTM cell; layer 2 uses LSTM cell with 1storder DoS; and layer 3 uses LSTM cell with 2ndorder DoS, etc. Since we are integrating the ideas of DoS from dRNN and deep stacked layers from deep RNN, our proposed model is termed deep differential Recurrent Neural Network (RNN). Within each layer of RNN, our model learns the change in information gain with individual order of DoS. With deeper layers of RNN cell, our model learns higherdegree and more complex dynamical patterns.
Figure 2 illustrates the LSTM unit in layer of the proposed RNN model. Hollow lines indicate the information flow of . Formally, we have the following recurrent equations to control the LSTM gates in layer of RNN.
(i) Input gate:
(6) 
(ii) Forget gate:
(7) 
(iii) Output gate:
(8) 
4.1. Discretized Model
Since RNN model is defined in the discretetime domain, the 1storder derivative , as the velocity of information change, can be discretized as the difference of states:
(9) 
for simplicity (Epperson, 2013).
Similarly, we consider the 2ndorder of DoS as the acceleration of information change. It can be discretized as:
(10) 
In this paper, we only consider the first two orders of DoS. Higher orders can be derived in a similar way.
4.2. Algorithm and Learning
With the above recurrent equations, the RNN model proceeds with the following procedures starting in layer 1 () at time step :
For a framebyframe prediction task, we output the label by applying the softmax to using Eq. (2) and (3). To learn the model parameters of
RNN, we define a loss function to measure the deviation between the target class
and at time :Then, we can minimize the cumulative loss over the sequence:
For an activity recognition task, the label of activity is often given at the video level. Since LSTMs have the ability to memorize the content of an entire sequence, the last memory cell of LSTMs ought to contain all the necessary information for recognizing the activity. The sequence level class probability p is generated by computing the output of RNN with Eq. (2) and applying the softmax function with Eq. (3). For a given training label , the RNN can be trained by minimizing the loss function below, i.e.
The loss function can be minimized by Back Propagation Through Time (BPTT) (Cuéllar et al., 2007), which unfolds an LSTM model over several time steps and then runs the back propagation algorithm to train the model. To prevent backpropagated errors from decaying or exploding exponentially, we use truncated BPTT according to Hochreiter et al. (Hochreiter and Schmidhuber, 1997) to learn the model parameters. Specifically, in our model, errors are not allowed to reenter the memory cell once they leave it through the DoS nodes.
Formally, we assume the following truncated derivatives of gate activations:
and
where stands for the truncated derivatives.
5. Experimental Results
We compare the performance of the proposed method with stateoftheart LSTM and nonLSTM methods present in existing literature on human activity datasets.
5.1. Datasets and Feature Extraction
The proposed method is evaluated on two publicly available human activity datasets: NUSHGA (Ni et al., 2009) and ViolentFlow (Hassner et al., 2012).
We choose the NUSHGA dataset as it is a wellcollected benchmark dataset for evaluating activity recognition techniques. The NUSHGA dataset includes 476 video clips covering six group activity classes: Fight, Gather, Ignore, RunInGroup, StandTalk and, WalkInGroup. Each instance involves 48 persons. The sequences are captured over different backgrounds with a static camera recording 25 frames per second. Each video clip has a resolution of 720 576 and lasts around 10 seconds.
The ViolentFlows (VF) dataset is a realworld video footage of crowd violence, along with standard benchmark protocols designed for violent/nonviolent classification. The ViolentFlows dataset includes 246 realworld videos downloaded from YouTube. The shortest clip duration is 1.04 seconds, the longest slip is 6.52 seconds, and the average length of the video is 3.6 seconds.
We are using densely sampled HOG3D features (Klaser et al., 2008) to represent each frame of video sequences from the NUSHGA and ViolentFlows datasets. Specifically, we uniformly divide the 3D video volumes into a dense grid, and extract the descriptors from each cell of the grid. The parameters for HOG3D are the same as the one used in (Klaser et al., 2008). The standard dense sampling parameters for extracting HOG3D features can be found on the author’s webpage. All the videos are resized into . The size of the descriptor is 300 per cell of grid, and there are 58 such cells in each frame, yielding a 17,400 dimensional feature vector per frame. To construct a compact input into
RNN model, Principal Component Analysis (PCA) is then applied to reduce the feature dimension. After PCA dimension reduction, NUSHGA has a feature dimension of 300, and ViolentFlows has a feature dimension of 500. Both of them are retaining 90% of energy among the principal components. For the sake of fair comparison, we use the same features as input for other LSTM models too.
5.2. Architecture and Training
The architecture of the RNN models trained on the above datasets are shown in Table 1. We keep the state units the same for all the LSTM layers in RNN. For the sake of fair comparison, we adopt the same architecture for stacked LSTMs models. For dRNN model, we keep the same number of state units as stacked LSTMs and RNN. We can see that the number of memory cell units is smaller than the input units on both datasets. This can be interpreted as follows. The sequence of a human activity video often forms a continuous pattern embedded in a lowdimensional manifold of the input space. Thus, a lowerdimension state space is sufficient to capture the dynamics of such patterns. The number of output units corresponds to the number of classes in the datasets.
NUSHGA  ViolentFlows  

Input Units  300  500 
State Units  200  300 
Output Units  6  2 
We plot the learning curves for training the RNN models on the NUSHGA dataset in Figure 4. layer RNN refers to the model of
layers with DoS starting from 0thorder to nthorder. The learning rate of BPTT algorithm is set to 0.0001. The figure shows that the objective loss continuously decreases over 50 epochs. Deep layers of
RNN converge faster due to larger model complexity.5.3. Results on the NUSHGA Dataset
There are several different evaluation protocols used on the NUSHGA dataset in literature, which can lead to fairly large differences in performance across different experiment protocols.
First to evaluate the performances of the proposed RNN vs. other LSTM models, we perform fivefold cross validation. This experiment setup increases the challenge for the task compared to the one using MonteCarlo cross validation. For the NUSHGA dataset, the activity videos are produced by chopping longer sequences into shorter ones. Due to random sampling, MonteCarlo makes the task easier by putting similar video instances to both training and testing set. Fivefold crossvalidation, however, dramatically increases the difficulty since training and testing examples usually have high inclass variations regarding background, viewangle, lightning, and activity participants.
We show the performance comparison of LSTM models in Figure 6. For the dRNN and RNN models, we report the accuracy up to the 2ndorder of DoS. For stacked LSTMs, we report the accuracy up to 3 layers, which is the same highest layer as RNN. All the LSTM models use the same HOG3D feature presented above. The conventional LSTM yields the lowest performance compared to other LSTM models. This is because conventional LSTM uses neither DoS or deep layers to capture motion saliency presented in given video frames.
Generally, higher orders of DoS generate better performance for both dRNN and RNN. However, the performance increase for dRNN is smaller than RNN. This can be explained that for dRNN, the combination of all the orders of DoS distorts the detected motion patterns. RNN, on the other hand, uses individual orders of DoS on each layer, and can preserve and align the salient dynamic structures.
It can also be seen that RNN outperforms stacked LSTMs given the same number of deep layers, which demonstrates the advantage of heterogeneous LSTM layers used in RNN over the homogeneous ones in stacked LSTMs. To be more specific, the higher orders of DoS from RNN detects salient spatiotemporal structures which cannot be captured by conventional LSTM layers used in stacked LSTMs. To this end, the above analysis shows that RNN not only is capable of modeling more complex dynamical patterns, but also enables a hierarchy of DoS saliency in deep layers to model the spatiotemporal dynamics over time.
Although deeper layers or higher orders of RNN might improve the accuracy further, we do not report the result since it becomes trivial by simply adding more deep layers with higherorder DoS. The improved performance, however, might not compensate for the increased computational cost. Moreover, with an increased number of deep layers modeling higher orders of DoS, a larger number of model parameters would have to be learned with the limited training examples. This tends to cause overfitting problem, making the performance stop improving or even begin to degenerate. Therefore, for most of practical applications, the 3layer setup for RNN should be sufficient. More applications of deep architectures of RNNs can be found in (Irsoy and Cardie, 2014; Sutskever et al., 2014).
In order to compare RNN with other nonLSTM stateoftheart methods, we follow (Cheng et al., 2014) and evaluate our method via MonteCarlo Cross Validation. To be more specific, we randomly select 80% of the examples from each class of the dataset to form the training set, and then assign the rest to the test set. The performance is reported by the average accuracy across five such trials.
Methods  Accuracy (%) 

Ni et al. (Ni et al., 2009)  73.50 
Zhu et al. (Zhu et al., 2011)  87.00 
Cho et al. (Cho et al., 2015)  96.03 
Cheng et al. (Cheng et al., 2014) (MF)  93.20 
Cheng et al. (Cheng et al., 2014) (MAF)  96.20 
3layer RNN  98.24 
We compare RNN model with other nonLSTM stateoftheart algorithms in Table 2. In addition, Figure 7 shows the confusion matrices obtained by (Cheng et al., 2014) and our proposed method. RNN model generally achieves better performance than the other nonLSTM methods. Traditional solutions for group activity algorithms need human supervision to acquire accurate human object trajectories from the videos. According to (Cheng et al., 2014), they acquired human bounding boxes using existing tracking tools, which requires manual annotation for bounding box initialization. This constraint prevents their method from being used in automatic or realtime applications. On the contrary, RNN models can outperform these traditional methods without the aid of manual operation, enabling broader applications for group behavior analysis. In addition, since traditional models rely on such special assumptions, they might not be applicable to other types of sequences which do not satisfy these assumptions. In contrast, the proposed RNN model is a generalpurpose model, not being tailored to any specific type of sequences, holding the potential for other sequencerelated applications.
Zhuang (Zhuang et al., 2017) reported an accuracy of 99.25% on NUSHGA dataset. It is worth noting that they used a combination of Deep VGG network (Simonyan and Zisserman, 2014) with stacked LSTMs, where the Deep VGG network plays a crucial role in reaching such performance. In addition, to avoid overfitting for the complex model, they applied data augmentation on the training set to increase the diversity of training sequences. While in our experiments, no data augmentation technique is used.
In order to better understand the disadvantage of the combination of different orders of DoS in dRNN model, we perform the following experimental study. First, we construct a variant of LSTM using individual orders of DoS and then treat them as separate models. We call these models ”1storder LSTM” and ”2ndorder LSTM”. Then we use AdaBoost algorithm (Freund and Schapire, 1997) to ensemble the above models, which we call Ensemble RNN (eRNN). To be more specific, 1storder eRNN ensembles 0th and 1storder LSTMs; 2ndorder eRNN ensembles 0th, 1st, and 2ndorder LSTMs. By doing so, we also intend to study whether each individual order of DoS is good at modeling a certain level of motion saliency.
In Figure 8, the leftmost three bars are performances of conventional LSTM, 1storder LSTM, and 2ndorder LSTM. LSTMs with higher orders of individual DoS gain slightly better performances. On the other hand, their ensemble models, which are eRNNs, achieve substantially better results. This shows that each individual order of DoS indeed can detect a certain level of motion saliency and contributes to their ensemble model. While comparing the same order of dRNN with eRNN, we find out that eRNN consistently achieves higher results. This demonstrates that it is suboptimal to combine different orders of DoS within the LSTM gates and the sum of all the orders of DoS would distort the detected motion patterns.
5.4. Results on the ViolentFlows Dataset
To evaluate our method on the ViolentFlows dataset, we follow the standard 5fold crossvalidation protocol in (Hassner et al., 2012) and report the results in terms of mean accuracy.
Figure 9 compares the results of LSTM models on the ViolentFlows dataset. For the dRNN models, we report the accuracy up to the 2ndorder of DoS. For stacked LSTMs and RNN, we report the accuracy up to 3 layers.
From the results, dRNN, stacked LSTMs, and the proposed RNN outperform the conventional LSTM algorithm with the same feature. RNN outperforms both dRNN and stacked LSTMs, which demonstrates the effectiveness of learning intrinsic dynamical patterns present in video sequences with both deeper layers of LSTMs and higherorders of DoS. While deeper layers enable greater model complexity, higherorders of DoS strengthen the model’s ability to detect salient spatiotemporal structures. In the meantime, the stacked deep layers in space naturally reveal the salient dynamics over time, and decrease the chance of misaligned DoS in different orders. dRNN model suffers from the distortion of salient motion patterns due to the combination of different orders DoS. On the other hand, stacked LSTMs ignore the salient spatiotemporal structures, by simply stacking homogeneous layers onto the model.
It is worth pointing out that for stacked LSTMs and RNN, the 3layer architectures do not improve the performance much compared to their corresponding 2layer models on the ViolentFlows dataset. For stacked LSTMs, the performance is even slightly decreasing. This is probably because with deep layers in the architecture, the model is getting high in complexity and has larger chance of overfitting.
From Figure 10, we can see that RNNN can effectively detect the violent scenes, which demonstrates the superiority of using individual orders of DoS. Stacked LSTMs, without using DoS, perform less designedly in recognizing violent activities. This is probably due to the lack of strength in detecting motion saliency.
In Table 3, we compare 3layer RNN with the nonLSTM stateoftheart algorithms on the ViolentFlows dataset. RNN model outperforms other stateoftheart methods, which again demonstrates our model’s effectiveness in learning more complex dynamical patterns via deep stacked layers and detecting spatiotemporal saliency via a hierarchy of DoS.
Methods  Accuracy (%) 

Violent Flows (Hassner et al., 2012)  81.30 
Common Measure (Mousavi et al., 2015b)  81.50 
Hist of Tracklet (Mousavi et al., 2015a)  82.30 
Substantial Derivative (Mohammadi et al., 2015)  85.43 
Holistic Features (Marsden et al., 2016)  85.53 
3layer RNN  86.58 
6. Conclusion
In this paper, we present a novel LSTM model deep differential Recurrent Neural Network (RNN), which integrates stacked LSTMs and Derivative of State (DoS). Instead of simply stacking homogeneous LSTM layers, RNN stacks multiple levels of LSTM cells with individual and increasing orders of DoS. Our model inherits the strength of stacked LSTMs to model more complex dynamical patterns than the conventional LSTM. In addition, it further enables the ability to detect salient spatiotemporal structures via the hierarchy of DoS. On the other hand, RNN differs substantially from dRNN. Instead of using the combination of different orders of DoS, which has been shown suboptimal, our model modulates the LSTM gates with individual orders DoS and mitigate the problem of information distortion. We demonstrate our model’s superiority on human activity datasets by showing that RNN outperforms LSTM, dRNN, and stacked LSTMs. Even in comparison with the other stateoftheart methods based on strong assumptions about the motion structure of activities being studied, the generalpurpose RNN model still demonstrates competitive performance. In the future work, we will explore the potential of RNN in broader applications, such as speech recognition, music synthesis, online handwriting recognition, video captioning, and gesture recognition.
References
 (1)
 Baccouche et al. (2010) Moez Baccouche, Franck Mamalet, Christian Wolf, Christophe Garcia, and Atilla Baskurt. 2010. Action classification in soccer videos with long shortterm memory recurrent neural networks. In International Conference on Artificial Neural Networks. Springer, 154–159.
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
 Cheng et al. (2014) Zhongwei Cheng, Lei Qin, Qingming Huang, Shuicheng Yan, and Qi Tian. 2014. Recognizing human group action by layered model with multiple cues. Neurocomputing 136 (2014), 124–135.

Cho
et al. (2015)
NamGyu Cho, YoungJi
Kim, Unsang Park, JeongSeon Park, and
SeongWhan Lee. 2015.
Group activity recognition with group interaction
zone based on relative distance between human objects.
International Journal of Pattern Recognition and Artificial Intelligence
29, 05 (2015), 1555007.  Chu et al. (2012) Hang Chu, Weiyao Lin, Jianxin Wu, Xingtong Zhou, Yuanzhe Chen, and Hongxiang Li. 2012. A new heatmapbased algorithm for human group activity recognition. In Proceedings of the 20th ACM international conference on Multimedia. ACM, 1069–1072.

Cuéllar et al. (2007)
Manuel P Cuéllar,
Miguel Delgado, and MC Pegalajar.
2007.
An application of nonlinear programming to train recurrent neural networks in time series prediction problems.
In Enterprise Information Systems VII. Springer, 95–102. 
Donahue et al. (2015)
Jeffrey Donahue, Lisa
Anne Hendricks, Sergio Guadarrama, Marcus
Rohrbach, Subhashini Venugopalan, Kate
Saenko, and Trevor Darrell.
2015.
Longterm recurrent convolutional networks for
visual recognition and description. In
Proceedings of the IEEE conference on computer vision and pattern recognition
. 2625–2634.  Epperson (2013) James F Epperson. 2013. An introduction to numerical methods and analysis. John Wiley & Sons.
 Freund and Schapire (1997) Yoav Freund and Robert E Schapire. 1997. A decisiontheoretic generalization of online learning and an application to boosting. Journal of computer and system sciences 55, 1 (1997), 119–139.
 Graves et al. (2007) Alex Graves, Santiago Fernández, and Jürgen Schmidhuber. 2007. MultiDimensional Recurrent Neural Networks. CoRR abs/0705.2011 (2007). arXiv:0705.2011 http://arxiv.org/abs/0705.2011
 Graves et al. (2013) Alex Graves, Abdelrahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 6645–6649.
 Greff et al. (2017) Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. 2017. LSTM: A search space odyssey. IEEE transactions on neural networks and learning systems 28, 10 (2017), 2222–2232.
 Grushin et al. (2013) Alexander Grushin, Derek D Monner, James A Reggia, and Ajay Mishra. 2013. Robust human action recognition via long shortterm memory. In Neural Networks (IJCNN), The 2013 International Joint Conference on. IEEE, 1–8.
 Hassner et al. (2012) Tal Hassner, Yossi Itcher, and Orit KliperGross. 2012. Violent flows: Realtime detection of violent crowd behavior. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on. IEEE, 1–6.
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long shortterm memory. Neural computation 9, 8 (1997), 1735–1780.

Irsoy and Cardie (2014)
Ozan Irsoy and Claire
Cardie. 2014.
Opinion mining with deep recurrent neural
networks. In
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)
. 720–728.  Kalchbrenner et al. (2015) Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. 2015. Grid long shortterm memory. arXiv preprint arXiv:1507.01526 (2015).
 Klaser et al. (2008) Alexander Klaser, Marcin Marszałek, and Cordelia Schmid. 2008. A spatiotemporal descriptor based on 3dgradients. In BMVC 200819th British Machine Vision Conference. British Machine Vision Association, 275–1.
 Marsden et al. (2016) Mark Marsden, Kevin McGuinness, Suzanne Little, and Noel E O’Connor. 2016. Holistic features for realtime crowd behaviour anomaly detection. In Image Processing (ICIP), 2016 IEEE International Conference on. IEEE, 918–922.
 Mohammadi et al. (2015) Sadegh Mohammadi, Hamed Kiani, Alessandro Perina, and Vittorio Murino. 2015. Violence detection in crowded scenes using substantial derivative. In Advanced Video and Signal Based Surveillance (AVSS), 2015 12th IEEE International Conference on. IEEE, 1–6.
 Mousavi et al. (2015a) Hossein Mousavi, Sadegh Mohammadi, Alessandro Perina, Ryad Chellali, and Vittorio Murino. 2015a. Analyzing tracklets for the detection of abnormal crowd behavior. In Applications of Computer Vision (WACV), 2015 IEEE Winter Conference on. IEEE, 148–155.
 Mousavi et al. (2015b) Hossein Mousavi, Moin Nabi, Hamed Kiani, Alessandro Perina, and Vittorio Murino. 2015b. Crowd motion monitoring using trackletbased commotion measure. In Image Processing (ICIP), 2015 IEEE International Conference on. IEEE, 2354–2358.
 Ni et al. (2009) Bingbing Ni, Shuicheng Yan, and Ashraf Kassim. 2009. Recognizing human group activities with localized causalities. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 1470–1477.
 Schuster and Paliwal (1997) Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681.
 Scovanner et al. (2007) Paul Scovanner, Saad Ali, and Mubarak Shah. 2007. A 3dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th ACM international conference on Multimedia. ACM, 357–360.
 Shao et al. (2014) Jing Shao, Chen Change Loy, and Xiaogang Wang. 2014. Sceneindependent group profiling in crowd. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2219–2226.
 Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556 (2014).
 Su et al. (2016) Hang Su, Yinpeng Dong, Jun Zhu, Haibin Ling, and Bo Zhang. 2016. Crowd Scene Understanding with Coherent Recurrent Neural Networks.. In IJCAI, Vol. 2. 5.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 3104–3112.
 Veeriah et al. (2015) Vivek Veeriah, Naifan Zhuang, and GuoJun Qi. 2015. Differential recurrent neural networks for action recognition. In Computer Vision (ICCV), 2015 IEEE International Conference on. IEEE, 4041–4049.
 Venugopalan et al. (2014) Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2014. Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 (2014).
 Zhu et al. (2011) Guangyu Zhu, Shuicheng Yan, Tony X Han, and Changsheng Xu. 2011. Generative group activity analysis with quaternion descriptor. In International Conference on Multimedia Modeling. Springer, 1–11.

Zhu
et al. (2015)
Xiaodan Zhu, Parinaz
Sobihani, and Hongyu Guo.
2015.
Long shortterm memory over recursive structures.
In
International Conference on Machine Learning
. 1604–1612. 
Zhuang
et al. (2017)
Naifan Zhuang,
Tuoerhongjiang Yusufu, Jun Ye, and
Kien A Hua. 2017.
Group Activity Recognition with Differential Recurrent Convolutional Neural Networks. In
Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on. IEEE, 526–531.
Comments
There are no comments yet.