StartNet: Online Detection of Action Start in Untrimmed Videos

We propose StartNet to address Online Detection of Action Start (ODAS) where action starts and their associated categories are detected in untrimmed, streaming videos. Previous methods aim to localize action starts by learning feature representations that can directly separate the start point from its preceding background. It is challenging due to the subtle appearance difference near the action starts and the lack of training data. Instead, StartNet decomposes ODAS into two stages: action classification (using ClsNet) and start point localization (using LocNet). ClsNet focuses on per-frame labeling and predicts action score distributions online. Based on the predicted action scores of the past and current frames, LocNet conducts class-agnostic start detection by optimizing long-term localization rewards using policy gradient methods. The proposed framework is validated on two large-scale datasets, THUMOS'14 and ActivityNet. The experimental results show that StartNet significantly outperforms the state-of-the-art by 15 offset tolerance of 1-10 seconds on THUMOS'14, and achieves comparable performance on ActivityNet with 10 times smaller time offset.

READ FULL TEXT VIEW PDF

Authors

page 3

page 6

02/19/2018

Online Action Detection in Untrimmed, Streaming Videos - Modeling and Evaluation

The goal of Online Action Detection (OAD) is to detect action in a timel...
03/17/2020

A Novel Online Action Detection Framework from Untrimmed Video Streams

Online temporal action localization from an untrimmed video stream is a ...
10/06/2020

Online Action Detection in Streaming Videos with Time Buffers

We formulate the problem of online temporal action detection in live str...
01/13/2021

Learning to Anticipate Egocentric Actions by Imagination

Anticipating actions before they are executed is crucial for a wide rang...
11/13/2020

SALAD: Self-Assessment Learning for Action Detection

Literature on self-assessment in machine learning mainly focuses on the ...
06/07/2019

Detecting the Starting Frame of Actions in Video

To understand causal relationships between events in the world, it is us...
04/21/2016

Online Action Detection

In online action detection, the goal is to detect the start of an action...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Comparison between (a) the previous method [29] and (b) the proposed framework. [29] aims to generate an action score sequence which produces low score for background and high score for the correct action immediately when the action starts– like a step function. We propose a two-stage framework: the first stage only focuses on per-frame action classification and the second stage learns to localize the start points given the historical trend of the action scores generated by the first stage.

Temporal action localization (TAL) in untrimmed videos has been widely studied in offline settings, where start and end times of an action are recognized after the action is fully observed [30, 39, 8, 4, 13, 7]. With the emerging applications that require identifying actions in real time, , autonomous driving, surveillance system, and collaborative robots, online action detection (OAD) methods [9, 12, 29, 38] have been proposed. They typically pose the TAL problem as a per-frame class labeling task.

However, in some time-sensitive scenarios, detecting accurate action starts in a timely manner is more important than successfully detecting every frame containing actions. For example, an autonomous driving car needs to detect the start of “pedestrian crossing” as soon as it happens to avoid collision; a surveillance system should generate alert as soon as a dangerous event is initiated. Online Detection of Action Start (ODAS) was proposed to address this problem specifically [29]

. Instead of classifying every frame, ODAS detects the occurrence and category of an action start as soon as possible. Thus, it addresses two sub-tasks: (i) if an action starts at time

and (ii) its associated action class.

The existing method [29] handles the two sub-tasks jointly by training a classification network that is capable of localizing the starts of different action classes. The network attempts to make the representation of a start point close to that of its associated action class and far from its preceding background. As shown in Fig. 1 (a), the network is encouraged to react immediately when an action starts. However, it is hard to achieve this goal due to the subtle appearance difference near start points and the lack of labeled training data (one action only contains one start point).

Our method is inspired by three key insights. First, decomposing a complex task properly allows sub-modules to focus on their own sub-tasks and makes the learning process easier. A good example is the success of the two-stage object detection framework [16, 15, 27]. Second, as mentioned in [16], when training data is scarce, learning from a representation that is pre-trained on an auxiliary task may lead to a significant performance boost. Third, OAD (per-frame labeling) is very related to ODAS. Compared to the scarce labeled data of action starts, the amount of per-frame action labels is much larger. Thus, there may be potential benefits if we take advantage of the per-frame labeling task.

Instead of focusing on learning subtle difference near start points, we propose an alternative framework, startNet, and address ODAS in two stages: classification (using ClsNet) and localization (using LocNet). ClsNet conducts per-frame labeling as an auxiliary task based on the spatial-temporal feature aggregation from input videos, and generates score distributions of action classes as a high-level representation. Based on the historical trend of score distributions, LocNet predicts class-agnostic start probability at each time (see Fig 

1

(b)). At the end, late fusion is applied on the outputs of both modules to generate the final result. When designing LocNet, we consider the implicit temporal constraint between action starts– two start points are unlikely to be close by. To impose the temporal constraint into the framework under the online setting, historical decisions are taken into account for later predictions. To optimize the long-term reward for start detection, LocNet is trained using reinforcement learning techniques. The proposed framework and its variants are validated on THUMOS’14 

[21] and ActivityNet [11]. Experimental results show that our approach significantly outperforms the state-of-the-art by - p-mAP under offsets of - seconds on THUMOS’14, and achieves comparable p-mAP with times smaller time offset on ActivityNet.

2 Related Work

Temporal Action Detection. Most existing methods [30, 39, 8, 4, 13, 7] on temporal action detection formulate the problem in an offline manner. These methods segment actions from long, untrimmed videos and require observing the entire video before making a decision. S-CNN [30] localizes actions with three stages: action proposal generation, proposal classification, and proposal regression. Dai  [8] proposed TCN which incorporates local context of each proposal for proposal ranking. By sharing features between proposal generation and classification, R-C3D [37] reduces computational cost significantly. Buch  [4] propose an efficient proposal generation model that avoids working on overlapping regions. Instead of treating temporal action detection as segment-level classification, Shou  [28] propose CDC network to produce per-frame predictions using 3D convolutional networks.

Online Action Detection. Online action detection is usually solved as a per-frame labeling task [9] on live, streaming videos. As soon as a video frame arrives, it is classified to an action class or background without accessing future frames. De Geest  [9] first introduced the problem and proposed several models as baselines. Gao  [12] propose a Reinforced Encoder-Decoder network for action anticipation and treat online action detection as a special case of their framework. Temporal Recurrent Networks [38] set a new state-of-the-art performance by conducting current and future action detection jointly. With the same goal of online per-frame labeling, these methods can serve as ClsNet in our framework.

Early Action Detection. Early action detectors detect actions after only processing a fraction of videos. The earlier a detector recognizes an action, the better it performs. Hoai  [18] solve this problem by proposing a max-margin framework with structured SVMs. However, this method works on simple scenarios, , one video contains only one action. Ma  [24] design a ranking loss for training assuming that the gaps of predicted scores between correct and incorrect actions should be non-decreasing when an model observes more of an activity.

Online Detection of Action Start (ODAS). As with early action detection, ODAS also aims to recognize actions as soon as possible. Specifically, it focuses on detecting action starts and tries to minimize the time delay of identifying the start point of an action. To the best of our knowledge, [29] is the first and only work that is designed to address ODAS. They solve the problem by encouraging a classification network to learn a representation that can separate action starts from their preceding backgrounds. To achieve the goal, they force the learned representation of an action start window to be similar to that of the following action window and different from that of the preceding background.

Sequential Search with RL. Reinforcement learning (RL) techniques are popular for sequential search problems, since RL allows models to be optimized for long-term rewards. Caicedo  [5] propose a framework based on Deep Q-learning [26] that transforms an initial bounding box iteratively until it lands on an object. In order to speed up object detection on large images, Gao  [14] design a coarse-to-fine framework also based on Deep Q-learning that sequentially selects regions to zoom in only when it is needed. Wu  [35] propose BlockDrop that trains with policy gradient [32] and improved computational efficiency by dropping unnecessary blocks of ResNets [17]. AdaFrame [36] is also optimized with policy gradient to reduce computations of LSTM by skipping input frames.

3 Action Start Detection Network (StartNet)

Figure 2: Our method works in two stages with ClsNet and LocNet. ClsNet: at time , features,

, are extracted by deep convolutional networks and input to an one-layer LSTM; The LSTM generates action score distributions at each time step and ClsNet is optimized with cross-entropy loss between action labels and the generated action scores. LocNet: after action score generation, it inputs together with a historical decision vector,

H, to a second one-layer LSTM which works as an agent to generate two-dimensional start probability sequentially; H is updated and the state is changed accordingly; The agent is trained using policy gradient mechanism to optimize long-term reward of start localization. At the end, results from ClsNet and LocNet are fused to obtain the final action start detection results at each time step. Here, ClsNet is implemented with LSTM. CNN and C3D can also be used to construct ClsNet (see Sec. 3.1 for details).

The input of an ODAS system is untrimmed, streaming video frames . The system processes each video frame sequentially and detects the start of each action instance. At time step

, it outputs a probability distribution,

, which indicates the start probability of the action class , without accessing any future information.

The overview of the proposed framework is illustrated in Fig. 2. The framework contains two sub-networks, , a classification network (ClsNet) and a localization network (LocNet). ClsNet focuses on per-frame class labeling. It takes the raw video frames as input and outputs action class probabilities at every time step in an online manner. ClsNet serves two purposes. First, it learns simpler but useful representation for localizing action starts. Second, the classification results can be combined later with the localization results to produce the action starts for each class. LocNet takes the output of ClsNet together with the historical decision vector as inputs. At each time step, it outputs a two-dimensional probability distribution indicating the probability that this frame contains an action start. The historical decision vector records its predictions in the previous steps in order to model the effect of historical decisions on later ones. Finally, the results of the two networks are fused to construct the final output.

3.1 Classification Network (ClsNet)

Inspired by recent online action detection methods [9, 12, 38], we utilize recurrent networks, specifically, LSTM [19], to construct ClsNet. At each time , it uses the previous hidden state , the cell , and the feature, , extracted from the current video frame, , as inputs, to update its hidden state and cell . Then, the likelihood distribution over all the action classes can be obtained in Eq. 1,

(1)

where is a dimensional vector and indicates the number of action classes including background.

To learn ClsNet, action class label for each frame is needed. The cross-entropy loss, , is used for optimization during training, where represents the parameter set of ClsNet.

We observe that ClsNet can be implemented with different architectures. Thus, we validate our framework using two additional structures as the backbone of ClsNet, , CNN and C3D [33]. CNN conducts action classification based only on the arriving frame, . It focuses on the spatial information of the current frame without considering temporal patterns of actions. C3D labels based on each temporal segment consisting of 16 consecutive video frames, from to . It captures spatial and temporal information jointly using 3D convolutional operations. Comparisons and explanations are discussed in Sec. 4.

3.2 Localization Network (LocNet)

As discussed in Sec. 1, historical action scores can provide useful cues for identifying action starts. At time , LocNet observes the action score distribution over classes of each frame, , obtained from ClsNet and outputs a two-dimensional vector, , indicating the start and non-start probability distribution.

The start probability is generated sequentially. In general, if an action starts at time step , there is a low probability that another action also starts at time

, given reasonable frames per second (FPS). Thus, there are implicit temporal constraints between nearby start points. To enable the model to consider constraints between decisions, we record the historical decisions made by LocNet and use the history to influence later decisions. To enable long-term decision planning, we formulate the problem as a Markov Decision Process (MDP) and use reinforcement learning to optimize our model. When making a decision

111The term “action” is generally used in reinforcement learning, we use “decision” instead to remove the confusion with action class., the model not only considers the effect of the decision at the current step, but also how it will influence the later ones by maximizing the expected long-term reward. In the following, we first discuss the inference phase of LocNet and then the training phase in detail.

3.2.1 Inference Phase

LocNet is built upon a LSTM structure. It acts as an agent which interacts with historical action scores recurrently. During testing, at each state, the agent makes a decision (predicts start probability) that produces the maximum expected long-term reward and updates the state according to the decision. To model the dependency between decisions, we incorporate the record of historical decisions (the decisions made by the agent at previous steps) as a part of the state. The state update procedure is described in Eq. 2 and 3, where indicates historical decisions from step to and indicates the concatenation of the vectors. At the beginning, H is initialized with zeros.

(2)
(3)

3.2.2 Training Phase

We train an agent that acts optimally based on the state of the environment. The goal is to maximize the reward by changing the predicted start probability distribution: at a given state, the start probability should be increased when the decision introduces bigger reward and be decreased otherwise. The start prediction procedure is formulated as a decision making policy defined using Gaussian distribution. Following 

[25, 36], the policy is trained by optimizing with , where , is sampled from and indicates the output start probability that determines the Gaussian distribution.

Reward function. Each decision at a given state is associated with an immediate reward to measure the decision made by the agent at the current time. With the goal of localizing start points, we define the immediate reward function in Eq. 4, where indicates the ground-truth label of action start and is the sampled start probability. The reward function encourages a high probability when there is an actual start and a low probability when there is not by giving a negative reward. Considering the sample imbalance between start points and background, weighted rewards are used by setting a parameter . In particular, we set to be the ratio between the number of negative samples to positive samples for each dataset.

(4)

The long-term reward is the summation of discounted future rewards. In order to maximize the expected long-term reward, the policy is trained by maximizing the objective in Eq. 5, where represents the parameters of the network and is a constant scalar for calculating the discounted rewards over time.

(5)

Optimization. When optimizing Eq. 5, it is not possible to train the network using error back propagation directly, since the objective is not differentiable. Following [32], we use policy gradient to calculate the expected gradient of as in Eq. 6, where indicates the long-term reward at time step and

is a baseline value which is widely used in policy gradient frameworks to reduce the variance of the gradient. The principle of policy gradient is to maximize the probability of an action with high reward given a state. The baseline value encourages that the model is optimized in the direction of performance improvement.

(6)

Following [36], we use the expected long-term reward at the current state as the baseline value and approximate it by minimizing the loss: . The training procedure of LocNet is summarized in Alg. 1.

Initialize parameters, , of LocNet
for  iteration = 1: do
     Obtain training sequence samples of length
     for t = 1: do
          Obtain based on current policy
          Sample decisions:
          Obtain and for each sample
     end for
     Compute , and
     Update parameters, , of LocNet
end for
Algorithm 1 Training Process of LocNet

The full objective including the loss term in ClsNet is shown in Eq. 7, where and are constant scalars.

(7)

3.3 Late Fusion

ClsNet outputs an action score distribution and LocNet produces class-agnostic start probabilities at each time step. Then, late fusion is applied to obtain the start probability for each action class, , following Eq. 8, where superscript indicates positive action classes and indicates background.

(8)

Action start generation. Follow [29], final action starts are generated online if all of the three conditions are satisfied: (i) is an action; (ii) and (iii) exceeds a threshold. We set this threshold to 0 by default. An action score sequence generated by ClsNet can also generate action start points online following this procedure. LocNet can locally adjust the start point by boosting time points with higher start probabilities and suppressing those with lower start probabilities.

4 Experiments

To validate the proposed framework, we conduct extensive experiments on two large-scale action recognition datasets, , THUMOS’14 [21] and ActivityNet v1.3 [11].

Evaluation protocol. To permit fair comparisons, we use the point-level average precision (p-AP) proposed in [29] to evaluate our framework. Under this protocol, each action start prediction is associated with a time point. For each action class, predictions of all frames are first sorted in descending order based on their confidence scores and then measured accordingly. An action start prediction is counted as correct only if it matches the correct action class and its temporal distance from a ground-truth point is smaller than an offset threshold (offset tolerance). Similar to segment-level average precision, no duplicate detections are allowed for the same ground-truth point. p-mAP is then calculated by averaging p-AP over all the action classes.

Following [29], we use two metrics based on p-AP to evaluate our framework on THUMOS’14. First, we use p-AP under different offset tolerances, varying from to seconds. Also, we adopt the metric AP depth at recall (Rec) X which averages p-AP on the Precision-Recall curve with the recall rate from to X. p-mAPs under different offset thresholds are then averaged to obtain the final average p-mAP at each depth. This metric is particularly used to evaluate top ranked predictions and to measure what precision a system can achieve if low recall is allowed. For ActivityNet, we evaluate our methods using p-mAP under offset thresholds of - seconds at depth =.

Baselines. We compare the proposed framework with the state-of-the-art method, , Shou et al. [29] and two baselines that were presented in [29], , SceneDetect and ShotDetect. The numbers were obtained from the authors [29]. Comparison results with Shou et al. [29] demonstrate the superior performance of StartNet. SceneDetect and ShotDetect are also two-stage methods. Similar to two-stage frameworks of object detection, they first conduct localization by getting action start proposals, which are generated by soft boundary detectors, and then classify them to different classes. Comparison with SceneDetect and ShotDetect shows the effectiveness of our decomposition design. Our framework trained by policy gradient is indicated by StartNet-PG.

Offsets (second) 1 2 3 4 5 6 7 8 9 10
Baselines SceneDetect [1] 1.0 2.0 2.3 3.1 3.6 4.1 4.7 5.0 5.1 5.2
ShotDetect [2] 1.1 1.9 2.3 3.0 3.4 3.9 4.3 4.5 4.6 4.9
Shou et al. [29] 3.1 4.3 4.7 5.4 5.8 6.1 6.5 7.2 7.6 8.2
StartNet-PG C3D [33] + LocNet 6.8 8.0 9.4 10.1 10.6 10.9 10.9 11.1 11.2 11.2
CNN [34] + LocNet 17.0 23.6 27.6 29.9 31.3 32.1 33.2 33.5 33.9 34.5
LSTM [19] + LocNet 19.5 27.2 30.8 33.9 36.5 37.5 38.3 38.8 39.5 39.8
Table 1: Comparisons using p-mAP at depth = on THUMOS’14. Results are under different offset thresholds. ClsNet is implemented with different structures, , C3D, CNN and LSTM. CNN and LSTM are using TS features.
Depth Rec. @0.1 @0.2 @0.3 @0.4 @0.5 @0.6 @0.7 @0.8 @0.9 @1.0
Baselines SceneDetect [1] 30.0 18.3 12.2 9.1 7.2 6.1 5.2 4.6 4.0 3.6
ShotDetect [2] 26.3 15.9 11.3 8.6 6.8 5.8 4.9 4.3 3.8 3.4
Shou et al. [29] 42.7 27.3 19.8 14.9 11.8 10.0 8.5 7.4 6.6 5.9
StartNet-PG C3D [33] + LocNet 34.8 27.7 22.6 19.0 16.3 14.4 12.9 11.8 10.8 10.0
CNN [34] + LocNet 71.8 64.7 58.0 52.4 47.2 43.3 39.5 35.9 32.5 29.6
LSTM [19] + LocNet 77.4 70.2 64.5 59.1 54.2 49.3 45.1 41.2 37.6 34.2
Table 2: Comparisons using average p-mAP at different depths on THUMOS’14. Average p-mAP means averaging p-mAP over offsets from to seconds. ClsNet is implemented with different structures, , C3D, CNN and LSTM. CNN and LSTM are using TS features.

Implementation details. Following [38, 12, 29], decisions are made on short temporal chunks, , where is its central frame. The appearance feature (RGB) of is extracted from and the motion feature (optical flow) is computed using the whole chunk as input. Following [38, 12], chunk size is fixed to 6 and image frames are obtained at 24 FPS. Two adjacent chunks are not overlapping, thus, there are exactly 4 chunks per second. Following [38], for ClsNet, we set the size of LSTM’s hidden state to and the length of each training sequence to 64. When using CNN, we finetune an fully-connected (FC) layer with different CNN features as input (see feature descriptions for each dataset). C3D is pretrained on Sports-1M [22] and finetuned for the per-frame labeling task on each dataset. Hidden state of LocNet is set to and the length of each training sequence, , is fixed to . Following [36], in Eq. 5 is fixed to . The length of the historical decision vector, , is set to 8. and in Eq. 7 are fixed to

. We adopt an alternating strategy for classification and localization training: ClsNet is first trained and fixed afterwards, and then LocNet is trained upon the pre-trained ClsNet. We implement the models in PyTorch 

[3], and set batch size to 32 for THUMOS’14 and 64 for ActivityNet. For parameter optimization, we used the Adam [23] optimizer with learning rate and weight decay .

4.1 Experiments on THUMOS’14

Dataset. THUMOS’14 [21] is a popular benchmark for temporal action detection. It contains 20 action classes related to sports. There are only trimmed videos in the training set which makes it not appropriate for training ODAS methods. Following [29], we use the validation set (including 200 untrimmed videos, 3K action instances) for training and the test set (including 213 untrimmed videos, 3.3K action instances) for testing.

Feature description. Two types of features are adopted on THUMOS’14 dataset, RGB and Two-Stream (TS) features. Following [12, 38], we extract appearance (RGB) feature at the Flatten 673 layer of ResNet-200 [17] and motion feature at the global pool layer of BN-Inception [20] with optical flows of consecutive frames as inputs. The TS feature is the concatenation of appearance and motion features, which are extracted with models222https://github.com/yjxiong/anet2016-cuhk. pre-trained on ActivityNet.

Features Offsets (second) 1 2 3 4 5 6 7 8 9 10
RGB ClsNet-only 11.8 17.2 21.3 24.9 27.9 28.7 29.5 30.0 30.4 30.7
StartNet-CE 13.7 20.7 23.8 27.2 29.4 30.7 31.9 32.5 33.2 33.6
StartNet-PG 15.9 21.0 24.8 28.4 30.7 31.8 33.0 33.5 34.0 34.4
Two Stream ClsNet-only 13.9 21.6 25.8 28.9 31.1 32.5 33.5 34.3 34.8 35.2
StartNet-CE 17.4 25.4 29.8 33.0 34.6 36.3 37.2 37.7 38.6 38.8
StartNet-PG 19.5 27.2 30.8 33.9 36.5 37.5 38.3 38.8 39.5 39.8
Table 3: Ablation study of our framework using p-mAP at depth = on THUMOS’14. LSTM is used to implement ClsNet. Different offset thresholds are used to evaluate our framework with different features. Best performance is marked in bold.
Features Depth Rec. @0.1 @0.2 @0.3 @0.4 @0.5 @0.6 @0.7 @0.8 @0.9 @1.0
RGB ClsNet-only 71.2 61.1 52.8 47.0 42.0 37.7 34.0 30.6 27.5 25.3
StartNet-CE 73.2 64.5 56.8 50.2 45.1 40.5 36.6 33.5 30.5 27.7
StartNet-PG 73.6 65.0 58.0 51.2 45.9 41.5 37.8 34.3 31.5 28.8
Two Stream ClsNet-only 71.3 63.0 56.9 52.0 46.9 42.3 38.7 35.0 31.8 29.2
StartNet-CE 72.7 65.6 60.2 55.3 51.0 46.8 43.0 39.2 36.0 32.9
StartNet-PG 77.4 70.2 64.5 59.1 54.2 49.3 45.1 41.2 37.6 34.2
Table 4: Ablation study of our framework using average p-mAP at different depths on THUMOS’14. At each depth, we average p-mAP over offset thresholds from to seconds. LSTM is used to implement ClsNet. Best performance is marked in bold.
Offsets (second) 1 2 3 4 5 6 7 8 9 10
Baselines SceneDetect [1] 4.7
ShotDetect [2] 6.1
Shou et al. [29] 8.3
StartNet ClsNet-only-VGG 2.7 4.1 5.1 5.9 6.7 7.5 8.1 8.7 9.2 9.8
StartNet-CE-VGG 4.2 6.1 7.4 8.7 9.7 10.5 11.4 12.0 12.6 13.1
StartNet-PG-VGG 6.0 7.6 8.8 9.8 10.7 11.5 12.2 12.6 13.1 13.5
ClsNet-only-TS 4.2 6.1 7.7 8.8 9.8 10.7 11.3 12.2 13.0 13.6
StartNet-CE-TS 6.0 8.3 10.1 11.7 12.9 13.9 15.0 15.8 16.7 17.5
StartNet-PG-TS 8.1 10.2 11.8 13.3 14.4 15.3 16.1 16.7 17.4 18.0
Table 5: Comparisons using p-mAP under varing offset thresholds at depth = on ActivityNet. ClsNet is implemented with LSTM. Numbers of baseline methods are cited from [29]. – indicates that numbers are not provided in [29].

4.1.1 Evaluation Results

Comparisons with previous methods are shown in Table 1 and Table 2. Table 1 shows comparisons based on p-mAP at depth = under different offset thresholds. All previous methods are under p-mAP at 1 second offset, while StartNet with LSTM achieves p-mAP, outperforming the state-of-the-arts largely by over . At seconds offset, previous methods obtain less than p-mAP and StartNet (LSTM) improves over Shou et al. [29] by p-mAP. Table 2 shows comparisons based on average p-mAP (averaging over offsets from to seconds) at different depths. The results demonstrate that StartNet with LSTM outperforms previous methods significantly (by around - average p-mAP) at depth from = to =. Obviously, under both metrics, StartNet outperforms previous methods by a very large margin.

Figure 3: Ablation study of LocNet: (a) effect of length of historical decision vector (b) effect of different gamma values in Eq. 5. Generally, the model performs better with bigger gamma and longer historical decision vector.
Figure 4: Qualitative results on THUMOS’14 and ActivityNet after action start generation (see Sec. 3.3). means no starts are detected at those times. Numbers indicate the scores of detected action starts. Results of ClsNet and StartNet are marked in blue and red, respectively. Yes/No (ground-truth) indicates if an action of the associated class starts at the time. Best viewed in color.

4.1.2 Ablation Experiments

ClsNet implemented with different structures. Comparisons among StartNet with different ClsNet’s backbones are shown in Table 1 and Table 2. LSTM+LocNet achieves the best performance among the three structures. It is worth noticing that C3D performs much worse than CNN and LSTM, which shows its disadvantage in the online action detection task. In offline setting, C3D can observe the entire temporal context of an action before making a decision, but it has to recognize the occurring action based only on the preceding temporal segment when working online. Compared to LSTM, it has no recurrent structure to learn long-term patterns. Compared to CNN, it has more complicated operations and is more prone to overfitting. Shou et al. [29] chose C3D as its backbone and proposed sophisticated training strategies for optimization. However, C3D may not be suitable for the task according to our comparisons with other structures. Even with C3D, StartNet still significantly outperforms Shou et al. [29], which demonstrates the effectiveness of our framework. Since LSTM+LocNet achieves the best performance, the following ablation studies are conducted using ClsNet implemented with LSTM.

Effectiveness of LocNet. The results from ClsNet alone can be used to generate action starts by following the action start generation procedure in Sec. 3.3. To evaluate the contribution of LocNet, we construct ClsNet-only by removing LocNet from our framework. Results of ClsNet-only can also demonstrate the performance of OAD methods if applied on the ODAS task directly. As shown in Table 3, ClsNet-only has already achieved good results, outperforming C3D based methods. When adding LocNet, StartNet-PG improves ClsNet-only by - p-mAP with TS feature and by - p-mAP with RGB features under varying offsets. We can also observe a trend that the gaps between StartNet-PG and ClsNet-only are larger when the offset is smaller. As shown in Table 4, StartNet-PG outperforms ClsNet-only by - p-mAP with TS features and about - p-mAP with RGB features at different depths. The qualitative comparison in Fig. 4 shows an example that ClsNet-only generate a false positive at the last frame, which may be because that the frame contains a classic appearance of the action, , Basketball Dunk. With the help of LocNet, the false positive is corrected by StartNet-PG.

Effectiveness of long-term planning. In order to investigate the effect of long-term planning, we replace the policy gradient training strategy with simple cross-entropy loss – – such that every frame is considered independently. This baseline is referred as StartNet-CE. Similar to StartNet-PG, weight factor, , is used to handle sample imbalance. Same as in Eq. 4, we set equal to the ratio between the number of negative samples and positive ones. As shown in Table 3 and 4, StartNet-PG significantly outperforms StartNet-CE under each offset threshold and at different depths, which proves the usefulness of the long-term planning.

In order to further investigate effects of parameter settings for LocNet, we conduct an ablation study on different values of the length of historical decision vector, , and gamma in Eq. 5 when offset threshold is set to second and depth =. Results are shown in Fig. 3. Increasing the length of the historical decision vector means increasing the dependency of later decisions on previous ones. As is shown, the model performs much better when incorporating historical decisions and it reaches its highest performance when historical decisions are considered. Increasing gamma indicates increasing the effect of future rewards to the total long-term reward. It shows that when increasing values of gamma, the model performs better.

Results with different features. To investigated the performance of our framework when using different features, we add experiments with ClsNet-only, StartNet-CE and StartNet-PG using appearance features (RGB) only. Results are displayed in Table 3 and Table 4. We see that when using only RGB features, performance of the three models drops. However, even with RGB features, our method still outperforms Shou et al. [29] largely.

Effectiveness of two-stage design. We validate our two-stage design by comparing with one-stage network which has similar structure as ClsNet (LSTM) except that we modify it to directly predict action starts for all classes and optimize it with cross-entropy loss. We get and p-mAP at second offset (depth =) using RGB and TS features, respectively. The results are much worse than StartNet-CE and StartNet-PG (drops about and ), demonstrating that simply learning classification and localization of action starts jointly is not a good strategy.

Learning from low-level features. Our framework uses action score distributions pretrained on an auxiliary task as inputs of LocNet. We believe that learning from this high-level representation is better than learning from low-level noisy features for our task due to the lack of training data. To prove this point, we construct StartNet-img where LocNet learns directly from the low-level image features. The p-mAP using RGB and TS features under offsets of second (depth is ) is and , respectively, which much under perform our framework (drops about ).

4.2 Experiments on ActivityNet

Dataset. ActivityNet v1.3 [11] is one of the largest datasets for action recognition. It contains annotations of 200 action classes. There are around 10K untrimmed videos (15K action instances) in the training set and 5K (7.6K action instances) untrimmed videos in the validation set. Averagely, there are around 1.6 action instances in each video. Following [29], we train our models on the train set and test them on the validation set.

Feature description. TS feature is constructed by concatenating appearance and motion features that are extracted from TSN model (with BN-Inception) [34] pretrained on Kinetics [6]

. Besides, we validate our method using appearance features extracted from

fc6 layer of VGG-16 [31]

. The VGG-16 model is pretrained on ImageNet 

[10]. VGG-16 features are not as good as ResNet and InceptionNet features for action recognition tasks. We use VGG-16 features to show that our framework can produce reasonable results even when using simple features pretrained only on images.

Training sample strategy of LocNet. Unlike THUMOS’14 which contains around 16 action instances per video in average, ActivityNet has only one action instance in most of the videos. Thus, ActivityNet has much severer imbalance problem between start and non-start classes. To balance the samples, we randomly select equal numbers of positive and negative sequences for each training batch. Positive sequence is defined as containing at least one action start and negative one contains no action start. Then, is set to the ratio between the number of negative samples over the number of positive ones after the sample balance.

Evaluation results. Comparisons of StartNet with previous methods on ActivityNet are shown in Table 5. StartNet significantly outperforms previous methods. Specifically, StartNet with TS feature achieves similar performance under second offset tolerance compared to Shou et al. [29] under seconds offset. At offset of seconds, our method improves Shou et al. [29] by around . It also outperforms SceneDetect and ShotDetect largely by and , respectively. Even with VGG features pretrained on only images, our method significantly outperforms the state-of-the-arts. Besides, we demostrate the contribution of each module by comparing with ClsNet-only and StartNet-CE (refer to Sec. 4.1.2 for detailed model description). Results show that by adding LocNet, StartNet-PG improves ClsNet-only by over (using VGG) and around (using TS) p-mAP. With long-term planning, StartNet-PG significantly outperforms StartNet-CE under both features, especially when the offset tolerance is small. Qualitative results in Fig. 4 shows a hard case where ClsNet-only misses an action start due to the subtle appearance difference near the start point. With LocNet, StartNet-PG successfully captures the start point although the score is low.

5 Conclusion

We proposed StartNet to handle Online Detection of Action Starts. StartNet consists of two networks, , ClsNet and LocNet. ClsNet processes the input streaming video and generates action scores for each video frame. LocNet localizes start points by optimizing long-term planning rewards using policy gradient methods. At the end, results from the two sub-networks are fused to produce the final action start predictions. Experimental results on THUMOS’14 and ActivityNet demonstrate that our framework significantly outperforms the state-of-the-arts. Extensive ablation studies were conducted to show the effectiveness of each module of our method.

References

  • [1] https://github.com/Breakthrough/PySceneDetect.
  • [2] https://github.com/johmathe/Shotdetect.
  • [3] http://pytorch.org/.
  • [4] S. Buch, V. Escorcia, C. Shen, B. Ghanem, and J. C. Niebles. SST: Single-stream temporal action proposals. In CVPR, 2017.
  • [5] J. C. Caicedo and S. Lazebnik. Active object localization with deep reinforcement learning. In ICCV, 2015.
  • [6] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
  • [7] Y.-W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar. Rethinking the faster r-cnn architecture for temporal action localization. In CVPR, 2018.
  • [8] X. Dai, B. Singh, G. Zhang, L. S. Davis, and Y. Q. Chen. Temporal context network for activity localization in videos. In ICCV, 2017.
  • [9] R. De Geest, E. Gavves, A. Ghodrati, Z. Li, C. Snoek, and T. Tuytelaars. Online action detection. In ECCV, 2016.
  • [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
  • [11] B. G. Fabian Caba Heilbron, Victor Escorcia and J. C. Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
  • [12] J. Gao, Z. Yang, and R. Nevatia. RED: Reinforced encoder-decoder networks for action anticipation. In BMVC, 2017.
  • [13] J. Gao, Z. Yang, C. Sun, K. Chen, and R. Nevatia. TURN TAP: Temporal unit regression network for temporal action proposals. ICCV, 2017.
  • [14] M. Gao, R. Yu, A. Li, V. I. Morariu, and L. S. Davis. Dynamic zoom-in network for fast object detection in large images. In CVPR, 2018.
  • [15] R. Girshick. Fast R-CNN. In ICCV, 2015.
  • [16] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [18] M. Hoai and F. De la Torre. Max-margin early event detectors. In IJCV, 2014.
  • [19] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997.
  • [20] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167, 2015.
  • [21] Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/, 2014.
  • [22] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei.

    Large-scale video classification with convolutional neural networks.

    In CVPR, 2014.
  • [23] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
  • [24] S. Ma, L. Sigal, and S. Sclaroff. Learning activity progression in lstms for activity detection and early detection. In CVPR, 2016.
  • [25] V. Mnih, N. Heess, A. Graves, et al. Recurrent models of visual attention. In NIPS, 2014.
  • [26] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. 2015.
  • [27] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
  • [28] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang. CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In CVPR, 2017.
  • [29] Z. Shou, J. Pan, J. Chan, K. Miyazawa, H. Mansour, A. Vetro, X. Giro-i Nieto, and S.-F. Chang. Online action detection in untrimmed, streaming videos-modeling and evaluation. In ECCV, 2018.
  • [30] Z. Shou, D. Wang, and S.-F. Chang. Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR, 2016.
  • [31] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.
  • [32] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • [33] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
  • [34] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
  • [35] Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and R. Feris. Blockdrop: Dynamic inference paths in residual networks. In CVPR, 2018.
  • [36] Z. Wu, C. Xiong, C.-Y. Ma, R. Socher, and L. S. Davis. Adaframe: Adaptive frame selection for fast video recognition. arXiv:1811.12432, 2018.
  • [37] H. Xu, A. Das, and K. Saenko. R-C3D: Region convolutional 3d network for temporal activity detection. In ICCV, 2017.
  • [38] M. Xu, M. Gao, Y.-T. Chen, L. S. Davis, and D. J. Crandall. Temporal recurrent networks for online action detection. arXiv:1811.07391, 2018.
  • [39] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin. Temporal action detection with structured segment networks. In ICCV, 2017.