Long Short-Term Transformer for Online Action Detection

by   Mingze Xu, et al.

In this paper, we present Long Short-term TRansformer (LSTR), a new temporal modeling algorithm for online action detection, by employing a long- and short-term memories mechanism that is able to model prolonged sequence data. It consists of an LSTR encoder that is capable of dynamically exploiting coarse-scale historical information from an extensively long time window (e.g., 2048 long-range frames of up to 8 minutes), together with an LSTR decoder that focuses on a short time window (e.g., 32 short-range frames of 8 seconds) to model the fine-scale characterization of the ongoing event. Compared to prior work, LSTR provides an effective and efficient method to model long videos with less heuristic algorithm design. LSTR achieves significantly improved results on standard online action detection benchmarks, THUMOS'14, TVSeries, and HACS Segment, over the existing state-of-the-art approaches. Extensive empirical analysis validates the setup of the long- and short-term memories and the design choices of LSTR.


page 9

page 11


Long-Short Temporal Modeling for Efficient Action Recognition

Efficient long-short temporal modeling is key for enhancing the performa...

Colar: Effective and Efficient Online Action Detection by Consulting Exemplars

Online action detection has attracted increasing research interests in r...

Multi-range Reasoning for Machine Comprehension

We propose MRU (Multi-Range Reasoning Units), a new fast compositional e...

Towards Long-Form Video Understanding

Our world offers a never-ending stream of visual stimuli, yet today's vi...

Learning short-term past as predictor of human behavior in commercial buildings

This paper addresses the question of identifying the time-window in shor...

Long-Short Ensemble Network for Bipolar Manic-Euthymic State Recognition Based on Wrist-worn Sensors

Manic episodes of bipolar disorder can lead to uncritical behaviour and ...

Long Short-Term Sample Distillation

In the past decade, there has been substantial progress at training incr...

1 Introduction

Given an incoming stream of video frames, online action detection de2016online

is concerned with the task of classifying what is happening at each frame without seeing the future. It is a causal reasoning problem different from offline action recognition where the entire video snippets are present. In this paper, we tackle the online action detection problem by presenting a new temporal modeling algorithm that is able to capture temporal correlations on prolonged sequences up to 8 minutes long, while retaining fine granularity of the event in the representation. This is achieved by modeling activities at different temporal scales that is necessary to capture the variety of events, some occurring in brief bursts, other manifest in slowly varying trends.

Specifically, we propose a new model, named Long Short-term TRansformer (LSTR), to jointly model the long- and short-term temporal dependencies. The LSTR has two advantages over previous work. 1) Storing history in the “plaintext” manner avoids the difficulty of learning with recurrent models elman1990finding; schuster1997bidirectional; hochreiter1997long; chung2014empirical

(backpropagation through time, BPTT, is not needed); the model can directly attend to any useful frames in the long-term memory. 2) Separating long- and short-term memories allows us to simultaneously conduct focused modeling of short-term information while extracting well-informed compressed knowledge from the long-term history. This means we can compress the long-term history without worrying about losing important historical information for making classification at the current time.

As shown in Fig. 1, we explicitly divide the entire history into the long- and short-term memories and build our model with an encoder-decoder architecture. Specifically, the LSTR encoder compresses and abstracts the long-term memory into a fixed length of latent representations, and the LSTR decoder uses a short window of transient frames that perform the self-attention operation in addition to the cross-attention to the exploited token embeddings from the LSTR encoder. In the LSTR encoder, an extended temporal support becomes beneficial in dealing with untrimmed, streaming videos by devising two-stage memory compression, which is shown to be computationally efficient in both training and inference. Our overall long short-term Transformer architecture gives rise to an effective and efficient representation for modeling prolonged time sequence data.

To evaluate the effectiveness of our model, we validate LSTR on three standard datasets (THUMOS’14 THUMOS14, TVSeries de2016online, and HACS Segment Zhao_2019_ICCV), which have distinct natures of video lengths (from a few seconds to tens of minutes). Experimental results show that LSTR significantly outperforms all state-of-the-art methods for online action detection, and the ablation studies further demonstrate LSTR’s superior performance for modeling long video sequences.

Figure 1: Overview of our Long Short-term TRansformer (LSTR). Given a live streaming video, LSTR sequentially identifies the actions happening in each incoming frame by using an encoder-decoder architecture, without future context. The dashed brown arrows indicate the data flow of the long- and short-term memories following the first-in-first-out (FIFO) logic. (Best viewed in color.)

2 Related Work

Online Action Detection. Temporal action localization targets detecting the temporal boundaries of action instances after observing the entire video shou2016temporal; xu2017r; gao2017turn; shou2017cdc; Zhao_2017_ICCV; buch2017sst; lin2018bsn; liu2019multi; lin2019bmn. Emerging real-world applications, such as robotic perception yao2019unsupervised, require causal processing, where only past and current frames are used for detecting actions of the present. There has been a wealthy body of recent work on this online action detection problem de2016online. RED gao2017red uses a reinforcement loss to encourage recognizing actions as early as possible. TRN xu2019temporal models greater temporal context by simultaneously performing online action detection and anticipation. IDN eun2020learning learns discriminative features and accumulates only relevant information for the present. LAP-Net qu2020lap proposes an adaptive sampling strategy to obtain optimal features. PKD zhao2020privileged transfers knowledge from offline to online models using curriculum learning. As with early action detection hoai2014max; ma2016learning, Shou et alshou2018online focus on online detection of action start (ODAS). StartNet gao2019startnet decomposes ODAS into two stages and learns with policy gradient. WOAD gao2020woad

uses weakly-supervised learning with video-level labels.

Temporal/Sequence Modeling. Modeling temporal information, especially from long-term context, is important but challenging for many video understanding tasks wu2019long; oh2019video; wu2021long. Prior work on action recognition mostly relies on heuristic sub-sampling (typically 3 to 7 video frames) for more feasible training yue2015beyond; wang2016temporal; feichtenhofer2017spatiotemporal; miech2017learnable; tang2018non

or recurrent neural networks 

hochreiter1997long; chung2014empirical that maintain the entire history in a summary state, at the cost of losing track of specifics for the individual time steps in the past. 3D CNNs carreira2017quo; tran2018closer; xie1712rethinking are also widely used to perform spatio-temporal feature modeling, but they are limited by the receptive field hence unable to model long-range temporal context. Wu et alwu2019long propose long-term feature bank for generic video understanding, which stores various supportive information, such as object and scene features, as context references. However, these features are usually pre-computed independently, having even no sense of position or order information, thus cannot provide an effective representation of the full context. On the other hand, most of above work does not explicitly integrate the long- and short-term features, but combine them with simple mechanisms, such as pooling and concatenation. Recent findings in cognitive science for the underlying mechanisms of working memory and attention oberauer2019working; cowan1998attention; chun2011visual; kiyonaga2013working have also shed light on the design principle for modeling long-term information with attention wang2018non; wu2019long. Comparing with RNNs, the attention mechanism is able to maintain long-term information at fast inference speed vaswani2017attention.

Transformers for Action Understanding. Transformers have achieved breakthrough success in NLP radford2018improving; devlin2018bert

and are adopted in computer vision for image recognition 

dosovitskiy2020image; touvron2020training and object detection carion2020end. Recent research papers exploit Transformers for temporal modeling in videos, such as action recognition neimark2021video; sharir2021image; li2021vidtr; bertasius2021space; arnab2021vivit and temporal action localization nawhal2021activity; tan2021relaxed, and achieve state-of-the-art results in the common benchmarks. However, most of these work only performs temporal modeling with relatively short video clips due to their computational demanding and limited modeling capability. Several work dai2019transformer; burtsev2020memory focuses on designing transformers to model long-term dependencies, but how to aggregate the long- and short-term information is still not well-explored today jaegle2021perceiver.

3 Long Short-Term Transformer

Figure 2: Visualization of our Long Short-Term Transformer (LSTR), which is formulated in an encoder-decoder manner. Specifically, the LSTR encoder compresses the long-term memory of size to encoded latent features, and the LSTR decoder references related context information from the encoded memory with the short-term memory of size for action recognition of the present. The LSTR encoder and decoder are built with Transformer decoder units vaswani2017attention, which take the input tokens (dark green arrows) and output tokens (dark blue arrows) as inputs. During inference, LSTR processes every incoming frame in an online manner, absent future context. (Best viewed in color.)

Given a live streaming video, our goal is to identify the actions performed in each video frame using only past and current observations. Future information is not accessible during inference. Formally, a streaming video at time is represented by a batch of past frames , which reads “ up to time .” The online action detection system receives as inputs, and classifies the action category belonging to one of classes,

, ideally using the posterior probability

, where

denotes the probability that no event is occurring at frame

. We design our method by assuming that there is a pretrained feature extractor wang2016temporal that processes each video frame

into a feature vector

of dimensions111In practice, some feature extractors tran2015learning take consecutive frames to produce one feature vector. Nonetheless, it is still temporally “centered” on a single frame. Thus we use the single frame notation here for simplicity.. These vectors form a -dimensional temporal sequence that serves as the input of our method.

3.1 Overview

Our method is based on the intuition that frames observed recently provide precise information about the ongoing action instance, while frames over an extended period offer contextual references for actions that are potentially happening right now. We propose Long Short-term TRansformer (LSTR) in an explicit encoder-decoder manner, as shown in Fig. 2. In particular, the feature vectors of frames in the distant past are stored in a long-term memory, and a short-term memory stores the features of recent frames. The LSTR encoder compresses and abstracts features in long-term memory to an encoded latent representation of vectors. The LSTR decoder queries the encoded long-term memory with the short-term memory for decoding, leading to the action prediction . This design follows the line of thought in combining long- and short-term information for action understanding donahue2015long; wang2016temporal; wu2019long, but addresses several key challenges to effectively and efficiently achieve this goal, thanks to the flexibility of modeling brought by the recently emerged Transformers vaswani2017attention.

3.2 Long- and Short-Term Memories

We store the streaming input of feature vectors in two consecutive memories. The first memory is the short-term memory which stores only a small number of frames that are recently observed. We implement it with a first-in-first-out (FIFO) queue of slots. At time , it stores the feature vectors as . When a frame becomes “older” than time steps, it graduates from and enters into the long-term memory, which is implemented with another FIFO queue of slots. The long term memory stores . The long-term memory serves as the input memory to the LSTR encoder and the short-term memory serves as the queries for the LSTR decoder. In practice, the long-term memory stores much longer time span than the short-term memory (). A typical example of choice is , which represents 512 seconds worth of video contents with 4 frames per second (FPS) sampling rate, and representing 8 seconds. We add a sinusoidal positional encoding  vaswani2017attention to each frame feature in the memories relative to current time (i.e., the frame at receives a positional embedding of ).

3.3 LSTR Encoder

The LSTR encoder aims at encoding the long-term memory of feature vectors into a latent representation that LSTR can use for decoding useful temporal context. This task requires large capacity in capturing the relations and temporal context among a span of hundreds or even thousands of frames. Prior work on modeling long-term dependencies for action understanding relies on heuristic temporal sub-sampling wu2019long; wang2016temporal or recurrent networks donahue2015long to make more feasible training, at the cost of losing specific information of each time step. Attention-based architectures, such as Transformer vaswani2017attention, have recently been shown promising for similar tasks that require long-range temporal modeling sharir2021image. A straightforward choice for LSTR encoder would be to use the Transformer encoder based on self-attention. However, its time complexity, , grows quadratically with the memory sequence length . This limits our ability to model long-term memory with sufficient length to cover long videos. Though recent work wang2020linformer has been exploring self-attention with linear complexity, repeatedly referencing information from the long-term memory with multi-layer Transformers is still computationally heavy. In LSTR, we propose to use a two-stage memory compression based on Transformer decoder units vaswani2017attention to achieve more effective memory encoding.

The Transformer decoder unit vaswani2017attention takes two sets of inputs. The first set includes a fixed number of learnable output tokens , where is the embedding dimension. The second set includes another input tokens , and can be a rather large number. It first applies one layer of multi-head self-attention on . The outputs are then used as queries in an “QKV cross-attention” operation and the input embeddings serve as key and value. The two steps can be written as

where denotes the intermediate layers between the two attention operations. One appealing property of this design is that it transforms the dimensional input tokens into the output tokens of dimensions in time complexity. When , the time complexity becomes linear to , making it an ideal candidate for compressing the long-term memory. This property is also utilized in jaegle2021perceiver to efficiently process large volume inputs, such as image pixels.

Two-Stage Memory Compression. Stacking multiple Transformer decoder units on the long-term memory, as in vaswani2017attention, can form a memory encoder with linear complexity with respect to the memory size . However, running the encoder at each time step can still be time consuming. We further reduce the time complexity with a two-stage memory compression design. The first stage has one Transformer decoder unit with output tokens. Its input tokens are the whole long-term memory of size . The outputs of the first stage are used as the input tokens to the second stage, which has stacked Transformer decoder units and output tokens. Then, the long-term memory of size is compressed into a latent representation of size , which can then be efficiently queried in the LSTR decoder later. This two-stage memory compression design is illustrated in Fig. 2.

Compared to an -layer Transformer encoder with time complexity or stacked Transformer decoder units with output-tokens having time complexity, the proposed LSTR encoder has complexity of . Because both and are much smaller than , and is usually larger than , using two-stage memory compression could be more efficient. In Sec. 3.6, we will show that, during online inference, it further enables us to reduce the runtime of the Transformer decoder unit of the first stage. In Sec. 4.5, we empirically found this design also leads to better performance for online action detection.

3.4 LSTR Decoder

The short-term memory contains informative features for classifying actions on the latest time step. The LSTR decoder uses the short-term memory as queries to retrieve useful information from the encoded long-term memory produced by the LSTR encoder. The LSTR decoder is formed by stacking layers of Transform decoder units. It takes the outputs of the LSTR encoder as input tokens and the feature vectors in the short-term memory as output tokens. It outputs probability vectors , each

representing the predicted probability distribution of

action categories and one “background” class at time . During inference, we only take the probability vector from the output token corresponding to the current time for classification result. However, having the additional outputs on the older frames allows the model to leverage more supervision signals during training. The details will described below.

3.5 Training LSTR

LSTR can be trained without temporal unrolling and Backpropagation Through Time (BPTT) as in LSTM hochreiter1997long, which is a common property of Transformers vaswani2017attention. We construct each training sample by randomly sampling an ending time and filling the long- and short-term memories by tracing back in time for frames. We use the empirical cross entropy loss between the predicted probability distribution at time and the ground truth action label as


where is the -th element of the probability vector , predicted on the latest frame at . Additionally, we add a directional attention mask vaswani2017attention

to the short-term memory so that any frame in the short-term memory can only depend on its previous frames. In this way, we can make prediction on all frames in the short-term memory as if they are the latest ones. Thus we can provide supervision on every frame in the short-term memory, and the complete loss function

is then


where denotes the prediction from the output token corresponding to time .

3.6 Online Inference with LSTR

During online inference, the video frame features are streamed to the model as time passes. Running LSTR’s long-term memory encoder from scratch for each frame results in a time complexity of and for the first and second memory compression stages, respectively. However, at each time step, there is only one new video frame to be updated. We show it is possible to achieve even more efficient online inference by storing the intermediate results for the Transformer decoder unit of the first stage. First, the queries of the first Transformer decoder unit are fixed. So their self-attention outputs can be pre-computed and used throughout the inference. Second, the cross-attention operation in the first stage can be written as


where the index is the relative position of a frame in the long-term memory to the latest time . This calculation depends on the un-normalized attention weight matrix , with elements . can be decomposed into the sum of two matrices and . We have their elements as and . The queries after the first self-attention operation, , and the position embedding are fixed during inference. Thus the matrix can be pre-computed and used for every incoming frame. We additionally maintain a FIFO queue of vectors of size . at any time step can be obtained by stacking all vectors currently in this queue. Updating this queue at each time step requires time complexity for the matrix-vector product. Now we can obtain the matrix with only additions by adding and together, instead of multiplications and additions using Eq. (3). This means the amortized time complexity of computing the attention weights can be reduced to . Although the time complexity of the cross-attention operation is still due to the inevitable operation of weighted sum, considering is usually larger than  he2016deep; szegedy2016inception, this is still a considerable reduction of runtime cost. We compare LSTR’s runtime with other baseline models in Sec. 6.1.

4 Experiments

4.1 Datasets

We evaluate our model on three publicly-available datasets: THUMOS’14 THUMOS14, TVSeries de2016online and HACS Segment Zhao_2019_ICCV. THUMOS’14 includes over 20 hours of sports video annotated with 20 actions. We follow prior work xu2019temporal; eun2020learning and train on the validation set (200 untrimmed videos) and evaluate on the test set (213 untrimmed videos). TVSeries contains 27 episodes of 6 popular TV series, totaling 16 hours of video. The dataset is annotated with 30 realistic, everyday actions (e.g., open door). HACS Segment is a large-scale dataset of web videos. It contains 35,300 untrimmed videos over 200 human action classes for training and 5,530 untrimmed videos for validation.

4.2 Settings

Feature Encoding. We follow the experimental settings of state-of-the-art methods xu2019temporal; eun2020learning. Specifically, we extract video frames at 24 FPS and set the video chunk size to 6. Decisions are made at the chunk level, and thus performance is evaluated every 0.25 seconds. For feature encoding, we adopt two-stream network wang2016temporal

with ResNet-50 backbone, and use the open-source toolbox 

2020mmaction2. We experiment with feature extractors pretrained on two datasets: ActivityNet and Kinetics. The visual features are extracted at the global pooling layer from the central frame of each chunk, and the motion features are extracted from pre-computed stacked optical flow fields between 6 consecutive frames. The visual and motion features are concatenated along the channel dimension as the final feature f.

Implementation Details.

We implemented our proposed model in PyTorch 

pytorch, and performed all experiments on a system with 8 Nvidia V100 graphics cards. To learn model weights, we used the Adam kingma2014adam optimizer with weight decay . The learning rate was linearly increased from zero to in the first training iterations and was reduced to zero according to a cosine function. Our models were optimized using batch size of , and the training was terminated after epochs.

Evaluation Protocols. We follow prior work and use per-frame mean average precision (mAP) to evaluate the performance of online action detection. We also use per-frame calibrated average precision (cAP) de2016online that was proposed for TVSeries to correct the imbalance between positive and negative samples, , where , is if frame is a true positive, is the number of true positives, and is the negative and positive ratio.

4.3 Comparison with the State-of-the-art Methods

(a) Results on ActivityNet pretrained features
Input Features THUMOS’14 TVSeries
mAP (%) mcAP (%)
CDC shou2017cdc ActivityNet 44.4 -
RED gao2017red 45.3 79.2
TRN xu2019temporal 47.2 83.7
FATS kim2021temporally 51.6 81.7
IDN eun2020learning 50.0 84.7
LAP qu2020lap 53.3 85.3
TFN eun2021temporal 55.7 85.0
LFB* wu2019long 61.6 84.8
LSTR (ours) 65.3 88.1
(b) Results on Kinetics pretrained features
Input Features THUMOS’14 TVSeries
mAP (%) mcAP (%)
FATS kim2021temporally Kinetics 59.0 84.6
IDN eun2020learning 60.3 86.1
TRN xu2019temporal 62.1 86.2
PKD zhao2020privileged 64.5 86.4
WOAD gao2020woad 67.1 -
LFB* wu2019long 64.8 85.8
LSTR (ours) 69.5 89.1
Table 1: Online action detection results on THUMOS’14 and TVSeries, comparing LSTR and the state-of-the-art in mAP (%) LSTR outperforms the state-of-the-art methods on THUMOS’14 by 3.7% and 2.6% in mAP and on TVSeries by 3.1% and 2.7% in cAP, by using ActivityNet and Kinetics pretrained features, respectively. *Results are reproduced using their papers’ default settings.
Portion of Video
Input Features 0%- 10% 10%- 20% 20%- 30% 30%- 40% 40%- 50% 50%- 60% 60%- 70% 70%- 80% 80%- 90% 90%- 100%
CNN de2016online ActivityNet 61.0 61.0 61.2 61.1 61.2 61.2 61.3 61.5 61.4 61.5
RNN de2016online 63.3 64.5 64.5 64.3 65.0 64.7 64.4 64.4 64.4 64.3
TRN xu2019temporal 78.8 79.6 80.4 81.0 81.6 81.9 82.3 82.7 82.9 83.3
IDN eun2020learning 80.6 81.1 81.9 82.3 82.6 82.8 82.6 82.9 83.0 83.9
TFN eun2021temporal 83.1 84.4 85.4 85.8 87.1 88.4 87.6 87.0 86.7 85.6
LSTR (ours) 83.6 85.0 86.3 87.0 87.8 88.5 88.6 88.9 89.0 88.9
IDN eun2020learning Kinetics 81.7 81.9 83.1 82.9 83.2 83.2 83.2 83.0 83.3 86.6
PKD zhao2020privileged 82.1 83.5 86.1 87.2 88.3 88.4 89.0 88.7 88.9 87.7
LSTR (ours) 84.4 85.6 87.2 87.8 88.8 89.4 89.6 89.9 90.0 90.1
Table 2: Online action detection results when only portions of videos are considered in cAP (%) on TVSeries (e.g., 80%-90% means only frames of this range of action instances were evaluated).

We compare LSTR against the state-of-the-art methods xu2019temporal; eun2020learning; gao2020woad on THUMOS’14, TVSeries, and HACS Segment. Specifically, on THUMOS’14 and TVSeries, we implement LSTR with the long- and short-term memories of 512 and 8 seconds, respectively. On HACS Segment, we reduce the long-term memory to 256 seconds, considering that its videos are strictly shorter than 4 minutes. For LSTR, we implement the two-stage memory compression using Transformer decoder units. We set the tokens with and and the Transformer layers with and .

THUMOS’14. We compare LSTR with recent work on THUMOS’14, including methods that use 3D ConvNets shou2017cdc and RNNs xu2017end; eun2020learning; gao2020woad

, reinforcement learning 

gao2017red, and curriculum learning zhao2020privileged. Table 1 shows that LSTR significantly outperforms the the state-of-the-art methods wu2019long; gao2020woad by 3.7% and 2.4% in terms of mAP using ActivityNet and Kinetics pretrained features, respectively.

TVSeries. Table 1 shows the online action detection results that LSTR outperforms the state-of-the-art methods eun2021temporal; zhao2020privileged by 3.1% and 2.7% in terms of cAP using ActivityNet and Kinetics pretrained features, respectively. Following prior work de2016online

, we also investigate LSTR’s performance at different action stages by evaluating each decile (ten-percent interval) of the video frames separately. Table 

2 shows that LSTR outperforms existing methods at every stage of action instances.

HACS Segment. LSTR achieves 82.6% on HACS Segment in term of mAP using Kinetics pretrained features. Note that HACS Segment is a new large-scale dataset with only a few previous results. LSTR outperforms existing methods RNN hochreiter1997long (77.6%) by 5.0% and TRN xu2019temporal (78.9%) by 3.7%.

4.4 Design Choices of Long- and Short-Term Memories

Temporal Stride

1 2 4 8 16 32 64 128
LSTR 69.5 69.5 69.5 69.2 68.7 67.3 66.6 65.9
Table 3: Results of LSTR using downsampled long-term memory on THUMOS’14 in mAP (%). In particular, we use long-term memory as 512 seconds and short-term memory as 8 seconds.

We experiment for design choices of long- and short-term memories. Unless noted otherwise, we use THUMOS’14, which contains various video lengths, and Kinetics pretrained features.

Lengths of long- and short-term memories. We first analyze the effect of different lengths of long-term and short-term memory. In particular, we test seconds with starting from 0 second (no long-term memory). Note that we choose the max length (1024 seconds for THUMOS’14 and 256 seconds for HACS Segment) to cover length of 98% videos, and do not have proper datasets to test longer . Fig. 3 shows that LSTR is beneficial from larger in most cases. In addition, when is short ( in our cases), using larger obtains better results and when is sufficient ( in our cases), increasing does not always guarantee better performance.

Figure 3: Effect of using different lengths of long- and short-term memories.

Can we downsample long-term memory? We implement LSTR with as 8 seconds and as 512 seconds, and test the effect of downsampling long-term memory. Table 3 shows the results that downsampling with strides smaller than 4 does not cause performance drop, but more aggressive strides dramatically decrease the detection accuracy. Note that, when extracting frame features in 4 FPS, both LSTR encoder () and downsampling with stride compress the long-term memory to 16 features, but LSTR achieves much better performance ( vs. in mAP). This demonstrates the effectiveness of our “adaptive compression” comparing to heuristics downsampling.

Can we compensate reduced memory length with RNN? We note that LSTR’s performance notably decreases when it can only access to very limited memory (e.g., seconds). Here we test if RNN can be used to compensate the LSTR’s reduced memory or even fully replace the LSTR encoder. We implement LSTR using

seconds with an extra Gated Recurrent Unit (GRU) 

chung2014empirical (its architecture is visualized in Fig. 6) to capture all history outside the long- and short-term memory. The dashed line in Fig. 3 shows the results. Plugging-in RNNs indeed improves the performance when is small, but when is large (i.e. > 64 seconds), it does not improve the accuracy anymore. Note that RNNs are not used in any other experiments in this paper.

4.5 Design Choices of LSTR

LSTR Encoder LSTR Decoder Length of Long-Term Memory (secs) 8 16 32 64 128 256 512 1024 N/A TR Encoder 65.7 66.8 67.1 67.2 67.3 66.8 66.5 66.2 N/A TR Decoder 66.5 67.3 67.7 68.1 68.3 67.9 67.0 66.5 TR Encoder TR Decoder 65.9 66.4 66.7 67.4 67.5 67.2 67.0 66.6 TR Decoder TR Decoder 66.1 67.1 67.4 68.0 68.5 68.6 68.7 68.7 TR Decoder + TR Encoder TR Decoder 66.2 67.3 67.6 68.4 68.6 68.8 68.9 69.0 TR Decoder + TR Decoder N/A 64.0 64.7 65.9 66.1 66.5 66.2 65.4 65.2 TR Decoder + TR Decoder TR Decoder 66.6 67.8 68.2 68.8 69.2 69.4 69.5 69.5
Table 4: Results of different designs of the LSTR encoder and decoder. The length of short-term memory is set to 8 seconds. “TR” denotes Transformer. The last row is our best performance.

We continue to explore the design trade-offs of LSTR. Unless noted otherwise, we use short-term memory of 8 seconds, long-term memory of 512 seconds, and Kinetics pretrained features.

Number of layers and tokens. First, we test to use different numbers of token embeddings (i.e., and ) in LSTR encoder. Fig 4 (left) shows that LSTR is quite robust to different choices (the best and worst performance gap is only about ), but using and gets highest accuracy. Second, we experiment for the effect of using different number of Transformer decoder units (i.e., and ) in LSTR. As shown in Fig 4 (right), LSTR does not need a large model to get best performance, and in practice, using more layers can easily cause the overfitting problem.

Figure 4: Left: Results of different number of token embeddings for our two-stage memory compression. Right: Results of different number of Transformer decoder units for and .

Can we unify the temporal modeling using only self-attention models? We test if long-term and short-term

memory can be learned as a whole using self-attention models. Specifically, we concatenate

and and feed them into a standard Transformer encoder vaswani2017attention with similar model size to LSTR. Table 4 (row 7 vs. row 1) shows that LSTR achieves better performance especially when is large (e.g., vs. when and vs. when ). This demonstrates the advantages of LSTR for temporal modeling on long- and short-term context.

Can we remove the LSTR encoder? We explore this by directly feeding into LSTR decoder, and using as tokens to reference useful information. Table 4 shows that LSTR outperforms this baseline (row 7 vs. row 2), especially when is large. We also compare it with self-attention models (row 1) and observe that although neither of them can effectively model prolonged memory, this baseline outperforms Transformer Encoder. This also demonstrates the effectiveness of our idea of using short-term memory to query related context from long-range context.

Can the LSTR encoder learn effectively using only self-attention? To evaluate the “bottleneck” design in LSTR encoder, we implement a baseline that models by using standard Transformer Encoder units vaswani2017attention. Note that it still captures and with similar workflow of LSTR, but it does not compress and encode using learnable tokens. The comparison (row 3 vs. row 7) in Table 4 shows that LSTR outperforms this baseline with all settings. It shows that the performance of this baseline decreases when is getting larger, which also suggests the supreme ability of LSTR for modeling long-range patterns.

How to design the memory compression for the LSTR encoder? First, we test the effectiveness of the single-stage design with Transformer decoder units. Table 4 shows that two-stage memory compression (row 7) is stably better than single-stage (row 4), and interestingly, the performance gap gets larger when using larger ( when and when ). Second, we compare cross-attention with self-attention for two-stage compression by replacing the second Transformer decoder units with Transformer encoder. Table 4 shows that LSTR stably outperforms this baseline (row 5) by about in mAP. However, its performance is better than models with one-stage compression about (row 5 vs. row 3) and (row 5 vs. row 4) on average.

Can we remove the LSTR decoder? We remove the LSTR decoder to evaluate its contribution. Specifically, we feed the entire memory to LSTR encoder and attach a non-linear classifier on its output tokens embeddings. Similar to above experiments, we increase the model size to ensure a fair comparison. Table 4 shows that LSTR outperforms this baseline (row 6) by about on large (e.g., 512 and 1024) and about on relative small (e.g., 8 and 16).

4.6 Error Analysis

Action Classes HammerThrow PoleVault LongJump Diving BaseballPitch FrisbeeCatch Billiards CricketShot AP 92.8 89.7 86.9 86.7 55.4 49.4 39.8 38.6
Table 5: Action classes with highest and lowest performance on THUMOS’14 in AP (%).
Figure 5: Failure cases on THUMOS’14. Action classes from left to right are “BaseballPitch”, “FrisbeeCatch”, “Billiards”, and “CricketShot”. Red circle indicates where the action is happening.

In Table 5, we list the action classes from THUMOS’14 THUMOS14 where LSTR gets the highest (color green) and the lowest (color red) per-frame APs. In Fig. 5, we illustrate four sample frames with incorrect predictions. More visualizations are included in Sec. 6.3. We observe that LSTR sees decrease in detection accuracy when the action incurs only tiny motion or the subject is very far away from the camera, but excels at recognizing actions with long temporal span and multiple stages, such as “PoleVault” and “Long Jump”. This suggests one direction of exploration to extend the temporal modeling capability of LSTR to both spatial and temporal domains.

5 Conclusion

We presented Long Short-term TRansformer (LSTR) for online action detection, which divides the historical observations into long- and short-term memories. LSTR compresses the long-term memory into encoded latent features, and then references related temporal context from them with short-term memory. Experimental results demonstrated that our memory design and jointly modeling the long- and short-term temporal information is beneficial for online action detection. Ablation studies showed LSTR’s superior ability to deal with prolonged sequences with less heuristic algorithm designs.

6 Appendix

6.1 Runtime

LSTR Encoder LSTR Decoder Frames Per Second (FPS) Optical Flow Computation RGB Feature Extraction Optical Flow Feature Extraction Action Detection Model N/A TR Encoder 8.1 70.5 14.6 43.2 TR Encoder TR Decoder 50.2 TR Decoder TR Decoder 59.5 TR Decoder + TR Decoder TR Decoder 91.6
Table 6: Runtime of LSTR using a single V100 GPU on THUMOS’14.

We test the runtime (denoted as frames per second or FPS) of LSTR using our default design choices in Sec. 4.3. This experiment is conducted on a system with single V100 GPU. More details about the experimental settings are discussed in Sec. 4.2. The results are shown in Table 6.

We start by evaluating LSTR’s runtime without considering the pre-processing (e.g., feature extraction). LSTR can run at 91.6 FPS on average (row 4). We compare the runtime between different design choices discussed in Sec. 4.5. First, we test the LSTR with one-stage encoder (row 3), and the results show that LSTR with two-stage design can run much faster (91.6 FPS vs. 59.5 FPS). The main reason is that, by using the two-stage compression, LSTR encoder does not need to reference information from the long-term memory multiple times, and can be further accelerated during online inference (discussed in Sec. 3.6). Second, we test implementing LSTR encoder using only self-attention (row 2). This design does not compress the long-term memory, which increases the computational cost of both LSTR encoder and decoder (50.2 FPS vs. 91.6 FPS). Third, we compare LSTR with the Transformer Encoder vaswani2017attention (row 1), which is about 2 slower than LSTR (row 4).

We also compare LSTR with state-of-the-art recurrent models. We are not aware of any prior work that reports their runtime, thus we test TRN xu2019temporal using their official open-source code trn. The result shows that TRN can run at 123.3 FPS, which is faster than LSTR. This is because recurrent models abstract the entire history as a compact representation but LSTR needs to process much more information. On the other hand, LSTR achieves much higher performance than TRN, outperforming by about in mAP on THUMOS’14 and about in cAP on TVSeries.

For the end-to-end testing, we follow the experimental settings of state-of-the-art methods xu2019temporal; eun2020learning; qu2020lap; zhao2020privileged; gao2020woad and build LSTR on two-stream features. Specifically, the visual and motion features are extracted using ResNet-50 he2016deep and BNInception szegedy2016inception, respectively. The end-to-end online inference runs at 4.6 FPS by using LSTR together with the pre-processing techniques. We can see that the speed bottleneck is the motion feature extraction — it accounts for about 90% of the total runtime including the optical flow computation with DenseFlow denseflow. One can improve the efficiency largely by using real-time optical flow extractors (e.g., PWC-Net sun2018pwc) or using only visual features extracted by a light-weight backbone (e.g., MobileNet howard2017mobilenets and FBNet wu2019fbnet). Building an end-to-end model with light-weight backbones could be one direction to work towards real-time online action detection.

6.2 Can we compensate reduced memory length with RNN? Cont’d

Figure 6: Overview of the architecture of using LSTR with an extra GRU.

To better understand the design in Sec. 4.4, we show its overall structure in Fig 6. Specifically, in addition to the long- and short-term memories, we use an extra GRU to capture all the history “outside” the long- and short-term memories as a compact representation, . We then concatenate the outputs of the LSTR encoder and the GRU as more comprehensive temporal features of size , and feed them into the LSTR decoder as input tokens.

6.3 Qualitative Results

Fig. 7 shows some qualitative results. We can see that, in most cases, LSTR can quickly recognize the actions and make relatively consistent predictions for each action instance. Two typical failure cases are shown in Fig. 7. The top sample contains the “Billiards”action that incurs only tiny motion. As discussed in Sec. 4.6, LSTR’s detection accuracy is observed to decrease on this kind of actions. The bottom sample is challenging — the “Open door” action occurs behind the female reporter and is barely visible. Red circle indicates where the action is happening in each frame.

(a) Qualitative results on THUMOS’14 (top two samples) and TVSeries (bottom two samples).
(b) Failure cases on THUMOS’14 (top sample) and TVSeries (bottom sample).
Figure 7: Qualitative results and failure cases of LSTR. The curves indicate the predicted scores of the ground truth and “background” classes. (Best viewed in color.)
(a) Qualitative results on THUMOS’14 (top two samples) and TVSeries (bottom two samples).