1 Introduction
Given an incoming stream of video frames, online action detection de2016online
is concerned with the task of classifying what is happening at each frame without seeing the future. It is a causal reasoning problem different from offline action recognition where the entire video snippets are present. In this paper, we tackle the online action detection problem by presenting a new temporal modeling algorithm that is able to capture temporal correlations on prolonged sequences up to 8 minutes long, while retaining fine granularity of the event in the representation. This is achieved by modeling activities at different temporal scales that is necessary to capture the variety of events, some occurring in brief bursts, other manifest in slowly varying trends.
Specifically, we propose a new model, named Long Short-term TRansformer (LSTR), to jointly model the long- and short-term temporal dependencies. The LSTR has two advantages over previous work. 1) Storing history in the “plaintext” manner avoids the difficulty of learning with recurrent models elman1990finding; schuster1997bidirectional; hochreiter1997long; chung2014empirical
(backpropagation through time, BPTT, is not needed); the model can directly attend to any useful frames in the long-term memory. 2) Separating long- and short-term memories allows us to simultaneously conduct focused modeling of short-term information while extracting well-informed compressed knowledge from the long-term history. This means we can compress the long-term history without worrying about losing important historical information for making classification at the current time.
As shown in Fig. 1, we explicitly divide the entire history into the long- and short-term memories and build our model with an encoder-decoder architecture. Specifically, the LSTR encoder compresses and abstracts the long-term memory into a fixed length of latent representations, and the LSTR decoder uses a short window of transient frames that perform the self-attention operation in addition to the cross-attention to the exploited token embeddings from the LSTR encoder. In the LSTR encoder, an extended temporal support becomes beneficial in dealing with untrimmed, streaming videos by devising two-stage memory compression, which is shown to be computationally efficient in both training and inference. Our overall long short-term Transformer architecture gives rise to an effective and efficient representation for modeling prolonged time sequence data.
To evaluate the effectiveness of our model, we validate LSTR on three standard datasets (THUMOS’14 THUMOS14, TVSeries de2016online, and HACS Segment Zhao_2019_ICCV), which have distinct natures of video lengths (from a few seconds to tens of minutes). Experimental results show that LSTR significantly outperforms all state-of-the-art methods for online action detection, and the ablation studies further demonstrate LSTR’s superior performance for modeling long video sequences.

2 Related Work
Online Action Detection. Temporal action localization targets detecting the temporal boundaries of action instances after observing the entire video shou2016temporal; xu2017r; gao2017turn; shou2017cdc; Zhao_2017_ICCV; buch2017sst; lin2018bsn; liu2019multi; lin2019bmn. Emerging real-world applications, such as robotic perception yao2019unsupervised, require causal processing, where only past and current frames are used for detecting actions of the present. There has been a wealthy body of recent work on this online action detection problem de2016online. RED gao2017red uses a reinforcement loss to encourage recognizing actions as early as possible. TRN xu2019temporal models greater temporal context by simultaneously performing online action detection and anticipation. IDN eun2020learning learns discriminative features and accumulates only relevant information for the present. LAP-Net qu2020lap proposes an adaptive sampling strategy to obtain optimal features. PKD zhao2020privileged transfers knowledge from offline to online models using curriculum learning. As with early action detection hoai2014max; ma2016learning, Shou et al. shou2018online focus on online detection of action start (ODAS). StartNet gao2019startnet decomposes ODAS into two stages and learns with policy gradient. WOAD gao2020woad
uses weakly-supervised learning with video-level labels.
Temporal/Sequence Modeling. Modeling temporal information, especially from long-term context, is important but challenging for many video understanding tasks wu2019long; oh2019video; wu2021long. Prior work on action recognition mostly relies on heuristic sub-sampling (typically 3 to 7 video frames) for more feasible training yue2015beyond; wang2016temporal; feichtenhofer2017spatiotemporal; miech2017learnable; tang2018non
hochreiter1997long; chung2014empirical that maintain the entire history in a summary state, at the cost of losing track of specifics for the individual time steps in the past. 3D CNNs carreira2017quo; tran2018closer; xie1712rethinking are also widely used to perform spatio-temporal feature modeling, but they are limited by the receptive field hence unable to model long-range temporal context. Wu et al. wu2019long propose long-term feature bank for generic video understanding, which stores various supportive information, such as object and scene features, as context references. However, these features are usually pre-computed independently, having even no sense of position or order information, thus cannot provide an effective representation of the full context. On the other hand, most of above work does not explicitly integrate the long- and short-term features, but combine them with simple mechanisms, such as pooling and concatenation. Recent findings in cognitive science for the underlying mechanisms of working memory and attention oberauer2019working; cowan1998attention; chun2011visual; kiyonaga2013working have also shed light on the design principle for modeling long-term information with attention wang2018non; wu2019long. Comparing with RNNs, the attention mechanism is able to maintain long-term information at fast inference speed vaswani2017attention.Transformers for Action Understanding. Transformers have achieved breakthrough success in NLP radford2018improving; devlin2018bert
and are adopted in computer vision for image recognition
dosovitskiy2020image; touvron2020training and object detection carion2020end. Recent research papers exploit Transformers for temporal modeling in videos, such as action recognition neimark2021video; sharir2021image; li2021vidtr; bertasius2021space; arnab2021vivit and temporal action localization nawhal2021activity; tan2021relaxed, and achieve state-of-the-art results in the common benchmarks. However, most of these work only performs temporal modeling with relatively short video clips due to their computational demanding and limited modeling capability. Several work dai2019transformer; burtsev2020memory focuses on designing transformers to model long-term dependencies, but how to aggregate the long- and short-term information is still not well-explored today jaegle2021perceiver.3 Long Short-Term Transformer

Given a live streaming video, our goal is to identify the actions performed in each video frame using only past and current observations. Future information is not accessible during inference. Formally, a streaming video at time is represented by a batch of past frames , which reads “ up to time .” The online action detection system receives as inputs, and classifies the action category belonging to one of classes,
, ideally using the posterior probability
, wheredenotes the probability that no event is occurring at frame
. We design our method by assuming that there is a pretrained feature extractor wang2016temporal that processes each video frameinto a feature vector
of dimensions111In practice, some feature extractors tran2015learning take consecutive frames to produce one feature vector. Nonetheless, it is still temporally “centered” on a single frame. Thus we use the single frame notation here for simplicity.. These vectors form a -dimensional temporal sequence that serves as the input of our method.3.1 Overview
Our method is based on the intuition that frames observed recently provide precise information about the ongoing action instance, while frames over an extended period offer contextual references for actions that are potentially happening right now. We propose Long Short-term TRansformer (LSTR) in an explicit encoder-decoder manner, as shown in Fig. 2. In particular, the feature vectors of frames in the distant past are stored in a long-term memory, and a short-term memory stores the features of recent frames. The LSTR encoder compresses and abstracts features in long-term memory to an encoded latent representation of vectors. The LSTR decoder queries the encoded long-term memory with the short-term memory for decoding, leading to the action prediction . This design follows the line of thought in combining long- and short-term information for action understanding donahue2015long; wang2016temporal; wu2019long, but addresses several key challenges to effectively and efficiently achieve this goal, thanks to the flexibility of modeling brought by the recently emerged Transformers vaswani2017attention.
3.2 Long- and Short-Term Memories
We store the streaming input of feature vectors in two consecutive memories. The first memory is the short-term memory which stores only a small number of frames that are recently observed. We implement it with a first-in-first-out (FIFO) queue of slots. At time , it stores the feature vectors as . When a frame becomes “older” than time steps, it graduates from and enters into the long-term memory, which is implemented with another FIFO queue of slots. The long term memory stores . The long-term memory serves as the input memory to the LSTR encoder and the short-term memory serves as the queries for the LSTR decoder. In practice, the long-term memory stores much longer time span than the short-term memory (). A typical example of choice is , which represents 512 seconds worth of video contents with 4 frames per second (FPS) sampling rate, and representing 8 seconds. We add a sinusoidal positional encoding vaswani2017attention to each frame feature in the memories relative to current time (i.e., the frame at receives a positional embedding of ).
3.3 LSTR Encoder
The LSTR encoder aims at encoding the long-term memory of feature vectors into a latent representation that LSTR can use for decoding useful temporal context. This task requires large capacity in capturing the relations and temporal context among a span of hundreds or even thousands of frames. Prior work on modeling long-term dependencies for action understanding relies on heuristic temporal sub-sampling wu2019long; wang2016temporal or recurrent networks donahue2015long to make more feasible training, at the cost of losing specific information of each time step. Attention-based architectures, such as Transformer vaswani2017attention, have recently been shown promising for similar tasks that require long-range temporal modeling sharir2021image. A straightforward choice for LSTR encoder would be to use the Transformer encoder based on self-attention. However, its time complexity, , grows quadratically with the memory sequence length . This limits our ability to model long-term memory with sufficient length to cover long videos. Though recent work wang2020linformer has been exploring self-attention with linear complexity, repeatedly referencing information from the long-term memory with multi-layer Transformers is still computationally heavy. In LSTR, we propose to use a two-stage memory compression based on Transformer decoder units vaswani2017attention to achieve more effective memory encoding.
The Transformer decoder unit vaswani2017attention takes two sets of inputs. The first set includes a fixed number of learnable output tokens , where is the embedding dimension. The second set includes another input tokens , and can be a rather large number. It first applies one layer of multi-head self-attention on . The outputs are then used as queries in an “QKV cross-attention” operation and the input embeddings serve as key and value. The two steps can be written as
where denotes the intermediate layers between the two attention operations. One appealing property of this design is that it transforms the dimensional input tokens into the output tokens of dimensions in time complexity. When , the time complexity becomes linear to , making it an ideal candidate for compressing the long-term memory. This property is also utilized in jaegle2021perceiver to efficiently process large volume inputs, such as image pixels.
Two-Stage Memory Compression. Stacking multiple Transformer decoder units on the long-term memory, as in vaswani2017attention, can form a memory encoder with linear complexity with respect to the memory size . However, running the encoder at each time step can still be time consuming. We further reduce the time complexity with a two-stage memory compression design. The first stage has one Transformer decoder unit with output tokens. Its input tokens are the whole long-term memory of size . The outputs of the first stage are used as the input tokens to the second stage, which has stacked Transformer decoder units and output tokens. Then, the long-term memory of size is compressed into a latent representation of size , which can then be efficiently queried in the LSTR decoder later. This two-stage memory compression design is illustrated in Fig. 2.
Compared to an -layer Transformer encoder with time complexity or stacked Transformer decoder units with output-tokens having time complexity, the proposed LSTR encoder has complexity of . Because both and are much smaller than , and is usually larger than , using two-stage memory compression could be more efficient. In Sec. 3.6, we will show that, during online inference, it further enables us to reduce the runtime of the Transformer decoder unit of the first stage. In Sec. 4.5, we empirically found this design also leads to better performance for online action detection.
3.4 LSTR Decoder
The short-term memory contains informative features for classifying actions on the latest time step. The LSTR decoder uses the short-term memory as queries to retrieve useful information from the encoded long-term memory produced by the LSTR encoder. The LSTR decoder is formed by stacking layers of Transform decoder units. It takes the outputs of the LSTR encoder as input tokens and the feature vectors in the short-term memory as output tokens. It outputs probability vectors , each
representing the predicted probability distribution of
action categories and one “background” class at time . During inference, we only take the probability vector from the output token corresponding to the current time for classification result. However, having the additional outputs on the older frames allows the model to leverage more supervision signals during training. The details will described below.3.5 Training LSTR
LSTR can be trained without temporal unrolling and Backpropagation Through Time (BPTT) as in LSTM hochreiter1997long, which is a common property of Transformers vaswani2017attention. We construct each training sample by randomly sampling an ending time and filling the long- and short-term memories by tracing back in time for frames. We use the empirical cross entropy loss between the predicted probability distribution at time and the ground truth action label as
(1) |
where is the -th element of the probability vector , predicted on the latest frame at . Additionally, we add a directional attention mask vaswani2017attention
to the short-term memory so that any frame in the short-term memory can only depend on its previous frames. In this way, we can make prediction on all frames in the short-term memory as if they are the latest ones. Thus we can provide supervision on every frame in the short-term memory, and the complete loss function
is then(2) |
where denotes the prediction from the output token corresponding to time .
3.6 Online Inference with LSTR
During online inference, the video frame features are streamed to the model as time passes. Running LSTR’s long-term memory encoder from scratch for each frame results in a time complexity of and for the first and second memory compression stages, respectively. However, at each time step, there is only one new video frame to be updated. We show it is possible to achieve even more efficient online inference by storing the intermediate results for the Transformer decoder unit of the first stage. First, the queries of the first Transformer decoder unit are fixed. So their self-attention outputs can be pre-computed and used throughout the inference. Second, the cross-attention operation in the first stage can be written as
(3) |
where the index is the relative position of a frame in the long-term memory to the latest time . This calculation depends on the un-normalized attention weight matrix , with elements . can be decomposed into the sum of two matrices and . We have their elements as and . The queries after the first self-attention operation, , and the position embedding are fixed during inference. Thus the matrix can be pre-computed and used for every incoming frame. We additionally maintain a FIFO queue of vectors of size . at any time step can be obtained by stacking all vectors currently in this queue. Updating this queue at each time step requires time complexity for the matrix-vector product. Now we can obtain the matrix with only additions by adding and together, instead of multiplications and additions using Eq. (3). This means the amortized time complexity of computing the attention weights can be reduced to . Although the time complexity of the cross-attention operation is still due to the inevitable operation of weighted sum, considering is usually larger than he2016deep; szegedy2016inception, this is still a considerable reduction of runtime cost. We compare LSTR’s runtime with other baseline models in Sec. 6.1.
4 Experiments
4.1 Datasets
We evaluate our model on three publicly-available datasets: THUMOS’14 THUMOS14, TVSeries de2016online and HACS Segment Zhao_2019_ICCV. THUMOS’14 includes over 20 hours of sports video annotated with 20 actions. We follow prior work xu2019temporal; eun2020learning and train on the validation set (200 untrimmed videos) and evaluate on the test set (213 untrimmed videos). TVSeries contains 27 episodes of 6 popular TV series, totaling 16 hours of video. The dataset is annotated with 30 realistic, everyday actions (e.g., open door). HACS Segment is a large-scale dataset of web videos. It contains 35,300 untrimmed videos over 200 human action classes for training and 5,530 untrimmed videos for validation.
4.2 Settings
Feature Encoding. We follow the experimental settings of state-of-the-art methods xu2019temporal; eun2020learning. Specifically, we extract video frames at 24 FPS and set the video chunk size to 6. Decisions are made at the chunk level, and thus performance is evaluated every 0.25 seconds. For feature encoding, we adopt two-stream network wang2016temporal
with ResNet-50 backbone, and use the open-source toolbox
2020mmaction2. We experiment with feature extractors pretrained on two datasets: ActivityNet and Kinetics. The visual features are extracted at the global pooling layer from the central frame of each chunk, and the motion features are extracted from pre-computed stacked optical flow fields between 6 consecutive frames. The visual and motion features are concatenated along the channel dimension as the final feature f.Implementation Details.
We implemented our proposed model in PyTorch
pytorch, and performed all experiments on a system with 8 Nvidia V100 graphics cards. To learn model weights, we used the Adam kingma2014adam optimizer with weight decay . The learning rate was linearly increased from zero to in the first training iterations and was reduced to zero according to a cosine function. Our models were optimized using batch size of , and the training was terminated after epochs.Evaluation Protocols. We follow prior work and use per-frame mean average precision (mAP) to evaluate the performance of online action detection. We also use per-frame calibrated average precision (cAP) de2016online that was proposed for TVSeries to correct the imbalance between positive and negative samples, , where , is if frame is a true positive, is the number of true positives, and is the negative and positive ratio.
4.3 Comparison with the State-of-the-art Methods
Input Features | THUMOS’14 | TVSeries | |
mAP (%) | mcAP (%) | ||
CDC shou2017cdc | ActivityNet | 44.4 | - |
RED gao2017red | 45.3 | 79.2 | |
TRN xu2019temporal | 47.2 | 83.7 | |
FATS kim2021temporally | 51.6 | 81.7 | |
IDN eun2020learning | 50.0 | 84.7 | |
LAP qu2020lap | 53.3 | 85.3 | |
TFN eun2021temporal | 55.7 | 85.0 | |
LFB* wu2019long | 61.6 | 84.8 | |
LSTR (ours) | 65.3 | 88.1 |
Input Features | THUMOS’14 | TVSeries | |
mAP (%) | mcAP (%) | ||
FATS kim2021temporally | Kinetics | 59.0 | 84.6 |
IDN eun2020learning | 60.3 | 86.1 | |
TRN xu2019temporal | 62.1 | 86.2 | |
PKD zhao2020privileged | 64.5 | 86.4 | |
WOAD gao2020woad | 67.1 | - | |
LFB* wu2019long | 64.8 | 85.8 | |
LSTR (ours) | 69.5 | 89.1 |
Portion of Video | |||||||||||
Input Features | 0%- 10% | 10%- 20% | 20%- 30% | 30%- 40% | 40%- 50% | 50%- 60% | 60%- 70% | 70%- 80% | 80%- 90% | 90%- 100% | |
CNN de2016online | ActivityNet | 61.0 | 61.0 | 61.2 | 61.1 | 61.2 | 61.2 | 61.3 | 61.5 | 61.4 | 61.5 |
RNN de2016online | 63.3 | 64.5 | 64.5 | 64.3 | 65.0 | 64.7 | 64.4 | 64.4 | 64.4 | 64.3 | |
TRN xu2019temporal | 78.8 | 79.6 | 80.4 | 81.0 | 81.6 | 81.9 | 82.3 | 82.7 | 82.9 | 83.3 | |
IDN eun2020learning | 80.6 | 81.1 | 81.9 | 82.3 | 82.6 | 82.8 | 82.6 | 82.9 | 83.0 | 83.9 | |
TFN eun2021temporal | 83.1 | 84.4 | 85.4 | 85.8 | 87.1 | 88.4 | 87.6 | 87.0 | 86.7 | 85.6 | |
LSTR (ours) | 83.6 | 85.0 | 86.3 | 87.0 | 87.8 | 88.5 | 88.6 | 88.9 | 89.0 | 88.9 | |
IDN eun2020learning | Kinetics | 81.7 | 81.9 | 83.1 | 82.9 | 83.2 | 83.2 | 83.2 | 83.0 | 83.3 | 86.6 |
PKD zhao2020privileged | 82.1 | 83.5 | 86.1 | 87.2 | 88.3 | 88.4 | 89.0 | 88.7 | 88.9 | 87.7 | |
LSTR (ours) | 84.4 | 85.6 | 87.2 | 87.8 | 88.8 | 89.4 | 89.6 | 89.9 | 90.0 | 90.1 |
We compare LSTR against the state-of-the-art methods xu2019temporal; eun2020learning; gao2020woad on THUMOS’14, TVSeries, and HACS Segment. Specifically, on THUMOS’14 and TVSeries, we implement LSTR with the long- and short-term memories of 512 and 8 seconds, respectively. On HACS Segment, we reduce the long-term memory to 256 seconds, considering that its videos are strictly shorter than 4 minutes. For LSTR, we implement the two-stage memory compression using Transformer decoder units. We set the tokens with and and the Transformer layers with and .
THUMOS’14. We compare LSTR with recent work on THUMOS’14, including methods that use 3D ConvNets shou2017cdc and RNNs xu2017end; eun2020learning; gao2020woad
gao2017red, and curriculum learning zhao2020privileged. Table 1 shows that LSTR significantly outperforms the the state-of-the-art methods wu2019long; gao2020woad by 3.7% and 2.4% in terms of mAP using ActivityNet and Kinetics pretrained features, respectively.TVSeries. Table 1 shows the online action detection results that LSTR outperforms the state-of-the-art methods eun2021temporal; zhao2020privileged by 3.1% and 2.7% in terms of cAP using ActivityNet and Kinetics pretrained features, respectively. Following prior work de2016online
, we also investigate LSTR’s performance at different action stages by evaluating each decile (ten-percent interval) of the video frames separately. Table
2 shows that LSTR outperforms existing methods at every stage of action instances.HACS Segment. LSTR achieves 82.6% on HACS Segment in term of mAP using Kinetics pretrained features. Note that HACS Segment is a new large-scale dataset with only a few previous results. LSTR outperforms existing methods RNN hochreiter1997long (77.6%) by 5.0% and TRN xu2019temporal (78.9%) by 3.7%.
4.4 Design Choices of Long- and Short-Term Memories
Temporal Stride
1 2 4 8 16 32 64 128 LSTR 69.5 69.5 69.5 69.2 68.7 67.3 66.6 65.9We experiment for design choices of long- and short-term memories. Unless noted otherwise, we use THUMOS’14, which contains various video lengths, and Kinetics pretrained features.
Lengths of long- and short-term memories. We first analyze the effect of different lengths of long-term and short-term memory. In particular, we test seconds with starting from 0 second (no long-term memory). Note that we choose the max length (1024 seconds for THUMOS’14 and 256 seconds for HACS Segment) to cover length of 98% videos, and do not have proper datasets to test longer . Fig. 3 shows that LSTR is beneficial from larger in most cases. In addition, when is short ( in our cases), using larger obtains better results and when is sufficient ( in our cases), increasing does not always guarantee better performance.

Can we downsample long-term memory? We implement LSTR with as 8 seconds and as 512 seconds, and test the effect of downsampling long-term memory. Table 3 shows the results that downsampling with strides smaller than 4 does not cause performance drop, but more aggressive strides dramatically decrease the detection accuracy. Note that, when extracting frame features in 4 FPS, both LSTR encoder () and downsampling with stride compress the long-term memory to 16 features, but LSTR achieves much better performance ( vs. in mAP). This demonstrates the effectiveness of our “adaptive compression” comparing to heuristics downsampling.
Can we compensate reduced memory length with RNN? We note that LSTR’s performance notably decreases when it can only access to very limited memory (e.g., seconds). Here we test if RNN can be used to compensate the LSTR’s reduced memory or even fully replace the LSTR encoder. We implement LSTR using
seconds with an extra Gated Recurrent Unit (GRU)
chung2014empirical (its architecture is visualized in Fig. 6) to capture all history outside the long- and short-term memory. The dashed line in Fig. 3 shows the results. Plugging-in RNNs indeed improves the performance when is small, but when is large (i.e. > 64 seconds), it does not improve the accuracy anymore. Note that RNNs are not used in any other experiments in this paper.4.5 Design Choices of LSTR
We continue to explore the design trade-offs of LSTR. Unless noted otherwise, we use short-term memory of 8 seconds, long-term memory of 512 seconds, and Kinetics pretrained features.
Number of layers and tokens. First, we test to use different numbers of token embeddings (i.e., and ) in LSTR encoder. Fig 4 (left) shows that LSTR is quite robust to different choices (the best and worst performance gap is only about ), but using and gets highest accuracy. Second, we experiment for the effect of using different number of Transformer decoder units (i.e., and ) in LSTR. As shown in Fig 4 (right), LSTR does not need a large model to get best performance, and in practice, using more layers can easily cause the overfitting problem.

Can we unify the temporal modeling using only self-attention models? We test if long-term and short-term
memory can be learned as a whole using self-attention models. Specifically, we concatenate
and and feed them into a standard Transformer encoder vaswani2017attention with similar model size to LSTR. Table 4 (row 7 vs. row 1) shows that LSTR achieves better performance especially when is large (e.g., vs. when and vs. when ). This demonstrates the advantages of LSTR for temporal modeling on long- and short-term context.Can we remove the LSTR encoder? We explore this by directly feeding into LSTR decoder, and using as tokens to reference useful information. Table 4 shows that LSTR outperforms this baseline (row 7 vs. row 2), especially when is large. We also compare it with self-attention models (row 1) and observe that although neither of them can effectively model prolonged memory, this baseline outperforms Transformer Encoder. This also demonstrates the effectiveness of our idea of using short-term memory to query related context from long-range context.
Can the LSTR encoder learn effectively using only self-attention? To evaluate the “bottleneck” design in LSTR encoder, we implement a baseline that models by using standard Transformer Encoder units vaswani2017attention. Note that it still captures and with similar workflow of LSTR, but it does not compress and encode using learnable tokens. The comparison (row 3 vs. row 7) in Table 4 shows that LSTR outperforms this baseline with all settings. It shows that the performance of this baseline decreases when is getting larger, which also suggests the supreme ability of LSTR for modeling long-range patterns.
How to design the memory compression for the LSTR encoder? First, we test the effectiveness of the single-stage design with Transformer decoder units. Table 4 shows that two-stage memory compression (row 7) is stably better than single-stage (row 4), and interestingly, the performance gap gets larger when using larger ( when and when ). Second, we compare cross-attention with self-attention for two-stage compression by replacing the second Transformer decoder units with Transformer encoder. Table 4 shows that LSTR stably outperforms this baseline (row 5) by about in mAP. However, its performance is better than models with one-stage compression about (row 5 vs. row 3) and (row 5 vs. row 4) on average.
Can we remove the LSTR decoder? We remove the LSTR decoder to evaluate its contribution. Specifically, we feed the entire memory to LSTR encoder and attach a non-linear classifier on its output tokens embeddings. Similar to above experiments, we increase the model size to ensure a fair comparison. Table 4 shows that LSTR outperforms this baseline (row 6) by about on large (e.g., 512 and 1024) and about on relative small (e.g., 8 and 16).
4.6 Error Analysis

In Table 5, we list the action classes from THUMOS’14 THUMOS14 where LSTR gets the highest (color green) and the lowest (color red) per-frame APs. In Fig. 5, we illustrate four sample frames with incorrect predictions. More visualizations are included in Sec. 6.3. We observe that LSTR sees decrease in detection accuracy when the action incurs only tiny motion or the subject is very far away from the camera, but excels at recognizing actions with long temporal span and multiple stages, such as “PoleVault” and “Long Jump”. This suggests one direction of exploration to extend the temporal modeling capability of LSTR to both spatial and temporal domains.
5 Conclusion
We presented Long Short-term TRansformer (LSTR) for online action detection, which divides the historical observations into long- and short-term memories. LSTR compresses the long-term memory into encoded latent features, and then references related temporal context from them with short-term memory. Experimental results demonstrated that our memory design and jointly modeling the long- and short-term temporal information is beneficial for online action detection. Ablation studies showed LSTR’s superior ability to deal with prolonged sequences with less heuristic algorithm designs.
6 Appendix
6.1 Runtime
We test the runtime (denoted as frames per second or FPS) of LSTR using our default design choices in Sec. 4.3. This experiment is conducted on a system with single V100 GPU. More details about the experimental settings are discussed in Sec. 4.2. The results are shown in Table 6.
We start by evaluating LSTR’s runtime without considering the pre-processing (e.g., feature extraction). LSTR can run at 91.6 FPS on average (row 4). We compare the runtime between different design choices discussed in Sec. 4.5. First, we test the LSTR with one-stage encoder (row 3), and the results show that LSTR with two-stage design can run much faster (91.6 FPS vs. 59.5 FPS). The main reason is that, by using the two-stage compression, LSTR encoder does not need to reference information from the long-term memory multiple times, and can be further accelerated during online inference (discussed in Sec. 3.6). Second, we test implementing LSTR encoder using only self-attention (row 2). This design does not compress the long-term memory, which increases the computational cost of both LSTR encoder and decoder (50.2 FPS vs. 91.6 FPS). Third, we compare LSTR with the Transformer Encoder vaswani2017attention (row 1), which is about 2 slower than LSTR (row 4).
We also compare LSTR with state-of-the-art recurrent models. We are not aware of any prior work that reports their runtime, thus we test TRN xu2019temporal using their official open-source code trn. The result shows that TRN can run at 123.3 FPS, which is faster than LSTR. This is because recurrent models abstract the entire history as a compact representation but LSTR needs to process much more information. On the other hand, LSTR achieves much higher performance than TRN, outperforming by about in mAP on THUMOS’14 and about in cAP on TVSeries.
For the end-to-end testing, we follow the experimental settings of state-of-the-art methods xu2019temporal; eun2020learning; qu2020lap; zhao2020privileged; gao2020woad and build LSTR on two-stream features. Specifically, the visual and motion features are extracted using ResNet-50 he2016deep and BNInception szegedy2016inception, respectively. The end-to-end online inference runs at 4.6 FPS by using LSTR together with the pre-processing techniques. We can see that the speed bottleneck is the motion feature extraction — it accounts for about 90% of the total runtime including the optical flow computation with DenseFlow denseflow. One can improve the efficiency largely by using real-time optical flow extractors (e.g., PWC-Net sun2018pwc) or using only visual features extracted by a light-weight backbone (e.g., MobileNet howard2017mobilenets and FBNet wu2019fbnet). Building an end-to-end model with light-weight backbones could be one direction to work towards real-time online action detection.
6.2 Can we compensate reduced memory length with RNN? Cont’d

To better understand the design in Sec. 4.4, we show its overall structure in Fig 6. Specifically, in addition to the long- and short-term memories, we use an extra GRU to capture all the history “outside” the long- and short-term memories as a compact representation, . We then concatenate the outputs of the LSTR encoder and the GRU as more comprehensive temporal features of size , and feed them into the LSTR decoder as input tokens.
6.3 Qualitative Results
Fig. 7 shows some qualitative results. We can see that, in most cases, LSTR can quickly recognize the actions and make relatively consistent predictions for each action instance. Two typical failure cases are shown in Fig. 7. The top sample contains the “Billiards”action that incurs only tiny motion. As discussed in Sec. 4.6, LSTR’s detection accuracy is observed to decrease on this kind of actions. The bottom sample is challenging — the “Open door” action occurs behind the female reporter and is barely visible. Red circle indicates where the action is happening in each frame.
![]() |
![]() |