Log In Sign Up

TraMNet - Transition Matrix Network for Efficient Action Tube Proposals

Current state-of-the-art methods solve spatiotemporal action localisation by extending 2D anchors to 3D-cuboid proposals on stacks of frames, to generate sets of temporally connected bounding boxes called action micro-tubes. However, they fail to consider that the underlying anchor proposal hypotheses should also move (transition) from frame to frame, as the actor or the camera does. Assuming we evaluate n 2D anchors in each frame, then the number of possible transitions from each 2D anchor to the next, for a sequence of f consecutive frames, is in the order of O(n^f), expensive even for small values of f. To avoid this problem, we introduce a Transition-Matrix-based Network (TraMNet) which relies on computing transition probabilities between anchor proposals while maximising their overlap with ground truth bounding boxes across frames, and enforcing sparsity via a transition threshold. As the resulting transition matrix is sparse and stochastic, this reduces the proposal hypothesis search space from O(n^f) to the cardinality of the thresholded matrix. At training time, transitions are specific to cell locations of the feature maps, so that a sparse (efficient) transition matrix is used to train the network. At test time, a denser transition matrix can be obtained either by decreasing the threshold or by adding to it all the relative transitions originating from any cell location, allowing the network to handle transitions in the test data that might not have been present in the training data, and making detection translation-invariant. Finally, we show that our network can handle sparse annotations such as those available in the DALY dataset. We report extensive experiments on the DALY, UCF101-24 and Transformed-UCF101-24 datasets to support our claims.


page 2

page 9


Searching Action Proposals via Spatial Actionness Estimation and Temporal Path Inference and Tracking

In this paper, we address the problem of searching action proposals in u...

Tubelets: Unsupervised action proposals from spatiotemporal super-voxels

This paper considers the problem of localizing actions in videos as a se...

Bounding Maps for Universal Lesion Detection

Universal Lesion Detection (ULD) in computed tomography plays an essenti...

YoTube: Searching Action Proposal via Recurrent and Static Regression Networks

In this paper, we present YoTube-a novel network fusion framework for se...

Self-taught Object Localization with Deep Networks

This paper introduces self-taught object localization, a novel approach ...

Transition Forests: Learning Discriminative Temporal Transitions for Action Recognition and Detection

A human action can be seen as transitions between one's body poses over ...

Complexity of Deliberative Coalition Formation

Elkind et al. (AAAI, 2021) introduced a model for deliberative coalition...

1 Introduction

Current state-of-the-art spatiotemporal action localisation works [23, 15, 12] focus on learning a spatiotemporal multi-frame 3D representation by extending frame-level 2D object/action detection approaches  [8, 30, 7, 22, 18, 24, 20, 26]. These networks learn a feature representation from pairs [23] or chunks [15, 12] of video frames, allowing them to implicitly learn the temporal correspondence between inter-frame action regions (bounding boxes). As a result, they can predict micro-tubes [23] or tubelets [15], i.e., temporally linked frame-level detections for short subsequences of a test video clip. Finally, these micro-tubes are linked [23, 15, 12] in time to locate action tube instances [26] spanning the whole video.

Figure 1: Illustrating the key limitation of anchor cuboids using a “dynamic” action like “horse riding”. (a) A horse rider changes its location from frame to as shown by the ground truth bounding boxes (in green). As the anchor cuboid generation [23, 15] is constrained by the spatial location of the anchor box in the first frame , the overall spatiotemporal IoU overlap between the ground-truth micro-tube and the anchor cuboid is relatively low. (b)

In contrast, our anchor micro-tube proposal generator is much more flexible, as it efficiently explores the video search space via an approximate transition matrix estimated based on a hidden Markov model (HMM) formulation. As a result, the anchor micro-tube proposal (in blue) generated by the proposed model exhibits higher overlap with the ground-truth.

(c) For “static” actions (such as “clap”) in which the actor does not change location over time, anchor cuboid and anchor micro-tubes have the same spatiotemporal bounds.

These approaches, however, raise two major concerns. Firstly, they [23, 15, 12] generate action proposals by extending 2D object proposals (anchor/prior boxes for images) [18, 22] to 3D proposals (anchor cuboids for multiple frames) (cf. Fig. 1 (a)). This cannot, by design, provide an optimal set of training hypotheses, as the video proposal search space () is much larger than the image proposal search space (), where is the number of anchor boxes per frame and is the number of video frames considered. Furthermore, 3D anchor cuboids are very limiting for action detection purposes. Whereas they can be suitable for “static” actions (e.g. “handshake” or “clap”, in which the spatial location of the actor(s) does not vary over time), they are most inappropriate for “dynamic” ones (e.g. “horse riding”, “skiing”). Fig. 1 underscores this issue. For “horse riding”, for instance, allowing “flexible” anchor micro-tubes (as those generated by our approach, Fig. 1 (b)) much improves the spatio-temporal overlap with the ground-truth (Fig. 1 (a)). Designing a deep network which can effectively make use of the video search space to generate high-quality action proposals, while keeping the computing cost as low as possible, is then highly desirable. To this end, we produced a new action detection dataset which is a “transformed” version of UCF-101-24 [27], in which we force action instances to be dynamic (i.e., to change their spatial location significantly over time) by introducing random translations in the 2d spatial domain. We show that our proposed action detection approach outperforms the baseline [23] when trained and tested on this transformed dataset.

In the second place, action detection methods such as [15, 12] require dense ground-truth annotation for network training: bounding-box annotation is required for consecutive video frames, where is the number of frames in a training example. Kalogeiton  [15] use whereas for Hou  [12] . Generating such dense bounding box annotation for long video sequences is highly expensive and impractical  [31, 10]. The latest generation action detection benchmarks DALY [31] and AVA [10], in contrast, provide sparse bounding-box annotations. More specifically, DALY has to frames bounding box annotation per action instance irrespective of the duration of an instance, whereas AVA has only one frame annotation per second. This motivates the design of a deep network able to handle sparse annotations, while still being able to predict micro-tubes over multiple frames.

Unlike [15, 12], Saha  [23] recently proposed to use pairs of successive frames , eliminating the need for dense training annotation when is large e.g. or arbitrary DALY [31]. If the spatio-temporal IoU (Intersection over Union) overlap between the ground-truth micro-tube and the action proposal could be improved (cf. Fig. 1), such a network would be able to handle sparse annotation (e.g., pairs of frames which are apart). Indeed, the use of pairs of successive frames in combination with the flexible anchor proposals introduced here, is arguably more efficient than any other state-of-the-art method [23, 16, 12] for handling sparse annotations (e.g. DALY [31] and AVA [10]). .

Figure 2: Overview of our proposed TraMNet at training time. The diagram is described in the text.

Concept. Here we support the idea of constructing training examples using pairs of successive frames. However, the model we propose is able to generate a rich set of action proposals (which we call anchor micro-tubes, cf. Fig. 1) using a transition matrix (cf. Section 3.3) estimated from the available training set. Such transition matrix encodes the probability of a temporal link between an anchor box at time and one at , and is estimated within the framework of discrete state/continuous observation hidden Markov models (HMMs, cf. Section 3.2[4]. Here, the hidden states are the 2D bounding-box coordinates

of each anchor box from a (finite) hierarchy of fixed grids at different scales. The (continuous) observations are the kindred four-vectors of coordinates associated with the ground truth bounding boxes (which are instead allowed to be placed anywhere in the image). Anchor micro-tubes are not bound to be strictly of cuboidal (as in 

[23, 15, 12]) shape, thus giving higher IoU overlap with the ground-truth, specifically for instances where the spatial location of the actor changes significantly from to

in a training pair. We thus propose a novel configurable deep neural network architecture (see Fig. 

2 and Section 3) which leverages high-quality micro-tubes shaped by learnt anchor transition probabilities.

We quantitatively demonstrate that the resulting action detection framework: (i) is suitable for datasets with temporally sparse frame-level bounding box annotation (e.g. DALY [31] and AVA [10]); (ii) outperforms the current state-of-the-art [23, 15, 26] by exploiting the anchor transition probabilities learnt from the training data. (iii) is suitable for detecting highly ‘dynamic’ actions (Fig. 1), as shown by its outperforming the baseline [23] when trained and tested on the “transformed” UCF-101-24 dataset.
Overview of the approach. Our network architecture builds on some of the architectural components of [18, 23, 15] (Fig. 2). The proposed network takes as input a pair of successive video frames (where is the inter-frame distance) (Fig. 2 (a)) and propagates these frames through a base network comprised of two parallel CNN networks (§ 3.1 Fig. 2 (b)), which produce two sets of conv feature maps and forming a pyramid. These feature pyramids are used by a configurable pooling layer (§ 3.4 and Fig. 2 (d)) to pool features based on the transition probabilities defined by a transition matrix (§ 3.3, Fig. 2). The pooled conv features are then stacked (§ 3.4 and Fig. 2 (e)), and the resulting feature vector is passed to two parallel fully connected (linear) layers (one for classification and another for micro-tube regression, see § 3.5 and Fig. 2 (f)), which predict the output micro-tube and its classification scores for each class  (g). Each training mini-batch is used to compute the classification and micro-tube regression losses given the output predictions, ground truth and anchor micro-tubes. We call our network “configurable” because the configuration of the pooling layer (see Fig. 2 (d)) depends on the transition matrix , and can be changed by altering the threshold applied to (cf. Section 3.3). or by replacing the transition matrix with a new one for another dataset.

In summary, we present a novel deep learning architecture for spatio-temporal action localisation which:

  • introduces an efficient and flexible anchor micro-tube hypothesis generation framework to generate high-quality action proposals;

  • handles significant spatial movement in dynamic actors without penalising more static actions;

  • is a scalable solution for training models on both sparse or dense annotations.

2 Related work

Traditionally, spatio-temporal action localisation was widely studied using local or figure centric features [6, 19, 14, 25, 28]. Inspired by Oneata  [19] and Jain  [14], Gemert  [6] used unsupervised clustering to generate 3D tubelets using unsupervised frame level proposals and dense trajectories. As their method is based on dense-trajectory features [29], however, it fails to detect actions characterised by small motions [6].

Recently, inspired by the record-breaking performance of CNNs based object detectors [21, 22, 18] several scholars  [26, 24, 8, 20, 30, 32, 35] tried to extend object detectors to videos for spatio-temporal action localisation. These approaches, however, fail to tackle spatial and temporal reasoning jointly at the network level, as spatial detection and temporal association are treated as two disjoint problems. Interestingly, Yang [33] use features from current, frame proposals to ‘anticipate’ region proposal locations in and use them to generate detections at time , thus failing to take full advantage of the anticipation trick to help with the linking process.
More recent works try to address this problem by predicting micro-tubes [23] or tubelets [15, 12] for a small set of frames taken together. As mentioned, however, these approaches use anchor hypotheses which are simply extensions of the hypothesis in the first frame, thus failing to model significant location transitions. In opposition, here we address this issue by proposing anchor regions which move across frames, as a function of a transition matrix estimated at training time from anchor proposals of maximal overlap.

Advances in action recognition are always going to be helpful in action detection from a general representation learning point of view. For instance, Gu [10] improve on [20, 15] by plugging in the inflated 3D network proposed by [3] as a base network on multiple frames. Although they use a very strong base network pre-trained on the large “kinetics” [16] dataset, they do not handle the linking process within the network as the AVA [10] dataset’s annotations are not temporally linked.

Temporal association is usually performed by some form of “tracking-by-detection”  [26, 30, 8] of frame level detections. Kalogeiton   [15] adapts the linking process proposed by Singh  [26] to link tubelets, whereas Saha  [23] builds on [8] to link micro-tubes. Temporal trimming is handled separately either by sliding window [31, 20], or in a label smoothing formulation solved using dynamic programming [24, 5]. For this taks we adopt the micro-tube linking from [15, 26] and the online temporal trimming from [26]. We demonstrate that the temporal trimming aspect does not help on UCF101-24 (in fact, it damages performance), while it helps on the DALY dataset in which only 4% of the video duration is covered by action instances.

3 Methodology

Figure 3: Base network architecture. (a) SSD convolutional layers; (b) the corresponding conv feature maps outputted by each conv layer; (c) anchor boxes with different aspect ratios assigned to cell location of the feature map grid; (d) transition matrices for the feature map grids in the pyramid, where .

In Section 3.1, we introduce the base network architecture used for feature learning. We cast the action proposal generation problem in a hidden Markov model (HMM) formulation (§ Section 3.2

), and introduce an approximate estimation of the HMM transition probability matrix using a heuristic approach (§ Section 

3.3). The proposed approximation is relatively inexpensive and works gracefully (§ 4). In Section 3.4, a configurable pooling layer architecture is presented which pools convolutional features from the regions in the two frames linked by the estimated transition probabilities. Finally, the output layers of the network (i.e., the micro-tube regression and classification layers) are described in Section 3.5.

3.1 Base network

The base network takes as inputs a pair of video frames and propagates them through two parallel CNN streams (cf. Fig. 2 (b)). In Fig. 3 (a), we show the network diagram of one of the CNN streams; the other follows the same design.
The network architecture is based on Single-Shot-Detector (SSD) [18]. The CNN stream outputs a set of convolutional feature maps , (feature pyramid, cfr. Fig. 3 (b)) of shape , where , and are the height, width and depth of the feature map at network depth , respectively. For the conv feature map spatial dimensions are , respectively. The feature maps at the lower depth levels (i.e., or 3) are responsible for encoding smaller objects/actions, whereas feature maps at higher depth levels encode larger actions/objects. For each cell location of feature map grid , anchor boxes (with different aspect ratios) are assigned where . E.g. at each cell location of the grid in the pyramid, anchor boxes are produced (Fig. 3 (c)), resulting in a total of anchor boxes. These anchor boxes, assigned for all distinct feature map grids, are then used to generate action proposal hypotheses based on the transition probability matrix, as explained below.

Note that the proposed framework is not limited to any particular base network architecture, and is flexible enough to accommodate any latest network [17, 3].

3.2 HMM-based action proposal generation

A hidden Markov model (HMM) models a time series of (directly measurable) observations , either discrete or continuous, as randomly generated at each time instant by a hidden state

, whose series form a Markov chain, i.e., the conditional probability of the state at time

given only depends on the value of the state at time . The whole information on the time series’ dynamics is thus contained in a transition probability matrix , where is the probability of moving from state to state , and .
In our setting, a state is a vector containing the 2D bounding-box coordinates of one of the anchor boxes in one of the grids forming the pyramid (§ 3.1). The transition matrix encodes the probabilities of a temporal link existing between an anchor box (indexed by ) at time and another anchor box (indexed by ) at time . The continuous observations , are the ground-truth bounding boxes, so that corresponds to a ground-truth action tube.

In hidden Markov models, observations are assumed to be Gaussian distributed given a state

, with mean and covariance . After assuming an appropriate distribution for the initial state, e.g. , the transition model allows us to predict at each time the probability of the current state given the history of previous observations, i.e., the probability of each anchor box at time given the observed (partial) ground-truth action tube. Given a training set, the optimal HMM parameters (, and for ) can be learned using standard expectation maximisation (EM) or the Baum-Welch algorithm, by optimising the likelihood of the predictions produced by the model.
Once training is done, at test time, the mean of the conditional distribution of the observations given the state associated with the predicted state at time can be used to initialise the anchor boxes for each of the CNN feature map grids (§ 3.1). The learnt transition matrix can be used to generate a set of training action proposals hypotheses (i.e., anchor micro-tubes, Fig. 1). As in our case the mean vectors , are known a-priori (as the coordinates of the anchor boxes are predefined for each feature map grid, § 3.1), we do not allow the M-step of EM algorithm to update . Only the covariance matrix is updated.

3.3 Approximation of the HMM transition matrix

Although the above setting perfectly formalises the anchor box-ground truth detection relation over the time series of training frames, a number of computational issues arise. At training time, some states (anchor boxes) may not be associated with any of the observations (ground-truth boxes) in the E-step, leading to zero covariance for those states. Furthermore, for a large number of states (in our case anchor boxes), it takes around days to complete a single HMM training iteration.
In response, we propose to approximate the HMM’s transition probability matrix with a matrix generated by a heuristic approach explained below.

The problem is to learn a transition probability, i.e., the probability of a temporal link (edge) between two anchor boxes belonging to two feature map grids and . If we assume that transitions only take place between states at the same level of the feature pyramid, the two sets of anchor boxes and belonging to a pair of grids are identical, namely: , allowing us to remove the time superscript. Recall that each feature map grid has spatial dimension .
We compute a transition probability matrix individually for each grid level , resulting in such matrices of shape (see Fig. 3 (d)). For example, at level we have a feature map grids, so that the transition matrix will be . Each cell in the grid is assigned to anchor boxes, resulting in total anchor boxes per grid (§ 3.1).

Transition matrix computation.

Initially, all entries of the transition matrix are set to zero: . Given a ground-truth micro-tube (a pair of temporally linked ground-truth boxes [23]), we compute the IoU overlap for each ground-truth box with all the anchor boxes in the considered grid, namely: and . We select the pair of anchor boxes (which we term anchor micro-tube) having the maximum IoU overlap with , where and are two cell locations. If (the resulting anchor boxes are in the same location) we get an anchor cuboid, otherwise a general anchor micro-tube.
This is repeated for all feature map grids to select the anchor micro-tube with the highest overlap. The best match anchor micro-tube for a given ground-truth micro-tube is selected among those , and the transition matrix is updated as follows: . The above steps are repeated for all the ground-truth micro-tubes in a training set. Finally, each row of the transition matrix is normalised by dividing each entry by the sum of that row.
Fig. 4 plots the transition matrix for (a feature map grid ), for different values of . As explained in the following, the configurable pooling layer employs these matrices to pool conv features for action proposal classification and regression.

Although our approach learns transition probabilities for anchor boxes belonging to the same feature map grid , we realise that the quality of the resulting action proposals could be further improved by learning transitions between anchors across different levels of the pyramid. As the feature dimension of each map varies in SSD, e.g. 1024 for and 512 for , a more consistent network such as FPN [17] with Resnet [11] would be a better choice as base architecture. Here we stick to SSD to produce a fair comparison with [15, 26, 23], and leave this extension to future work.

Figure 4: (a) Transition matrix for a feature map grid () for different values. As increases, off-diagonal probability values also increase, indicating a need for anchor micro-tubes rather than anchor-cuboids. (b) Top - Monte Carlo sampling of transition hypotheses based on uniformly sampling the range. Bottom - our anchor micro-tube sampling scheme, based on thresholding the transition probabilities , is also stochastic in nature and emulates Monte Carlo sampling. The blue line denotes the threshold and the shaded area above the threshold line shows the sampling region, a subset of the product grid .

3.4 Configurable pooling layer

The SSD [18] network uses convolutional kernels of dimension as classification and regression layers (called classification and regression heads). More specifically, SSD uses kernels for bounding box regression (recall anchor boxes with different aspect ratios are assigned to each cell location (§ 3.1)) and kernels for classification over the conv feature maps (§ 3.1). This is fine when the number of proposal hypotheses is fixed (e.g., for object detection in images, the number of anchor boxes is set to ). In our setting, however, the number of proposals varies depending upon the cardinality of transition matrix (§ 3.3). Consequently, it is more principled to implement the classification and regression heads as fully connected layers (see Fig. 2 (f)

). If we observe consistent off-diagonal entries in the transition matrices (e.g. lots of cells moving one step in the same direction), we could perform pooling as convolution feature map stacking with padding to allow spatial movement. However, transition matrices are empirically extremely sparse (e.g., there are only 25 and 1908 off-diagonal non-zero entries in the transition matrices at

equal to 4 and 20, respectively, on the UCF101-24 dataset).

Anchor micro-tube sampling. Each transition matrix is converted into a binary one by thresholding, so that the cardinality of the matrix depends not only on the data but also on the transition probability threshold. Our transition matrix based anchor micro-tube sampling scheme is stochastic in nature and emulates Monte Carlo sampling technique (Fig. 4 (b)). A thresholding on the transition matrix allows us to sample a variable number of anchors rather than a fixed one. We empirically found that a 10% threshold gives the best results in all of our tests. We discuss the threshold and its effect on performance in § 3.3.

The pooling layer (see Fig. 2 (d)) is configured to pool features from a pair of convolutional feature maps , each of shape . The pooling is done at cell locations and , specified by the estimated (thresholded) transition matrix (§ 3.3). The pooling kernel has dimension . Pooled features are subsequently stacked (Fig. 2 (e)) to get a single feature representation of a shape per anchor micro-tube.

3.5 Classification and regression layers

After pooling and stacking, we get conv features of size , for each anchor micro-tube cell regions where is the sum of the cardinalities of the transition matrices. We pass these features to a classification layer , , and a regression layer , (see Fig. 2 (f)). The classification layer outputs class scores and the regression layer outputs bounding-box coordinates for anchor micro-tubes per anchor micro-tube cell region (see Fig. 2 (g)). The linear classification and regression layers have the same number of parameters as the convolutional heads in the SSD network [18].

3.6 Online action tube generation and temporal trimming

The output of the proposed network is a set of detection micro-tubes and their class confidence scores (see Fig. 2 (g)). We adapt the online action tube generation algorithm proposed by Singh  [26] to compose these detection micro-tubes into complete action paths (tracklets) spanning the entire video. Note that, Singh  [26] use their tube generation algorithm to temporally connect frame-level detection bounding-boxes, whereas our modified version of the algorithm connects video-level detection micro-tubes. Similarly to [26], we build action paths incrementally by connecting micro-tubes across time. as the action paths are extracted, their temporal trimming is performed using dynamic programming [24, 5]. In Section 4 we show that temporal segmentation helps improve detection performance for datasets containing highly temporally untrimmed videos e.g., DALY [31], where on average only % of the video duration is covered by action instances.
Fusion of appearance and flow cues We follow a late fusion strategy [15, 26] to fuse appearance and optical flow cues, performed at test time after all the detections are extracted from the two streams. Kalogeiton  [15] demonstrated that mean fusion works better than both boost fusion [24] and union-set fusion [26]. Thus, in this work we produce all results (cf. Section 4) using mean fusion [15]. We report an ablation study of the appearance and flow stream performance in the supplementary material.

4 Experiments

We first present datasets, evaluation metrics, fair comparison and implementation details used in Section 

4.1. Secondly, we show how TraMNet is able to improve spatial-temporal action localisation in Section 4.2. Thirdly, in Section 4.3, we discuss how a network learned using transition matrices is able to generalise at test time, when more general anchor-micro-tubes are used to evaluate the network. Finally, in Section 4.4, we quantitatively demonstrate that TraMNet is able to effectively handle sparse annotation as in the DALY dataset, and generalise well on various train and test ’s.

4.1 Datasets

We selected UCF-101-24 [27] to validate the effectiveness of the transition matrix approach, and DALY [31] to evaluate the method on sparse annotations.
UCF101-24 is a subset of 24 classes from UCF101 [27] dataset, which has 101 classes. Initial spatial and temporal annotations provided in THUMOS-2013 [13] were later corrected by Singh  [26] – we use this version in all our experiments. UCF101 videos contain a single action category per video, sometimes multiple action instances in the same video. Each action instance cover on average 70% of the video duration. This dataset is relevant to us as we can show how the increase in affects the performance of TraMNet [23], and how the transition matrix helps recover from that performance drop. Transformed-UCF101-24 was created by us by padding all images along both the horizontal and the vertical dimension. We set the maximum padding values to 32 and 20 pixels, respectively, as of the average width (80) and height (52) of bounding box annotations. A uniformly sampled random fraction of 32 pixels is padded on the left edge of the image, the remaining is padded on the right edge of the image. Similar random padding is performed at the top and bottom of each frame. The padding itself is obtained by mirroring the adjacent portion of the image through the edge. The same offset is applied to the bounding box annotations. The DALY dataset was released by Weinzaepfel  [31] for 10 daily activities and contains 520 videos (200 for test and the rest for training) with 3.3 million frames. Videos in DALY are much longer, and the action duration to video duration ratio is only 4% compared to UCF101-24’s 70%, making the temporal labelling of action tubes very challenging. The most interesting aspect of this dataset is that it is not densely annotated, as at max 5 frames are annotated per action instance, and 12% of the action instances only have one annotated frame. As a result, annotated frames are 2.2 seconds apart on average (). Note. THUMOS [9] and Activity-Net [2] are not suitable for spatiotemporal detection, as they lack bounding box annotation. Annotation at 1fps for AVA [10] was released in week 1 of March 2018 (to the best of our knowledge). Also, AVA’s bounding boxes are not linked in time, preventing a fair evaluation of our approach there.
Evaluation metric. We evaluate TraMNet using video-mAP [20, 34, 26, 15, 23]. As a standard practice [26], we use “average detection performance” (avg-mAP) to compare TraMNet’s performance with the state-of-the-art. To obtain the latter, we first compute the video-mAPs at higher IoU thresholds () ranging , and then take the average of these video-mAPs. On the DALY dataset, we also evaluate at various thresholds in both an untrimmed and a trimmed setting. The latter is achieved by trimming the action paths generated by the boundaries of the ground truth [31]. We further report the video classification accuracy using the predicted tubes as in [26], in which videos are assigned the label of the highest scoring tube. One can improve classification on DALY by taking into consideration of other tube scores. Nevertheless, in our tests we adopt the existing protocol.
For fair comparison we re-implemented the methods of our competitors [24, 15, 26] with SSD as the base network. As in our TraMNet network, we also replaced SSD’s convolutional heads with new linear layers. The same tube generation [26] and data augmentation [18]

methods were adopted, and the same hyperparameters were used for training all the networks, including TraMNet. The only difference is that the anchor micro-tubes used in  

[24, 15] were cuboidal, whereas TraMNet’s anchor micro-tubes are generated using transition matrices. We refer to these approaches as SSD-L (SSD-linear-heads) [26], AMTnet-L (AMTnet-linear-heads)  [23] and as ACT-L (ACT-detector-linear-heads) [15].
Network training and implementation details. We used the established training settings for all the above methods. While training on the UCF101-24 dataset, we used a batch size of and an initial learning rate of , with the learning rate dropping after iterations for the appearance stream and for the flow stream. Whereas the appearance stream is only trained for iterations, the flow stream is trained for iterations. In all cases, the input image size was for the appearance stream, while a stack of five optical flow images [1] () was used for flow. Each network was trained on 1080Ti GPUs. More details about parameters and training are given in the supplementary material.

Methods Train Test = 0.2 = 0.5 = 0.75 = .5:.95 Acc %
T-CNN [12] NA NA 47.1
MR-TS [20] NA NA 73.5 32.1 02.7 07.3
Saha  [24] NA NA 66.6 36.4 07.9 14.4
SSD [26] NA NA 73.2 46.3 15.0 20.4
AMTnet [23] rgb-only 1,2,3 1 63.0 33.1 00.5 10.7
ACT [15] 1 1 76.2 49.2 19.7 23.4
Gu  [10] ([20] + [3]) NA NA 59.9
SSD-L with-trimming NA NA 76.2 45.5 16.4 20.6 92.0
SSD-L NA NA 76.8 48.2 17.0 21.7 92.1
ACT-L 1 1 77.9 50.8 19.8 23.9 91.4
AMTnet-L 1 1 79.4 51.2 19.0 23.4 92.9
AMTnet-L 5 5 77.5 49.5 17.3 22.5 91.6
AMTnet-L 21 5 76.2 47.6 16.5 21.6 90.0
TraMNet (ours) 1 1 79.0 50.9 20.1 23.9 92.4
TraMNet (ours) 5 5 77.6 49.7 18.4 22.8 91.3
TraMNet (ours) 21 5 75.2 47.8 17.4 22.3 90.7
Table 1: Action localisation results on untrimmed videos from UCF101-24 split1. The table is divided into 4 parts. The first part lists approaches which have single frames as input; the second part approaches which take multiple frames as input; the third part contemplates the re-implemented versions of approaches in the second group; lastly, we report our TraMNet’s performance.

4.2 Action localisation performance

Table 1 shows the resulting performance on UCF101-24 at multiple train and test s for TraMNet versus other competitors [24, 15, 26, 20, 12]. Note that Gu  [10] build upon MS-TS [20] by adding a strong I3D [3] base network, making it unfair to compare [10] to SSD-L, AMTnet-L, ACT-L and TraMNet, which all use VGG as a base network.
ACT is a dense network (processin consecutive frames), which shows the best performance at high overlap (an avg-mAP of 23.9%). AMTnet-L is slightly inferior (%), most likely due to it learning representations from pairs of consecutive frames only at its best training and test settings (). TraMNet is able to match ACT-L’s performance at high overlap (%), while being comparatively more efficient.

The evaluation of AMTNet-L on Transformed-UCF101-24 (§ 4.1) shows an avg-mAP of using the appearance stream only, whereas TraMNet records an avg-mAP of , a gain of that can be attributed to its estimating grid location transition probabilities. It shows that TraMNet is more suited to action instances involving substantial shifts from one frame to the next. A similar phenomenon can be observed on the standard UCF101-24 when the train or test is greater than in Table 1.

We cross-validated different transition probability thresholds on transition matrices. Thresholds of , , , and yielded an avg-mAP of , , , and , respectively, on the appearance stream. Given such evidence, we concluded that a transition probability threshold was to be adopted throughout all our experiments.

4.3 Location invariance at test time

Anchor micro-tubes are sampled based on the transition probabilities from specific cells (at frame ) to other specific cells (at frame ) (§ 3.3) based on the training data. However, as at test time action instances of a same class may appear in other regions of the image plane than those observed at training time, it is desirable to generate additional anchor micro-tubes proposals than those produced by the learnt transition matrices. Such location invariance property can be achieved at test time by augmenting the binary transition matrix (§ 3.4) with likely transitions from other grid locations.

Each row/column of the transition matrix (§ 3.3) corresponds to a cell location in the grid. One augmentation technique is to set all the diagonal entries to 1 (i.e., , where ). This amounts to generating anchor cuboids which may have been missing at training time (cfr. Fig. 4 (a)). The network can then be evaluated using this new set of anchor micro-tubes by configuring the pooling layer (§ 3.4)) accordingly. When doing so, however, we observed only a very minor difference in avg-mAP at the second decimal point for TraMNet with test . Similarly, we also evaluated TraMNet by incorporating the transitions from each cell to its 8 neighbouring cells (also at test time), but observed no significant change in avg-mAP.

A third approach, given a pyramid level , and the initial binary transition matrix for that level, consists of computing the relative transition offsets for all grid cells (offset where ). All such transition offsets correspond to different spatial translation patterns (of action instances) present in the dataset at different locations in the given video. Augmenting all the rows with these spatial translation patterns, by taking each diagonal entry in the transition matrix as reference point, yields a more dense transition matrix whose anchor micro-tubes are translation invariant, i.e., spatial location invariant. However, after training TraMNet at train we observed that the final avg-mAP at test was as compared to when using the original (sparse) transition matrix. As in the experiments (i.e., added diagonal and neighbour transitions) explained above, we evaluated the network that was trained on the original transition matrices at train by using the transition matrix generated via relative offsets, observing an avg-mAP consistent (i.e., ) with the original results.

This shows that the system should be trained using the original transition matrices learned from the data, whereas more anchor micro-tube proposals can be assessed at test time without loss of generality. It also shows that UCF101-24 is not sufficiently realistic a dataset from the point of view of translation invariance, which is why we conducted tests on Transformed-UCF101-24 (§ 4.1) to highlight this issue.

Untrimmed Videos Trimmed Videos
Methods Test =0.2 =0.5 Acc% =0.5 =.5:.95 Acc% CleaningFloor
weinzaepfel  [31] NA 13.9 63.9
SSD-L without-trimming NA 06.1 01.1 61.5
SSD-L NA 14.6 05.7 58.5 63.9 38.2 75.5 80.2
AMTnet-L 3 12.1 04.3 62.0 63.7 39.3 76.5 83.4
TraMNet (ours) 3 13.4 04.6 67.0 64.2 41.4 78.5 86.6
Table 2: Action localisation results (video-mAP) on the DALY dataset. SSD-L without trimming refers to when action paths are not trimmed and the network is SSD.

4.4 Handling sparse annotations

Table 2 shows the results on the DALY dataset. We can see that TraMNet significantly improves on SSD-L and AMTnet-L in the trimmed video setting, with an avg. video-mAP of %. TraMNet reaches top classification accuracy in both the trimmed and the untrimmed cases. As we would expect, TraMNet improves the temporal linking via better micro-tubes and classification, as clearly indicated in the trimmed videos setting. Nevertheless, SSD-L is the best when it comes to temporal trimming. We think this is because each micro-tube in our case is 4 frames long as the test is equal to 3, and each micro-tube only has one score vector rather than 4 score vectors for each frame, which might smooth temporal segmentation aspect.
DALY allows us to show how TraMNet is able to handle sparse annotations better than AMTNet-L, which uses anchor cuboids, strengthening the argument that learning transition matrices helps generate better micro-tubes.

TramNet’s performance on ‘CleaningFloor’ at equal to 0.5 in the trimmed case highlights the effectiveness of general anchor micro-tubes for dynamic classes. ‘CleaningFloor’ is one of DALY’s classes in which the actor moves spatially while the camera is mostly static. To further strengthen the argument, we picked classes showing fast spatial movements across frames in the UCF101-24 dataset and observed the class-wise average-precision (AP) at equal to . For ‘BasketballDunk’, ‘Skiing’ and ‘VolleyballSpiking’ TraMNet performs significantly better than both AMTnet-L and ACT-L; e.g. on ‘Skiing’, the performance of TraMNet, AMTNet-L and ACT-L is , and , respectively. More class-wise results are discussed in the supplementary material.
Training and testing at multiple ’s To test whether TraMNet can handle sparse annotation we introduced an artificial gap () in UCF101’s training examples, while testing on frames that are far away (e.g. ). We can observe in Figure 5(a) that performance is preserved when increasing the training while keeping the test small (e.g. equal to 5, as shown in plot (a)). One could think of increasing at test time to improve run-time efficiency: we can observe from Figure 5(b) that performance drops linearly as speed linearly increases. In both cases TraMNet consistently outperforms AMTNet. When is large TraMNet’s improvement is large as well.

Figure 5: Avg mAP () performance of TraMNet vs the competitor AMTnet-L, (a) when tested at constant equal to 5 and trained on increasing from 1 to 20, (b) when tested at increasing from 1 to 30 and trained at constant equal to 5.

Temporal labelling is performed using the labelling formulation presented in [26]. Actually, temporal labelling hurts the performance on UCF101-24, as shown in Table 1 where ‘SSD-L-with-trimming’ uses [26]’s temporal segmenter, whereas ‘SSD-L’ and the other methods below that do not. In contrast, on DALY the results are quite the opposite: the same temporal labelling framework improves the performance from to at . We think that these (superficially) contradictory results relate to the fact that action instances cover on average a very different fraction (70% versus 4%) of the video duration in UCF101-24 and DALY, respectively.
Detection speed: We measured the average time taken for a forward pass for a batch size of 1 as compared to 8 by  [26]. A single-stream forward pass takes 29.8 milliseconds (i.e. 33fps) on a single 1080Ti GPU. One can improve speed even further by evaluating TraMNet with equal to 2 or 4, obtaining a 2 or 4 speed improvement while paying very little in terms of performance, as shown in Figure 5(b).

5 Conclusions

We presented a TraMNet deep learning framework for action detection in videos which, unlike previous state-of-the-art methods [23, 15, 12] which generate action cuboid proposals, can cope with real-world videos containing “dynamic” actions whose location significantly changes over time. This is done by learning a transition probability matrix for each feature pyramid layer from the training data in a hidden Markov model formulation, leading to an original configurable layer architecture. Furthermore, unlike its competitors [15, 12], which require dense frame-level bounding box annotation, TraMNet builds on the network architecture of [23] in which action representations are learnt from pairs of frames rather than chunks of consecutive frames, thus eliminating the need for dense annotation. An extensive experimental analysis supports TraMNet’s action detection capabilities, especially under dynamic actions and sparse annotations.


  • [1] Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping (2004)
  • [2]

    Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: Activitynet: A large-scale video benchmark for human activity understanding. In: IEEE Int. Conf. on Computer Vision and Pattern Recognition. pp. 961–970 (2015)

  • [3] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4724–4733. IEEE (2017)
  • [4] Elliott, R.J., Aggoun, L., Moore, J.B.: Hidden Markov models: estimation and control, vol. 29. Springer Science & Business Media (2008)
  • [5] Evangelidis, G., Singh, G., Horaud, R.: Continuous gesture recognition from articulated poses. In: ECCV Workshops (2014)
  • [6] van Gemert, J.C., Jain, M., Gati, E., Snoek, C.G.: APT: Action localization proposals from dense trajectories. In: BMVC. vol. 2, p. 4 (2015)
  • [7] Girshick, R., Donahue, J., Darrel, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE Int. Conf. on Computer Vision and Pattern Recognition (2014)
  • [8] Gkioxari, G., Malik, J.: Finding action tubes. In: IEEE Int. Conf. on Computer Vision and Pattern Recognition (2015)
  • [9] Gorban, A., Idrees, H., Jiang, Y., Zamir, A.R., Laptev, I., Shah, M., Sukthankar, R.: Thumos challenge: Action recognition with a large number of classes (2015)
  • [10] Gu, C., Sun, C., Vijayanarasimhan, S., Pantofaru, C., Ross, D.A., Toderici, G., Li, Y., Ricco, S., Sukthankar, R., Schmid, C., et al.: Ava: A video dataset of spatio-temporally localized atomic visual actions. arXiv preprint arXiv:1705.08421 (2017)
  • [11] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
  • [12]

    Hou, R., Chen, C., Shah, M.: Tube convolutional neural network (t-cnn) for action detection in videos. In: IEEE Int. Conf. on Computer Vision (2017)

  • [13] Idrees, H., Zamir, A.R., Jiang, Y.G., Gorban, A., Laptev, I., Sukthankar, R., Shah, M.: The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding 155, 1–23 (2017)
  • [14] Jain, M., Van Gemert, J., Jégou, H., Bouthemy, P., Snoek, C.G.: Action localization with tubelets from motion. In: Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. pp. 740–747. IEEE (2014)
  • [15] Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector for spatio-temporal action localization. In: IEEE Int. Conf. on Computer Vision (2017)
  • [16] Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
  • [17] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. arXiv preprint arXiv:1708.02002 (2017)
  • [18] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: Single shot multibox detector. arXiv preprint arXiv:1512.02325 (2015)
  • [19] Oneata, D., Verbeek, J., Schmid, C.: Efficient action localization with approximately normalized fisher vectors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2545–2552 (2014)
  • [20] Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: ECCV 2016 - European Conference on Computer Vision. Amsterdam, Netherlands (Oct 2016),
  • [21] Redmon, J., Farhadi, A.: Yolo9000: Better, faster, stronger. arXiv preprint arXiv:1612.08242 (2016)
  • [22] Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems. pp. 91–99 (2015)
  • [23] Saha, S., Singh, G., Cuzzolin, F.: Amtnet: Action-micro-tube regression by end-to-end trainable deep architecture. In: IEEE Int. Conf. on Computer Vision (2017)
  • [24] Saha, S., Singh, G., Sapienza, M., Torr, P.H.S., Cuzzolin, F.: Deep learning for detecting multiple space-time action tubes in videos. In: British Machine Vision Conference (2016)
  • [25] Sapienza, M., Cuzzolin, F., Torr, P.H.: Learning discriminative space-time action parts from weakly labelled videos. Int. Journal of Computer Vision (2014)
  • [26] Singh, G., Saha, S., Sapienza, M., Torr, P., Cuzzolin, F.: Online real-time multiple spatiotemporal action localisation and prediction. In: IEEE Int. Conf. on Computer Vision (2017)
  • [27] Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human action classes from videos in the wild. Tech. rep., CRCV-TR-12-01 (2012)
  • [28] Sultani, W., Shah, M.: What if we do not have multiple videos of the same action? - video action localization using web images. In: IEEE Int. Conf. on Computer Vision and Pattern Recognition (2016)
  • [29] Wang, H., Kläser, A., Schmid, C., Liu, C.: Action Recognition by Dense Trajectories. In: IEEE Int. Conf. on Computer Vision and Pattern Recognition (2011)
  • [30] Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Learning to track for spatio-temporal action localization. In: IEEE Int. Conf. on Computer Vision and Pattern Recognition (June 2015)
  • [31] Weinzaepfel, P., Martin, X., Schmid, C.: Human action localization with sparse spatial supervision. arXiv preprint arXiv:1605.05197 (2016)
  • [32] Weinzaepfel, P., Martin, X., Schmid, C.: Towards weakly-supervised action localization. arXiv preprint arXiv:1605.05197 (2016)
  • [33] Yang, Z., Gao, J., Nevatia, R.: Spatio-temporal action detection with cascade proposal and location anticipation. In: BMVC (2017)
  • [34] Yu, G., Yuan, J.: Fast action proposals for human action detection and search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1302–1311 (2015)
  • [35] Zolfaghari, M., Oliveira, G.L., Sedaghat, N., Brox, T.: Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In: IEEE Int. Conf. on Computer Vision. pp. 2923–2932. IEEE (2017)