Compositional Temporal Visual Grounding of Natural Language Event Descriptions

12/04/2019 ∙ by Jonathan C. Stroud, et al. ∙ University of Michigan 6

Temporal grounding entails establishing a correspondence between natural language event descriptions and their visual depictions. Compositional modeling becomes central: we first ground atomic descriptions "girl eating an apple," "batter hitting the ball" to short video segments, and then establish the temporal relationships between the segments. This compositional structure enables models to recognize a wider variety of events not seen during training through recognizing their atomic sub-events. Explicit temporal modeling accounts for a wide variety of temporal relationships that can be expressed in language: e.g., in the description "girl stands up from the table after eating an apple" the visual ordering of the events is reversed, with first "eating an apple" followed by "standing up from the table." We leverage these observations to develop a unified deep architecture, CTG-Net, to perform temporal grounding of natural language event descriptions to videos. We demonstrate that our system outperforms prior state-of-the-art methods on the DiDeMo, Tempo-TL, and Tempo-HL temporal grounding datasets.



There are no comments yet.


page 1

page 7

page 8

page 13

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the task of temporally grounding natural language event descriptions in videos. In this task, we are given a video and a natural language query, and we must locate the point in time that corresponds to the query. For example, we may have a video of a baseball game, along with the query “the batter hits the ball, and a player tags him out at first base.” The system must find the point in time that best corresponds with the event described in the query. A system that can accomplish this task would have many applications, such as video retrieval and human-robot interaction.

Figure 1: Example query and video. Queries are often compositional, and they impose a temporal ordering over their components.

Like many tasks at the intersection of vision and language, temporal grounding requires that we generalize to both unseen videos and unseen queries. Since we make no assumptions about the content of a particular video or query, it is possible for them to depict a completely novel scene or event. This presents a challenge, and one way to overcome it is to leverage our prior knowledge; these modalities have structural properties which can be encoded into a model, which allows us to make better use of training data and generalize more effectively. In our work, we focus on two structural properties inherent to temporal grounding, which we leverage in our model.

The first property is compositionality, that is, a single query may be composed of many events. Consider the example in Figure 1, which contains two events (“walks by” and “touches”). Conceptually, there is no limit to the number of events that a query could describe. We can take advantage of this structure by breaking up a query into atomic components, and localizing each of them individually. This makes the problem more manageable; while the full query may be a novel combination of components, it is likely that we will have seen some of these components before.

Figure 2: Overview of our proposed CTG-Net. We represent the input query as several sub-events (depicted as red and blue boxes) using our Event Representation Network (Section 3). We match these sub-events to video segments, and then combine and refine these matchings using our Temporal Grounding Network (Section 4).

The second property is temporal ordering, that is, each query designates a particular ordering of its components. Often, the events must occur in the video in the same order that they appear in the sentence, like in “this then that”. But this is not always the case, as in the example in Figure 1, where the order is reversed. Determining the ordering is non-trivial, as natural language is complex and orderings can be implied from context. However, a model that can determine this ordering can then use it to its advantage, as it allows the model to prune spurious detections which do not match the temporal constraints.

We propose a model, called Compositional Temporal Grounding Network or CTG-Net, which explicitly leverages both compositional and temporal structure for temporal grounding, depicted in Figure 2. Specifically, our model first segments the query into discrete sub-events, leveraging compositionality (Section 3). Next, we ground each sub-event in the video, and then refine these groundings to enforce temporal ordering (Section 4). We apply our model on three temporal grounding datasets: DiDeMo, TEMPO-TL, and TEMPO-HL [7, 8], and demonstrate that our approach outperforms state-of-the-art methods (Section 5). Concretely, we improve Recall@1 from 26.8% to 28.7% on TEMPO-TL and from 20.8% to 21.5% on TEMPO-HL.

2 Related Work

Temporal grounding is a recently-introduced task, and it is often referred to in prior work as moment retrieval or action localization with natural language queries

[7, 5]. Prior work on this task generally takes one of two approaches: sliding window or single-shot.

Sliding Window. In this approach, we embed the query and video segments into the same low-dimensional space, such that the query embedding is similar to the video embedding for the correct segment, and dissimilar for other segments. To ground a query, we slide its embedding across the video and find the most similar segment. Three notable sliding window approaches include Moment Context Networks (MCN) and Moments Localized with Latent Context (MLLC) by Hendricks et al. [7, 8], Cross-modal Temporal Regression Localizers (CTRL), introduced by Gao et al. [5], and Activity Concepts-based Localizer (ACL), by Ge et al. [6]. While these approaches have been successful, they fail to account for compositional events. MLLC makes a step towards this goal, by representing each query as two sub-events, the “base” and the “context”. However, it is restrictive to assume that every event description has exactly two sub-events. Our approach lifts this restriction by allowing any number of sub-events.


In this approach, we compare each token in the query to each segment in the video, and then aggregate these comparisons to classify each segment as a correct or incorrect grounding. This includes Temporal GroundNet (TGN), by Chen

et al. [2], Moment Alignment Networks (MAN), by Zhang et al. [33], and Temporal Modular Networks (TMN), introduced by Liu et al. [16]. These approaches do leverage compositionality, but only at the level of individual words, and they generally do not account for temporal relationships. Temporal Modular Networks make significant improvements in terms of compositionality, in that they use the parse tree of the query to gradually aggregate more refined groundings. However, TMN is limited by the use of a fixed dependency parser, while we demonstrate that our method can work both with a fixed parser or in an end-to-end architecture.

Compositional Representations. It is well known that events are compositional, and this observation has inspired many prior works in action recognition [26, 4, 15], captioning [32], and temporal grounding [16]

. Leveraging compositional structures is central to many tasks in computer vision 

[3]. To the best of our knowledge, our work is the first to leverage explicit compositional structure for temporal grounding.

Temporal Relationships. There is a large body of work that addresses temporal relationships in natural language, enabled by corpora with labeled temporal relations [24, 18]. Prior approaches have used formal logic [23, 13]

in addition to machine learning 

[17, 14, 1]

. Few papers have approached this problem utilizing the most recent developments in deep learning for NLP, however there has been some recent work on determining temporal ordering from clinical notes

[25, 11]. We provide a mechanism for incorporating temporal relations which does not rely on hand-designed features or formal logic, by creating a “position embedding” for each sub-event. We find that this mechanism, while simple, is effective for reasoning about temporal relations.

Weakly-Supervised Temporal Localization. Our work is related to the task of weakly-supervised temporal localization, a well-studied problem in which a system must learn to perform temporal action localization when given only video-level labels [21, 27, 19, 30, 10]. Similarly, in our work, we do not have individual temporal labels for each atomic sub-event, and must learn to localize these without labels. However, we do have temporal labels for each query. In addition, these prior works use action category labels, as opposed to natural language queries, as the description for each labeled event. Only recently has weakly-supervised temporal localization been studied in the context of natural language queries [29].

Figure 3: Event Representation Network with the parser approach. The Sub-Event Segmentation module produces masks that indicate which words belong to each detected sub-event (Section 3.1). The Sub-Event Representation module produces a triplet representation for each sub-event (Section 3.2).

Relation to MCN and MLLC. Our work is a generalization of Moment Context Networks (MCN) [7]. Specifically, we use the same encoders for embedding words (Section 3.2) and video segments (Section 3.3), as well as the same distance metric to compute matching scores (Section 4). We generalize MCN by allowing multiple sub-events to each be localized simultaneously, incorporating compositional structure, and later combining and refining their matching scores, incorporating temporal structure. Moments Localized with Latent Context (MLLC) [8] similarly builds off of MCN, but is limited to localizing just two sub-events, and does not include temporal refinement. Our model, therefore, is more flexible and performs well with both simple and complex queries.

3 Event Representation Network

We represent each natural language query as a set of one or more sub-events, where the number of sub-events is chosen flexibly depending on the query. We produce this representation using a novel network architecture, which is depicted in Figure 3 and has two primary components. The first component is the sub-event segmentation module, which determines the number of sub-events as well as which words belong to each. The second component is the sub-event representation module

, which creates a vector representation for each sub-event.

3.1 Sub-Event Segmentation Module

We propose two methods for segmenting a query into sub-events.

Figure 4: Result of segmenting sub-events with a parser. This query contains three clauses (with tags S, SBAR, and S). Each word is assigned to the lowest-level clause to which it belongs, and length-1 clauses are discarded. Two sub-events are detected in this example, depicted by the red and blue outlines.

Parser. In this approach, we use the Stanford Parser [12] to segment the sentence into clauses, and we consider each clause to be a sub-event. We define “clauses” to include all clause-level tags in the Penn Treebank (S, SBAR, SINV), as well as fragment tags (FRAG). Since clauses can themselves contain subordinate clauses, we assign each word to the lowest-level clause to which it belongs. For an example of this, see Figure 4. We discard all sub-events that contain only one word, as these are typically due to a conjunction such as “after,” which we do not consider to be events.

Bi-directional LSTM. While the parser approach allows us to segment reasonable sub-events, it imposes some limitations. First, all sub-events are contiguous in the query, and cannot overlap. This introduces an issue when a clause contains a pronoun, as we cannot resolve the reference of the pronoun without the broader context of the sentence. By removing the restrictions of contiguity and overlap, we can replace a pronoun with its referent, making it more easy to localize. Another limitation is that the parser is fixed, and therefore any noise in its output cannot be fine-tuned away during training.

As a flexible and end-to-end learnable alternative to the parser, we propose a simple attention mechanism using a bi-directional LSTM. We feed the word-level features (Section 3.2) into the Bi-LSTM, and then apply linear classifiers to its output to get a set of attention masks corresponding to sub-events. Unlike in the parser approach, this always results in an equal number of sub-events. Specifically, we set , since this is the maximum number of sub-events identified by the parser in our experiments. However, as we will explain in the following section, each sub-event is later associated with a weight, and it is ignored when its weight is zero. This way, we still can flexibly represent the query as any number of sub-events.

Output. Mathematically, both segmentation approaches result in a set of masks for , where is the number of tokens in the input query and is the number of detected sub-events.

3.2 Sub-Event Representation Module

To create the sub-event representations, we first compute GloVe embeddings [22] for each word in the query, and then pass these into an LSTM [9]. The output of the LSTM is a sequence of feature vectors for where is the number of words and

is a hyperparameter corresponding to the hidden feature dimension. To convert these word-level features to sub-event-level features, we pool them via a weighted average using the sub-event masks from Section 

3.1. That is, , resulting in , the feature vector for the sub-event.

We then create a triplet representation for each sub-event. Specifically, we represent the sub-event as , where is the language embedding, is the position embedding, and is a scalar weight. The language embedding is used to co-embed the sub-event features in the same space as the visual features in Section 3.3. The position embedding, , represents the position of the sub-event in time, and is used to enforce temporal consistency in Section 4.1. The weight allows the model to ignore sub-events which describe something non-visual, or sub-events which are used purely as context for describing another event (Section 4). Each of these embeddings is created by passing the sub-event features through a single fully-connected layer. We normalize each with L2 normalization, and we normalize the weights with a softmax across the K sub-events.

3.3 Video Representation

We represent each video segment as an embedding , where is the start and end-point of the segment. To create the embeddings, we adopt the approach introduced by [7]. In this approach, either RGB or Optical Flow frames are passed into a CNN to create frame-level feature vectors. These features are averaged within each segment to create local features, and averaged across the entire video to get global features. We then concatenate the local and global feature vectors along with the start and end-points (called temporal end-point features, or TEF) into a -length feature vector, where

depends on whether RGB or Optical Flow features are used. This vector is then fed into a 2-layer multi-layer perceptron (MLP) to get the video segment embedding


4 Temporal Grounding

When queries are compositional, it is not always straightforward to define the expected output of temporal grounding. Consider a query of the form “X after Y.” Should the grounding include the temporal extent of both X and Y, or only X? Some datasets (DiDeMo [7]) encode the answer using the former option, opting to ground the full extent of events described by the query. Other datasets (Tempo-TL, Tempo-HL [8]) however, use the latter, since Y in this case is used only to refer to a particular instance of X. We propose a method which is agnostic the particular grounding scheme and can be trained end-to-end for either scenario.

Intuitively, our proposed temporal grounding procedure leverages the sub-event representations, as well as their temporal ordering, to find a matching between the query and the video. We first attempt to localize each individual sub-event in the video. Then, we apply a refinement network which updates these locations to ensure temporal consistency, and combines them to compute the final grounding. Concretely, this network takes as input the sub-event representations (Section 3.2) and video segment representations (Section 3.3), and as output, it produces a score for each time segment , which corresponds to how well that time segment matches the query.

To locate each sub-event , we compare their embeddings with the embedding of each video segment . We compute the Euclidean distance between each pair, that is, , where pairs with smaller distances between them are considered better matches. The distance therefore serves as a matching score between sub-event and video segment .

We then combine these matching scores across sub-events to compute an initial matching score for the entire query. Specifically, we perform a weighted average of the sub-event matching scores, where the weights are given by from the event representation network, that is, . The weights allow the matching to favor particular sub-events over others, or exclude a sub-event entirely. This is helpful when a sub-event is used only for context, or when a sub-event is not visible in the video.

4.1 Refinement Network

The initial matching score accounts for compositional structure, but does not account for temporal structure. To leverage temporal structure, we propose an additional step, where the matching scores and expected positions of each sub-event are used to update the matching score for the entire query. This phase allows the model to downweight segments that do not match the expected position of each sub-event in the video, which is encoded using the position embedding (Section 3.2).

More precisely, for all video segments , we apply a refinement function which takes as input the matching score , the matching score for a particular sub-event , the position embedding for that sub-event , and the temporal extent of the segment . We apply the refinement function to each sub-event and add the results to the matching score to get a refined score . That is,


For the refinement function , we use a 2-layer MLP. The inputs are concatenated along the feature dimension, which results in an input with dimension . The final matching score is used to rank the candidate segments, and as the final grounding we choose the segment that results in the lowest distance . That is, .

4.2 Training

CTG-Net is fully end-to-end differentiable and can be optimized via gradient descent. We adopt the triplet ranking loss from Hendricks et al[7] and apply it directly to the refined matching scores . This loss imposes a penalty when the score for an incorrect segment is lower than that of a correct segment (recall that lower scores are better), or if they are within some margin , that is, . For every positive example, we compare it with two negative examples: one is a segment chosen randomly from the same video , and the other is a segment chosen from a random video at the same timepoint . The loss is a weighted sum of the ranking losses for these two negative examples, that is, .

Implementation Details.

We optimize CTG-Net using stochastic gradient descent with a batch size of 120, an initial learning rate of 0.05, and we decay this learning rate by a factor of 10 every 33 epochs (50 for Tempo-HL). We multiply the learning rate by a factor of 10 when updating the LSTM used to create word-level features. We train each network for a maximum of 100 epochs, with early stopping if the validation accuracy plateaus.

We choose hyperparameters from MCN [7] where applicable, that is, the word feature dimension , the embedding dimension , the inter-video loss weight . The remaining hyperparameters, including the dimension of the position embedding and the size of the hidden layer in the refinement MLP , are selected using the validation set.

As in Hendricks et al. [7], we divide each video into 5-second clips, and consider all segments which consist of one or more contiguous clips. For a 30-second video, this gives 6 possible segments of length 1, 5 possible segments of length 2, and so on. The number of segments for a video with clips is therefore .

Late Fusion. We use two sets of visual features, one extracted from RGB video frames, and the other extracted from Optical Flow sequences. For RGB, we pass each frame of the video through a pretrained VGG16 network up to the fc7 layer [28], producing a feature vector for each frame. For Optical Flow, we use the penultimate layer of a Temporal Segment Network [31] trained on action recognition. For each experiment, we train two networks, one for each of the two visual modalities, and then perform a weighted average of their refined correspondences , to get a fused result. All results include late fusion.


Our implementation is built in PyTorch 

[20], and uses GloVE embeddings and visual features (both RGB and Flow) provided by the original creators of the datasets [7, 8]. Our implementation will be made publicly available.

5 Experiments

DiDeMo. Distinct Describable Moments (DiDeMo) is a recent dataset introduced by Hendricks et al[7] as a benchmark for temporal grounding. This dataset consists of over 10K unedited videos from Flickr and 40K unique queries. Each query describes a distinct moment in the corresponding video, and its temporal location is annotated independently by four annotators. The queries describe a rich range of events that are not limited to human activities, such as camera and object motion. The dataset includes a high percentage of queries (18.4%) that include words about temporal relationships, such as “first”, “begin”, “after”, and “final”. These properties make DiDeMo a realistic benchmark dataset for temporal grounding.

Tempo-TL and Tempo-HL. Temporal Template Language (Tempo-TL) and Temporal Human Language (Tempo-HL) are two recent temporal grounding datasets which build off of DiDeMo [8]. These datasets contain complex queries which are constructed specifically to test compositional grounding, making them an ideal testbed for our method.

Tempo-TL queries are procedurally constructed from pairs of DiDeMo queries. They are constructed using one of five templates: X before Y, Y, before X, X after Y, Y, after X, and X then Y, depending on the temporal relationship of the two events. While this procedure sometimes results in unnatural sentences, it enables fine-grained evaluation of grounding under specific temporal relationships. Depending on the template, the system must localize the base moment X (“before” and “after”), or the concatenation of both events X and Y (“then”).

Tempo-HL queries are constructed by asking each annotator to describe an event relative to an existing query from DiDeMo. The resulting queries include all of the same temporal relationships as Tempo-TL, as well as “while” relationships, which are not covered by Tempo-TL. Because these queries are rewritten in natural language from scratch, they include coreference statements and a wider range of temporal prepositions. Tempo-HL is therefore a much more realistic and challenging dataset.

DiDeMo+Tempo. When training on Tempo-TL or Tempo-HL, we also include training examples from DiDeMo, as suggested by the original creators. This is referred to as ”DiDeMo+Tempo-TL” or ”DiDeMo+Tempo-HL.”

Tempo-TL Splits - R@1 Average Tempo-HL Splits - R@1 Average
Method DiDeMo Before After Then R@1 R@5 DiDeMo Before After Then While R@1 R@5
Prior 10.7 17.9 22.4 0.0 12.7 52.6 19.4 29.3 0.0 0.0 4.7 10.7 37.6
MCN [7] 24.9 32.3 26.1 25.1 27.1 73.4 26.1 26.8 14.9 18.6 10.7 19.4 70.9
TALL [5] 21.0 27.1 26.3 4.8 19.8 64.7 21.8 25.9 14.4 2.5 8.1 14.6 60.7
MLLC [8] 25.9 32.0 24.3 25.0 26.8 74.0 27.4 32.3 14.4 19.6 10.4 20.8 71.7
CTG-Net-P 26.8 34.1 27.6 26.4 28.7 76.0 27.6 27.9 16.9 18.7 10.5 20.3 71.3
CTG-Net-A 26.6 35.1 26.1 23.6 27.9 75.5 27.6 28.6 18.8 20.8 11.6 21.5 72.7
Table 1: Results on Tempo-TL and Tempo-HL. We compare against prior work, using two variants of our model: with the fixed parser (CTG-Net-P), and with the Bi-LSTM attention mechanism (CTG-Net-A). For both Tempo-TL and Tempo-HL, we train on the DiDeMo+Tempo-TL and DiDeMo+Tempo-HL training sets, respectively, and evaluate on the test sets. Average refers to the average of each metric across the splits (Section 5).


We adopt the suggested evaluation metrics for both DiDeMo and Tempo. In both datasets, four annotators are each given a video and query and are asked to select the temporal segment which corresponds best with the query. In some cases, there is disagreement between the annotators. To account for disagreement, each metric is computed by comparing the prediction to each of the four annotations, and the annotation that most disagrees with the prediction is discarded. Using this method, we compute three metrics: Recall@1 (R@1), Recall@5 (R@5), and Mean Intersection over Union (mIOU). We note that prior works use the names Rank@1 and Rank@5 when computing these metrics. Our Recall@1 and Recall@5 metrics are equivalent to these.

For DiDeMo, the three metrics are computed for all videos in the dataset. For TEMPO, we use a modification suggested by the original creators [8]: we split the dataset into subsets based on the temporal words (“before”, “after”, “then”, “while”) which are present in the queries. We compute the metrics within each subset, and then average the results with equal weight to get a final set of metrics. This allows us to get a fine-grained understanding of how our model performs under different temporal relationships.

5.1 Comparison with State of the Art

Tempo-TL. In Table 1, we compare CTG-Net with several prior works and baselines on Tempo-TL. All methods outperform the Prior baseline, which always selects the first video segment. Our model outperforms MCN [7] and TALL [5], two sliding-window approaches which do not account for the compositional structures present in Tempo. We also outperform MLLC [8], the previous state-of-the-art on Tempo-TL, which achieves an average R@1 of 26.8% compared to 27.9% for our method with the Bi-LSTM attention mechanism (CTG-Net-A) and 28.7% for ours with the parser (CTG-Net-P). In terms of average R@5, we also find that CTG-Net-P outperforms MLLC (+2.0%, from 74.0% for MLLC to 76.0% for ours), and performs comparably in terms of mIOU (-0.3%, from 42.3% for MLLC to 42.0% for ours). We perform particularly well on queries containing “before“ (+3.1%, from 32.0% of MLLC to 35.1% of ours) and and “after” (+3.3%, from 24.3% of MLLC to 27.6% of ours), demonstrating our robustness to different temporal relationships.

Tempo-HL. In Table 1, we also perform the same set of comparisons on Tempo-HL. We find that Tempo-HL is more challenging for all temporal words, likely because of the wider range of temporal prepositions and coreferences present in natural language. Our model with the Bi-LSTM (CTG-Net-A) outperforms all prior work on this challenging dataset. We note that our model performs well on the less common temporal relations “after” and “while”. These relations are particularly challenging because they require the model to perform temporal reasoning about sub-events that occur out of order (“after”) and simultaneously (“while”). Our temporal refinement procedure allows the model to take this into account.

We find that, while our model with the fixed parser (CTG-Net-P) outperforms that with the Bi-LSTM attention mechanism (CTG-Net-A) on Tempo-TL, the opposite is true on Tempo-HL. This is likely due to the rigid, procedurally constructed queries in Tempo-TL, which lend themselves well to parsing. Tempo-HL queries, on the other hand, contain more variation in sentence structure, which may not be accounted for by the parser. However, we find that both methods are competitive, demonstrating that compositional structure is useful, regardless of the method of decomposition. Concretely, CTG-Net-A achieves 21.5% Average R@1 on Tempo-HL, an improvement over MLLC (20.8%), the previous state-of-the-art. As with Tempo-TL, we also find that CTG-Net outperforms MLLC in terms of R@5 (+1.0%, from 71.7% for MLLC to 72.7% for ours), but performs worse in terms of mIOU (-1.4%, from 44.6% for MLLC to 43.2% for ours). This drop in mIOU is because our model, when compared to MLLC, tends to be over-confident when predicting shorter segments. These shorter segments may contain important components of the event, but this is not reflected in the mIOU metric.

DiDeMo. In Table 1, we demonstrate that CTG-Net outperforms MLLC on DiDeMo when trained on both DiDeMo and Tempo-TL (R@1 +1.0%, from 25.8% to 26.8%) and on DiDemo and Tempo-HL (+0.2%, from 27.4 to 27.6%). This demonstrates that CTG-Net performs well on the relatively simple queries present in DiDeMo, in addition to the complex queries present in Tempo.

Prior work on DiDeMo does not use the additional training data from Tempo-TL and Tempo-HL, so for fair comparison we additionally report performance of CTG-Net trained on DiDeMo only. When trained on DiDeMo only, CTG-Net-A (27.8%) outperforms several recent baselines, including Temporal Modular Networks (22.9%) [16] and Moment Alignment Networks (27.0%) [33], both of which are competitive recent baselines which leverage compositional reasoning as part of their approach. We additionally achieve competitive results with MCN (28.1%), MLLC (28.4%), and Temporal GroundNet (28.2%) [7, 7, 2]. We observe that, suprisingly, CTG-Net-A performs slightly below MCN and MLLC when trained on DiDeMo only, while the opposite is true when we use additional data from Tempo. A simple explanation for this discrepancy is that our model is more complex, and therefore requires more data than is available in the DiDeMo-only setting. In support of this, we observe that our model performs similarly on DiDeMo when trained using the additional data from Tempo-HL (-0.1%), while other models experience steep drops in performance when using this additional data, such as -2.0% for MCN, and -1.0% for MLLC. This demonstrates that our model has higher capacity and is more robust to changes in distributions between datasets, both of which are important qualities in practice.

Figure 5:

Recall@1 on DiDeMo+Tempo-HL as a function of query complexity (above) and query novelty (below). Complexity: queries with a high number of clauses are more complex and are therefore more difficult to ground, and our method outperforms prior work at all levels. Novelty: Queries which are dissimilar to previously-seen queries are also more difficult to ground, and our model outperforms prior work at all levels. Error bars depict one standard deviation.

Tempo-HL - Average (Val)
Method R@1 R@5 mIOU
Ours w/o , 20.75 71.91 41.68
Ours w/o 20.69 71.54 41.72
Ours w/o 21.24 72.09 42.55
Ours w/o , 21.49 72.46 42.31
Ours w/o 21.37 71.68 42.58
Ours w/o 21.27 71.89 42.88
Ours-A (Full) 21.83 72.98 43.25
Table 2: Ablation studies. Top section: to demonstrate the impact of compositional and temporal structure, we remove the sub-event masks and temporal refinement network . Middle section: we remove the position embedding and weights from the sub-event representations. Our full model outperforms all variants. All numbers are reported on the Tempo-HL validation set.
Figure 6: Examples of sub-event localization on Tempo-TL. The highlighted portions of each query indicate the sub-event masks, and the colored boxes below the video frames indicate the predicted locations of each sub-event (the location with the lowest matching score ). In both cases, CTG-Net correctly identifies both sub-events, both in the query and in the video.

5.2 Complex & Novel Queries

We expect CTG-Net to generalize well to a broad range of queries. Two challenging areas are complex queries, which have many sub-events, and novel queries, which are dissimilar to queries seen during training. To test our model in these scenarios, we evaluate its performance on subsets of queries from DiDeMo and Tempo-HL with increasing complexity and novelty.

Complexity. To measure complexity, we count the number of clauses in each query, using the results of the Stanford Parser [12]. We divide the queries into categories based on the number of clauses, and we give the performance on each split in Figure 5. We find that queries with more clauses are more challenging to ground. We also find that our full model (Ours-A) achieves better performance than MCN and MLLC [7, 8] at all levels, demonstrating that we are able to generalize to both simple and complex queries.


. To measure novelty, we compute the average GloVe embedding for each query in the test set, find the most similar average embedding in the training set, and take the Euclidean distance between their embeddings. We consider queries with a high GloVe distance to be more novel. We divide the queries into four categories based on the quartile of the GloVe distance and give the performance in Figure 

5. Novel queries are more challenging, and again we find that our full model achieves better performance at all levels, demonstrating that it is robust to novel queries.

In these experiments, we use Recall@1 over the entire Tempo-HL+DiDeMo test set, not the Average R@1 metric (Section 1) that separates by temporal words. This allows us to fairly compare queries based only their complexity and novelty, rather than their source dataset. We use our own implementations of MCN and MLLC, and verify that these implementations achieve similar or better performance than that reported by the original authors: specifically we find that MCN and MLLC achieve Average R@1 scores of 20.6% and 20.6% on Tempo-HL, respectively, compared to 19.4% and 20.8% in the original work.

5.3 Ablation Study

We perform an in-depth ablation study in Table 2. In these experiments, we demonstrate the contribution of the novel components of our model, namely our use of compositional and temporal structure, and our use of weights and position embeddings in the sub-event representation. In all ablation experiments, we use CTG-Net with attention (Ours-A) trained on DiDeMo+TempoHL, and we report results on the validation set.

Compositional and Temporal Structure. In the top section of Table 2, we demonstrate the effect of eliminating compositional and temporal reasoning from our model. Ours w/o refers to our model without sub-events masks. This model still has the temporal refinement step, but it does not have the benefit of receiving refinement updates from multiple sub-events. Ours w/o refers to our model without the temporal refinement step, which still uses compositional structure by combining sub-event matching scores. Ours w/o , has both components removed, and is equivalent to MCN [7]. We find that removing both compositional and temporal structure lead to decreased performance (-1.1%), and that removing compositional structure alone (-1.1%) is more detrimental than removing temporal structure (-0.6%). Interestingly, we see no benefit to including temporal refinement without also including sub-event decomposition, demonstrating that temporal refinement leverages updates from multiple sub-events as intended.

Sub-Event Representation. In the middle section of Table 2, we demonstrate the effect of removing two pieces of our proposed sub-event representation. Ours w/o refers to our model without the position embeddings, which are later used as part of the refinement step. Ours w/o refers to our model without sub-event weights, meaning that each sub-event is weighted equally under this scheme. Ours w/o , refers to a model with neither of these components. We find that removing either of these components leads to decreased performance (-0.46% and -0.56%), but that removing them both does not lead to further decreased performance, demonstrating that there is still utility in including a sub-event decomposition without these components.

Qualitative Examples. In Figure 6, we present examples of CTG-Net correctly identifying the locations of individual sub-events in complex queries. For more examples, please refer to the appendix.

6 Conclusion

We demonstrate that compositional and temporal structure are useful for temporal grounding. Specifically, show that event descriptions are composed of sub-events, and that they impose an ordering on these sub-events. To leverage these structures, we propose Compositional Temporal Grounding Networks (CTG-Net), and show that this model leads to higher performance on challenging datasets when compared with models which do not leverage such structure. We make our code publicly available.

7 Acknowledgements

This work is partially supported by King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. OSR-CRG2017-3405, and by the Toyota Research Institute (TRI).


  • [1] N. Chambers, S. Wang, and D. Jurafsky (2007) Classifying temporal relations between events. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 173–176. Cited by: §2.
  • [2] J. Chen, X. Chen, L. Ma, Z. Jie, and T. Chua (2018) Temporally grounding natural sentence in video. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    pp. 162–171. Cited by: §2, §5.1.
  • [3] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan (2009) Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence 32 (9), pp. 1627–1645. Cited by: §2.
  • [4] A. Gaidon, Z. Harchaoui, and C. Schmid (2013) Temporal localization of actions with actoms. IEEE transactions on pattern analysis and machine intelligence 35 (11), pp. 2782–2795. Cited by: §2.
  • [5] J. Gao, C. Sun, Z. Yang, and R. Nevatia (2017) Tall: temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5267–5275. Cited by: §2, §2, §5.1, Table 1.
  • [6] R. Ge, J. Gao, K. Chen, and R. Nevatia (2019) MAC: mining activity concepts for language-based temporal localization. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 245–253. Cited by: §2.
  • [7] L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell (2017) Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5803–5812. Cited by: §A.1, Figure 7, §1, §2, §2, §2, §3.3, §4.2, §4.2, §4.2, §4.2, §4, §5.1, §5.1, §5.2, §5.3, Table 1, §5.
  • [8] L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell (2018) Localizing moments in video with temporal language. In Empirical Methods in Natural Language Processing (EMNLP), Cited by: §A.1, Figure 7, §1, §2, §2, §4.2, §4, §5.1, §5.2, Table 1, §5, §5.
  • [9] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.2.
  • [10] D. Huang, L. Fei-Fei, and J. C. Niebles (2016) Connectionist temporal modeling for weakly supervised action labeling. In European Conference on Computer Vision, pp. 137–153. Cited by: §2.
  • [11] S. Jeblee and G. Hirst (2018) Listwise temporal ordering of events in clinical notes. In Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis, pp. 177–182. Cited by: §2.
  • [12] D. Klein and C. D. Manning (2003) Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pp. 423–430. Cited by: §3.1, §5.2.
  • [13] S. Konur (2008) An interval logic for natural language semantics.. Advances in Modal Logic 7, pp. 177–191. Cited by: §2.
  • [14] M. Lapata and A. Lascarides (2006) Learning sentence-internal temporal relations.

    Journal of Artificial Intelligence Research

    27, pp. 85–117.
    Cited by: §2.
  • [15] X. Liang, L. Lin, and L. Cao (2013) Learning latent spatio-temporal compositional model for human action recognition. In Proceedings of the 21st ACM international conference on Multimedia, pp. 263–272. Cited by: §2.
  • [16] B. Liu, S. Yeung, E. Chou, D. Huang, L. Fei-Fei, and J. Carlos Niebles (2018) Temporal modular networks for retrieving complex compositional activities in videos. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 552–568. Cited by: §2, §2, §5.1.
  • [17] I. Mani, M. Verhagen, B. Wellner, C. M. Lee, and J. Pustejovsky (2006) Machine learning of temporal relations. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp. 753–760. Cited by: §2.
  • [18] P. Muller and X. Tannier (2004) Annotating and measuring temporal relations in texts. In Proceedings of the 20th international conference on Computational Linguistics, pp. 50. Cited by: §2.
  • [19] P. Nguyen, T. Liu, G. Prasad, and B. Han (2018) Weakly supervised action localization by sparse temporal pooling network. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 6752–6761. Cited by: §2.
  • [20] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §4.2.
  • [21] S. Paul, S. Roy, and A. K. Roy-Chowdhury (2018) W-talc: weakly-supervised temporal activity localization and classification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 563–579. Cited by: §2.
  • [22] J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §3.2.
  • [23] I. Pratt-Hartmann (2005) Temporal prepositions and their logic. Artificial Intelligence 166 (1-2), pp. 1–36. Cited by: §2.
  • [24] J. Pustejovsky, P. Hanks, R. Sauri, A. See, R. Gaizauskas, A. Setzer, D. Radev, B. Sundheim, D. Day, L. Ferro, et al. (2003) The timebank corpus. In Corpus linguistics, Vol. 2003, pp. 40. Cited by: §2.
  • [25] P. Raghavan, E. Fosler-Lussier, N. Elhadad, and A. M. Lai (2014) Cross-narrative temporal ordering of medical events. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 998–1008. Cited by: §2.
  • [26] M. Rohrbach, M. Regneri, M. Andriluka, S. Amin, M. Pinkal, and B. Schiele (2012) Script data for attribute-based recognition of composite activities. In European Conference on Computer Vision, pp. 144–157. Cited by: §2.
  • [27] Z. Shou, H. Gao, L. Zhang, K. Miyazawa, and S. Chang (2018) Autoloc: weakly-supervised temporal action localization in untrimmed videos. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 154–171. Cited by: §2.
  • [28] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.2.
  • [29] R. Tan, H. Xu, K. Saenko, and B. A. Plummer (2019) WMAN: weakly-supervised moment alignment network for text-based video segment retrieval. arXiv preprint arXiv:1909.13784. Cited by: §2.
  • [30] L. Wang, Y. Xiong, D. Lin, and L. Van Gool (2017) Untrimmednets for weakly supervised action recognition and detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 4325–4334. Cited by: §2.
  • [31] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2016) Temporal segment networks: towards good practices for deep action recognition. In European conference on computer vision, pp. 20–36. Cited by: §4.2.
  • [32] R. Xu, C. Xiong, W. Chen, and J. J. Corso (2015) Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In Twenty-Ninth AAAI Conference on Artificial Intelligence, Cited by: §2.
  • [33] D. Zhang, X. Dai, X. Wang, Y. Wang, and L. S. Davis (2018) MAN: moment alignment network for natural language moment retrieval via iterative graph adjustment. arXiv preprint arXiv:1812.00087. Cited by: §2, §5.1.

Appendix A Qualitative Examples

a.1 Comparison with MCN and MLLC

In Figure 7, we provide examples of challenging instances from Tempo-HL and DiDeMo, and show the temporal segment chosen by our model compared to prior work. Our model reliably localizes difficult queries that are missed by both MCN [7] and MLLC [8].

a.2 Examples of Compositional Grounding

In Figure 8, we provide examples of instances from Tempo-HL and DiDeMo, and show the temporal segments that we identify as sub-events. We show that CTG-Net identifies sensible sub-events, and later uses these to create a final grounding.

a.3 Examples Before and After Refinement

In Figure 9, we provide examples of instances from Tempo-HL and DiDeMo, grounded using CTG-Net before and after the refinement step. Refinement improves groundings.

Appendix B Dependency Parser

b.1 Example Segmentations

In Figure 10, we show examples of sub-events identified by CTG-Net with the parser approach. This model identifies sensible sub-events, which can each be grounded in the video.

b.2 Distribution of Sub-Events

In Figure 11, we show the distribution of the number of sub-events identified by the parser in each of the three datasets: DiDeMo, Tempo-TL, and Tempo-HL.

Figure 7: Example results from the Tempo-HL and DiDeMo training sets. We compare CTG-Net with the Bi-directional LSTM attention mechanism against results from MCN [7] and MLLC [8].
Figure 8: Example results of temporally grounded sub-events from Tempo-HL and DiDeMo. We CTG-Net model with the dependency parser, and show the individual temporal groundings before combination and refinement. The top 3 examples depict examples of accurate compositional groundings, and the bottom 2 depict incorrect groundings.
Figure 9: Example results of CTG-Net before and after the refinement step, shown on Tempo-HL and DiDeMo. In the top 3 examples, we see that the refinement step is able to improve the prediction. In the bottom example, we show a difficult case where the refinement process leads to incorrect predictions.
Figure 10: Example sub-event segmentations produced by the dependency parser.
Figure 11: Distribution of number of sub-events in DiDeMo (left), Tempo-TL (middle) and Tempo-HL (right), as identified by our dependency parser approach. DiDeMo queries are typically shorter and contain few sub-events. Tempo-TL and Tempo-HL are strongly biased towards two sub-events per query.