Tripping through time: Efficient Localization of Activities in Videos

04/22/2019 ∙ by Meera Hahn, et al. ∙ Georgia Institute of Technology 0

Localizing moments in untrimmed videos via language queries is a new and interesting task that requires the ability to accurately ground language into video. Previous works have approached this task by processing the entire video, often more than once, to localize relevant activities. In the real world applications that this task lends itself to, such as surveillance, efficiency a is pivotal trait of a system. In this paper, we present TripNet, an end-to-end system that uses a gated attention architecture to model fine-grained textual and visual representations in order to align text and video content. Furthermore, TripNet uses reinforcement learning to efficiently localize relevant activity clips in long videos, by learning how to intelligently skip around the video. It extracts visual features for fewer frames to perform activity classification. In our evaluation over Charades-STA, ActivityNet Captions and the TACoS dataset, we find that TripNet achieves high accuracy and saves processing time by only looking at 32-41



There are no comments yet.


page 1

page 4

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The increasing availability of videos and their importance in application domains, such as social media and surveillance, has created a pressing need for automated video analysis methods. A particular challenge arises in long video clips which have noisy or nonexistent labels, and which are common in video surveillance, instructional videos, and many other settings. While classical video retrieval works at the level of entire clips, a more challenging and important task is to efficiently sift through large amounts of unorganized video content and retrieve specific moments of interest.

A promising approach to this task is the paradigm of temporal activity localization through language query (TALL), which was introduced and developed in two pioneering works by Gao et. al. [13] and Hendricks et. al. [18]. The TALL task is illustrated in Fig. 1, which illustrates a few frames from a climbing video along with a language query describing the event of interest ”Climber adjusts his feet for the first time.” The green frame denotes the desired output, the first frame in which the feet are being adjusted. Note that the solution to this problem requires local and global temporal information: local cues are needed to identify the specific frame in which the feet are adjusted, but global cues are also needed to identify the first time this occurs, since the climber adjusts his feet throughout the clip.

Figure 1: Figure demonstrating video search method where given an input query, a human would normally forward the video until relevant objects or objects arrive and then proceed slowly to obtain the appropriate temporal boundaries.

While the original MCN [18] and CTRL [13] architectures, along with more recent work by Yuan et. al. [46], obtained encouraging results for the TALL task, they all suffer from a significant limitation which we address in this paper: Existing methods construct temporally-dense representations of the entire video clip which are then analyzed to identify the target events of interest. In long video clips, where the event of interest is a single short moment, this can be very inefficient. This also stands in stark contrast to how humans search for events of interest, as illustrated schematically in Fig. 1

. A human would fast-forward through the clip from the beginning, effectively sampling a sparse set of frames until they got closer to the region of interest. Then they would go frame-by-frame until the starting point was localized. At that point the search would terminate, leaving the vast majority of frames unexamined. Note that efficient solutions must go well beyond simple heuristics like starting from the beginning, since events of interest can occur anywhere within a target clip and the global position may not be obvious from the query.

In order to develop an efficient solution to the TALL task it is necessary to address two main challenges: 1) Devising an effective joint representation (embedding) of video and language features to support localization; and 2) Learning an efficient search strategy that can mimic the human ability to sample a video clip intelligently to localize an event of interest. Most prior works on TALL address the issue of constructing a joint embedding space through a sliding window approach that pools at different temporal scales. While this is an effective strategy is not well suited to efficient search because it provides only coarse control over which frames are evaluated. In contrast, we use a gated-attention architecture which is effective in aligning the text queries, that often consist of an object and its attributes or an action, with the video features that consist of convolutional filters that can identify these elements. Two prior works [46, 43]

also utilize an attention model, but they do not address its use in efficient search.

We address the challenge of efficient search through a combination of reinforcement learning (RL) and fine-grained video analysis. Our approach is inspired in part by strategies used by human video annotators. In addition, we make an analogy to the success of RL in language-guided navigation tasks in 3D environments [1, 6, 10, 15]. Specifically, we make an analogy between temporally localizing events in a video through playback controls and the actions an agent would take to navigate around an environment looking for a specific object of interest. We share with the navigation task, the fact that labeled data is not available for explicitly modeling the relationship between actions and rewards, but it is possible to learn a model through simulation. Our approach to temporal localization uses a novel architecture for combining the multi-modal video and text features with a policy learning module that learns to step forward and rewind the video and receives awards for accurate temporal localization.

In summary this paper makes two contributions:

  1. [noitemsep,topsep=0pt]

  2. We present a novel end-to-end reinforcement learning framework called TripNet that addresses the problem of temporal activity localization via language query. TripNet uses gated-attention to align text and visual features, improving its accuracy.

  3. We present experimental results on the datasets Charades-STA, ActivityNet Captions and Tacos. These results demonstrate that TripNet achieves state of the art results in accuracy while significantly increasing efficiency by evaluating from only 32-41% of the total video.

2 Related Work

Querying using Natural Language

This paper is most-closely related to prior works on the TALL problem, beginning with the two works [13, 18] that introduced it. These works additionally introduced a dataset for the task, Charades-STA[13] and DiDeMo[18]. Each dataset contains untrimmed videos with multiple sentence queries and the corresponding start and end timestamp of the clip within the video. The two papers adopt a supervised cross modal embedding approach, in which sentences and videos are projected into a embedding space, optimized so that queries and their corresponding video clips will lie close together, while non-corresponding clips will be far apart. At test time both approaches run a sliding window across the video and compute an alignment score between the candidate window and the language query. The window with the highest score is then selected as the localized query. Follow-up works on the TALL task have adopted a similar approach, but have differed in the design of the embedding process [7]. [24, 38, 46] modified the original approach by adding self-attention and co-attention to the embedding process. [43] introduces early fusion of the text queries and video features rather than using an attention mechanism. Additionally, [43] uses the text to produce activity segment proposals as their candidate windows instead of using a fixed sliding window approach. Pre-processing of the sentence queries has not been a primary focus of previous works, with most methods using a simple LSTM based architecture for sentence embedding, with Glove [30] or Skip-Thought [21]vectors to represent the sentences. The primary difference with this paper is that all previous methods require the analysis of the entire video in order to generate their candidate windows. In contrast, our learned agent examines between 32-41% of frames on average, a substantial increase in efficiency (see Sec. 4 for details).

Other work in video-based localization which predated the introduction of TALL either used a limited set of text vocabulary to search for specific events or used structured videos [3, 33, 39, 16]. Other earlier works addressed the task of retrieving objects using natural language questions, which is in general less complex than than localizing actions in videos [19, 20, 27].

Temporal Localization

Temporal action localization refers to localizing activities over a known set of action labels in untrimmed videos. Some existing work in this area has found success by extracting CNN features from the video frames, pooling the features and feeding them into either single or multi-stage classifiers to obtain action predictions, along with temporal labels 

[5, 9, 26, 37, 45, 42]. These methods use a sliding window approach. Often, multiple window sizes are used and candidate windows are densely sampled meaning that the candidates overlap with each other. While successful in terms of accuracy, this is an exhaustive search method that leads to high computational costs. There are another set of methods for temporal action localization that fore-go the end-to-end approach and instead use a two-stage approach: first generating temporal proposals, and second, classifying the proposals into action categories [4, 14, 17, 34, 35]. Unlike both sets of previous methods, our proposed model TripNet, uses reinforcement learning to perform temporal localization using natural language queries. In the related field of action-recognition [2, 25, 12, 23, 28, 36, 40], there is a method called Frame glimpse [44] that uses reinforcement learning to identify an action while looking at the smallest possible number of frames. This work is only on classifying actions in trimmed videos and does not do any type of temporal boundary localization.

3D navigation and Reinforcement Learning

Locating a specific activity within a video using reinforcement learning agent is very similar to an agent navigating through a three dimensional world. How can we learn to efficiently learn to navigate around the temporal landscape of the video? The parallels between our approach and navigation lead us to additionally review the related work in the language-based navigation area  [1, 15]. Most recently, the task of Embodied Question Answering [10] was introduced. In this task, an agent is asked questions about the environment, such as “what color is the bathtub,” and the agent must navigate to the bathtub and answer the question. The method introduced in [10] focuses on grounding the language of the question not directly into the pixels of the scene but into actions for navigating around the scene. We seek to ground our text query into actions for skipping around the video to narrow down on the correct clip. Another recent work  [6]

explores giving visually grounded instructions such as “go to the green red torch” to an agent and having them learn to navigate their environment to find the object and complete the instruction. This line of work is similar to ours in a different task/domain, because in this work the agent is not given navigational instructions but instead given a specific visual description of what object to find. However, locating actions additionally requires a temporal understanding of the video stream instead of just processing video frames to locate spatial objects.

In reinforcement learning, an agent learns how to act through trial and error interactions with an environment. There are many different reinforcement learning approaches and we choose to use a model free actor-critic framework. We choose the actor critic framework because it estimates both a value function, for being in a certain state, and a policy function, to map the state to an action directly. Named based on their functionality, the value estimate is called the critic and is used to update the policy which is called the actor. We specifically use the Asynchronous Actor Critic method (A3C) 

[29] which deploys multiple workers in parallel that each have their own network parameters. At the end of each episode the workers update a global set of parameters.

3 Methods

In this section, we describe TripNet, an end to end reinforcement learning method for localizing and retrieving temporal activities in videos given a natural language query. TripNet can be broken up into two major components: the state processing module and the policy learning module. The state processing module creates a visual-linguistic encoding of the current state and passes it to the policy module which generates an action policy.

3.1 Problem Formulation

The localization problem that we are solving is defined as follows: Given an untrimmed video and a language query , the aim is to temporally localize the specific clip in which is described by . In other words, let us denote the untrimmed video as where is the number of frames in the video, we want to find the that corresponds best to . It is possible to solve this problem efficiently because videos have an inherent temporal structure, such that an observation made at frame conveys information about frames in the past and in the future. Basic questions are how to encode the uncertainty in the location of the target event in a video, and how to update the uncertainty from successive observations. While a Bayesian formulation could be employed, the measurement and update model would need to be learned and supervision is not available.

Since it is computationally feasible to simulate the search process (in fact it is only a one-dimensional space, in contrast to standard navigation tasks) we adopt an RL approach instead. We are motivated by human annotators who observe a short clip and make a decision to skip forward or backward in the video by some number of frames, until they can narrow in on the target clip. We emulate this sequential decision process using reinforcement learning (RL). Using RL we train an agent that can steer a fixed sized window around the video to efficiently find without looking at all frames of . We employ the actor-critic method A3C [29] to learn the policy that maps . The intuition is that the agent will take large jumps around the video until it finds visual features that identify proximity to , and then it will start to take smaller steps as it narrows in on localizing the clip.

Figure 2: An overview of our reinforcement learning framework, TripNet, that localizes specific moments in videos based on a natural language description of the moment. Each state consists of the natural language query and a set of consecutive frames that are the frames within the current bounding window. There are two main components of TripNet: the state processing module and the learning policy module. The state processing module encodes the state into a joint visual and linguistic representation which is then fed to the policy module on which it generates the action policy. The actions in our framework are either moving the bounding window forward or backward by some number of frames or terminating the search and returning the clip from the current state.

3.2 State and Action Space

At each time step, the agent observes the current state, which consists of the sentence and a candidate clip of the video. The clip is defined by a bounding window [, ] where the start and end are frame numbers. At time step , the bounding window is set to [, ], where X is the average length of annotated clips within the dataset. This window size is fixed and does not change. At each time step the State Processing Model creates a state representation vector for the Policy Module on which it generates an action policy. This policy is a distribution over all the possible actions. An action is then sampled according to the policy. Our action space consists of 7 predefined actions: move the entire bounding window , forward or backward by frames, frames, or 1 second of frames or TERMINATE. Where and . If the window cannot be shifted because it is already at the start or end of the video, the bounding window will not change. On the action TERMINATE this ends the search and returns the clip of the current state as the clip in the video best matched to .

3.3 TripNet architecture

We now describe the architecture of TripNet which is illustrated in Figure 2. TripNet consists of a state-processing module that processes the video and text features, followed by a policy learning module. This module allows TripNet to localize actions described in the text without going over the entire video.

3.3.1 State Processing Module

At each time step, the state-processing module takes the current state as an input and outputs a joint-representation of the input video clip and the sentence query . The joint representation is used by the policy learner create an action policy over which the optimal action to take, is sampled. The clip is fed into C3D [41] to extract the spatio-temporal features from the fifth convolutional layer. We mean-pool the C3D features across frames and denote the result as . The sentence query is encoded by being passed through a fully connected linear layer with a sigmoid activation. The output of this layer is a vector which is then expanded to be the same size as .

To encode the sentence query, we first pass

through a Gated Recurrent Unit (GRU) 

[8] which outputs a vector . We then transform the query embedding into an attention vector that we can apply to the video embedding. To do so, the sentence query embedding

is then passed through a fully connected linear layer with a sigmoid activation function. The output of this layer is then expanded to be same dimension of

. We can call the output of the linear layer the attention vector . We then perform a Hadamard multiplication between and . We output this result as the state representation . This attention unit is our gated attention architecture for activity localization. Hence, the Gated-Attention unit is designed to gate specific filters based on the attention vector from the natural language query [11]. Hence, in order to attend to specific objects (or their attributes) in the search query, the gating mechanism allows us to focus on specific filters that can identify this information. For example, to identify the “lady in red shirt”, we need to focus on ‘red’ and ‘lady’ in the visual features.

In our experiments, we also implement an additional baseline using a simple concatenation operation between the video and text representations to demonstrate the effectiveness of our gated-attention architecture. Here, we only do self-attention over the mean pooled C3D features. Then we take a Skip-Thought [21] encoding of the sentence query and concatenate it with the features of the video frames to produce the state representation. In the experiments, we denote this method as TripNet-Concat.

3.3.2 Policy Learning Module

We use an actor-critic method to model the sequential decision process of grounding the language query to a temporal video location. The module employs a deep neural network to learn the policy and value functions. The network consists of a fully connected linear layer followed by an LSTM, which is followed by a fully connected layer to output the value function

and fully connected layer to output the policy , where is the state representation at time , is the critic branch parameters and is actor branch parameters. The policy, , is a probabilistic distribution over all possible actions given the current state. Since we are trying to model a sequential problem we use an LSTM so that the system can have memory of the previous states which will inevitably positively impact the future actions. Specifically, we use the asynchronous actor-critic method known as A3C [29] with Generalized Advantage Estimation [32]

that reduces policy gradient variance. The method runs multiple parallel threads that each run their own episodes and update global network parameters at the end of the episode.

Since the goal is to learn a policy that returns the best matching clip, we want to encourage each action to bring the bounding windows [, ] closer to the bounds of the ground truth clip. Hence, the action to take, should return a state that has a clip with more overlap with the ground-truth than the previous state. However, we also want to ensure the agent is taking an efficient number of jumps and not excessively sampling the clip. In order to encourage this behavior, we give a small negative reward in proportion with the total number of steps thus far. As a result, the agent is encouraged to find the clip window as quickly as possible. We experiment to find the optimal negative reward factor . We found using a negative reward factor also results in the agent taking more actions with larger frame jump. Hence, our reward at any time step is calculated as follows:


where we set to .01. We calculate the IOU between the clip of the state at time , [, ], and the ground truth clip for sentence , [, ] as follows:


We use the common loss functions for A3C for the value and policy loss. For training the value function we use make the value loss the mean squared loss between the discounted reward sum and the estimated value.


where we set to .5 and is the accumulated reward. For training the policy function we use the policy gradient loss.


where GAE is the generalized advantage estimation function, H is the calculation of entropy and is set to .5. Therefore the total loss for our policy learning module is:

Figure 3: This figure shows the two models we explore, TripNet-GA and TripNet-Concat, where gated-attention over text features and simple concatenation are explored, respectively.

4 Evaluation

In this section we describe the methods of evaluation and discuss the results with TripNet.

4.1 Datasets

We evaluate the TripNet architecture over three video datasets, Charades-STA [13], ActivityNet Captions [22] and TACoS [31]. Charades-STA was created specifically for the moment retrieval task and the other datasets were created for the video captioning task but are commonly used to evaluate the moment retrieval task. Note that we chose not to include the DiDeMo [18] dataset because in previous work, the evaluation is based off splitting the video into 21 pre-defined segments, instead of specific start and end times. This would mean changing the set of actions for our agent and we wanted the set of actions to be consistent across datasets. We do, however, compare against the method from  [18] on the other datasets. All the datasets that we use contain untrimmed videos and natural language descriptions of specific moments in the videos. These language descriptions are annotated with the corresponding start and end time of the corresponding clip.
Charades-STA [13]. This dataset takes the original Charades dataset, which contains video annotations of activities and video descriptions, and transforms these annotations to temporal sentence annotations which have a start and end time. This dataset was made for the task of temporal activity localization based on sentence descriptions. There are 13898 video to sentence pairs in the dataset. For evaluation, we use the dataset’s predefined test and train splits. The videos are 31 seconds long on average and the described temporally annotated clips are 8 seconds long on average.
ActivityNet Captions [22]. In order to test the robustness of our system with longer video lengths, we use ActivityNet Captions that contains 100K temporal description annotations over 20k videos that are on average 2.5 minutes long. This dataset was originally created for video captioning but is easily adaptable to our task and showcases the efficient performance of our architecture on longer videos. The videos are 2 minutes long on average and the described temporally annotated clips are 36 seconds long on average.
TACoS [31]. This dataset was built on top of the MPII Composite dataset and has 127 videos. It contains both activity labels and natural language descriptions, both with temporal annotations. Following previous work, for evaluation we randomly split the dataset into 50% for training, 25% for validation and 25% for testing. We choose this dataset also because of its long videos, which are 4.5 minutes long on average and the temporally annotated clips are 5 seconds long on average.

4.2 Implementation details.

During training, we take a video and a single query sentence that has a ground truth temporal alignment in the clip. At time we set the bounding window [, ] to be [0,X] where X is the average length of ground truth clips in the dataset. This means that this is the initial clip in the sequential decision process. Furthermore, it also means that the first actions selected will most likely be skipping forward in the video. The input to the system is sequential video frames and a sentence query. The sentence is first encoded through a Gated Recurrent Unit of size 256 and then through a fully-connected linear layer of size 512 with sigmoid activation. We run the video frames within the bounding window through a 3D-CNN [40]

which is pre-trained on the Sports-1M dataset and extract the 5th convolution layer. The A3C reinforcement learning method is then used for the policy learning module and is trained with stochastic gradient descent (SGD) with a learning rate of

. The first fully-connected (FC) layer of the policy module is

dimensions and is followed by an long short term memory (LSTM) layer of size

. During training, we set A3C to run 8 parallel threads.

4.3 Experiments

MCN [18] 21.37 9.58 -
CTRL [13] 28.70 14.00 -
TGN [7] 45.51 28.47 -
ABLR [46] 55.67 36.79 -
ACRN [46] 31.29 16.17 -
MLVI [43] 45.30 27.70 13.60
TripNet-Concat 36.75 25.64 10.25
TripNet-GA 48.42 32.19 13.93
Table 1: The accuracy of each method on ActivityNet measured by IoU at different values.
Evaluation Metric.

We use Intersection over Union (IoU) at different alpha thresholds to measure the difference between the ground truth clip and the clip TripNet that aligns the clip temporally to a sentence. If a predicted window has above the set alpha threshold IoU with the ground truth window we classify the window as correct otherwise incorrect. See Equation 2 for how the IoU is calculated. Most of the previous work in this space has used a R@k-IoU, which is a IoU score for the top k returned clips. These previous works have used a sliding window alignment approach allowing their alignment system was able to return k-top candidate windows based on confidence scores. Instead, our architecture, TripNet, searches the video until it finds the best aligned clip and returns only that. As a result, all of our IoU scores are only measured at R@1.

4.3.1 Comparison and Baseline Methods

We compare against other methods both from prior work and a baseline version of the TripNet architecture. The prior work we compare against is as follows: MCN [18], CTRL [13], TGN [7], ABLR[46], ACRN [24], MLVI [43], and VAL [38]. All of these prior works tackle the task by learning to jointly represent the ground truth moment clip and the moment description query and then, during testing, going across the whole video to generate candidate windows either by sliding window or otherwise. Then they find the candidate that best corresponds to the moment description query encoding. These methods rely on seeing all frames of the video at least once, if not more, during test time.

So far, previous work has only focused on improving the method of encoding the visual and linguistic features, such that the joint representation is as accurate as possible. Instead, our work focuses on how to find the clip most efficiently while maintaining accuracy. Therefore, we compare both the methods in terms of accuracy and efficiency. We explore efficiency by looking at the average number of frames seen in relation to the size of the video as well as the average time it takes to find a clip during test time. In addition to the prior works we also run TripNet without the gated attention mechanism, shown as TripNet-Concat. For TripNet-Concat, we do self attention over the mean pooled C3D features of the video frames and concatenate this with the Skip-Thought encoding of the sentence query to produce the state representation. Testing both these methods allow us to explore the performance of our state processing module separate from our policy learning module.

MCN [18] 1.64 1.25 -
CTRL [13] 18.32 13.3 -
TGN [7] 21.77 18.9 -
ACRN [46] 19.52 14.62 -
VAL [38] 19.76 14.74 -
TripNet-Concat 18.24 14.16 6.47
TripNet-GA 23.95 19.17 9.52
Table 2: The accuracy of each method on TACoS measured by IoU at different values.

We can see that, in terms of accuracy, TripNet outperforms all other methods on the TACoS dataset and that it performs comparably to the state of art method on AcitivityNet Captions and Charades-STA. Using a state processing module for TripNet that does not use attention (TripNet-Concat) performs worse than the state processing module for TripNet that uses the gated attention architecture (TripNet-GA). This is unsurprising as previous work as well as other tasks that require multi-modal fusion between vision and language show improvement with different attention mechanisms. This happens because it gives more weight to the visual features that corresponds with the language query and improves accuracy. Hence, our results show that a temporal localization method does not necessarily need to see the entire clip to achieve success.

Looking at the scores between the datasets we do see a drop in performance of TripNet on TACoS but we don’t believe this is due to the length of the videos but instead due to the nature of the content in the video. TACoS contains long cooking videos all taken in the same kitchen making the dataset more difficult. In comparison, ActivityNet videos are up to four times longer than Charades-STA videos and TripNet is still able to achieve good accuracy meaning that this method is able to scale to long or short videos.

In the analysis of our results we found that one source of IoU inaccuracy came from the size of the bounding window. We are using a fixed size bounding window and the agent is moving the bounding window around the video until it returns a predicted clip. Our fixed size comes from the average length of the ground truth annotated clips. Since this is the just mean length there are ground truth clips are longer or shorter. A possible direction for future work would be to add actions to expand or contract the size of the bounding window.

CTRL [13] - 23.63 8.89
VAL [38] - 23.12 9.16
MLVI [43] 54.70 35.60 15.80
TripNet-Concat 41.84 27.23 12.62
TripNet-GA 51.33 36.61 14.50
Table 3: The accuracy of each method on Charades-STA measured by IoU at different values.

4.3.2 Efficiency

We now briefly describe how all the methods that we compare against, generate their candidate windows. We do this in order to get a better understanding of the computational cost of each approach and how it compares to using a reinforcement learning approach.

Dataset %of frames seen Avg. num. of actions Num. size bounding window (seconds)
ActivityNet-Captions 41.65 5.56 35.7
Charades-STA 33.11 4.16 8.3
TACoS 32.7 9.84 8.0
Table 4: The efficiency of the TripNet architecture on different datasets. ‘%of frames seen’ means of the total video how many frames were seen. Note that this does not reflect when the same frame is seen more than once. Number of actions includes the TERMINATE action.
Figure 4: Qualitative performance of TripNet-GA: We show two examples where the TripNet agent skips through the video looking at different candidate windows before terminating the search. Both these videos are from the Charades-STA dataset.

MCN [18]: segments a video into 6 clips and creates 21 candidate windows from possible combinations of the 6 clips.
CTRL [13]: uses a sliding window to generate candidate windows. They use sliding windows of size 128, 256 frames. When moving the sliding window there is a .8 percent overlap in the windows.
VAL [38]: uses a sliding window to generate candidate windows. They use sliding windows of frame lengths 32, 64, 128, 256,and 512 with and overlap of .8.
ACRN [24]: uses a sliding window to generate candidate windows. They use sliding windows of frame lengths 64, 128, 256,and 512 with and over lap of .8.
TGN [7]: does not use a sliding window but generates candidate windows across the video at different scales.
ABLR[46]: does not use a sliding window but instead encodes the entire video and then uses attention via the language query to localize the clip. The method still requires two passes over the entire video.
MLVI [43]: trains a separate network to generate candidate window proposals.
TripNet (ours): to our knowledge, TripNet is the only method that does not need to watch the entire video to temporally localize a described moment. Instead our trained agent intelligently moves a candidate window around the video until it localizes the described clip. Different measurements of efficiency are described in Table 4. A comparison of run times between the methods is included in the supplementary materials.

In order to get a better understanding of efficiency, we run the method that introduced the Charades-STA dataset  [13] as well as TripNet-GA over the Charades-STA test set and get average time it takes to localize a moment in the video. This is shown in Table 5. For reference the videos in the Charades-STA dataset are 31 seconds long on average.

Method Seconds
CTRL [13] 0.044186
TripNet-GA 0.005125
Table 5: The average time it takes to localize a moment on Charades-STA.

4.3.3 Qualitative Results

In Figure 4 we show qualitative results of TripNet-GA on the Charades-STA dataset. In the figure, we show the sequential list of actions the agent takes in order to temporally localize this moment in the video. In the figure the green boxes represent the bounding window of the state at time t and the yellow box represents the ground truth bounding window. In this figure the first video is 33 seconds long (792 frames) and the second video is 20 seconds long (480 frames). For both these examples the agent skips both backwards and forwards. In the first example, TripNet sees 408 of the frames of the video, which is 51% of the video. In the second example TripNet sees 384 frames of the video, which is 80% of the frames. Intuitively, the shorter the video is in ratio to the size of the bounding window the more percentage of frames will be seen.

5 Conclusion

Localizing moments in long, untrimmed videos using natural language queries is a useful and challenging task for fine-grained video retrieval. While existing methods have obtained encouraging results, prior work has not addressed the challenge of efficient localization and search, and in general prior methods will analyze 100% of the video frames. In this paper, we have introduced a system that uses a gated-attention mechanism over cross-modal features to automatically localize a moment in time given a natural language text query with high accuracy. Furthermore, we extend this model with a policy network, resulting in an efficient system that on average will look at less then 50% of the video frames in order to make a prediction. We provide quantitative and qualitative evaluations and set new state-of-the-art accuracy’s for multiple baseline datasets.


  • [1] P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 3674–3683, 2018.
  • [2] Y. Bian, C. Gan, X. Liu, F. Li, X. Long, Y. Li, H. Qi, J. Zhou, S. Wen, and Y. Lin. Revisiting the effectiveness of off-the-shelf temporal modeling approaches for large-scale video classification. arXiv preprint arXiv:1708.03805, 2017.
  • [3] P. Bojanowski, R. Lajugie, E. Grave, F. Bach, I. Laptev, J. Ponce, and C. Schmid. Weakly-supervised alignment of video with text. In Proceedings of the IEEE international conference on computer vision, pages 4462–4470, 2015.
  • [4] S. Buch, V. Escorcia, C. Shen, B. Ghanem, and J. Carlos Niebles. Sst: Single-stream temporal action proposals. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2911–2920, 2017.
  • [5] Y.-W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar. Rethinking the faster r-cnn architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1130–1139, 2018.
  • [6] D. S. Chaplot, K. M. Sathyendra, R. K. Pasumarthi, D. Rajagopal, and R. Salakhutdinov. Gated-attention architectures for task-oriented language grounding. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    , 2018.
  • [7] J. Chen, X. Chen, L. Ma, Z. Jie, and T.-S. Chua. Temporally grounding natural sentence in video. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    , pages 162–171, 2018.
  • [8] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
  • [9] X. Dai, B. Singh, G. Zhang, L. S. Davis, and Y. Qiu Chen. Temporal context network for activity localization in videos. In Proceedings of the IEEE International Conference on Computer Vision, pages 5793–5802, 2017.
  • [10] A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra. Embodied question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 5, page 6, 2018.
  • [11] B. Dhingra, H. Liu, Z. Yang, W. W. Cohen, and R. Salakhutdinov. Gated-attention readers for text comprehension. arXiv preprint arXiv:1606.01549, 2016.
  • [12] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2625–2634, 2015.
  • [13] J. Gao, C. Sun, Z. Yang, and R. Nevatia. Tall: Temporal activity localization via language query. arXiv preprint arXiv:1705.02101, 2017.
  • [14] J. Gao, Z. Yang, and R. Nevatia. Cascaded boundary regression for temporal action detection. arXiv preprint arXiv:1705.01180, 2017.
  • [15] D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi. Iqa: Visual question answering in interactive environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4089–4098, 2018.
  • [16] M. Hahn, N. Ruiz, J.-B. Alayrac, I. Laptev, and J. M. Rehg. Learning to localize and align fine-grained actions to sparse instructions. arXiv preprint arXiv:1809.08381, 2018.
  • [17] F. C. Heilbron, W. Barrios, V. Escorcia, and B. Ghanem. Scc: Semantic context cascade for efficient action detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3175–3184. IEEE, 2017.
  • [18] L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 5803–5812, 2017.
  • [19] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell. Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4555–4564, 2016.
  • [20] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2901–2910, 2017.
  • [21] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302, 2015.
  • [22] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision, 2017.
  • [23] J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos “in the wild”. In Computer vision and pattern recognition, 2009. CVPR 2009. IEEE conference on, pages 1996–2003. IEEE, 2009.
  • [24] M. Liu, X. Wang, L. Nie, X. He, B. Chen, and T.-S. Chua. Attentive moment retrieval in videos. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 15–24. ACM, 2018.
  • [25] C.-Y. Ma, A. Kadav, I. Melvin, Z. Kira, G. AlRegib, and H. Peter Graf. Attend and interact: Higher-order object interactions for video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6790–6800, 2018.
  • [26] S. Ma, L. Sigal, and S. Sclaroff. Learning activity progression in lstms for activity detection and early detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1942–1950, 2016.
  • [27] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016.
  • [28] A. Miech, I. Laptev, and J. Sivic. Learnable pooling with context gating for video classification. arXiv preprint arXiv:1706.06905, 2017.
  • [29] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In

    International conference on machine learning

    , pages 1928–1937, 2016.
  • [30] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
  • [31] A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele. Coherent multi-sentence video description with variable level of detail. In German conference on pattern recognition, pages 184–195. Springer, 2014.
  • [32] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
  • [33] O. Sener, A. R. Zamir, S. Savarese, and A. Saxena. Unsupervised semantic parsing of video collections. In Proceedings of the IEEE International Conference on Computer Vision, pages 4480–4488, 2015.
  • [34] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5734–5743, 2017.
  • [35] Z. Shou, D. Wang, and S.-F. Chang. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1049–1058, 2016.
  • [36] G. A. Sigurdsson, S. Divvala, A. Farhadi, and A. Gupta. Asynchronous temporal fields for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [37] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao.

    A multi-stream bi-directional recurrent neural network for fine-grained action detection.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1961–1970, 2016.
  • [38] X. Song and Y. Han. Val: Visual-attention action localizer. In Pacific Rim Conference on Multimedia, pages 340–350. Springer, 2018.
  • [39] S. Tellex and D. Roy. Towards surveillance video search by natural language query. In Proceedings of the ACM International Conference on Image and Video Retrieval, page 38. ACM, 2009.
  • [40] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. C3d: generic features for video analysis. CoRR, abs/1412.0767, 2:7, 2014.
  • [41] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4489–4497. IEEE, 2015.
  • [42] H. Xu, A. Das, and K. Saenko. R-c3d: Region convolutional 3d network for temporal activity detection. 2017.
  • [43] H. Xu, K. He, L. Sigal, S. Sclaroff, and K. Saenko. Text-to-clip video retrieval with early fusion and re-captioning. arXiv preprint arXiv:1804.05113, 2018.
  • [44] S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2678–2687, 2016.
  • [45] J. Yuan, B. Ni, X. Yang, and A. A. Kassim. Temporal action localization with pyramid of score distribution features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3093–3102, 2016.
  • [46] Y. Yuan, T. Mei, and W. Zhu. To find where you talk: Temporal sentence localization in video with attention based location regression. arXiv preprint arXiv:1804.07014, 2018.