Hopper: Multi-hop Transformer for Spatiotemporal Reasoning

03/19/2021 ∙ by Honglu Zhou, et al. ∙ Rutgers University 8

This paper considers the problem of spatiotemporal object-centric reasoning in videos. Central to our approach is the notion of object permanence, i.e., the ability to reason about the location of objects as they move through the video while being occluded, contained or carried by other objects. Existing deep learning based approaches often suffer from spatiotemporal biases when applied to video reasoning problems. We propose Hopper, which uses a Multi-hop Transformer for reasoning object permanence in videos. Given a video and a localization query, Hopper reasons over image and object tracks to automatically hop over critical frames in an iterative fashion to predict the final position of the object of interest. We demonstrate the effectiveness of using a contrastive loss to reduce spatiotemporal biases. We evaluate over CATER dataset and find that Hopper achieves 73.2 FPS by hopping through just a few critical frames. We also demonstrate Hopper can perform long-term reasoning by building a CATER-h dataset that requires multi-step reasoning to localize objects of interest correctly.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 18

page 19

page 20

page 22

page 23

page 26

page 27

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Snitch Localization in CATER (cater)

is an object permanence task where the goal is to classify the final location of the snitch object within a 2D grid space.

In this paper, we address the problem of spatiotemporal object-centric reasoning in videos. Specifically, we focus on the problem of object permanence, which is the ability to represent the existence and the trajectory of hidden moving objects (baillargeon1986representing). Object permanence can be essential in understanding videos in the domain of: (1) sports like soccer, where one needs to reason, “which player initiated the pass that resulted in a goal?”, (2) activities like shopping, one needs to infer “what items the shopper should be billed for?”, and (3) driving, to infer “is there a car next to me in the right lane?”. Answering these questions requires the ability to detect and understand the motion of objects in the scene. This requires detecting the temporal order of one or more actions of objects. Furthermore, it also requires learning object permanence, since it requires the ability to predict the location of non-visible objects as they are occluded, contained or carried by other objects (opnet). Hence, solving this task requires compositional, multi-step spatiotemporal reasoning which has been difficult to achieve using existing deep learning models (bottou2014machine; lake2017building).

Existing models have been found lacking when applying to video reasoning and object permanence tasks (cater). Despite rapid progress in video understanding benchmarks such as action recognition over large datasets, deep learning based models often suffer from spatial and temporal biases and are often easily fooled by statistical spurious patterns and undesirable dataset biases (johnson2017inferring). For example, researchers have found that models can recognize the action “swimming” even when the actor is masked out, because the models rely on the swimming pool, the scene bias, instead of the dynamics of the actor (whycannotdance).

Hence, we propose Hopper to address debiased video reasoning. Hopper uses multi-hop reasoning over videos to reason about object permanence. Humans realize object permanence by identifying key frames where objects become hidden (bremner2015perception) and reason to predict the motion and final location of objects in the video. Given a video and a localization query, Hopper uses a Multi-hop Transformer (MHT) over image and object tracks to automatically identify and hop over critical frames in an iterative fashion to predict the final position of the object of interest. Additionally, Hopper uses a contrastive debiasing loss that enforces consistency between attended objects and correct predictions. This improves model robustness and generalization. We also build a new dataset, CATER-h, that reduces temporal bias in CATER and requires long-term reasoning.

We demonstrate the effectiveness of Hopper over the recently proposed CATER ‘Snitch Localization’ task (cater) (Figure 1). Hopper achieves Top- accuracy in this task at just FPS. More importantly, Hopper identifies the critical frames where objects become invisible or reappears, providing an interpretable summary of the reasoning performed by the model. To summarize, the contributions of our paper are as follows: First, we introduce Hopper that provides a framework for multi-step compositional reasoning in videos and achieves state-of-the-art accuracy in CATER object permanence task. Second, we describe how to perform interpretable reasoning in videos by using iterative reasoning over critical frames. Third, we perform extensive studies to understand the effectiveness of multi-step reasoning and debiasing methods that are used by Hopper. Based on our results, we also propose a new dataset, CATER-h, that requires longer reasoning hops, and demonstrates the gaps of existing deep learning models.

2 Related Work

Video understanding. Video tasks have matured quickly in recent years (hara2018can); approaches have been migrated from 2D or 3D ConvNets (ji20123d) to two-stream networks (simonyan2014two), inflated design (carreira2017quo), models with additional emphasis on capturing the temporal structures (trn), and recently models that better capture spatiotemporal interactions (nonlocal; girdhar2019video). Despite the progress, these models often suffer undesirable dataset biases, easily confused by backgrounds objects in new environments as well as varying temporal scales (whycannotdance). Furthermore, they are unable to capture reasoning-based constructs such as causal relationships (fire2017inferring) or long-term video understanding (cater).

Visual and video reasoning. Visual and video reasoning have been well-studied recently, but existing research has largely focused on the task of question answering (clevr; macnetwork; neuralstatemachine; CLEVRER). CATER, a recently proposed diagnostic video recognition dataset focuses on spatial and temporal reasoning as well as localizing particular object of interest. There also has been significant research in object tracking, often with an emphasis on occlusions with the goal of providing object permanence (deepsort; SiamMask). Traditional object tracking approaches often require expensive supervision of location of the objects in every frame. In contrast, we address object permanence and video recognition on CATER with a model that performs tracking-integrated object-centric reasoning without this strong supervision.

Multi-hop reasoning. Reasoning systems vary in expressive power and predictive abilities, which include symbolic reasoning, probabilistic reasoning, causal reasoning, etc. (bottou2014machine). Among them, multi-hop reasoning is the ability to reason with information collected from multiple passages to derive the answer (wang2019multi), and it gives a discrete intermediate output of the reasoning process, which can help gauge model’s behavior beyond just the final task accuracy (chen2019multi). Several multi-hop datasets and models have been proposed for the reading comprehension task (welbl2018constructing; yang2018hotpotqa; dua2019drop; Dhingra2020Differentiable). We extend multi-hop reasoning to the video domain by developing a dataset that explicitly requires aggregating clues from different spatiotemporal parts of the video, as well as a multi-hop model that automatically extracts a step-by-step reasoning chain, which improves interpretability and imitates a natural way of thinking. We provide an extended discussion of related work in Appendix I.

Figure 2: An overview of the Hopper framework. Hopper first obtains frame representations from the input video. Object representations and object tracks are then computed to enable tracking-integrated object-centric reasoning for the Multi-hop Transformer (details in Section 4).

3 Hopper

Hopper (Figure 2) is a framework inspired from the observation that humans think in terms of entities and relations. Unlike traditional deep visual networks that perform processing over the pixels from which they learn and extract features, object-centric learning-based architecture explicitly separates information about entities through grouping and abstraction from the low-level information (slotattention). Hopper obtains representations of object entities from the low-level pixel information of every frame (Section 3.2

). Additionally, to maintain object permanence, humans are able to identify key moments when the objects disappear and reappear. To imitate that,

Hopper computes object tracks with the goal to have a more consistent object representation (Section 3.3) and then achieves multi-step compositional long-term reasoning with the Multi-hop Transformer to pinpoint these critical moments. Furthermore, Hopper combines both fine-grained (object) and coarse-grained (image) information to form a contextual understanding of a video. As shown in Figure 2, Hopper contains components; we describe them below.

3.1 Backbone

Starting from the initial RGB-based video representation where represents the number of frames of the video, is for the three color channels, and and denote the original resolution height and width, a conventional CNN backbone would extract the feature map and for every frame a compact image representation . The backbone we use is ResNeXt-101 from sinet, and . A 11 convolution (detr) then reduces the channel dimension of from to a smaller dimension (), and a linear layer is used to turn the dimension of from to .

3.2 Object Detection and Representation

We collapse the spatial dimensions into dimension and combine the batch dimension with the temporal dimension for the feature map . Positional encodings are learned for each time step ( in total) and each spatial location ( in total), which are further added to the feature map in an element-wise manner. The positional encoding-augmented feature map is the source input to the transformer encoder (transformer) of DETR (detr). DETR is a recently proposed transformer-based object detector for image input; it additionally accepts embeddings of object queries for every image (assuming every image at most has objects222, i.e., none object, will be predicted if the number of objects in an image is less than .

) to the transformer decoder. We also combine the batch dimension with temporal dimension for the object queries. Outputs from DETR are transformed object representations that are used as inputs to a multilayer perceptron (MLP) to predict the bounding box and class label of every object. For Snitch Localization, DETR is trained on object annotations from LA-CATER 

(opnet).

3.3 Tracking

Tracking produces consistent object representations as it links the representations of each object through time. We perform tracking using the unordered object representations, bounding boxes and labels as inputs, and applying our Hungarian-based algorithm to match objects between every two consecutive frames. We describe the details as follows.

Tracking is essentially an association problem (bewley2016simple). An association between objects respectively from consecutive frames can be defined by the object class agreement and the difference of the two bounding boxes. Let us denote by the predicted list of objects at all frames in a video, where denotes the predicted set of objects at frame . Each object is represented as a -tuple where denotes the class label that has the maximum predicted likelihood for object at frame ,

is a vector that defines the bounding box top left and bottom right coordinates relative to the image size,

denotes the predicted likelihood for class (where large metal green cube, small metal green cube, , }), and denotes the representation vector of this object at frame .

In order to obtain the optimal bipartite matching between the set of predicted objects at frame and , we search for a permutation of elements with the lowest permutation cost:

(1)

where is a pair-wise track matching cost between predicted object (i.e., object at frame ) and predicted object at frame with index from the permutation , denoted by . Following detr, the optimal assignment is computed efficiently with the Hungarian algorithm. The track matching cost at time for object is defined as

(2)

where denotes an indicator function such that the equation after the symbol only takes effect when the condition inside the is true, otherwise the term will be . weight each term. is defined as a linear combination of the loss and the generalized IoU loss (GIoU). When the predicted class label of object at frame is not , we aim to maximize the likelihood of the class label for the predicted object at frame , and minimize the bounding box difference between the two. The total track matching cost of a video is the aggregation of from object to and frame to .

This Hungarian-based tracking algorithm is used due to its simplicity. A more sophisticated tracking solution (e.g. DeepSORT (deepsort)) could be easily integrated into Hopper, and may improve the accuracy of tracking in complex scenes.

3.4 Video Query Representation and Recognition

The object tracks obtained from the Hungarian algorithm and a single track of image features from the backbone are further added with the learned positional time encodings to form the source input to our Multi-hop Transformer (which will be introduced in Section 4). Multi-hop Transformer produces the final latent representation of the video query . A MLP uses the video query representation as an input and predicts the grid class for the Snitch Localization task.

4 Multi-hop Transformer

Motivated by how humans reason an object permanence task through identifying critical moments of key objects in the video (bremner2015perception), we propose Multi-hop Transformer (MHT). MHT reasons by hopping over frames and selectively attending to objects in the frames, until it arrives at the correct object that is the most important for the task. MHT operates in an iterative fashion, and each iteration produces one-hop reasoning by selectively attending to objects from a collection of frames. Objects in that collection of frames form the candidate pool of that hop. Later iterations are built upon the knowledge collected from the previous iterations, and the size of the candidate pool decreases as iteration runs. We illustrate the MHT in Figure 16 in Appendix. The overall module is described in Algorithm 1. MHT accepts a frame track : [, , , ], an object track : [, , , , , , , , ], an initial target video query embedding , the number of objects and number of frames . denotes the hop index, and is the frame index that the previous hop (i.e., iteration) mostly attended to, in Algorithm 1.

Overview. Multiple iterations are applied over MHT, and each iteration performs one hop of reasoning by attending to certain objects in critical frames. With a total of hops, MHT produces refined representation of the video query (). As the complexity of video varies, should also vary across videos. In addition, MHT operates in an autoregressive manner to process the incoming frames. This is achieved by ‘Masking()’ (will be described later). The autoregressive processing in MHT allows hop + to only attend to objects in frames after frame , if hop mostly attends to an object at frame . We define the most attended object of a hop as the object that has the highest attention weight (averaged from all heads) from the encoder-decoder multi-head attention layer in Transformer (will be described later). The hopping ends when the most attended object is an object in the last frame.

MHT Architecture. Inside of the MHT, there are encoder-decoder transformer units (transformer): Transformer and Transformer. This architecture is inspired by study in cognitive science that reasoning consists of stages: first, one has to establish the domain about which one reasons and its properties, and only after this initial step can one’s reasoning happen (stenning2012human). We first use Transformer to adapt the representations of the object entities, which form the main ingredients of the domain under the context of an object-centric video task. Then, Transformer is used to produce the task-oriented representation to perform reasoning.

We separate types of information from the input: attention candidates and helper information. This separation comes from the intuition that humans sometimes rely on additional information outside of the candidate answers. We call such additional information as helper information (specifically in our case, could be the coarse-grained global image context, or information related to the previous reasoning step). We define candidate answers as attention candidates , which are representations of object entities (because object permanence is a task that requires reasoning relations of objects). For each hop, we first extract attention candidates and helper information from the source sequence, then use Transformer to condense the most useful information by attending to attention candidates via self-attention and helper information via encoder-decoder attention. After that, we use Transformer to learn the latent representation of the video query by attentively utilizing the information extracted from Transformer (via encoder-decoder attention). Thus, MHT decides on which object to mostly attend to, given the current representation of the video query , by reasoning about the relations between the object entities (), and how would each object entity relate to the reasoning performed by the previous hop or global information ().

Input: , , , ,

Params: LayerNorm, Transformer, Transformer, ,

Algorithm 1 Multi-hop Transformer module.
1:,
2:while do
3:      
4:      if  then
5:            Extract (, , , )
6:      else
7:            
8:      end if
9:      
10:       Transformer (, )
11:      Sigmoid ()
12:      Masking (, )
13:       Transformer (, )
14:      Softargmax ()
15:end while
16: LayerNorm ()

Return

Transformer. Transformer uses helper information from the previous hop, to adapt the representations of the object entities to use in the current reasoning step. Formally, is the object track sequence as in line in Algorithm 1 (, tokens), whereas encompasses different meanings for hop and the rest of the hops. For hop , is the frame track (, tokens, line ). This is because hop is necessary for all videos with the goal to find the first critical object (and frame) from the global information. Incorporating frame representations is also beneficial because it provides complementary information and can mitigate occasional errors from the object detector and tracker. For the rest of the hops, is the set of representations of all objects in the frame that the previous hop mostly attended to (, tokens, Extract() in line ). The idea is that, to select an answer from object candidates after frame , objects in frame could be the most important helper information. Transformer produces (, tokens), an updated version of , by selectively attending to . Further, MHT conditionally integrates helper-fused representations and the original representations of . This conditional integration is achieved by Attentional Feature-based Gating (line ), with the role to combine the new modified representation with the original representation. This layer, added on top of Transformer, provides additional new information, because it switches the perspective into learning new representations of object entities by learning a feature mask (values between 0 and 1) to select salient dimensions of , conditioned on the adapted representations of object entities that are produced by Transformer. Please see details about this layer in  margatina2019attention).

Transformer. Transformer is then used to produce the task-oriented video query representation . As aforementioned, MHT operates in an autoregressive manner to proceed with time. This is achieved by ‘Masking()’ that turns into () for Transformer by only retaining the object entities in frames after the frame that the previous hop mostly attended to (for hop , is ). Masking is commonly used in NLP for the purpose of autoregressive processing (transformer). Masked objects will have attention weights. Transformer learns representation of the video query by attending to (line ). It indicates that, unlike Transformer in which message passing is performed across all connections between tokens in , between tokens in , and especially across and (we use for Transformer, instead of , because potentially, to determine which object a model should mostly attend to in frames after , objects in and before frame might also be beneficial), message passing in Transformer is only performed between tokens in (which has only token for Snitch Localization), between tokens in unmasked tokens in , and more importantly, across connections between the video query and unmasked tokens in . The indices of the most attended object and the frame that object is in, are determined by attention weights from the previous hop with a differentiable ‘Softargmax()’  (chapelle2010gradient; honari2018improving), defined as, , where is an arbitrarily large number. Attention weights () is averaged from all heads. is updated over the hops, serving the information exchange between the hops.

Summary & discussion. , and are updated in every hop. should be seen as an encoding for a query of the entire video. Even though in this dataset, a single token is used for the video query, and the self-attention in the decoder part of Transformer is thus reduced to a stacking of linear transformations, it is possible that multiple queries would be desirable in other applications. These structural priors that are embedded in (e.g., the iterative hopping mechanism and attention, which could be treated as a soft tree) essentially provide the composition rules that algebraically manipulate the previously acquired knowledge and lead to the higher forms of reasoning (bottou2014machine)

. Moreover, MHT could potentially correct errors made by object detector and tracker, but poor performance of them (especially object detector) would also make MHT suffer because (1) inaccurate object representations will confuse MHT in learning, and (2) the heuristic-based loss for intermediate hops (will be described in Section 

5) will be less accurate.

5 Training

We propose the following training methods for the Snitch Localization task and present an ablation study in Appendix A. We provide the implementation details of our model in Appendix H.

Dynamic hop stride.

A basic version of autoregressive MHT is to set the per-hop frame stride to

with ‘Masking()’ as usually done in NLP. It means that Transformer will only take in objects in frame + as the source input if the previous hop mostly attended to an object in frame . However, this could produce an unnecessary long reasoning chain. By using dynamic hop stride, we let the model automatically decide on which upcoming frame to reason by setting ‘Masking()’ to give unmasked candidates as objects in frames after the frame that the previous hop mostly attended to.

Minimal hops of reasoning. We empirically set the minimal number of hops that the model has to perform for any video as to encourage multi-hop reasoning with reasonably large number of hops (unless not possible, e.g., if the last visible snitch is in the second last frame, then the model is only required to do hops). This is also achieved by ‘Masking()’. E.g., if hop mostly attends to an object in frame , ‘Masking()’ will not mask objects in frames from frame to frame for hop , in order to allow hop , , to happen (suppose frames per video, and frame is computed from , frame is computed as ).

Auxiliary hop object loss. Identifying the correct object to attend to in early hops is critical and for Snitch Localization, the object to attend to in hop should be the last visible snitch (intuitively). Hence, we define an auxiliary hop object loss as the cross-entropy of classifying index of the last visible snitch. Inputs to this loss are the computed index of the last visible snitch from (with the heuristic that approximates it from predicted object bounding boxes and labels), as well as the attention weights from Transformer of hop , serving as predicted likelihood for each index.

Auxiliary hop object loss. Similarly, we let the second hop to attend to the immediate occluder or container of the last visible snitch. The auxiliary hop object loss is defined as the cross-entropy of classifying index of the immediate occluder or container of the last visible snitch. Inputs to this loss are the heuristic 333The heuristic is distance to find out that in the immediate frame which object’s bounding box bottom midpoint location is closest to that of the last visible snitch. computed index and attention weights from Transformer of hop .

Auxiliary hop frame loss. Attending to objects in the correct frames in hop and is critical for the later hops. A loss term could guide the model to find out the correct frame index.

Teacher forcing is often used as a strategy for trainingrecurrent neural networks that uses the ground truth from a prior time step as an input (williams1989learning). We use teacher forcing for hop and by providing the ground truth and (since we can compute the frame index of the last visible snitch with heuristics as described above).

Contrastive debias loss via masking out. This loss is inspired from the human mask confusion loss in  whycannotdance. It allows penalty for the model if it could make predictions correctly when the most attended object in the last frame is masked out. However, in contrast to human mask, we enforce consistency between attended objects and correct predictions, ensuring that the model understands why it is making a correct prediction. The idea here is that the model should not be able to predict the correct location without seeing the correct evidence. Technically, the contrastive debias loss is defined as the entropy function that we hope to maximize, defined as follows.

(3)

where denotes the video query representation and recognition module (Multi-hop Transformer along with MLP) with parameter that produces the likelihood of each grid class, is the source sequence to the Multi-hop Transformer with the most attended object in the last hop being masked out (set to zeros), and denotes the number of grid classes.

Summary & discussion. The total loss of the model is a linear combination of hop and hop object & frame loss, contrastive debiasing loss for the last hop, and the final grid classification cross-entropy loss. The object & frame loss for hop and are based on heuristics. The motivation is to provide weak supervision for the early hops to avoid error propagation, as multi-hop model can be difficult to train, without intermediate supervision or when ground truth reasoning chain is not present (as in Hopper) (dua2020benefits; qi2019answering; ding2019cognitive; wang2019multi; chen2018learn; jiang2019self). One can use similar ideas as the ones here on other tasks that require multi-hop reasoning (e.g., design self-supervision or task-specific heuristic-based weak supervision for intermediate hops, as existing literature often does).

6 Experiments

Datasets. Snitch Localization (Figure 1) is the most challenging task in the CATER dataset and it requires maintaining object permanence to solve the task successfully (cater). However, CATER is highly imbalanced for Snitch Localization task in terms of the temporal cues: snitch is entirely visible at the end of the video for samples, and entirely visible at the second last frame for samples. As a result, it creates a temporal bias in models to predict based on the last few frames. To address temporal bias in CATER, we create a new dataset, CATER-hard (CATER-h for short), with diverse temporal variations. In CATER-h, every frame index roughly shares an equal number of videos, to have the last visible snitch in that frame (Figure 13).

Baselines & metrics. We experiment TSM (tsm), TPN (tpn), and SINet (sinet). Additionally, Transformer that uses the time-encoded frame track as the source sequence is utilized. Both SINet and Transformer are used to substitute our novel MHT in the Hopper framework in order to demonstrate the effectiveness of the proposed MHT. This means that Hopper-transformer and Hopper-sinet use the same representations of image track and object tracks as our Hopper-multihop. Moreover, we report results from a Random baseline, a Tracking baseline used by cater (DaSiamRPN), and a Tracking baseline based on our Hungarian algorithm (Section 3.3) in order to have a thorough understanding of how well our tracking component performs. For CATER. we also compare with the results in cater. Details of the baselines are available in Appendix H. We evaluate models using Top- and Top- accuracy, as well as mean distance of the predicted grid cell from the ground truth following cater. is cognizant of the grid structure and will penalize confusion between adjacent cells less than those between distant cells.

Methods FPS # Frames Top Top
Random - -
DaSiamRPN (Tracking) (DaSiamRPN) - -
Hungarian (Tracking - ours) - -
TSN (RGB) (tsn) -
TSN (RGB) + LSTM (tsn) -
TSN (Flow) (tsn) -
TSN (Flow) + LSTM (tsn) -
I3D-50 (carreira2017quo)
I3D-50 + LSTM (carreira2017quo)
I3D-50 + NL (nonlocal)
I3D-50 + NL + LSTM (nonlocal)
TPN-101 (tpn) 65.3
TSM-50 (tsm) 0.93
SINet (sinet)
Transformer (transformer)
Hopper-transformer (last frame)
Hopper-transformer 90.1
Hopper-sinet 69.1 91.8 1.02
Hopper-multihop (our proposed method) 73.2 93.8 0.85
Table 1: CATER Snitch Localization results (on the test set). The top 3 performance scores are highlighted as: First, Second, Third. Hopper outperforms existing methods under only FPS.
Methods FPS # Frames Top Top
Random - -
DaSiamRPN (Tracking) (DaSiamRPN) - -
Hungarian (Tracking - ours) - -
TPN-101 (tpn) 88.3
TSM-50 (tsm)
SINet (sinet)
Transformer (transformer)
Hopper-transformer (last frame)
Hopper-transformer 57.6 90.1 1.39
Hopper-sinet 62.8 91.7 1.25
Hopper-multihop (our proposed method) 68.4 1.09
Table 2: CATER-h Snitch Localization results (on the test set). The top 3 performance scores are highlighted as: First, Second, Third. Hopper outperforms existing methods under only FPS.

Results. We present the results on CATER in Table 1 and on CATER-h in Table 2. For all methods, we find that the performance on CATER-h is lower than that on CATER, which demonstrates the difficulty of CATER-h. Such performance loss is particularly severe for the tracking baseline and the temporal video understanding methods (TPN and TSM). The DaSiamRPN tracking approach only solves about a third of the videos on CATER and less than on CATER-h. This is because the tracker is unable to maintain object permanence through occlusions and containments, showcasing the challenging nature of the task. Our Hungarian tracking has Top-1 on CATER and Top-1 on CATER-h. On CATER, TPN and TSM, as two state-of-the-art methods focusing on temporal modeling for videos, achieve a higher accuracy than methods in cater. However, on CATER-h, TPN only has Top-1 accuracy and for TSM. SINet performs poorly even though SINet reasons about the higher-order object interactions via multi-headed attention and fusion of both coarse- and fine-grained information (similar to our Hopper). The poor performance of SINet can be attributed to the less accurate object representations and the lack of tracking whereas effective temporal modeling is critical for the task. Without the object-centric modeling, Transformer that uses a sequence of frame representations performs poorly. However, for both SINet and Transformer, a significant improvement is observed after utilizing our Hopper framework. This shows the benefits of tracking-enabled object-centric learning embedded in Hopper. Since almost 60% of videos in CATER have the last visible snitch in the last frame, we run another experiment by using Transformer in Hopper but only uses the representations of the frame and objects in the last frame. This model variant achieves Top-1 on CATER, even higher than the best-performing method in cater, but only Top-1 on CATER-h, which proves and reiterates the necessity of CATER-h. The accuracy is relatively high, likely due to other dataset biases. More discussion is available in Appendix G.6. Our method, Hopper with the MHT, outperforms all baselines in Top- using only frames per video, highlighting the effectiveness and importance of the proposed multi-hop reasoning. opnet report slightly better Top-1 () and () on CATER but they impose strong domain knowledge, operate at FPS ( frames per video) and require the location of the snitch to be labeled in every frame, even when contained or occluded, for both object detector and their reasoning module. Labeling these non-visible objects for every frame of a video would be very difficult in real applications. Furthermore, their method only has Top-1 and of on CATER-h at FPS. Please refer to Appendix where we provide more quantitative results.

Figure 3: Qualitative result & interpretability of our model. We highlight the object attended per hop and per head from Hopper-multihop. The frame border of attended object is colored based on the hop index (accordingly: Hop1, Hop2, Hop3, Hop4, and Hop5). The bounding box of the most attended object in each hop shares the same color as the color of the hop index. Please zoom in to see the details. Best viewed in color.

Interpretability. We visualize the attended objects in Figure 3. As illustrated, the last visible snitch is in frame in this video, and at frame snitch is contained by the purple cone. At frame , the purple cone is contained by the blue cone, but in the end, the blue cone excludes the purple cone. Hop of Hopper-multihop attends to the last visible snitch at frame , and hop attends to snitch’s immediate container: the purple cone at frame . Hop mostly attends to the purple cone’s immediate container: the blue cone at frame , and secondly attends to the blue cone at frame (by the other head) as the blue cone is just sliding from frame till . Hop mostly attends to the blue cone at frame . Hop mostly attends to the purple cone (who contains the snitch) at frame , and secondly attends to the blue cone at frame . The visualization exhibits that Hopper-multihop performs reasoning by hopping over frames and meanwhile selectively attending to objects in the frame. It also showcases that MHT provides more transparency to the reasoning process. Moreover, MHT implicitly learns to perform snitch-oriented tracking automatically. More visualizations are available in Appendix.

7 Conclusion and Future Work

This work presents Hopper with a novel Multi-hop Transformer to address object permanence in videos. Hopper achieves Top- accuracy at just FPS on CATER, and demonstrates the benefits of multi-hop reasoning. In addition, the proposed Multi-hop Transformer uses an iterative attention mechanism and produces a step-by-step reasoning chain that improves interpretability. Multi-hop models are often difficult to train without supervision for the middle hops. We propose several training methods that can be applied to other tasks to address the problem of lacking a ground truth reasoning chain. In the future, we plan to experiment on real-world video datasets and extend our methods to deal with other complex tasks (such as video QA).

Acknowledgments

The research was supported in part by NSF awards: IIS-1703883, IIS-1955404, and IIS-1955365.

References

Appendix A Ablation Study

Dynamic Min Hop Hop Frame Teacher Debias Top Top
Stride Hops Loss Loss Loss Forcing Loss
Table 3: Ablation Study of Hopper training methods. We gradually add training methods described in Section 4, i.e., dynamic hop stride, minimal hops of reasoning, auxiliary hop object loss, auxiliary hop object loss, auxiliary frame loss, teacher forcing, and contrastive debias loss via masking out, onto the base Hopper-multihop model. The results are obtained from the CATER-h test set.

We conduct an ablation study of training methods described in Section 4 in Table 3. As shown, all proposed training methods are beneficial. ‘Dynamic Stride’ gives the model more flexibility whereas ‘Min Hops’ constrains the model to perform a reasonable number of steps of reasoning. ‘Hop Loss’, ‘Hop Loss’, ‘Frame Loss’ and ‘Teacher Forcing’ stress the importance of the correctness of the first hops to avoid error propagation. ‘Debias Loss’ is the most effective one by contrastively inducing the latent space to capture information that is maximally useful to the task at hand.

Object Detector & Tracker Reasoning Model Top Top
DETR + Hungarian (ours) MHT (both Masked)
DETR + Hungarian (ours) MHT (no Gating)
DETR (no Tracking) MHT (no Tracking)
DETR + Hungarian (ours) MHT (mask out LAST)
DETR + Hungarian (ours) MHT (mask out ALL)
Table 4: Ablation study & comparative results of analyzing components of our method (on CATER-h test set).

We then study how would different choices of the sub-components of our method affect the Snitch Localization performance. In Table 4:

  • ‘MHT (both Masked)’: This refers to using our Hopper-multihop but replacing the original input to Transformer with a masked version. In this way, both Transformer and Transformer have ‘Masking()’ applied beforehand.

  • ‘MHT (no Gating)’: This refers to using our Hopper-multihop but removing the Attentional Feature-based Gating (line in Algorithm 1) inside of MHT.

  • ‘MHT (no Tracking)’: This refers to using our Hopper-multihop but entirely removing the Hungarian tracking module. Thus, MHT directly takes in unordered object representations as inputs.

  • ‘MHT (mask out LAST)’: This refers to taking our trained Hopper-multihop, masking out the representation of the most attended object in the last hop by zeros, and then making predictions. This is to verify whether the most attended object in the last hop is important for the final Snitch Localization prediction task.

  • ‘MHT (mask out ALL)’: Similar to the above, ‘MHT (mask out ALL)’ refers to taking our trained Hopper-multihop, masking out the representations of the most attended objects in all hops by zeros, and then making predictions. This is to verify how important are the most attended objects in all hops that are identified by our Hopper-multihop.

As shown in Table 4, all of these ablations give worse performance, thus, indicating that our motivations for these designs are reasonable (see Section 4). Recall that in Table 2, ‘DETR + Hungarian’ (without MHT) has only Top-1 accuracy on CATER-h (learning a perfect object detector or tracker is not the focus of this paper). This highlights the superiority of our MHT as a reasoning model, and suggests that MHT has the potential to correct mistakes from the upstream object detector and tracker, by learning more robust object representations during the process of learning the Snitch Localization task. Masking out the most attended object identified by our Hopper-multihop in the last hop only has Top-1 accuracy. Masking out all of the most attended objects from all hops only has Top-1 accuracy. Such results reassure us about the interpretability of our method.

Appendix B Parameter Comparison

Model # Parameters (M) GFLOPs Top Acc.
SINet
Transformer
Hopper-transformer (last frame)
Hopper-transformer
Hopper-sinet
Hopper-multihop (our proposed method)
Table 5: Parameter and FLOPs comparison of our Hopper-multihop to alternative methods. (M) indicates millions. Results of the methods on CATER-h test set is also listed.

In Table 5, we compare the number of parameters of our Hopper-multihop with alternative methods. Our proposed method is the most efficient one in terms of the number of parameters. This is because of the iterative design embedded in MHT. Unlike most existing attempts on using Transformer that stack multiple encoder and decoder layers in a traditional way, MHT only has one layer of Transformer and Transformer. As multiple iterations are applied to MHT, parameters of Transformer and Transformer from different iterations are shared. This iterative transformer design is inspired by previous work slotattention. This design saves parameters, accumulates previously learned knowledge, and adapts to varying number of hops, e.g., some require hop and some require more than hop (e.g, hops). Because for these videos that tentatively only require hop, stacking multiple layers of Transformer such as might be wasteful and not necessary, our design of MHT could address such issue, being more parameter-efficient. We also report the GFLOPs comparison. Given that the FLOPs for MHT depends on the number of hops predicted for a video, we report the average number of FLOPs for the CATER-h test set.

Appendix C Diagnostic Analysis

c.1 Diagnostic Analysis on the Hopping Mechanism

# Hops
Ground Truth
Prediction
Jaccard Similarity
Table 6: Diagnostic analysis of the Multi-hop Transformer in terms of the ‘hopping’ ability (# hops performed).

We evaluate the ‘hopping’ ability of the proposed Multi-hop Transformer in Table 6. The prediction is made by our Hopper-multihop that requires at least hops of reasoning unless not possible. For each test video, we compute the ground truth number of hops required by this video and obtain the number of hops that Hopper actually runs. In the table, we provide the ground truth count and predicted count of test videos that require hop, hops, hops, hops, and equal or greater than hops. Since the numbers are close, we further compute Jaccard Similarity (range from to and higher is better) to measure the overlapping between the ground truth set of the test videos and predicted set of the test videos. According to these metrics, our proposed Hopper-multihop functionally performs the correct number of hops for almost all test videos.

Frame Index
Hop
Hop
Hop
Hop
Hop
Hop
Table 7: Hop Index vs. Frame Index: the number of times hop mostly attends to frame (results are obtained from Hopper-multihop on CATER-h). See details in Appendix C.1.
Figure 4: Hop Index vs. Frame Index: we plot the index of frame of the most attended object identified by each hop. Each hop has its unique color, and the transparency of the dot denotes the normalized frequency of that frame index for that particular hop.

In Table 7, we show the number of times hop mostly attends to frame . The results are obtained from Hopper-multihop on CATER-h, for those test videos predicted with hops shown in Table 6. In Figure 4, we plot the index of frame of the most attended object identified by each hop (conveying the same meaning as Table 7). The transparency of the dot denotes the normalized frequency of that frame index for that particular hop.

We can observe that: (1) Hop to tend to attend to later frames, and this is due to lacking supervision for the intermediate hops. As we discussed in Section 5, multi-hop model is hard to train in general when the ground truth reasoning chain is missing during training (dua2020benefits; chen2018learn; jiang2019self; wang2019multi). Researchers tend to use ground truth reasoning chain as supervision when they train a multi-hop model (qi2019answering; ding2019cognitive; chen2019multi). The results reconfirm that, without supervision for the intermediate steps, it is not easy for a model to automatically figure out the ground truth reasoning chain; (2) MHT has learned to predict the next frame of the frame that is identified by hop , as the frame that hop should attend to; (3) there are only videos predicted with more than hops even though we only constrain the model to perform at least hops (unless not possible). Again, this is because no supervision is provided for the intermediate hops. As the Snitch Localization task itself is largely focused on the last frame of the video, without supervision for the intermediate hops, the model tends to “look at” later frames as soon as possible. These results suggest where we can improve for the current MHT, e.g., one possibility is to design self-supervision for each intermediate hop.

c.2 Comparative Diagnostic Analysis across Frame Index

Figure 5: Diagnostic analysis of the performance in terms of when snitch becomes last visible in the video.

In Figure 5, we present the comparative diagnostic analysis of the performance in terms of when snitch becomes last visible. We bin the test set using the frame index of when snitch becomes last visible in the video. For each, we show the test set distribution with the bar plot, the performance over that bin using the line plot, and performance of that model on the full test set with the dashed line. We find that for Tracking (DaSiamRPN) and TSM, the Snitch Localization performance drops if the snitch becomes not visible earlier in the video. Such phenomenon, though still exists, but is alleviated for Hopper

-multihop. We compute the standard deviation (SD) and coefficient of variation (CV). Both are measures of relative variability. The higher the value, the greater the level of dispersion around the mean. The values of these metrics as shown in Figure 

5 further reinforce the stability of our model and necessity of CATER-h dataset.

Appendix D Extra Qualitative Results

In Figure 6, we visualize the attention weights per hop and per head from Transformer to showcase the hops performed by Hopper-multihop for video ‘CATERh054110’ (the one in Figure 3) in details. Please see Figure 78910, and 11 for extra qualitative results from Hopper-multihop. We demonstrate the reasoning process for different cases (i.e., ‘visible’, ‘occluded’, ‘contained’, ‘contained recursively’, and ‘not visible very early in the video’).

Figure 6: Visualization of attention weights & interpretability of our model. In (a), we highlight object(s) attended in every hop from Hopper-multihop (frame border is colored accordingly: Hop1, Hop2, Hop3, Hop4, and Hop5). In (b), we visualize the attention weights per hop (the smaller attention weight that an object has, the larger opacity is plotted for that object entity). As shown, Hopper-multihop performs hops of reasoning for the video ‘CATERh054110’. Our model performs reasoning by hopping over frames and meanwhile selectively attending to objects in the frame. Please zoom in to see the details. Best viewed in color.
Figure 7: We visualize the attention weights per hop and per head from Transformer in our Hopper-multihop. Objects attended in every hop are highlighted (whose frame border is colored accordingly: Hop1, Hop2, Hop3, Hop4, and Hop5). Please zoom in to see the details. Best viewed in color.
Figure 8: We visualize the attention weights per hop and per head from Transformer in our Hopper-multihop. Objects attended in every hop are highlighted (whose frame border is colored accordingly: Hop1, Hop2, Hop3, Hop4, and Hop5). Please zoom in to see the details. Best viewed in color.
Figure 9: We visualize the attention weights per hop and per head from Transformer in our Hopper-multihop. Objects attended in every hop are highlighted (whose frame border is colored accordingly: Hop1, Hop2, Hop3, Hop4, and Hop5). Please zoom in to see the details. Best viewed in color.
Figure 10: We visualize the attention weights per hop and per head from Transformer in our Hopper-multihop. Objects attended in every hop are highlighted (whose frame border is colored accordingly: Hop1, Hop2, Hop3, Hop4, and Hop5). Please zoom in to see the details. Best viewed in color.
Figure 11: We visualize the attention weights per hop and per head from Transformer in our Hopper-multihop. Objects attended in every hop are highlighted (whose frame border is colored accordingly: Hop1, Hop2, Hop3, Hop4, and Hop5). Please zoom in to see the details. Best viewed in color.

Appendix E Track Representation Visualization

Please see Figure 12 for visualization of the object track representations of video ‘CATERh054110’ (attention weights from Hopper-multihop for this video are shown in Figure 6). Hopper utilizes tracking-integrated object representations since tracking can link object representations through time and the resulting representations are more informative and consistent. As shown in the figure, the tracks that are obtained from our custom Hungarian algorithm that are competitive. Our model Hopper-multihop takes in the best-effort object track representations (along with the coarse-grained frame track) as the source input to the Multi-hop Transformer, and then further learns the most useful and correct task-oriented track information implicitly (as shown in Figure 6).

Figure 12: Tracking-integrated object representation visualization. Hopper utilizes tracking-integrated object representations since tracking can link object representations through time and the resulting representations are more informative and consistent. We visualize the object track representations of video ‘CATERh054110’ (attention weights from Hopper-multihop for this video are shown in Figure 6). Here, every column is for a track and every row is for a frame. The bounding box and object class label computed from the object representation are plotted ( is the object class label for , i.e., none object, and is the object class label for the snitch). As shown in the figure, the tracks that are obtained from our designed Hungarian algorithm are not perfect but acceptable since having perfect tracking here is not the goal of this paper. Our model Hopper-multihop takes in the (imperfect) object track representations (along with the coarse-grained frame track) as the source input to the Multi-hop Transformer, and then further learns the most useful and correct task-oriented track information implicitly (as shown in Figure 6). Hopper-multihop preforms hops of reasoning for this video; objects attended in every hop are highlighted (whose frame border is colored accordingly: Hop1, Hop2, Hop3, Hop4, and Hop5). Please zoom in to see the details. Best viewed in color.

Appendix F Failure Cases

We present a sample of the failure cases of Hopper-multihop in Figure 15. Generally, a video is more difficult if: (1) there are similar looking objects present simultaneously (especially if the object is similar to snitch or another cone in the video); (2) wrong hop or identified; (3) critical moments are occluded (e.g. in Figure 15, when the snitch becomes occluded, it is contained by the brown cone); (4) complex object interactions such as recursive containment along with container moving (since such case usually has the last visible snitch very early). Hopper-multihop fails in the first scenario due to the error made by the object representation and detection module, which can be avoided by using a fine-tuned object detector model. Hopper-multihop fails in the second scenario can attribute to the error made by the object detector, tracker, our heuristics, or capability of the inadequately-trained Multi-hop Transformer. The third scenario is not easy even for humans under FPS, thus increasing FPS with extra care might ease the problem. The last scenario requires more sophisticated multi-step reasoning, thus changing the minimal number of hops of the Multi-hop Transformer into a larger number with self-supervision for the intermediate hops to handle the long hops should help in solving this scenario. Overall, an accurate backbone, object detector, tracking method, or the heuristics to determine visibility and the last visible snitch’s immediate container (or occluder) will help improve the performance of Hopper-multihop. We would like to focus on enhancing Hopper-multihop for these challenges and verify our hypothesis in our future work.

Appendix G The CATER-h Dataset

g.1 Basics: CATER

CATER (cater) provides a diagnostic video dataset that requires spatial and temporal understanding to be solved. It is built against models that take advantage of wrong scene biases. With fully observable and controllable scene bias, the videos in CATER are rendered synthetically at 24 FPS (-frame xpx) using a library of standard 3D objects: different object classes in total which includes object shapes (cube, sphere, cylinder, cone, snitch) in sizes (small, medium, large), materials (shiny metal and matte rubber) and colors. Every video has a small metal snitch (see Figure 1). There is a large “table” plane on which all objects are placed. At a high level, the dynamics in CATER videos are in analogy to the cup-and-balls magic routine444https://en.wikipedia.org/wiki/Cups_and_balls. A subset of atomic actions (‘rotate’, ‘pick-place’, ‘slide’ and ‘contain’) is afforded by each object. See Appendix G.2 for definition of the actions. Note that ‘contain’ is only afforded by cone and recursive containment is possible, i.e., a cone can contain a smaller cone that contains another object. Every video in CATER is split into several time slots, and every object in this video randomly performs an action in the time slot (including ‘no action’). Objects and actions vary across videos. The “table” plane is divided into grids ( rectangular cells), and the Snitch Localization task is to determine the grid that the snitch is in at the end of the video, as a single-label classification task. The task implicitly requires the understanding of object permanence because objects could be occluded or contained (hidden inside of) by another object.

g.2 Definition of Actions

We follow the definition of the four atomic actions in  cater. Specifically:

  1. ‘rotate’: the ‘rotate’ action means that the object rotates by about its perpendicular or horizontal axis, and is afforded by cubes, cylinders and the snitch.

  2. ‘pick-place’: The ‘pick-place’ action means the object is picked up into the air along the perpendicular axis, moved to a new position, and placed down. This is afforded by all objects.

  3. ‘slide’: the ‘slide’ action means the object is moved to a new location by sliding along the bottom surface, and is also afforded by all objects.

  4. ‘contain’: ‘contain’ is a special operation, only afforded by the cones, in which a cone is pick-placed on top of another object, which may be a sphere, a snitch or even a smaller cone. This allows for recursive containment, as a cone can contain a smaller cone that contains another object. Once a cone ‘contains’ an object, the ‘slide’ action of the cone effectively slides all objects contained within the cone. This holds until the top-most cone is pick-placed to another location, effectively ending the containment for that top-most cone.

g.3 Dataset Generation Process

The generation of the CATER-h dataset is built upon the CLEVR (clevr) and CATER (cater) codebases. Blender is used for rendering. The animation setup is the same as the one in CATER. A random number of objects with random parameters are spawned at random locations at the beginning of the video. They exist on a portion of a 2D plane with the global origin in the center. Every video has a snitch, and every video is split into several time slots. Each action is contained within its time slot. At the beginning of each slot, objects are randomly selected to perform a random action afforded by that object (with no collision ensured). Please refer to cater for more animation details.

In order to have a video dataset that emphasizes on recognizing the effect of the temporal variations on the state of the world, we set roughly equal number of video samples to have the last visible snitch along the temporal axis. In order to obtain such a dataset, we generated a huge number of videos, computed the frame index of the last visible snitch in every video under FPS ( frames per video). Then, for every frame index , we obtained the set of videos whose last visible snitch is at frame index , and finally, we randomly chose and more videos from this set and discarded the rest. Eventually, the total number of videos in CATER-h is . We split the data randomly in ratio into a training and test set, resulting in training samples and testing samples.

g.4 CATER-h v.s. CATER

Figure 13 compares the CATER-h dataset and the CATER dataset.

Figure 13: Histogram of the frame index of the last visible snitch of every video. We find that CATER is highly imbalanced for the Snitch Localization task in terms of the temporal cues: e.g., snitch is entirely visible at the end of the video for samples. This temporal bias results in high-accuracy even if it ignores all but the last frame of the video. Our dataset CATER-h addresses this issue with a balanced dataset.

g.5 Train/Test Distribution

Figure 14 shows the data distribution over classes in CATER-h.

g.6 Other Potential Dataset Bias

As shown in Table 2, ‘Hopper-transformer (last frame)’ still has a relatively high accuracy on CATER-h. We hypothesize that the reason it has Top-1 accuracy on CATER-h might be due to other dataset biases (apart from the snitch grid distribution bias, and temporal bias that CATER has). Upon further investigation, we identify one type of additional bias, the “cone bias”, i.e., snitch can only be contained by a cone in the videos of CATER and CATER-h.

In order to verify the existence of the “cone bias”, we compute the accuracy if we make a random guess among the grids of cones that are not covered by any other cones, for all test videos whose snitch is covered in the end. This gives us Top-1 accuracy. This shows that the “cone bias” does exist in the dataset. The so-called “cone bias” comes from the nature of objects used in CATER and the fact that only the cone can carry the snitch (thus, it is closer to a feature of the dataset, rather than being a “bias” per se). Furthermore, because of the animation rules of CATER, there might exist other dataset biases, such as bias in terms of object size and shape, etc., which are hard to discover and address. This highlights the glaring challenge in building a fully unbiased (synthetic or real) dataset. CATER-h addresses the temporal bias that CATER has. A model has to perform long-term spatiotemporal reasoning in order to have a high accuracy on CATER-h.

Appendix H Implementation Details

h.1 Hopper 

We introduce the implementation of Hopper-multihop in this section. For both training and testing, we only used FPS (frames per second) to demonstrate the efficiency of our approach. This means we only have frames per video. Note that we trained the video query representation and recognition part (Multi-hop Transformer along with the final MLP) end to end.

The CNN backbone we utilized is the pre-trained ResNeXt-101 (xie2017aggregated) model from sinet. We trained DETR (detr)555https://github.com/facebookresearch/detr on LA-CATER (opnet) which is a dataset with generated videos following the same configuration to the one used by CATER, but additional ground-truth object bounding box location and class label annotations are available (opnet predicts the bounding box of snitch in the video given the supervision of the bounding box of the snitch in frames). We followed the settings in detr to set up and train DETR, e.g., stacking transformer encoder layers and transformer decoder layers, utilizing the object detection set prediction loss and the auxiliary decoding loss per decoder layer. is , is and is . The initial object query embeddings are learned. The MLP for recognizing the object class label is one linear layer and for obtaining the object bounding box is a MLP with hidden layers with neurons. After DETR was trained, we tested it on CATER to obtain the object representations, predicted bounding boxes and class labels. For tracking, we set because under a low FPS, using the bounding boxes for tracking is counterproductive, and it yielded reasonable results. Then, we trained the video recognition part, i.e., Multi-hop Transformer along with the final MLP, end to end with Adam (kingma2014adam) optimizer. The final MLP we used is one linear layer that transforms the video query representation of dimension

into the grid class logits. The initial learning rate was set to

and weight decay to . The batch size was . The number of attention heads for DETR was set to and for the Multi-hop Transformer was set to . Transformer dropout rate was set to . We used multi-stage training with the training methods proposed. Moreover, we found that DETR tends to predict a snitch for every cone on the “table” plane when there is no visible snitch in that frame. To mitigate this particular issue of the DETR object detector trained on opnet, we further compute an object visibility map , which is a binary vector and determined by a heuristic: an object is visible if the bounding box of the object is not completely contained by any bounding box of another object in that frame. The ‘Masking()’ function uses by considering only the visible objects.

h.2 Description of Reported CATER Baselines

For CATER, we additionally compare our results with the ones reported by cater from TSN (tsn), I3D (carreira2017quo), NL (nonlocal) as well as their LSTM variants. Specifically, TSN (Temporal Segment Networks) was the top performing method that is based on the idea of long-range temporal structure modeling before TSM and TPN. Two modalities were experimented with TSN, i.e., RGB or Optical Flow (that captures local temporal cues). I3D inflates 2D ConvNet into 3D for efficient spatiotemproal feature learning. NL (Non-Local Networks) proposed a spacetime non-local operation as a generic building block for capturing long-range dependencies for video classification. In order to better capture the temporal information for these methods, cater further experimented with a 2-layer LSTM aggregation that operates on the last layer features before the logits. Conclusions from cater are: (1) TSN ends up performing significantly worse than I3D instead of having similar performance which contrasts with standard video datasets; (2) the optical flow modality does not work well as the Snitch Localization task requires recognizing objects which is much harder from the optical flow; (3) more sampling from the video would give higher performance; (4) LSTM for more sophisticated temporal aggregation leads to a major improvement in performance.

h.3 Baselines

We introduce the implementation of the baselines that we experimented with in this section. First, for our Hungarian tracking baseline, for every test video, we obtain the snitch track (based on which track’s first object has snitch as its label) produced from our Hungarian algorithm, and project the center point of the bounding box of the last object in that track to the 3D plane (and eventually, the grid class label) by using a homography transformation between the image and the 3D plane (same method used in cater). We also try with the majority vote, i.e., obtain the snitch track as the track who has the highest number of frames classified as snitch. We report the majority vote result in Table 1 and 2 because it is a more robust method. The results of using the first frame of our Hungarian tracking baseline are Top-1, Top-5, on CATER, and Top-1, Top-5, on CATER-h.

We used the public available implementation provided by the authors (tpn)

for TSM and TPN. Models were defaultly initialized by pre-trained models on ImageNet 

(deng2009imagenet). The original ResNet (he2016deep) serves as the D backbone, and the inflated ResNet (feichtenhofer2019slowfast) as the D backbone network. We used the default settings of TSM and TPN provided by tpn, i.e., TSM- (D ResNet- backbone) settings that they used to obtain results on Something-Something (goyal2017something), which are also the the protocols used in tsm, as well as TPN- (I3D- backbone, i.e. D ResNet-) with multi-depth pyramid and the parallel flow that they used to obtain results on Kinetics (their best-performing setting of TPN) (carreira2017quo). Specifically, the augmentation of random crop, horizontal flip and a dropout of were adopted to reduce overfitting. BatchNorm (BN) was not frozen. A momentum of , a weight decay of and a synchronized SGD with the initial learning rate , which would be reduced by a factor of at , epochs ( epochs in total). The weight decay for TSM was set to . TPN used auxiliary head, spatial convolutions in semantic modulation, temporal rate modulation and information flow (tpn). For SINet, we used the implementation provided by sinet. Specifically, image features for SINet were obtained from a pre-trained ResNeXt-101 (xie2017aggregated) with standard data augmentation (randomly cropping and horizontally flipping video frames during training). Note that the image features used by SINet are the same as the ones used in our Hopper. The object features were generated from a Deformable RFCN (dai2017deformable). The maximum number of objects per frame was set to . The number of subgroups of higher-order object relationships () was set to

. SGD with Nesterov momentum were used as the optimizer. The initial learning rate was

and would drop by x when validation loss saturates for epochs. The weight decay was and the momentum was . The batch size was for these baselines. Transformer, Hopper-transformer, and Hopper-sinet used the Adam optimizer with a total of epochs, a initial learning rate of , a weight decay of , and a batch size of . Same as our model, the learning rate would drop by a factor of when there has been no improvement for epochs on the validation set. The number of attention heads for the Transformer (and Hopper-transformer) was set to , the number of transformer layers was set to to match the hops in our Multi-hop Transformer, and the Transformer dropout rate was set to . For OPNet related experiments, we used the implementation provided from authors (opnet). We verified we could reproduce their results under FPS on CATER by using their provided code and trained models.

For the Random baseline, it is computed as the average performance of random scores passed into the evaluation functions cater. For the Tracking baseline, we use the DaSiamRPN implementation from cater  666https://github.com/rohitgirdhar/CATER. Specifically, the ground truth information of the starting position of the snitch was first projected to screen coordinates using the render camera parameters. A fixed size box around the snitch is defined to initialize the tracker, and run the tracker until the end of the video. At the last frame, the center point of the tracked box is projected to the 2D plane by using a homography transformation between the image and the 2D plane, and then converted to the class label. With respect to TSN, I3D, NL and their variants, the results were from cater, and we used the same train, val split as theirs when obtaining our results on CATER.

[CATER-h Train Set]

[CATER-h Test Set]

Figure 14: Data distribution over classes in CATER-h.
Figure 15: Failure cases. Hopper produces wrong Top-, Top- prediction and terrible results for these failure cases. Similarly, we highlight the attended object per hop and per head (Hop1, Hop2, Hop3, Hop4, and Hop5). Case (a). ‘CATERh048295’: the occlusion has made the Snitch Localization task extremely difficult since when the snitch got occluded it was contained by the brown cone. Meanwhile, Hopper fails to attend to the immediate container of the last visible snitch (should be the brown cone at frame ) in Hop . Case (b). ‘CATERh022965’: the snitch was not visible very early in the video (at frame ), the recursive containment, as well as the presence of two similar looking cones have made the task extremely difficult. Hopper fails to attend to the correct object in Hop (should be the yellow cone).

Appendix I Related Work (Full Version)

In this section, we provide detailed discussion of related work. Our work is generally related to the following recent research directions.

Video understanding & analysis. With the release of large-scale datasets such as Kinetics (carreira2017quo)

, Charades 

(sigurdsson2016hollywood), and Something something (goyal2017something), the development of video representation has matured quickly in recent years. Early approaches use deep visual features from 2D ConvNets with LSTMs for temporal aggregation (donahue2015long; yue2015beyond). As a natural extension to handle the video data, 3D ConvNets were later proposed (ji20123d; taylor2010convolutional) but with the issue of inefficiency and huge increase in parameters. Using both RGB and optical flow modalities, Two-stream networks (simonyan2014two; feichtenhofer2016convolutional) and Two-Stream Inflated 3D ConvNets (I3D) (carreira2017quo) were designed. With the emphasis on capturing the temporal structure of a video, TSN (tsn), TRN (trn), TSM (tsm) and TPN (trn) were successively proposed and gained considerate improvements. Recently, attention mechanism and Transformer design (transformer)

have been utilized for more effective and transparent video understanding. Such models include Non-local Neural Networks (NL) 

(nonlocal) that capture long-range spacetime dependencies, SINet (sinet) that learns higher-order object interactions and Action Transformer (girdhar2019video)

that learns to attend to relevant regions of the actor and their context. Nevertheless, instead of the reasoning capabilities, existing benchmarks and models for video understanding and analysis mainly have focused on pattern recognition from complex visual and temporal input.

Visual reasoning from images. To expand beyond image recognition and classification, research on visual reasoning has been largely focused on Visual Question Answering (VQA). For example, a diagnostic VQA benchmark dataset called CLEVR (clevr) was built that reduces spatial biases and tests a range of visual reasoning abilities. There have been a few visual reasoning models proposed (santoro2017simple; perez2017film; mascharka2018transparency; suarez2018ddrprog; aditya2018explicit). For example, inspired by module networks (neuralmodulenet; andreas2016learning), johnson2017inferring propose a compositional model for visual reasoning on CLEVR that consists of a program generator that constructs an explicit representation of the reasoning process to be performed, and an execution engine that executes the resulting program to produce an answer; both are implemented by neural networks. Evaluated on the CLEVR dataset, hu2017learning proposed N2NMNs, i.e., End-to-End Module Networks, which learn to reason by directly predicting the network structures while simultaneously learning network parameters. MAC networks (macnetwork), that approach CLEVR by decomposing the VQA problems into a series of attention-based reasoning steps, were proposed by stringing the MAC cells end-to-end and imposing structural constraints to effectively learn to perform iterative reasoning. Further, datasets for real-world visual reasoning and compositional question answering are released such as GQA (hudson2019gqa). Neural State Machine (neuralstatemachine) was introduced for real-world VQA that performs sequential reasoning by traversing the nodes over a probabilistic graph which is predicted from the image.

Video reasoning. There has been a notable progress for joint video and language reasoning. For example, in order to strengthen the ability to reason about temporal and causal events from videos, the CLEVRER video question answering dataset (CLEVRER) was introduced being a diagnostic dataset generated under the same visual settings as CLEVR but instead for systematic evaluation of video models. Other artificial video question answering datasets include COG (yang2018dataset) and MarioQA (mun2017marioqa). There have been also numerous datasets that are based on real-world videos and human-generated questions such as MovieQA (tapaswi2016movieqa), TGIF-QA (jang2017tgif), TVQA (lei2018tvqa) and Social-IQ (zadeh2019social). Moving beyond the question and answering task, CoPhy (baradel2019cophy) studies physical dynamics prediction in a counterfactual setting and a small-sized causality video dataset (fire2017inferring) was released to study the causal relationships between human actions and hidden statuses. To date, research on the general video understanding and reasoning is still limited. Focusing on both video reasoning and a general video recognition and understanding, we experimented on the recently released CATER dataset (cater), a synthetic video recognition dataset which is also built upon CLEVR focuses on spatial and temporal reasoning as well as localizing particular object of interest. There also has been significant research in object tracking, often with an emphasis on occlusions with the goal of providing object permanence (bewley2016simple; deepsort; SiamRPN; DaSiamRPN; SiamMask). Traditional object tracking approaches have focused on the fine-grained temporal and spatial understanding and often require expensive supervision of location of the objects in every frame (opnet). We address object permanence and video recognition on CATER with a model that performs tracking-integrated object-centric reasoning for localizing object of interest.

Multi-hop reasoning. Reasoning systems vary in expressive power and predictive abilities, which include systems focus on symbolic reasoning (e.g., with first order logic), probabilistic reasoning, causal reasoning, etc (bottou2014machine). Among them, multi-hop reasoning is the ability to reason with information collected from multiple passages to derive the answer (wang2019multi)

. Because of the desire for chains of reasoning, several multi-hop datasets and models have been proposed for nature language processing tasks 

(Dhingra2020Differentiable; dua2019drop; welbl2018constructing; talmor2018repartitioning; yang2018hotpotqa). For example, das2016chains introduced a recurrent neural network model which allows chains of reasoning over entities, relations, and text.  chen2019multi proposed a two-stage model that identifies intermediate discrete reasoning chains over the text via an extractor model and then separately determines the answer through a BERT-based answer module.  wang2019multi investigated that whether providing the full reasoning chain of multiple passages, instead of just one final passage where the answer appears, could improve the performance of the existing models. Their results demonstrate the existence of the potential improvement using explicit multi-hop reasoning. Multi-hop reasoning gives us a discrete intermediate output of the reasoning process, which can help gauge model’s behavior beyond just final task accuracy (chen2019multi). Favoring the benefits that multi-hop reasoning could bring, in this paper, we developed a video dataset that explicitly requires aggregating clues from different spatiotemporal parts of the video and a multi-hop model that automatically extracts a step-by-step reasoning chain. Our proposed Multi-hop Transformer improves interpretability and imitates a natural way of thinking. The iterative attention-based neural reasoning (slotattention; neuralstatemachine) with a contrastive debias loss further offers robustness and generalization.

Appendix J MHT Architecture

We illustrate the architecture of the proposed Multi-hop Transformer (MHT) in Figure 16.

Figure 16: Architecture of the Multi-hop Transformer (MHT) that learns a comprehensive video query representation and meanwhile encourages multi-step compositional long-term reasoning of a spatiotemporal sequence. As inputs to this module, the ‘Source Sequence’ is [, ], where [ , ] denote concatenation; and the ‘Target Video Query’ is . ‘Final Video Query Representation’ is . refers to attention weights from the encoder-decoder multi-head attention layer in Transformer, averaged over all heads. To connect this figure with Algorithm 1, ‘Obtain Helper Information & Attention Candidates’ refers to line for ‘Helper Information’ (or line for the first iteration), and line for the ‘Attention Candidates’ . Dimensionality of ‘Helper Information’ is for hop and for the rest of the hops. Transformer and Transformer are using the original Transformer architecture (transformer). ‘Attentional Feature-based Gating’ corresponds to line . ‘Updated Attention Candidates’ is . ‘Masking’ corresponds to line . Its output, ‘Masked Attention Candidates’ is (however, certain tokens out of these tokens are masked and will have attention weights). The last ‘Layer Norm’ after the Transformer corresponds to line . denotes the total number of hops, i.e., total number of iterations, and it varies across videos. Please refer to Section 4 for details.