Answering natural questions regarding a video suggests the capability to purposefully reason about what we see in a dynamic scene. Modern neural networks promise a scalable approach to train such reasoning systems directly from examples in the form (video, question, answer). However, the networks’ high degree of trainability leads to an undesirable behavior: They tend to exploit shallow patterns and thus creating shortcuts through surface statistics instead of true systematic reasoning. Stretching this methodology leads to data inefficiency, poor deliberative reasoning, and limited systematic generalization to novel questions and scenes [greff2020binding].
Humans take a different approach. Since early ages, we identify objects as the core “living” construct that lends itself naturally to high-level whole scene reasoning of object compositions, temporal dynamics, and interaction across space-time [spelke2007core]
. Objects admit spatio-temporal principles of cohesion, continuity in space-time, and local interactions, allowing humans to infer the past and predict the future without relying too much on constant sensing. Cognitively, objects offer a basis for important abstract concepts such as sets and numbers, which can be symbolically manipulated without grounding to sensory modalities; and concrete concepts such as spatial layout and affordances. Mathematically, objects offer a decompositional scheme to disentangle the complexity of the world, giving rise to modularity through encapsulation, separating sensory patterns from functions, thus enabling high-order transfer learning across completely separate domains.
Inspired by these capabilities, we advocate for new paths to train neural networks for video question answering (Video QA) via object-centric video representation [desta2018object]. Here objects in video are primary constructs that have unique evolving lives throughout space-time. In addition to usual visual parts, moving objects have temporal parts [hawley2020temporal], and these are essential to understand its evolution and contextualized interaction with other objects. More concretely, we first extract from a video as a set of object tubelets using recent deep neural nets for object detection and tracking. Tubelets are then partitioned into short sub-tubelets, each of which corresponds to a brief period of object life with small change in appearance and position. This allows object representations to be temporally summarized and refined, keeping only information relevant to the query. Objects living in the same period form context-specific relationships, which are conditionally inferred from the scene under the guidance of the query. The objects and their relationships are represented as a graph, one per period. As the objects change throughout the video, their relationships also evolve accordingly. Thus the video is abstracted as a query-conditioned evolving object graph.
The object graphs are then parameterized as a sequence of deep graph convolutional networks (DGCNs) [Kipf2017SemiSupervisedCW]. The DGCNs refine the object’s temporal part representations within each short period. The representation serves as input to BiLSTM, which sequentially connects different temporal parts of the same object. Thus the first and the last states of a BiLSTM effectively encode the lifetime information of an object in the context of others, i.e., a contextualized r’esum’e. At this stage, the video is abstracted into a set of r’esum’es which are then reasoned about given a linguistic query using any general-purpose relational reasoning engine.
The system is evaluated on three major Video QA datasets: MSVD-QA [xu2017video] consisting of 50K questions with 2K videos, MSRVTT-QA [Xu2016MSRVTTAL] with 243K questions over 10K videos, and SVQA [emrvqasongMM18] with 120K questions over 12K videos of moving objects. These datasets are suitable to test the reasoning capability against complex compositional questions about spatial-temporal relationships among objects. Our results establish new state-of-the-art accuracies on the datasets.
To summarize, we make the following contributions: (a) proposal of a new object-centric representation for Video QA, (b) introduction of a new dynamic graph neural architecture that enables learning to reason over multiple objects, their relations and events, and (c) establishing new state-of-the-art results on three common Video QA benchmarks of MSVD-QA, MSRVTT-QA, and SVQA.
Ii Related Work
Question answering in its fullest form requires a strong reasoning capability for deliberately manipulating previously acquired knowledge to form the answer of a given question [bottou2014machine]. While high-level reasoning in humans seems to involve symbols and logic [marcus2018algebraic]
, the computational substrate is largely neuronal, suggesting that there exists an intermediate form of the reasoning process that is not yet formally logical, but still powerful[bottou2014machine]. Neural networks can be a good computational model for reasoning [greff2020binding, le2020dynamic], but networks need to be dynamic and deliberative as driven by the question, stepping away from simple associative responses. The dynamics can be manifested in many ways: By rearranging computational units on-the-fly [hu2017learning], by constructing relations as they emerge and linking them to the answer [le2020dynamic], or by dynamically building neural programs from memories [le2020neural].
Visual reasoning presents a great challenge because the formal knowledge needed for reasoning has not been acquired, but must be inferred from low-level pixels. Adding to this challenge, visual question answering typically involves high-level symbolic cues presented in the form of linguistic questions, thus requiring automatic binding between linguistic symbols and visual concepts [hu2017learning, le2020dynamic]. A powerful way to bridge this semantic gap is through object-centric representation as an abstraction over pixels [greff2020binding, le2020dynamic]. Each object has the potential to package object instance features such as its spatio-temporal information, shape, and color, which can solve the significant problem of disentangling complex scenes into manageable units. Learning objects and relations have been demonstrated to be essential for symbolic representations [garnelo2019reconciling]. A popular way to construct objects is through semantic segmentation or object detection. Extending to video requires tracking algorithms which exploit the temporal permanence of object to form tubelets [kalogeiton2017action].
In recent years video question answering (Video QA) has become a major playground for spatio-temporal representation and visual-linguistic integration. A popular approach treats video as a sequential data structure to be queried upon. Video is typically processed using classic techniques such as RNN [zhu2017uncovering] or 3D CNN [qiu2017learning], augmented by attention mechanisms [ye2017video], or manipulated in a memory network [kim2017deepstory]. Stepping away from this flat structure, video can be abstracted as a hierarchy of temporal relations [le2020hierarchical, le2020neural-reason], which can either directly generate answers given the query, or serve as a representation scheme to be reasoned about by a generic reasoning engine [le2020neural-reason]. More recently, objects have been suggested to be the core component of Video QA [yi2019clevrer] as they offer clean structural semantics compared to whole-frame unstructured features. Our work pushes along this line of object-centric representation for Video QA, built on the premise that object interactions are local, and their representation should depend on other objects in context as well as being guided by the query. In particular, objects are tracked into tubelets so that their lifelines are well defined. Their attributes and relations along such lifeline are constantly updated in the dynamic graph of their social circles of co-occurred fellow objects.
The Video QA task seeks to build a conditional model to infer an answer from of set , given a video and a natural linguistic query :
The challenges stem from (a) the long-range dependencies intra–objects across time, and inter–objects across space in the video ; (b) the arbitrary expression of the linguistic question ; and (c) the spatio-temporal reasoning over the visual domain as guided by the linguistic semantics. Fig. 2 gives some typical examples that can describe the major problems in Video QA task. In what follows, we present our object-oriented solution to tackle these challenges.
Iii-a Model Overview
Fig. 3 illustrates the overall architecture of our model. Our main contribution is the object-centric video representation for Video QA in which a video is abstracted as a dynamic set of interacting objects that live in space-time. The interactions between objects are interpreted in the context information given by the query (See Sec. III-B
). The output of the object-centric video representation is a set of object r’esum’es. These r’esum’es later serve as a knowledge base for a general-purpose relational reasoning engine to extract the relevant visual information to the question. In the reasoning process, it is desirable that query-specific object relations are discovered and query words are bound to objects in a deliberative manner. The reasoning module can be generic, as long as it can handle the natural querying over the set of r’esum’es. At the end of our model, an answer decoder takes as input the output of the reasoning engine and the question representation to output a vector of probabilities across words in the vocabulary for answer prediction.
Linguistic question representation
We make use of BiLSTM with GloVe word embedding [pennington2014glove] to represent a given question . In particular, each word in the question is first embedded into a vector of 300 dimensions. A BiLSTM running on top of this vector sequence produces a sequence of state pairs, one for the forward pass and the other for the backward pass. Each state pair becomes a contextual word representation , where is the vector length, and where is the length of the question. The global sequential representation is obtained by combining the two end states of BiLSTM: . Finally, we integrate with the contextual words by an attention mechanism to output the final question representation:
where is network learnable weights.
Iii-B Object-centric Video Representation
We represent an object live based on tubelet – the 3D structure of the object in space-time. At each time step, an object is a 2D bounding box with a unique identity assigned by object tracking (Sec. III-B1). We obtain tubelets by simply linking bounding boxes of the same identities throughout the given video. If seeing a video as a composition of events, the interactions between objects often happen in a short period of time. Hence, we can break the object tubelets into sub-tubelets where each sub-tubelet is equivalent to a temporal part [hawley2020temporal]. As we model the interactions between objects in the context given by the query, we then summarize each object-wise temporal part into a query-conditioned vector representation (Sec. III-B2). The representations of temporal parts of objects are then treated as nodes of a query-conditioned evolving object graph where the edges of the graph denote the relationships of objects living in the same temporal part. In order to obtain the full representations of object lives through the video, we link object representations in consideration of their interference with neighbor objects across temporal parts. Finally, each object life is summarized into a r’esum’e, preparing the object system for relational reasoning. See Fig. 4 for illustration of our object-centric video representation in a scene of three objects.
Iii-B1 Constructing Object Tubelets
Detecting and tracking objects
We use Faster R-CNN [ren2015faster] to detect frame-wise objects in a given video. This returns, for each object, (a) appearance features (representing “what”); (b) the bounding box (“where”) and (c) a confidence score. We use a popular object tracking framework DeepSort [wojke2017simple] which makes use of the appearance features and confidence scores of detected bounding boxes to track objects, assigning a unique ID to bounding boxes of the same object. For the ease of implementation, we assume that the objects live from the beginning to the end of video and missing objects at a time are marked with null values. Each object is now a tubelet, consisting of a unique ID, a sequence of coordinates of bounding boxes through time and a set of corresponding appearance features.
Joint encoding of “what” and “where”
Positions are of critical importance in reasoning about the spatial arrangement of objects at any given time [zhuang2017towards, wang2019neighbourhood]. Therefore, to represent a geometrical information of each object tubelet at a time step, we incorporate a spatial vector of 7-dimensions of the bounding boxes, including their four coordinates, width and height, and the size of their relative areas w.r.t the area of the whole video frame, i.e , where , , and are the width and height of the video frame, respectively.
Considering a brief period of time where an action takes place, the appearance feature of an object may change very little while the change in its position is more noticeable. For example, given a video where a person is walking towards a car, it is likely that the changes in position is more informative than that in appearance. Hence, we propose to use a multiplicative gating mechanism to encode the position-specific appearance of objects
where is non-linear mapping function and is a position gating function to take control the flow of appearance, making the change in representation of objects at different time steps be more discriminated from each other. We choose and in our later implementation, where and are network parameters.
Besides, in order to compensate for the loss caused by the imperfect object detection, we incorporate a so-called contextual object of a video frame. Particularly, we utilize pre-trained ResNet [He2016DeepRL] and take pool5 output features as the object feature of each time step.
Iii-B2 Language-conditioned Representation of Temporal Parts
Once the position-specific appearance of objects is computed, we reduce the complexity by grouping several frames into a short clip, creating a language-conditioned temporal part by taking the question into account. The goal is to prepare the parts to be question–specific, making them more ready for subsequent relational reasoning.
We split an object life into equal temporal parts where part contains frames. Let , where is the position-conditioned feature computed in Eq. (3) of the -th frame in the -th part. Given the question representation , we then apply a temporal attention mechanism to compute the probability to which a frame in a part is attended:
where is the Hadamard product, are learnable weights. The temporal part of an object is then summarized as:
where is a binary mask vector to exclude missed detections of objects. is calculated based on indices of the time steps marked with null values as in Sec. III-B1.
Iii-B3 Query-conditioned Object Graph
Imagining that we have sequences of objects living together in a period of time (temporal part), their living paths are not only constrained by their own behavior at different points in time but also influenced by their object neighbors living nearby. Besides, in the context of Video QA, relationships between objects are understood in the context given by the query. Note that temporal object parts living in the same period can be highly correlated, depending on the spatial distance between objects and the semantics of the query. We represent objects in a temporal part and their relationships by a graph , where are nodes and is the adjacent matrix. The adjacent matrix is query-dependent and is given by:
where is representation of part of object , as computed in Eq. (5), and is the probability to which object is attended.
Given the adjacent matrix , we use graph convolutional networks (GCNs) [Kipf2017SemiSupervisedCW] to correlate objects and systematically refine all the object representations by considering their connections with other objects in the same temporal part. We stack GCNs into layers with skip-connections between them, allowing the refinement to run through multiple rounds. To mitigate to effects of imperfect object detection, we propose to use the contextual features of video frames as explained in Sec. III-B1 as an additional node for graph . Different from the original bidirectional graph in [Kipf2017SemiSupervisedCW], our temporal-part-specific object graph is now considered as a heterogeneous graph where connections between objects are bidirectional while the connections between the contextual node and other nodes are directional. Let be the initial node embedding matrix, we refine representations of nodes in each graph as follows:
where is the number of GCN layers, and is nonlinear transformation chosen to be ELU in our implementation. The skip-connection is then added between GCN layers to do multiple round representation refinement:
Finally, the contextualized parts representation is then .
Iii-B4 Video as Evolving Object Graphs
An object is a sequence of temporal parts whose representation is contextualized in Sec. III-B3. Hence the video is abstracted as a spatio-temporal graph whose spatial and temporal dependencies are conditioned on the query, as illustrated in Fig. 4. Let be contextualized part representation of the object. Temporal parts are then connected through a BiLSTM:
with the first state initialized as and , respectively.
The dynamic object graph represents the scene evolution well but it remains open how to efficiently perform arbitrary reasoning, given the free-form expression of the linguistic query. For example, given a scene of multiple objects of different shapes, we want to answer a question “Is there a cylinder that start rotating before the green cylinder?”. For this we need to search for the matching cylinders, and traverse the graph to identify the event of rotating, then follow up the cylinder trajectories before the event. In the process of matching, we need to find the right visual colored cylinder that agree with the linguistic words “cylinder”, “green cylinder” and “rotating”. For large temporal graphs, it poses a great challenge to learn to perform these discrete graph operations.
To mitigate the potential complexity, we propose to integrate out the temporal dimension of the dynamic object graph, creating an unordered set of object r’esum’es. This method of temporal integration has been empirically shown to be useful in Video QA [le2020neural-reason]. In particular, we compute a r’esum’e for each object by summarizing its lifetime using , where and are end states of the BiLSTM using Eqs. (10,11). This representation implicitly codes the appearance, geometric information, temporal relations of an object in space-time, as well as the parts relation between objects living in the same spatio-temporal context.
Iii-C Relational Reasoning
Now we have two sets, one consisting of contextualized visual r’esum’es , and the other consisting of the query representation . Reasoning, as defined in Eq. (1), amounts to the process of constructing the interactions between these two sets (e.g., see [hudson2018compositional]). More precisely, it is the process of manipulating the representation of the visual object set, as guided by the word set. An important manipulation is to dynamically construct language-guided predicates of object relations, and chain these predicates in a correct order so that the answer emerges. It is worth to emphasize that our object-centric video representation can combine with a wide range of reasoning models. Examples of the generic reasoning engines are the recently introduced MACNet (Memory, Attention, and Composition Network) [hudson2018compositional] and LOGNet (Language-binding Object Graph Network) [le2020dynamic]. Although these models were originally devised for static images, its generality allows adaptation to Video QA through object r’esum’es.
The application of generic iterative relational reasoning engines, for object-oriented Video QA somewhat resembles the dual-process idea proposed in [le2020neural-reason]. It disentangles the visual QA system into two sub-processes – one for domain-specific modeling, and the other for deliberative general-purpose reasoning.
Iii-D Answer Decoder
We follow prior works in Video QA [le2020neural-reason, fan2019heterogeneous, le2020hierarchical] to design an answer decoder of two fully connected layers followed by the softmax function to obtain probabilities of labels for prediction. As we treat Video QA as a multi-class classification, we use cross-entropy loss to train the model.
We evaluate our proposed architecture on the three recent benchmarks, namely, MSVD-QA, MSRVTT-QA [xu2017video, Xu2016MSRVTTAL] and SVQA [emrvqasongMM18].
MSVD-QA contains 1,970 real-world video clips and 50,505 QA pairs of five question types: what, who, how, when, and where, of which 61% of the QA pairs is used for training, 13% for validation and 26% for testing.
consists of 243K QA pairs annotated from over 10K real videos. Similar to the MSVD-QA dataset, the questions are classified into five types: what, who, how, when, and where. The proportions of videos in the training, testing and validation splits are 65/30/5%, respectively.
SVQA is designed specifically for multi-step reasoning in Video QA, similar to the well-known visual reasoning benchmark with static images CLEVR [johnson2017clevr]. It contains 12K short synthetic videos and 120K machine-generated compositional and logical question-answer pairs covering five categories: attribute comparison, count, query, integer comparison and exist. Each question is associated with a question program. SVQA helps mitigate several limitations of the current Video QA datasets such as language bias and the lack of compositional logic structure. Therefore, it serves as an excellent testbed for multi-step reasoning in space-time. We follow prior work [le2020neural-reason] to use 70% of videos as the training set, 20% of videos as the test set, and the last 10% for cross-validation.
Iv-B Implementation Details
Unless otherwise stated, each video is segmented into 10 clips of 16 consecutive frames. Faster R-CNN111https://github.com/airsplay/py-bottom-up-attention [ren2015faster] is used to detect frame-wise objects. We take 40 tubelets per video for MSVD-QA and MSRVTT-QA while that of SVQA is 30 simply based on empirical experience. Regarding the language processing, we embed each word in a given question into a vector of 300 dimensions, which are initialized by pre-trained GloVe [pennington2014glove]. The default configuration of our model is with GCN layers for the query-conditioned object graph (See Sec. III-B3), feature dimensions in all sub-networks. Regarding the general relational reasoning engine, we use 12 reasoning steps on both MACNet and LOGNet with default other parameters as in [le2020dynamic, hudson2018compositional]. We refer our models as OCRL+LOGNet or OCRL+MACNet, respectively, where OCRL stands for Object-Centric Representation Learning. Without mentioning explicitly, we report results of the OCRL+LOGNet as it generally produces favorable performance over the OCRL+MACNet.
We train our models with Adam optimizer with an initial learning rate of and a batch size of 64. To be compatible with related works [le2020neural-reason, emrvqasongMM18, le2020hierarchical]
, we use accuracy as evaluation metric for all tasks.
Iv-C Comparison against SOTAs
We compare our proposed method against recent state-of-the-art models on each dataset. MSVD-QA and MSRVTT-QA datasets: Results on the MSVD-QA and MSRVTT-QA datasets are presented in Table I. As can be seen, our OCRL+LOGNet model consistently surpasses the all SOTA methods. In particular, our model significantly outperforms the most advanced model HCRN [le2020hierarchical] by 2.1 absolute points on the MSVD-QA, whereas it slightly advances the accuracy on the MSRVTT-QA from 35.6 to 36.0. Note that the MSRVTT-QA is greatly bigger than the MSVD-QA and it contains long videos with complex relations between objects, so it might need more tubelets per video than the current implementation of 40 tubelets. As for the contribution of each model’s component, we provide further ablation studies on MSVD-QA dataset in Sec. IV-E.
|Model||Test accuracy (%)|
SVQA dataset: Table II shows the comparisons between our model against the SOTA methods. Indeed, our proposed network significantly outperforms all recent SOTA models by a large margin (3.7 points). Specifically, we achieve the best performance in a majority of sub-tasks (11/15), which account for 91% of all question-answer pairs, and slightly underperform in the other four categories. We speculate that this is due the advantages of CRN+MACNet [le2020neural-reason] in explicitly encoding high-order temporal relations. Nevertheless, the results clearly demonstrate the necessity of an object-centric approach in handling a challenging problem as Video QA.
Iv-D Object-Centric vs. Grid Representation
In order to understand the effects of the object-centric representation learning (OCRL) in comparison with the popular grid representation used by [le2020neural-reason, le2020hierarchical], we use a MACNet [hudson2018compositional] as the relational reasoning engine for fair comparison. Fig. 5 presents comparison results on MSVD-QA, MSRVTT-QA, and SVQA datasets between the OCRL and non-object feature baselines in [le2020neural-reason, le2020hierarchical]. For the first two real datasets, the results confirm that the object-oriented representation makes it a lot easier to arrive at correct answers than the grid representation counterpart. In particular, the object-centric representation outperforms other SOTA non-object methods on both MSVD-QA and MSRVTT-QA (1.8 and 0.2 points improvement, respectively). For the SVQA, our proposed OCRL model surpasses the grid representation by roughly 2 points. Empirical results on the SVQA specifically reveal the strong desire for proper object-centric representation towards solving multi-step reasoning.
Iv-E Ablation Study
We conduct ablation studies on the MSVD-QA dataset to justify the contributions of different components in our model. Empirically, we have noticed that the contribution of each component is more noticeable when decreasing the number of reasoning steps of the relational module. In particular, we set the number of reasoning steps of LOGNet as 2 in all ablated experiments. Results in Table III reveal that each design component ameliorates the overall performance of the model. The effects are detailed as follows:
Position-specific mechanism: We study the effect of the gating mechanism in the Eq. (3). In particular, we replace this module with a simple concatenation. As can be seen, adding gating mechanism leads to an improvement of 1.1 absolute points. The results convincingly demonstrate the effects of our novel position-gated mechanism comparing to the common approach of combining the appearance and spatial feature [le2020dynamic, wang2019neighbourhood].
Language-conditioned representation of temporal parts: We study the effect of the language on representing temporal parts in Sec. III-B2. In particular, we replace the temporal attention in Eq. (4) with a simple mean-pooling operation. We experience a significant drop in performance from 37.6% to 35.3% when disregarding the language-condition representation. We conjecture that the question-specific representations of object lives make it more ready for the later relational reasoning to arrive at correct answers.
Query-conditioned object graph: We conduct a series of experiments going from shallow GCNs to very deep GCNs to study the effects of the temporal parts representation refinement based on the information carried out by their surrounding objects. Empirical results suggest that it gradually improves the performance when increasing the depth of the GCN, and 6 layers are sufficient on the MSVD-QA dataset.
Object graphs summary: We verify the effect of the sequential modeling of temporal parts as described in Sec. III-B4. The simplest way of computing the summary of object graphs is via an average-pooling operator over all elements in the sequence which is to totally ignore the temporal dependencies between the elements. Results in Table III show that connecting a chain of object temporal parts by a sequence model like BiLSTM is beneficial. Comparing to the clip-based relation network [le2020neural-reason], a BiLSTM as in our model is less computationally expensive and easy to implement.
Contextual object: Finally, we investigate the effect of contextual feature in Eq. (8). The performance of our model notably decreases from 37.6 to 35.8 when removing this special object. This result once again confirms the importance of the contextual object in compensating for the loss of information caused by the imperfect object detection.
|Default config. (*)||37.6|
|w/o Gating mechanism||36.5|
|Language-conditioned rep. of temporal parts|
|w/o Temporal attention||35.3|
|Interaction graph of temporal parts|
|w/ 1 GCN layer||36.9|
|w/ 4 GCN layers||37.3|
|w/ 8 GCN layers||37.1|
|Object graphs summary|
|w/o Contextual Obj.||35.8|
We have introduced a novel neural architecture for object-centric representation learning in video question answering. Object-centric representation adds modularity and compositionality to neural networks, offering structural alternatives to the default vectorial representation. This is especially important in complex scenes in video with multi-object dynamic interactions. More specifically, our representation framework abstracts a video as an evolving relational graph of objects, whose nodes and edges are conditionally inferred. We also introduce the concept of r’esum’e that summarizes the live of an object over the entire video. This allows seamless plug-and-play with existing reasoning engines that operate on a set of items in response to natural questions. The whole object-centric system is supported by a new dynamic graph neural network, which learns to refine object representation given the query and the context defined by other objects and the global scene. Our architecture establishes new state-of-the-arts on MSVD-QA, MSRVTT-QA, and SVQA – the three well-known video datasets designed for complex compositional questions, relational, spatial and temporal reasoning.