Consistent Multiple Sequence Decoding

by   Bicheng Xu, et al.
The University of British Columbia

Sequence decoding is one of the core components of most visual-lingual models. However, typical neural decoders when faced with decoding multiple, possibly correlated, sequences of tokens resort to simple independent decoding schemes. In this paper, we introduce a consistent multiple sequence decoding architecture, which is while relatively simple, is general and allows for consistent and simultaneous decoding of an arbitrary number of sequences. Our formulation utilizes a consistency fusion mechanism, implemented using message passing in a Graph Neural Network (GNN), to aggregate context from related decoders. This context is then utilized as a secondary input, in addition to previously generated output, to make a prediction at a given step of decoding. Self-attention, in the GNN, is used to modulate the fusion mechanism locally at each node and each step in the decoding process. We show the efficacy of our consistent multiple sequence decoder on the task of dense relational image captioning and illustrate state-of-the-art performance (+ 5.2 task. More importantly, we illustrate that the decoded sentences, for the same regions, are more consistent (improvement of 9.5 regions maintain diversity.


page 1

page 2

page 3

page 4


Middle-Out Decoding

Despite being virtually ubiquitous, sequence-to-sequence models are chal...

Adaptively Aligned Image Captioning via Adaptive Attention Time

Recent neural models for image captioning usually employs an encoder-dec...

Graph Neural Networks for Channel Decoding

In this work, we propose a fully differentiable graph neural network (GN...

Graph Sequential Network for Reasoning over Sequences

Recently Graph Neural Network (GNN) has been applied successfully to var...

Fast Image Caption Generation with Position Alignment

Recent neural network models for image captioning usually employ an enco...

Blockwise Parallel Decoding for Deep Autoregressive Models

Deep autoregressive sequence-to-sequence models have demonstrated impres...

A Better Variant of Self-Critical Sequence Training

In this work, we present a simple yet better variant of Self-Critical Se...

1 Introduction

Sequence decoding has emerged as one of the fundamental building blocks for a large variety of computer vision problems. For example, it is a critical component in a range of visual-lingual architectures, for tasks such as image captioning 

[18, 24, 31, 35] and question answering [2, 3, 22], as well as in generative models that tackle trajectory prediction or forecasting [1, 14, 17, 33]. Most existing methods assume a single sequence and implement neural decoding using recurrent architectures, e.g., LSTMs or GRUs; recent variants include models like BERT [6]. However, in many scenarios, more than one sequence needs to be decoded at the same time. Common examples include trajectory forecasting in team sports [7, 32, 38] or autonomous driving [27, 30], where multiple agents (players/cars) need to be predicted and behavior of one agent may closely depend on the others.

Similarly, in dense image captioning [10, 37, 39], or dance relational captioning [11], multiple linguistic captions need to be decoded for a set of possibly overlapping regions, with the goal of describing the regions themselves [10] or relationships among them [11]. In such scenarios, a trivial (and most common) solution is to decode each sequence independent of the other. While simple, this looses the important context among the various sequences being processed.

Consider, for example, dense image relational captioning, which is the main focus of this paper. As defined in [11], the goal of the task is to generate captions denoting relationships between pairs of objects in an image. This means that, in general, multiple captions will refer to the same object region as either the subject or object of the relation that needs to be captioned. If processed independently, each decoder will interpret the same image region and combine it with a language model to produce a caption. This may result in different proper names (possibly synonyms or worse) being used in each decoder, e.g., “desk (subject) standing on the floor”, “lamp standing on the furniture (object)”, “table (subject) is next to a chair” when referring to the exact same image region containing the table. This seems unnatural and problematic, as a given person typically would have a preferred expression for an object and would use it consistently; while different people may use different terms or names to refer to a given object, any one individual would typically be self-consistent and not switch naming convention. In other words, a given individual would use “table” or “desk” in all three aforementioned captions, but not switch between these phrases. A more drastic example is when one sequence is directly correlated with another (e.g., a player following an opponent; prediction of temperature in close geographic regions).

Addressing this challenge in the multi-sequence decoder, implies coupling various decoders so they can communicate and consistently generate sequences. We propose a model, where specifically, at each time step during decoding, all decoders are allowed to communicate among themselves using a graph neural network to arrive at the context required to generate the next output (word or 2D location). We note that depending on the application, the communication may take the form of propagating the hidden state or the readout (output) from each decoder, which may involve sampling (choosing a word). We illustrate the performance of this architecture on both synthetic data and real application of dense relational captioning, where we simultaneously generate each word, in each sequence, conditioned on the previous words of all sequences. We illustrate improvements in the quality of the captions and, more significantly, consistency.

Our contributions are as follows:

  • We propose a novel consistent multiple sequence decoder which can learn to incorporate consistent patterns among sequences, while decoding them.

  • The consistency fusion mechanism in the proposed decoder, implemented using message passing in a graph neural network, enables arbitrary topology of problem-dependent contextual information flow.

  • We validate our approach by producing new state-of-the-art results on the dense relational captioning task, resulting in improvement in mAP measurement, obtained using consistent decoding mechanism alone.

  • We also introduce a new measurement to evaluate the consistency in decoded sequences (captions), and illustrate improvement on this measure.

We note that while our main focus is on dense image relational captioning, the formulation of our consistent multiple sequence decoder is general and can be used in any application and architecture which requires multi-sequence decoding.

2 Related Work

2.1 Multiple Sequence Decoding

Sequence decoding is about giving some condition information, such as images or videos, and decoding sequences from the given condition. The applications are ranging from image/video captioning [37, 29, 2, 25, 8] to dense image/video captioning [10, 37, 39, 15, 19, 28]

. Image/video captioning takes an image or a short video clip as an input and automatically produces a caption describing the content. Dense captioning, instead of generating one caption for the whole image or video, is tasked with generating a caption for every region of interest (for image), or every time period of interest (for video).

The task of dense image captioning was first proposed in [10], where given an image, the model is able to detect the objects in that image, and then generate a caption for every detected object. Recently, Kim et al. [11] extended the task to generate a caption for every pair of detected objects, which is called dense relational captioning. These two tasks are both multiple sequence decoding tasks. However, all these works generate captions independently of one another, and do not consider the potential correlations existed among sequences to be decoded. Similar to Kim et al. [11], we address the dense relational captioning task, but specifically focus on modeling correlations among language decoders using our proposed consistent multiple sequence decoding architecture.

2.2 Multiple Sequence Prediction

Sequence prediction is about giving some state information in the past, predicting the future states. The problem of consistent multi-sequence decoding is related to multi-agent state or trajectory prediction, which has been handled by contextual spatial pooling in SocialLSTM [1]

or graph-structured variational recurrent neural network in 

[32]. SocialLSTM [1] predicts the future trajectories of people based on their past positions. It proposes the idea of social pooling to incorporate a person’s neighbors’ past position information for one person’s trajectory prediction. Sun et al. [32] proposes a graph-structured variational recurrent neural network model that learns to integrate temporal information with ambiguous visual information to forecast the future states of the world, which is represented as multiple sequences. [21]

proposes a non-autoregressive decoding procedure for deep generative models that can impute missing values for spatiotemporal sequences with long-range dependencies.

Different from these works, which are designed to operate specifically on 2D trajectory data, we propose a more flexible and effective multi-sequence consistent decoding architecture.

3 Multiple Sequence Decoder

Sequence decoding is usually performed by recurrent neural networks (RNNs). Traditionally, multiple sequence decoding decodes every sequence independently. During decoding, at every time step, the input to the recurrent unit is only the output representation from its own previous time step (see Figure 1), or the representation of a special token (e.g., the start or end of sequence token).

However, in general, the sequences may not be independent of one another and may exhibit high degree of correlation. In such cases, the independent decoder will loose out on the across-sequence context and correlations, making predictions poor or inconsistent. Therefore it is important to learn problem-dependent correlations among the decoders and to be able to integrate this contextual information effectively and continuously as the decoding takes place.

Our proposed consistent multiple sequence decoder, consistent decoder in short, can learn and take advantage of the consistency/correlation pattern existing among sequences for better decoding results. To utilize these consistency patterns, we want each decoder to have the summary of the previous outputs from other decoders when generating the output. Therefore, during decoding, at every time step, rather than receiving one input, our consistent decoder gets two. First is the output representation from the previous time step of its own, as in the traditional independent decoder. The second input is a fused output representation from previous time step of all other sequence decoders that it may be correlated with. We propose a novel graph-based consistency fusion mechanism (Section 3.1) for this output fusion. The proposed fusion mechanism has two core benefits: (1) it allows modeling of arbitrary dependencies among the decoders (i.e., it allows one to define chains of potential directed or undirected influences among the decoders, e.g., Decoder A can influence Decoder B and C and vice versa, Decoder C can influence Decoder D but not vice versa); and (2) a graph attention mechanism can guide contribution of information at each decoding step, allowing the context to be dynamic over time. The overall structure of our consistent multiple sequence decoder is illustrated in Figure 2.

Figure 1: Independent Decoder. In the illustration, , , and refer to the hidden state, input, and output representations at time steps , , , and respectively. In the independent decoder, the input at the current time step is the output representation from its own previous time step.
Figure 2: Consistent Multiple Sequence Decoder. Illustrated is an example of decoding three sequences simultaneously, indicated by blue, orange, and green colors. The , , and refer to the hidden state, input, and output representations of the relevant recurrent unit at time steps , , , and respectively. represents the consistency fusion mechanism, which acts on the output representations from all decoders. At every time step, our decoder receives two inputs. One is the output representation generated by itself from the previous time step; the other is fused output representation from . The two inputs are concatenated and fed into a recurrent unit.

3.1 Consistency Fusion Mechanism

Graph neural networks (GNNs) [12, 13, 34, 36] are powerful mechanisms for contextual feature refinement. As such, we use gated GNN [20], a type of graph neural networks, to implement the consistency fusion mechanism. In doing so, our goal is to propagate information among related decoders in order to arrive at a fused representation for each decoding step in each of the decoders.

During each step of decoding, this mechanism fuses the output representations of the previous time step from all decoders. A graph is built on these output representations. Every output representation is treated as a node in the graph and the number of nodes is equal to the number of sequences/decoders. Formally, , where is the set of vertices in a graph, and is the adjacency matrix representing the dependencies between any pair of nodes/decoders (). Message passing is used to refine these representations and the results are fed into corresponding decoders as additional context.

Adjacency Matrix. The adjacency matrix in the gated GNN is predefined before sequence decoding and same adjacency matrix is used at every time step. We say that two sequences are correlated, if they are believed to be somehow dependent. We set the elements in the adjacency matrix if two sequences and are correlated, and

otherwise. Note that reasonable adjacency matrix can often be estimated based on the constraints of the problem; e.g., for dense relational captioning all decoders that take a certain region into account (or regions with sufficient IoU overlap) can be connected. In problem where no prior information is known, a complete adjacency matrix can be used.

Gated Graph Neural Network. We use gated GNN [20] to fuse the output representations at every time step based on the predefined adjacency matrix. This involves two steps in one iteration of feature update. The first step is (message passing), and the second step is (feature fusion). The step can be formulated as follows:


where is the aggregated information received by node from all its neighbors during iteration of message passing. is a

-dimensional vector representing the feature of node

before the message passing. is a learnable -dimensional graph kernel matrix and is the adaptive self-attention which will be described later.

After obtaining , the feature of node will be updated using the operation,


We adopt the GRU gating mechanism [5] in the operation as proposed in [20]. After iterations of and , we get , the final representation of node , which is our fused output representation.

Initialization. The initial state of each node is set to the output representation produced by the recurrent unit of the corresponding decoder.

Adaptive Self-Attention. We calculate the self-attention for every node at each message passing iteration to decide the edge weights of the graph. For efficiency, all edges that originate from a node have the same weight , which is calculated by



is a three-layer fully-connected neural network with leaky ReLU activations 

[23]. When aggregating the information for a node , for all of its neighbors , is normalized to sum to 1 by the operation.

We do not calculate a pair-wise attention between all connected nodes as it is computationally expensive to do for a large graph. The number of elements in the attention matrix increases quadratically with the increasing number of nodes. As can be seen in the dense relational captioning experiments (Section 4.2), the average number of sequences to decode for every image is during evaluation; making pair-wise attention untenable.

3.2 Input Combination

At every time step, our consistent decoder receives two inputs. One is the output representation from its own previous time step, and the other is the fused output context representation from the consistency fusion mechanism. We simply concatenate to form the input to the recurrent unit.

4 Experiments

We conduct two sets of experiments. One is on a synthetic mathematical dataset created by us, where we build two mathematical sequences one being a function of the other. The goal is to illustrate the full power of our model in the scenario where dependency between sequences is strong; as well as to illustrate the power of consistent decoder in absence of additional complexities introduced by a specific vision problem. The second set of experiments focuses on the use of consistent multiple sequence decoding for dense relational captioning on a real image dataset introduced in [11]. We also conduct ablation experiments to illustrate the importance of various components.

4.1 Mathematical Sequence Decoding

4.1.1 Dataset.

We build a dataset which contains of paired mathematical sequences:


By design, is a function of and hence depends on the value of . The constants , , , and

are randomly drawn from the uniform distribution

. Paired sequences are constructed by evaluating the two functions on integer values . The goal of the task is to forcast values of both and for steps conditioned on the input and respectively. We build 80K paired sequences in total; 70K for training and 5K for validation/testing each.

Figure 3: Synthetic Experiment. One recurrent unit at time step of the models. (a) The baseline model, where the LSTM input is its own output representation from the previous time step. (b) Our consistent decoder, where the LSTM unit receives two inputs. One is its own output representation from the previous time step, the other one is the output from the consistent fusion mechanism. We show the two decoders here. The consistent fusion mechanism is applied on the two decoders’ output representations.

Baseline. As a baseline, we train two separate decoders to decode the sequences independent of one another, where every decoder is an LSTM [9] recurrent neural network. One such recurrent unit is illustrated in Figure 3, which contains three components: a feature embedding layer, an LSTM unit, and an output layer. In the first time step, the input to the feature embedding layer is a concatenation of and , or and . The input in the second time step is or value evaluated at . The output layer is trained to produce the next value in the sequence. The input in the remaining time steps is the output from the previous time step. Except the feature embedding layer in the first time step, all other feature embedding layers share the same weights in one decoder. All LSTM units and output layers have the same weights in one decoder.

Baseline (2x). Our consistent decoder takes two inputs, increasing the input dimension of the LSTM cell by a factor of two (compared to the baseline above). To ensure that the improved performance is not due to the increased capacity of the LSTM, we also implement a baseline where input to LSTM cell matches ours; we call this variant Baseline (2x). In practice, we simply concatenate the two identical inputs and feed this vector into LSTM unit instead. We highlight that the number of parameters in the LSTM of this variant matches exactly those in our consistent multiple sequence decoding.

Consistent decoder. Our consistent decoder (see Figure 3) has the same structure as the Baseline (2x) above. The difference is that we run the consistent fusion mechanism, that involves message passing in a gated GNN, to obtain the fused context vector that is concatenated to the encoding of the previous output. Because only two sequences are decoded in this scenario, we set the number of and iteration () in the fusion mechanism to .

Implementation details. The LSTM input size is set to 32 for Baseline; and 64 for both Baseline (2x) and our Consistent Decoder. The hidden state size is set to 2048. The total steps for the LSTM decoder is , where the first time step is encoding coefficients ([, ] for or [, ] for ); and the remaining ones are used for sequence decoding. The second input in the first two time steps of the Baseline (2X) and the Consistent Decoder is a zero vector. During training, we set batch size to . We use the SGD optimizer with momentum , and a constant learning rate . Models are trained using prediction mean squared error (MSE) loss.

Results. The quantitative results are shown in Table 1, which are measured using the MSE loss on the test set for and sequences respectively. All three models can do well in decoding . However, baseline models are not able to do well when decoding . Our consistent decoder, learns to incorporate predictions from when decoding , resulting in a much more accurate decoding. The decoding for , using consistent decoder, has times lower MSE loss than the baselines! One qualitative result is shown in Table 2 with , , , and for decoding .

Model MSE loss on MSE loss on
Baseline 0.0200 340.1852
Baseline (2x) 0.0043 340.0925
Consistent Decoder 0.0049 0.3356
Table 1: Synthetic Experiment. Results on the mathematical sequence dataset.
Time step () 2 3 4 5 6 7 8
Baseline 45.33 66.37 87.19 107.52 128.24 148.93 169.57 190.28
Baseline (2x) 45.33 66.63 86.99 107.37 128.08 148.58 169.14 189.73
Consistent decoder 45.33 66.72 95.09 121.88 147.06 172.58 198.09 223.54
Ground-truth 45.33 70.83 96.32 121.81 147.31 172.80 198.30 223.79
Time step 9 10 11 12 13 14 15 16
Baseline 210.82 231.54 252.40 273.14 293.95 314.66 335.11 355.56
Baseline (2x) 210.40 231.24 252.24 273.37 294.50 315.54 336.20 356.61
Consistent decoder 248.99 274.44 299.89 325.35 350.80 376.25 401.76 427.31
Ground-truth 249.28 274.78 300.27 325.76 351.26 376.75 402.24 427.74
Table 2: Synthetic Experiment. A qualitative result on the mathematical sequence dataset with , , , and for decoding .

4.2 Dense Relational Captioning

The task of dense relational captioning was first introduced in [11]. It is an extension of the dense image captioning initially proposed by Johnson et al.  in [10]. The dense relational captioning task is similar to pipeline in [10], where regions are detected first and then captioned. The main difference is that relational captions are generated for every pair of detected object regions.

Dataset. Dense relational captioning dataset, introduced in [11], is built upon the Visual Genome [16] (VG, v.1.2). It consists of 75,456 training/4,871 validation/4,873 testing images. The ground-truth relational captions are formed using the relationship labels and attribute labels of VG. We use original splits proposed in [16] for our experiments. However, we also find an unusual lack of consistency in the dataset among the captions for the same object. This stems from merging of different annotations provided by Amazon Mechanical Turkers for a given image in VG. Because we expect captions produced by one person to be (self-)consistent, we also report the performance on a manually constructed variant of the dense relational captioning dataset which has consistent labels.

Original label set. This is the original dense image relational captioning label set, which has vocabulary size of 15,596 words across all captions.

Consistent label set. For every image, we first count the number of different descriptions for every bounding box. Then for the bounding box which has multiple descriptions, we select the most frequent description as its only ground-truth. This makes the descriptions for one bounding box consistent for a specific image. After label pre-process [11], we obtain a vocabulary size of 15,009 words.

Figure 4: Captioning Experiment. One recurrent unit at time step of the models. (a) Baseline, the MTTSNET [11] model, where {SUBJ, PRED, OBJ} LSTMs are the triple-stream decoders. {SUBJ, PRED, OBJ} W-Ebds are the respective word embedding layers. and are the POS tag and word output predicted at time step . (b) Baseline (2x). (c) Our consistent decoder, where we have one consistent fusion mechanism (CFM) for every LSTM stream.

Models. As in the mathematical sequence decoding (Sec. 4.1), we also have three models for this dataset: Baseline, Baseline (2x), and our Consistent Decoder.

Baseline. [11] proposes a multi-task triple-stream network (MTTSNet) for dense relational captioning. MTTSNet has three LSTM RNNs to generate outputs based on the subject region (SUBJ), object region (OBJ), and the union region (PRED) respectively. Then the outputs from the three RNNs are concatenated to predict a caption word and its part-of-speech (POS) tag (subject, predicate, or object) in one time step via a multi-task module. We treat this as our baseline. One such recurrent unit is illustrated in Figure 4, comprising of subject, object, union embedding layers, LSTM units, and a multi-task module.

Baseline (2x). To mimic the structure of our consistent decoder, which receives two inputs, as in the Baseline (2x) in Section 4.1, we also build another Baseline (2x) on the dense relational captioning dataset. One such recurrent unit is shown in Figure 4. Built upon the baseline, the other input to the LSTM unit is the identical output word embedding from the decoder’s own previous time step.

Consistent decoder. Our consistent decoder structure is the same as the Baseline (2x), while the other input to the LSTM unit is the fused output word embedding from the previous time step of all the decoders. We apply our consistency fusion mechanism over the output word embeddings from all the decoders. We define that two captions are correlated if they involve a same proposed bounding box. We use this correlation information to build the adjacency matrix. One such recurrent unit is illustrated in Figure 4.

Implementation details. Rather than training the whole detection and captioning framework end-to-end from scratch, we utilize the trained detection module in the original dense relational captioning model provided by [11]111 We use the trained detection module to extract region proposal (bounding box) locations and the relevant features. Given an image, we run the trained detection module, and keep the most confident proposed regions. We then apply one round of non-maximum suppression (NMS) with intersection over union (IoU) threshold to reduce the number of overlapping region proposals. The remaining region proposals are treated as detected bounding boxes for an image. On average, in the validation set, there are detected bounding boxes per image. We fix the detected bounding box information, and train the captioning module from scratch. Training requires pairs of bounding boxes. We do a pairwise combination on the detected bounding boxes and keep those which have corresponding ground-truth relational captions for training, which results in an actual number of training images. A proposed bounding box pair has a ground-truth relational caption if the two proposed bounding boxes both have non-zero overlaps with their respective ground-truth bounding boxes. On average, in the validation set, there are bounding box pairs having ground-truth captions per image.

Network setting. Same as in [11], we feed the visual features to the relevant LSTMs in the first time step, and a special start of sequence token (SOS) at the second time step for starting decoding captions. The visual features and word embedding dimensions are set to 512. The LSTM input dimension is set to 512 for Baseline, and 1024 for Baseline (2x) and our Consistent Decoder. In the first two time steps of the Baseline (2x) and the Consistent Decoder, the other input is a zero vector. For our consistent decoder, in the consistency fusion mechanism, we set the number of and iterations () to 2.

Training procedure. We use SGD optimizer with batch size of to train both baselines and our consistent decoder. Initial learning rate is set to

and it is halved every 2 epochs. The minimum learning rate is capped at

. Momentum is set to 0.98. Since we fix the bounding box locations and their feature representations, the models are trained with the captioning loss and the POS classification loss :


where is set to 0.1. and are both cross-entropy losses at every time step for word and POS classifications respectively.

Evaluation metrics. Our consistent decoder has the property that it can produce more consistent captions in terms of bounding box level descriptions. Besides only measuring caption correctness as in earlier works [10, 37, 39], we propose a caption consistency evaluation to measure the consistency patterns exhibited in the generated captions. For completeness, we also report image level recall (Img-Lv. Recall) measurement [11] to evaluate the word diversity of generated captions for a given image. However this measure is faulty; as it measures diversity of captions within an image. We argue one desires consistency within an image (an opposite of diversity) but diversity across images. To address this we introduce another bounding box level description diversity measurement.

Caption correctness measurement. Following [11], we use mean average precision (mAP) to measure both localization and language accuracy, and average METEOR score [4] to evaluate the caption correctness regardless of the bounding box location. When calculating mAP, the METEOR score thresholds for language are {0, 0.05, 0.1, 0.15, 0.2, 0.25}, and the IoU thresholds for bounding box localization are {0.2, 0.3, 0.4, 0.5, 0.6}. For the localization AP, both bounding boxes in the bounding box pair are measured with their respective ground-truths. Only the results with IoUs of both bounding boxes greater than the localization threshold are considered for the language thresholds.

Caption consistency measurement. We propose a measurement to evaluate the consistency of generated captions. In one image, if a bounding box appears in multiple relations, we calculate the average pair-wise BLEU-1 score [26] among the generated descriptions for this specific bounding box. We utilize the predicted POS tag to extract the bounding box descriptions from a generated caption. We then take the mean of the averaged pairwise BLEU-1 scores over the whole test set to get the consistency score. Consistency score is in the range ; higher is better. Values closer to imply more consistent results for one region.

Bounding box level diversity measurement (BBox Div.). We argue that one desires a more consistent captioning when decoding multiple sequences involving the same region (bounding box). However, when we use sampling for caption decoding, generating captions for an image multiple times, while each individual run should be consistent, the results across runs should remain diverse. In other words a model should be able to choose among diverse references to use for a given region/object, but be consistent in its use of this reference in a given run. To capture this intuition we propose a word diversity measurement on the bounding box level to measure the diversity of the bounding box descriptions obtained with multiple runs of captioning for the same image. We calculate the pairwise BLEU-1 score [26] among the multiple descriptions for a specific bounding box in a specific relation, and average over all the bounding boxes. This measure is treated as our diversity score on the bounding box level. The score is in the range . Lower score is better and indicates more diverse bounding box descriptions.


During testing, we give the model the pairwise combination information of the detected bounding boxes for caption generation. When decoding captions, at every time step, we take the most probable word and POS tag for all the measurements except the BBox Div. measurement. The BBox Div. measurement requires sampling during caption decoding. On this metric, we caption

times for a given image, which results in descriptions for a specific bounding box in a specific relation. We only measure this on the original label set because the ground-truth descriptions for a specific bounding box in the consistent label set are not diverse by design. Table 3 and Table 4 show the quantitative results on the original label set and the consistent label set respectively. For completeness, we report the direct results from [11] (line 1 in Table 3, MTTSNet-50 RP), where the model is trained end-to-end from scratch using the original label set, and evaluated using most confident region proposals before the second round of NMS. We also evaluate the same trained model using most confident region proposals (same as in our model). The results are listed in line 2 in Table 3 (MTTSNet-100 RP). Our baseline model has the same structure as the original relational captioning model [11]. For our Baseline, Baseline (2x), and Consistent Decoder, only the decoding mechanism is trained on top of a fixed region detection backbone. Figure 5 shows two qualitative results from the baseline model and our consistent decoder trained with the original label set.

Model mAP METEOR Consistency Img-Lv. Recall BBox Div.
MTTSNet-50 RP [11] 0.88 18.73 - 34.27 -
MTTSNet-100 RP [11] 1.29 20.26 - 55.68 -
Baseline 1.93 22.86 33.08 71.32 17.26
Baseline (2x) 1.94 22.86 33.27 71.35 17.07
Consistent Decoder 2.04 23.32 36.43 69.60 17.50
Table 3: Test results on the original label set. Same training protocol, decoder capacity and structure are used for Baseline (2x) and our Consistent Decoder. The only difference is that Consistent Decoder has our proposed consistency fusion mechanism.
Model mAP METEOR Consistency Img-Lv. Recall
Baseline 1.90 22.49 33.35 71.58
Baseline (2x) 1.93 22.45 33.50 71.60
Consistent Decoder 2.01 22.58 38.71 69.60
Table 4: Test results on the consistent label set. Same training protocol, decoder capacity and structure are used for Baseline (2x) and our Consistent Decoder. The only difference is that Consistent Decoder has our proposed consistency fusion mechanism.
Figure 5: Qualitative results. Sample results for dense relational captioning trained using the original label set. Given a pair of bounding boxes (subject object), the generated captions are of form: “subject description predicate object description”.

Result analysis. Our consistent decoder not only performs better than baselines on caption correctness, but also generates more consistent captions. Compared to the results of the trained model provided by [11], our consistent decoder achieves new state-of-the-art results. On the original label set, the consistent decoder has relative improvement on mAP and relative improvement on the consistency measure. The improvements are due solely to consistency fusion mechanism. Comparing the difference on the BBox Div. score among the models, our consistent decoder does generate bounding box level descriptions which have similar word level diversity with the baselines. The consistent label set, by design, has consistent ground-truth bounding box descriptions. As a result, our relative improvement in the consistency score on the consistent label set ( over the baseline) is greater than that on the original one ().

Graph neural network matters. Our consistency fusion mechanism is implemented by gated GNN [20]. To explore how the number of and iteration (), in the gated GNN, will impact our captioning results, we conduct experiments on the original label set with The results are illustrated in Table 5. With the increase of , the captioning results are improved. When , our consistency fusion mechanism can learn a weighted average of output representations from all directly relevant decoders. As increases information can be propagated among decoders that do not directly effect each other, but do so indirectly through other intermediate decoders. This appears to be an important property, particularly for improving consistency. We find that is a reasonable compromise between quality of results and running time for the dense relational captioning task.

Model mAP METEOR Consistency Img-Lv. Recall
Consistent Decoder () 2.04 23.26 35.71 70.21
Consistent Decoder () 2.04 23.32 36.43 69.60
Consistent Decoder () 2.05 23.27 36.47 69.50
Table 5: Test results on the original label set with different s.

5 Conclusion

We propose a consistent multiple sequence decoder, which can utilize the consistent pattern in the sequences to be decoded. With the core consistency fusion mechanism component, one decoder can utilize the outputs from all other decoders for better decoding. We show the advantage of our consistent decoder on two datasets, one synthetic dataset having two correlated mathematical sequences, and one real dense relational captioning dataset. Our consistent decoder achieves new state-of-the-art results on the dense relational captioning task.


  • [1] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese (2016) Social lstm: human trajectory prediction in crowded spaces. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 961–971. Cited by: §1, §2.2.
  • [2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6077–6086. Cited by: §1, §2.1.
  • [3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015) Vqa: visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433. Cited by: §1.
  • [4] S. Banerjee and A. Lavie (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72. Cited by: §4.2.
  • [5] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio (2014)

    On the properties of neural machine translation: encoder-decoder approaches

    arXiv preprint arXiv:1409.1259. Cited by: §3.1.
  • [6] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
  • [7] P. Felsen, P. Agrawal, and J. Malik (2017) What will happen next? forecasting player moves in sports videos. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3342–3351. Cited by: §1.
  • [8] L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen (2017) Video captioning with attention-based lstm and semantic consistency. IEEE Transactions on Multimedia 19 (9), pp. 2045–2055. Cited by: §2.1.
  • [9] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.1.1.
  • [10] J. Johnson, A. Karpathy, and L. Fei-Fei (2016) Densecap: fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4565–4574. Cited by: §1, §2.1, §2.1, §4.2, §4.2.
  • [11] D. Kim, J. Choi, T. Oh, and I. S. Kweon (2019) Dense relational captioning: triple-stream networks for relationship-based captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6271–6280. Cited by: §1, §1, §2.1, Figure 4, §4.2, §4.2, §4.2, §4.2, §4.2, §4.2, §4.2, §4.2, §4.2, §4.2, Table 3, §4.
  • [12] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §3.1.
  • [13] B. Knyazev, G. W. Taylor, and M. Amer (2019) Understanding attention and generalization in graph neural networks. In Advances in Neural Information Processing Systems, pp. 4204–4214. Cited by: §3.1.
  • [14] V. Kosaraju, A. Sadeghian, R. Martín-Martín, I. Reid, H. Rezatofighi, and S. Savarese (2019) Social-bigat: multimodal trajectory forecasting using bicycle-gan and graph attention networks. In Advances in Neural Information Processing Systems, pp. 137–146. Cited by: §1.
  • [15] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles (2017) Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pp. 706–715. Cited by: §2.1.
  • [16] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1), pp. 32–73. Cited by: §4.2.
  • [17] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker (2017) Desire: distant future prediction in dynamic scenes with interacting agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 336–345. Cited by: §1.
  • [18] G. Li, L. Zhu, P. Liu, and Y. Yang (2019) Entangled transformer for image captioning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8928–8937. Cited by: §1.
  • [19] Y. Li, T. Yao, Y. Pan, H. Chao, and T. Mei (2018) Jointly localizing and describing events for dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7492–7500. Cited by: §2.1.
  • [20] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel (2015) Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §3.1, §3.1, §3.1, §4.2.
  • [21] Y. Liu, R. Yu, S. Zheng, E. Zhan, and Y. Yue (2019) NAOMI: non-autoregressive multiresolution sequence imputation. arXiv preprint arXiv:1901.10946. Cited by: §2.2.
  • [22] J. Lu, J. Yang, D. Batra, and D. Parikh (2016) Hierarchical question-image co-attention for visual question answering. In Advances in neural information processing systems, pp. 289–297. Cited by: §1.
  • [23] A. L. Maas, A. Y. Hannun, and A. Y. Ng (2013) Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, Vol. 30, pp. 3. Cited by: §3.1.
  • [24] S. Olivastri, G. Singh, and F. Cuzzolin (2019) End-to-end video captioning. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §1.
  • [25] Y. Pan, T. Yao, H. Li, and T. Mei (2017) Video captioning with transferred semantic attributes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6504–6512. Cited by: §2.1.
  • [26] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §4.2, §4.2.
  • [27] S. H. Park, B. Kim, C. M. Kang, C. C. Chung, and J. W. Choi (2018) Sequence-to-sequence prediction of vehicle trajectory via lstm encoder-decoder architecture. In 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1672–1678. Cited by: §1.
  • [28] T. Rahman, B. Xu, and L. Sigal (2019) Watch, listen and tell: multi-modal weakly supervised dense event captioning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8908–8917. Cited by: §2.1.
  • [29] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel (2017) Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024. Cited by: §2.1.
  • [30] N. Rhinehart, R. McAllister, K. Kitani, and S. Levine (2019) Precog: prediction conditioned on goals in visual multi-agent settings. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2821–2830. Cited by: §1.
  • [31] Z. Shen, J. Li, Z. Su, M. Li, Y. Chen, Y. Jiang, and X. Xue (2017) Weakly supervised dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1916–1924. Cited by: §1.
  • [32] C. Sun, P. Karlsson, J. Wu, J. B. Tenenbaum, and K. Murphy (2019) Stochastic prediction of multi-agent interactions from partial observations. arXiv preprint arXiv:1902.09641. Cited by: §1, §2.2.
  • [33] C. Tang and R. R. Salakhutdinov (2019) Multiple futures prediction. In Advances in Neural Information Processing Systems, pp. 15398–15408. Cited by: §1.
  • [34] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §3.1.
  • [35] G. Vered, G. Oren, Y. Atzmon, and G. Chechik (2019) Joint optimization for cooperative image captioning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8898–8907. Cited by: §1.
  • [36] K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §3.1.
  • [37] L. Yang, K. Tang, J. Yang, and L. Li (2017) Dense captioning with joint inference and visual context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2193–2202. Cited by: §1, §2.1, §4.2.
  • [38] R. A. Yeh, A. G. Schwing, J. Huang, and K. Murphy (2019) Diverse generation for multi-agent sports games. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4610–4619. Cited by: §1.
  • [39] G. Yin, L. Sheng, B. Liu, N. Yu, X. Wang, and J. Shao (2019) Context and attribute grounded dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6241–6250. Cited by: §1, §2.1, §4.2.