1 Introduction
Recurrent Neural Networks (RNNs) and their generalizations (LSTMs, GRUs, ) have emerged as popular and effective frameworks for modeling sequential data across varied domains. The application of these models has led to significantly improved performance on a variety of tasks – speech recognition [17, 10], machine translation [2, 9, 20], conversation modeling [33], image captioning [34, 21, 14, 7, 13], visual question answering (VQA) [1, 27, 23, 24, 15], and visual dialog [11, 12].
Broadly speaking, in these applications RNNs are typically used in two distinct roles – (1) as encoders
that convert sequential data into realvalued vectors, and (2) as
decoders that convert encoded vectors into sequential output. Models for image caption retrieval and VQA (with classification over answers) [1, 27] consist of encoder RNNs but not decoders. Image caption generation models [34]consist of decoder RNNs but not encoders (image encoding is performed via Convolutional Neural Networks). Visual dialog models use encoders to embed dialog history and model state, while using decoders to generate dialog responses. Regardless of the setting, the task of decoding a sequence from an RNN consists of finding the most likely sequence
given some input .Unidirectional RNNs model this probability by estimating the likelihood of outputting a symbol at time
(say ) given the history of previous outputs () by “compressing” the history into a hidden state vector such that. Since each output symbol is conditioned on all previous outputs, the search space of possible sequences is exponential in the sequence length and exact inference is intractable. As a result, approximate inference algorithm are applied, with Beam Search (BS) being the primary workhorse. BS is a greedy heuristic search that maintains the top
most likely partialsequences through the search tree, where is referred to as the beam width. At each time step, BS expands these partial sequences to all possible beam extensions and then selects the highest scoring among the expansions.In contrast to Unidirectional RNNs, Bidirectional RNNs model both forward (increasing time) and backward (decreasing time) dependencies via two hidden state vectors and . This enables Bidirectional RNNs to consider both past and future when predicting an output. Unfortunately, these dependencies also make exact inference in these models more difficult than in Unidirectional RNNs and to the best of our knowledge no efficient approximate algorithms exist. In this paper, we present the first efficient approximate inference algorithm for these models.
As a challenging testbed for our method, we propose a fillintheblank image captioning task. As an example, given the blanked image caption “A man on ________________ how to ski” shown in Figure 1, our goal is to generate the missing content “skis showing a young child” (or an acceptable paraphrase) to complete the sentence. This task serves as a concrete standin for a broad class of other similar sequence completion tasks, such as predicting missing sections in a DNA sequence or path planning problems where an agent must hit intermediate checkpoints.
On the surface, this task perhaps seems easier than generating an entire caption from scratch; there is after all, more information in the input. However, the need to condition on the context when generating the missing symbols is challenging for existing greedy approximate inference algorithms. Figure 1(a) shows a sample decoding from standard ‘lefttoright’ BS on a Unidirectional RNN. Note the grammatically incorrect “with a how to” transition produced. Similar problems occur at the other boundary for righttoleft models. Simply put, the inability to consider both the future and past contexts in BS leads Unidirectional RNNs to fill the blank with words that clash abruptly with the context around the blank.
Moreover, decoding also poses a computational challenge. Consider the following sentence that we know has only a single word missing: “The _____ was barreling down the tracks.” Filling in this blank feels simple – we just need to find the best single word in the vocabulary . However, since all future outputs in a Unidirectional RNN are conditioned on the past, selecting the best word at time requires evaluating the likelihood of the entire sequence once for each possible word (similarly for Bidirectional RNNs). This amounts to forward passes through an RNN’s basic computational unit to fill in a single blank optimally! More generally, for an arbitrarily sized blank covering words, this number grows exponentially as and quickly becomes intractable.
To overcome these shortcomings, we introduce the first approximate inference algorithm for 1Best (and MBest) inference in bidirectional neural sequence models (RNNs, LSTMs, GRUs, ) – Bidirectional Beam Search (BiBS). We show BiBS performs well on fillintheblank tasks, efficiently incorporating both forward and backward time information from Bidirectional RNNs.
To give an algorithmic overview, we begin by decomposing a Bidirectional RNN into two calibrated but independent Unidirectional RNNs (one going forward in time and the other backward). To perform approximate inference with these decomposed models, our method alternatively performs BS on one direction while holding the beams in the opposite direction fixed. The fixed, oppositelydirected beams are used to roughly approximate the conditional probability of all future sequence given the past such that a BSlike update minimizes an approximation of the full joint at each time step. Figure 1(b) shows an example result of our algorithm – “A man on skis is teaching a child how to ski” – which smoothly fits within its context while still describing the image content.
We compare BiBS against natural ablations and baselines for fillintheblank tasks. Our results show that BiBS is an effective and efficient approach for decoding Bidirectional RNNs, consistently outperforming all baselines.
2 Related Work
While Unidirectional RNNs are popular models with widespread adoption [17, 10, 2, 9, 20, 33, 1, 27], Bidirectional RNNs have been utilized relatively infrequently [19, 28, 35] and even more rarely as decoders [6] – we argue due to the lack of efficient inference approaches for these models.
Wang [35] used Bidirectional RNNs for image caption generation, but do not perform bidirectional inference, rather simply use Bidirectional RNNs to rescore candidates. Specifically, at inference time they decompose a Bidirectional RNN into two independent Unidirectional RNNs, apply standard Beam Search in each direction, and then rerank these two collection of beams based on the max probability of each beam under the forward or backward model. We compare to this method and show that approximate joint optimization via Bidirectional Beam Search leads to better performance in the fillintheblank image captioning task.
Most related to our work is that of Burglund [6], which studies generating missing sections of sequential data in an unsupervised setting using Bidirectional RNNs. They propose three probabilistically justified approaches to fill these gaps by drawing samples from the full joint.
Their first model, Generative Stochastic Networks (GSN), resamples the output at a random time from the conditional output . For a blank of length , resampling each output tokens times requires passes of the RNN. Thus, the cost of producing a sample with the GSN method scales linearly with the size of the gap and requires a full pass of the Bidirectional RNN. Their second approach, NADE, trains a model specifically for filling in the blank – at train time, some inputs are set to a specific ‘missing’ token to indicate the content that needs to be generated. At inference time, the inputs from the gap are set to this token and sampled from the resulting conditional. Note that this approach is ‘trained to fill in gaps’ and as such requires training data of this kind. To contrast, this is a new model for filling in gaps, while we propose a new inference algorithm, which can be broadly applied to any generative bidirectional model. Finally, they propose a third sampling approach based on a Unidirectional RNN which draws from the conditional ; however, as the model is a lefttoright Unidirectional RNN, this term requires computing the likelihood of the remaining sequence given each possible token at time . This costly approach requires steps of the RNN and is intractable for large vocabularies.
3 Preliminaries: RNNs and Beam Search
We begin by establishing notation and reviewing RNNs and standard Beam Search for completeness. While our exposition details the classical RNN updates, the techniques developed in this paper are broadly applicable to any recurrent neural architecture (LSTMs [18] or GRUs [8]).
Notation. Let denote an input sequence, where is an input vector at time . Similarly, let denote an output sequence, where is an output vector at time . To avoid notational clutter, our exposition uses the same length for both input and output sequences (); however, this is not a restriction in theory or practice. Given integers , we use the notation to denote the subsequence ; thus, by convention. Given discrete variables , we generalize the classical maximization notation to find the (unique) top states with highest via the notation .
Unidirectional RNN (URNNs) model the probability of given the history of inputs by “compressing” the history into a hidden state vector such that
(1a)  
(1b) 
where and are learned parameters defining the transforms from the input and hidden state to the output and updated hidden state . In applications with symbol sequences as output (such as image captioning), the nonlinear function is typically the softmax function which produces a distribution over the output vocabulary . An example lefttoright Unidirectional RNN architecture is shown in Figure (a)a.
Bidirectional RNNs (BiRNNs) (shown in Figure (b)b) model both forward (positive time) and backward (negative time) dependencies via two hidden state vectors – forward and backward – each with its own update dynamics and corresponding weights. For a BiRNN, we can write the probability of the token given the input sequence as
(2a)  
(2b)  
(2c) 
BiRNNs as URNNs. Consider a Bidirectional RNN with the output nonlinearity defined as the softmax function It is straightforward to show that the conditional probability of given all other tokens can be written as
where the resulting terms in the proportionality resemble the URNNs output equation in Eq. 1a. Intuitively, this expression shows that the output of a Bidirectional RNN with a softmax output layer can be equivalently expressed as the
product of the output from two independent but oppositely directed URNNs with specifically constructed weights, renormalized after multiplication. This construction also works in reverse such that an equivalent Bidirectional RNN can be constructed from two independently trained but oppositely directed URNNs. As such, we will consider a Bidirectional RNN as consisting of a forwardtime model and a backwardtime model .
RNNs for decoding are trained to produce sequences conditioned on some encoded representation . For machine translation tasks, may represent an encoding of some source language sequence to be translated and is the translation. For image captioning, is typically a dense vector embedding of the image produced by a Convolutional Neural Network (CNN) [29], and is a sequence of 1hot encoding of the words of the corresponding image caption. Regardless of its source, this encoded representation is considered the first input and for all remaining time steps , such that decoder RNNs are learning to model . This is the setting of interest in this paper, but we drop this explicit dependence on the encoding to reduce notational clutter in later sections.
Beam Search (BS).
Maximum a posteriori (MAP) (or more generally, MBestMAP [4, 25]) inference in RNNs consists of finding the most likely sequence under the model.
The primary difficulty for decoding is that the number of possible length sequences grows exponentially as ,
so approximate inference algorithms are employed. Due to this exponential output space and the dependence on previous outputs,
exact inference is NPhard in the general case.
Beam Search (BS) is a greedy heuristic search algorithm that traverses the search tree using breadthfirst search, while only expanding the most promising nodes at each depth.
Specifically, BS in Unidirectional RNNs involves maintaining and expanding the top highestscoring partial hypotheses, called beams.
Let denote a partial hypothesis (beam) at time .
We use the notation to denote a collection of beams.
BS begins with empty beams, , where and proceeds in a lefttoright manner up to time or until a special END token is generated.
At each time , BS considers the space of all possible beam extension and selects the top highscoring length beams among this expanded hypothesis space. We can formalize this search for optimal updated beams as
Each log probability term in the above expression can be computed via a forward pass in Unidirectional RNNs such that implementing the operation simply requires sorting values. An example run of BS on a lefttoright URNN is shown in Figure (c)c.
4 Bidirectional Beam Search (BiBS)
(3) 
We begin by analyzing the decision made by lefttoright
Beam Search at time .
Specifically, at each time ,
we can factorize the joint probability in a particular way:
(4) 
This lefttoright decomposition of the joint around is comprised of three terms

the ‘marginal’ of the sequence prior to : ,

the conditional of given this past: , and

the conditional of the remaining sequence after given all prior terms: .
If we consider choosing to maximize this joint, the first two terms can be computed exactly via the forward pass of a lefttoright URNN given the existing sequence; however, the third term cannot be exactly computed because it depends on all futures. Even approximating the third term with beams requires rerunning beam search for each possible setting of , which is prohibitively expensive.
One way of interpreting lefttoright BS is to view it as approximating the joint in eq:joint with just the first two terms. Specifically, if we assume that is uniform, all futures are equally likely given the sequence so far, then BS is picking the optimal . This approximation does not hold in practice and results in poor performance for fillintheblank tasks where all future sequences are not equally likely by design. In this section, we consider an alternative approximation and derive our BiBS approach.
Efficiently Approximating the Future. In order to derive a tractable approximation to this third term (and by proxy the full joint), we make two simplifying assumptions (which we know will be violated in practice, but result in an efficient approximate inference algorithm). First, we assume that future sequence tokens are independent of past sequence tokens given , we treat RNNs as firstorder Markov around during inference. Second, we assume that is uniform, avoiding the need to estimate marginal distributions over for all time steps. Under these assumptions, we write the conditional probability of the remaining sequence tokens given the past sequence as
(5)  
Notice that the resulting terms are exactly the output of a righttoleft Unidirectional RNN. Substituting Eq. 5 into Eq. 4, we arrive at an expression that is proportional to the full joint, but comprised of terms which can be independently computed from a pair of oppositelydirected Unidirectional RNNs (or equivalently a Bidirectional RNN),
(6) 
Note that the two central conditional terms are proportional to the output of an equivalent softmax Bidirectional RNN as discussed in the previous section.
Coordinate Descent. Given some initial sequence , a simple coordinate descent algorithm could select a random time and update such that this approximate joint is maximized and repeat this until convergence. Computing Eq. 6 would require feeding to the forward RNN and to the backward RNN. Therefore, updating all outputs times in this approach would require RNN steps (combined from both the forward and backward models). If we instead follow an alternating lefttoright then righttoleft update order, this can be reduced to by reusing cached log probabilities from the previous direction. This algorithm resembles a beam search with which bases extensions on the value of Eq. 6.
Bidirectional Beam Search. Finally, we arrive at our full Bidirectional Beam Search (BiBS) algorithm by generalizing the simple algorithm outlined above to maintain multiple beams during each update pass. Given some set of initial sequences (perhaps from a lefttoright beams search), we alternate between forward (lefttoright) and backward (righttoleft) beam searches with respect to the approximate joint. We consider a pair of forward and backward updates a single round of BiBS.
Without loss of generality, we will describe a forward update pass of beam width . At each time , we have updated the first tokens of each beam such that we have partial forward sequences and the values have yet to be updated. To update the forward beams, we consider all possible connections between the current lefttoright beams and the righttoleft beams (held fixed from previous round) through any token in the dictionary . Our search space is then and .
Figure 3 shows an example lefttoright update step for image captioning as well as the precise update rule based on Eqn. 6 for this time step. For each combination of forward beam and backward beam, this objective can be computed easily from stored sum of log probabilities of each beam and conditional output of the forward and backward RNNs. Like standard Beam Search, the optimal extensions can be found exactly by sorting these values for all possible combinations. Our approach requires only RNN steps to perform rounds of updates. Our approach is summarized in Alg. 1 with representing from and representing from .
5 Experiments
a) The woman has many bananas and other fruit at her stand  a) A man is skateboarding on a ramp in a basement  
b) The woman has a bunch of bananas on at her stand  b) A man riding a skateboard up the in a basement  
c) The woman has holding a bunch of bananas at her stand  c) A man a trick on a skateboard in a basement  
d) The woman has a large bunch of bananas at her stand  d) A man doing tricks on a skateboard in a basement  
a) A number of small planes behind a fence  a) A black and yellow bird with a colorful beak  
b) A number of small planes on a fence  b) A black and yellow bird sitting a colorful beak  
c) A number plane is parked near a fence  c) A black a yellow bird with a colorful beak  
d) A number of planes parked near a fence  d) A black and yellow bird with in a basement  
a) A group of people standing on top of a snow covered slope  a) A row of transit buses sitting in a parking lot  
b) A group of people on skis on a snowy snow covered slope  b) A row of buses parked in a a parking lot  
c) A group of riding skis on top of a snow covered slope  c) A row of double decker buses parked a parking lot  
d) A group of people standing on top of a snow covered slope  d) A row of red buses parked in a parking lot  
a) The person is riding the waves in the water  a) Two people riding a motorcycle to the beach  
b) The person is is riding a wave in the water  b) Two people on a motorcycle on the beach  
c) The person is person on a surfboard in the water  c) Two people riding a motorcycle on the beach  
d) The person is is on a surfboard in in the water  d) Two people on a motorcycle on the beach 
a) Ground Truth  b) URNNf  c) URNNb  d) BiRNNBiBS 
In this section, we evaluate the effectiveness of our proposed Bidirectional Beam Search (BiBS) algorithm for inference in BiRNNs. To examine the performance of bidirectional inference, we evaluate on tasks that require the generated sequence to fit well with existing structures both in the past and the future. We choose fillintheblank style tasks where a number of tokens have been removed from a sequence and must be reconstructed. Specifically, we evaluate on fillintheblank tasks on image captioning for the Common Objects in Context (COCO) [22] dataset and descriptions from the Visual Madlibs [36] dataset.
Baselines. We compare our approach, which we denote BiRNNBiBS, against several baseline approaches:

URNNf: that runs BS on a forward LSTM to produce B output beams (ranked by their probabilities under the forward LSTM),

URNNb: that runs BS on a backward LSTM to produce B output beams (ranked by their probabilities under the backward LSTM),

URNNf+b: that runs BS on forward and backward LSTMs to produce output beams (ranked by the maximum of the probabilities assigned by the forward and backward LSTMs). The method used by Wang [35].

BiRNNf+b: that runs BS on two LSTMs (forward and backward) to produce output beams (ranked by the sum of the log probabilities assigned by the forward and backward LSTMs). This lacks formal justification but we find it to be a reasonable heuristic for this task.

GSN (Ordered): that samples tokens from the BiRNN for each time step. We found randomly selecting the time step as in [6] resulted in poor performance on our tasks and instead perform updates in an alternating lefttoright / righttoleft order. For fairness, we compare at the same number of updates as our method and all sample sequences are reranked based on log probability.
All baselines perform inference on the same trained model that we train using neuraltalk2 [21] with standard maximumlikelihood training over complete human captions.
Evaluation. For all models, we evaluate only the top beam from the sorted list returned by the algorithm. We compare methods on standard sentencelevel metrics – CIDEr[31], Meteor[3], and Bleu[26]
– computed between the ground truth captions and the (full) reconstructed sentences. We note that the metrics are computed over the entire sentence (and not just the blank region) in order to capture the quality of the alignment of the generated text with the existing sentence structure. As a side effect, the absolute magnitude of these metrics are inflated due to the correctness of the context words, so we focus on the relative performance.
=25%  =50%  =75%  

CIDEr  Bleu4  Meteor  CIDEr  Bleu4  Meteor  CIDEr  Bleu4  Meteor  
Known Length  URNNf  6.54  0.661  0.488  3.744  0.345  0.350  1.927  0.143  0.238 
URNNb  6.58  0.668  0.491  3.931  0.372  0.356  2.476  0.219  0.259  
URNNf+b[35]  6.98  0.709  0.510  4.15  0.398  0.367  2.40  0.209  0.257  
BiRNNf+b  6.94  0.705  0.508  3.99  0.385  0.361  2.24  0.201  0.252  
GSN[6] (Ordered)  6.90  0.701  0.507  3.63  0.337  0.334  1.876  0.135  0.232  
BiRNNBiBS (ours)  7.12  0.720  0.517  4.26  0.408  0.368  2.57  0.228  0.265  
Unknown Length  URNNf  5.607  0.569  0.440  4.232  0.432  0.370  2.594  0.268  0.269 
URNNb  5.514  0.561  0.436  4.151  0.424  0.367  2.909  0.303  0.285  
URNNf+b[35]  5.632  0.570  0.440  4.377  0.451  0.376  2.924  0.306  0.287  
BiRNNf+b  5.640  0.588  0.452  4.380  0.453  0.378  2.930  0.305  0.303  
GSN[6] (Ordered)  5.725  0.589  0.447  3.591  0.413  0.357  2.456  0.257  0.261  
BiRNNBiBS (ours)  5.935  0.614  0.460  4.40  0.454  0.380  2.936  0.305  0.288 
5.1 FillintheBlank Image Captioning on COCO
The COCO [22] dataset contains over 120,000 images, each with a rich set of annotations. This include five captions describing the content of each image, collected from Amazon Mechanical Turk workers. We propose a novel fillintheblank image captioning task based on this data. Given an image and a corresponding ground truth caption from the dataset, we remove a sequential portion of the caption such that we are left with a prefix and suffix consisting of the remaining words on either side of the blank. Using the image and the context from these remaining words, the goal is to generate the missing tokens . This is a challenging task that explores how well models and inference algorithms reason about the past and future during sequence generation. We first consider the known blank length setting (where the inference algorithm knows the blank length) and then generalize to the unknown blank length setting.
Known Blank Length. In this experiment, we remove 25%, 50%, or 75% of the words from the middle of a caption for each image and task the model with generating the lost content. For example, at 50% the caption “A close up of flowers and plants inside of a bowl” would appear to the system as the blanked caption “A close __ __ ______ ___ ______ ______ of a bowl” and the generation task would then be to reproduce the removed subsequence of words “up of flowers and plants inside.”
As we are interested in bidirectional inference (not learning), we train our models on the original COCO image captioning task (we do not explicitly train to fill blanked captions). Like [21], we use 5000 images for test, 5000 images for validation, and the rest for training. We evaluate on a single random caption per image in the test set.
The upper half of Table 1 reports the performance of our approach (BiBS) on this fillintheblank inference task for differently sized blanks. We run GSN and BiBS for four full forward / backward passes of updates. Generally we find that bidirectional methods outperform unidirectional ones on this task. We find that BiBS outperforms all baselines on all metrics. We note that the nearest baselines in performance (URNNf+b, BiRNNf+b) are reranked from 2B beams. While BiBS operates in an alternating lefttoright and righttoleft fashion, it only ever maintains beams.
Interestingly, the backward time URNNb model consistently outperforms the forward time model URNNf on all metrics and across all sizes of blanks. This may be due to the way the dataset was collected. When tasked with describing the content of an image, people often begin by grounding their sentences with respect to specific entities visible in the image (especially when humans are depicted). Given this, we would expect many more sentences to begin with the similar words such that generating the beginning of a sentence from the end would be an easier task.
fig:quals shows several qualitative examples, comparing completed captions from URNNf, URNNb, and our BiRNNBiBS method with ground truth human annotations. The unidirectional models running standard BS typically generate sentences that abruptly clash with existing words at the edge of the blank. For example in the topleft instance, the forward model produces the grammatically incorrect phrase “bananas on at her stand” and similarly the backward model outputs “The woman has holding a bunch”. This behavior is a natural consequence of the inability for these models to efficiently reason about the past and future simultaneously. While these unidirectional models struggle to reason about word transitions on either side of the blank, our BiRNN based BiBS algorithm typically produces reconstructions that smoothly fit with the context, producing a reasonable sentence “The woman has a large bunch of bananas at her stand.”
(GT) Some baseball players are playing a game  (GT) a woman in a red shirt is holding a blue and orange kite  (GT) this woman is sitting in front of a restaurant smoking a cigarette  
(Init) Some baseball baseball bat during a game  (Init) a woman in a man is holding a blue and orange kite  (Init) this woman is is working on a laptop and smoking a cigarette  
(1st) Some baseball baseball players in a game  (1st) a woman in a a hat is holding a blue and orange kite  (1st) this woman is sitting down on a laptop while smoking a cigarette  
(2ed) Some baseball players playing in a game  (2ed) a woman in a red shirt is holding a blue and orange kite  (2ed) this woman is sitting down in a chair and smoking a cigarette 
This example also highlights a possible deficiency in our evaluation metrics; while a human observer can clearly tell which of the three sentences is most natural, the sentence level statistics of each are quite similar, with each sharing only the word banana with the ground truth caption “The woman has many bananas and other fruit at her stand.” Evaluating generated language is a difficult and open problem that is further complicated in the fillintheblank task.
BiBS Convergence. To study the convergence of our approach, we consider the true joint probability of filledin captions as a function of the number of rounds of BiBS. We compute the average of these joint log probabilities after each metaiteration of BiBS, where we define a metaiteration as a pair of full forward and backward update passes. We find that joint log probabilities drop quickly (reducing from 2.47 to 2.11 in a single metaiteration), indicating high quality solutions are found from unidirectional initializations within only a few metaiterations of BiBS. In practice, we find the beams typically converge in 1 to 2 metaiterations for fillintheblank image captioning. Figure 5 shows how predicted sentence completions change over metaiterations of BiBS for three example images.
Unknown Length Blanks. While our method is designed for known blank lengths, we can apply BiBS as a black box inference algorithm over a range of blank lengths and then rank these solutions. We calibrate what lengths to search over by first generating the top1 lefttoright beam by only conditioning on words on the left of the blank and the righttoleft top1 beam by only conditioning words on the right side of the blank. Then, we define the range of lengths of the blank as to where, is the length of beam . We perform inference at each length in this range and select the highest probability completion across all lengths. The lower half of Table 1 reports the results. We find that BiBS outperforms nearly all baselines on all metrics (narrowly being bested by URNNf+b at Blue4). All methods perform worse at the unknown length task than when blank length is know, due largely to the difficulty comparing sequences of differing lengths.
5.2 Visual Madlibs
type 7  type 12  
Bleu1  Bleu2  Bleu1  Bleu2  
Known Length  URNNf  0.313  0.138  0.275  0.160 
URNNb  0.460  0.284  0.346  0.213  
URNNf+b[35]  0.447  0.275  0.347  0.214  
BiRNNf+b  0.448  0.275  0.347  0.213  
GSN[6] (Ordered)  0.427  0.28  0.148  0.099  
BiRNNBiBS (ours)  0.470  0.389  0.353  0.216  
nCCA  0.56  0.1  0.46  0.07  
nCCA(box)  0.60  0.11  0.48  0.08  
Unknown Length  URNNf  0.317  0.155  0.285  0.174 
URNNb  0.334  0.184  0.309  0.186  
URNNf+b  0.334  0.181  0.302  0.184  
BiRNNf+b  0.343  0.195  0.291  0.190  
GSN[6] (Ordered)  0.348  0.203  0.270  0.184  
BiRNNBiBS  0.351  0.197  0.31  0.190 
In this section, we evaluate our approach on the Visual Madlibs[36] fillintheblank description generation task. The Visual Madlibs dataset contains 10,738 images with 12 types of fillintheblank questions answered by 3 workers on Amazon Mechanical Turk. We use object’s affordance (type 7) and pair’s relationship (type 12) fillintheblank questions as these types have blanks in the middle of questions. For instance, People could relax on the couches. and The person is putting food in the bowl. We use 2000 images for validation, train on the remaining images from the train set, and evaluate on their 2,160 image test set. To the best of our knowledge, ours is the first paper to explore the performance of CNN+LSTM text generation for this task.
We compare to two additional baselines for these experiments, nCCA[16] and the nCCA(box) method implemented in the Visual Madlibs paper [36]. nCCA maps image and text to a jointembedding space and then finds the nearest neighbor from the training set to this embedded point. We note that this is a retrieval and not a description generation technique such that it cannot be directly compared with BiBS and report it only for the sake of completeness. nCCA(box) operates similarly to nCCA, but extracts visual features from the groundtruth bounding box of the relevant person or object referred to in the question and thus is an ‘oracle’ result that makes use of ground truth.
We again use the neuraltalk2 [21] framework to train a CNN+LSTM model for both object’s affordance (type 7) and pair’s relationship (type 12) question types. We evaluate on the test data using Bleu1 and Bleu2 (to be consistent with [36]). Table 2 shows the results of this experiment for known blank length (top) and unknown blank length (bottom) settings. For known blank length, we find that BiBS outperforms the other generation based baselines in both question types and is competitive with the retrieval based nCCa techniques, greatly outperforming the nCCA retrieval and nCCA(box) oracle methods on Bleu2. For unknown blank lengths, BIBS similarly outperforms the baselines, but is narrowly bested by GSN[6] (Ordered) for type7 questions under the Bleu2 metric. We note that the nCCA techniques have not been evaluated in this setting.
6 Conclusions
In summary, we presented the first approximate inference algorithm for 1Best (and MBest) decoding in bidirectional neural sequence models (RNNs, LSTMs, GRUs, ). We study our method in the context of a novel fillintheblank image captioning task which evaluates how well sequence generation models and their associated inference algorithms incorporate known information from both the past and future to ‘fill in the gaps’. This is a challenging setting and we demonstrate that standard Beam Search is poorly suited for this task. We develop a Bidirectional Beam Search (BiBS) algorithm based on an approximation of the full joint distribution over output sequences that is efficient to compute in Bidirectional Recurrent Neural Network models. To the best of our knowledge, this is the first algorithm for top
MAP inference in Bidirectional RNNs. We have demonstrated that BiBS outperforms natural baselines at both fillintheblank image captioning and Visual Madlibs. Future work involves generalizing these ideas to treestructured or more general recursive neural networks [30], and producing diverse MBest sequences [5, 32].Acknowledgements We thank Rama Vedantam for initial brainstorming. This work was funded in part by the following awards to DB: NSF CAREER, ONR YIP, ONR Grant N000141410679, ARO YIP, and NVIDIA GPU donations. SL was supported in part by the Bradley Postdoctoral Fellowship. Conclusions contained herein are the authors and are not to be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.
References
 [1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. Vqa: Visual question answering. In ICCV, 2015.
 [2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 [3] S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005.
 [4] D. Batra. An Efficient MessagePassing Algorithm for the MBest MAP Problem. In UAI, 2012.
 [5] D. Batra, P. Yadollahpour, A. GuzmanRivera, and G. Shakhnarovich. Diverse MBest Solutions in Markov Random Fields. In ECCV, 2012.
 [6] M. Berglund, T. Raiko, M. Honkala, L. Karkkainen, A. Vetek, and J. Karhunen. Bidirectional recurrent neural networks as generative models. In NIPS, 2015.
 [7] X. Chen and C. L. Zitnick. Mind’s Eye: A Recurrent Visual Representation for Image Caption Generation. In CVPR, 2015.

[8]
K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio.
On the properties of neural machine translation: Encoderdecoder approaches.
In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST8), 2014.  [9] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
 [10] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Contextdependent pretrained deep neural networks for largevocabulary speech recognition. Audio, Speech, and Language Processing, 20(1):30–42, 2012.
 [11] A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. Moura, D. Parikh, and D. Batra. Visual dialog. CVPR, 2016.
 [12] H. de Vries, F. Strub, S. Chandar, O. Pietquin, H. Larochelle, and A. C. Courville. Guesswhat?! visual object discovery through multimodal dialogue. CVPR, 2017.
 [13] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Longterm Recurrent Convolutional Networks for Visual Recognition and Description. In CVPR, 2015.
 [14] H. Fang, S. Gupta, F. N. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, C. L. Zitnick, and G. Zweig. From Captions to Visual Concepts and Back. In CVPR, 2015.

[15]
D. Geman, S. Geman, N. Hallonquist, and L. Younes.
A Visual Turing Test for Computer Vision Systems.
In PNAS, 2014.  [16] Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. A multiview embedding space for modeling internet images, tags, and their semantics. International Journal of Computer Vision, 106(2):210–233, 2014.
 [17] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6):82–97, 2012.
 [18] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural Computation, 9(8), 1997.
 [19] Z. Huang, W. Xu, and K. Yu. Bidirectional lstmcrf models for sequence tagging. CoRR, 2015.
 [20] N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. In EMNLP, 2013.
 [21] A. Karpathy and L. FeiFei. Deep visualsemantic alignments for generating image descriptions. In CVPR, 2015.
 [22] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft COCO: Common objects in context, 2014.
 [23] J. Lu, C. Xiong, D. Parikh, and R. Socher. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning. CVPR, 2017.
 [24] M. Malinowski and M. Fritz. A MultiWorld Approach to Question Answering about RealWorld Scenes based on Uncertain Input. In NIPS, 2014.
 [25] D. Nilsson. An efficient algorithm for finding the m most probable configurations in probabilistic expert systems. Statistics and Computing, 8:159–173, 1998. 10.1023/A:1008990218483.
 [26] K. Papineni, S. Roukos, T. Ward, and W.J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 311–318, 2002.
 [27] M. Ren, R. Kiros, and R. Zemel. Exploring models and data for image question answering. In NIPS, 2015.
 [28] H. Sak, A. Senior, K. Rao, and F. Beaufays. Fast and accurate recurrent neural network acoustic models for speech recognition. arXiv preprint arXiv:1507.06947, 2015.
 [29] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [30] R. Socher, C. C. Lin, C. Manning, and A. Y. Ng. Parsing natural scenes and natural language with recursive neural networks. In ICML, 2011.
 [31] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensusbased image description evaluation. In CVPR, pages 4566–4575, 2015.
 [32] A. K. Vijayakumar, M. Cogswell, R. R. Selvaraju, Q. Sun, S. Lee, D. J. Crandall, and D. Batra. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv, abs/1610.02424, 2016.
 [33] O. Vinyals and Q. V. Le. A neural conversational model. http://arxiv.org/pdf/1506.05869v3.pdf, 2015.
 [34] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.
 [35] C. Wang, H. Yang, C. Bartz, and C. Meinel. Image captioning with deep bidirectional lstms. CoRR, 2016.
 [36] L. Yu, E. Park, A. C. Berg, and T. L. Berg. Visual Madlibs: Fill in the blank Description Generation and Question Answering. ICCV, 2015.
Comments
There are no comments yet.