Deep Bayesian Active Learning for Multiple Correct Outputs

12/02/2019 ∙ by Khaled Jedoui, et al. ∙ Stanford University 53

Typical active learning strategies are designed for tasks, such as classification, with the assumption that the output space is mutually exclusive. The assumption that these tasks always have exactly one correct answer has resulted in the creation of numerous uncertainty-based measurements, such as entropy and least confidence, which operate over a model's outputs. Unfortunately, many real-world vision tasks, like visual question answering and image captioning, have multiple correct answers, causing these measurements to overestimate uncertainty and sometimes perform worse than a random sampling baseline. In this paper, we propose a new paradigm that estimates uncertainty in the model's internal hidden space instead of the model's output space. We specifically study a manifestation of this problem for visual question answer generation (VQA), where the aim is not to classify the correct answer but to produce a natural language answer, given an image and a question. Our method overcomes the paraphrastic nature of language. It requires a semantic space that structures the model's output concepts and that enables the usage of techniques like dropout-based Bayesian uncertainty. We build a visual-semantic space that embeds paraphrases close together for any existing VQA model. We empirically show state-of-art active learning results on the task of VQA on two datasets, being 5 times more cost-efficient on Visual Genome and 3 times more cost-efficient on VQA 2.0.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Many vision tasks, such as image captioning and visual question answering, have multiple correct answers for the same input. In this example, multiple correct answers exist: “Glasses”, “A pair of Glasses”, etc. Existing active learning strategies overestimate uncertainty as they fail to account for paraphrases. We propose a new uncertainty estimation framework built over a visual-semantic embedding space that extends Monte-Carlo dropout-based Bayesian uncertainty for VQA and demonstrate that it is more sample efficient across multiple datasets.

Active Learning is an approach for data annotation in supervised learning problems that selectively seeks labels for informative examples 

[tong2001support, settles2012active]. It has proven to be a potential solution [tong2001support] to maximize learning while reducing the data annotation cost for many problems, including vision-language generation tasks like Visual Question Answering (VQA) [misra2018learning, deng2018adversarial, lin2017active] or image captioning [shen2019learning] that rely on large, expensive, curated datasets [antol2015vqa, krishna2017visual, zhu2016visual7w, gurari2018vizwiz]. Nonetheless, applying existing active learning strategies to language generation tasks does not result in performance increase compared to randomly annotating examples [figueroa2012active, mairesse2010phrase, settles2008active].

Traditionally designed for classification and regression [settles2012active], popular active learning strategies use some notion of model output based uncertainty to select informative data [joshi2009multi, abramson2004active, collins2008towards, gal2017deep, kendall2017uncertainties]. There are two main challenges with existing approaches when they are adapted for vision-language tasks. First, they are founded on the assumption that no two output categories can be associated with any given input, assuming that the model’s output space is a set of mutually exclusive labels. Unfortunately, this assumption is invalid for numerous vision-language tasks, such as image captioning and VQA, as one image or image-question pair can have multiple correct captions or answers [gurari2018vizwiz, bhattacharya2019does, yang2018visual]. This paraphrastic nature of language causes these measures to fail. Figure 1 illustrates an instance of this phenomenon where the question “what is the dog wearing?” can be answered by “Glasses”, “Spectacles”, “Eye-wear” and countless other paraphrases. Measures like entropy [shen2017deep], margin [culotta2005reducing], or least confidence [shen2017deep] confound multiple correct answers to imply high uncertainty in the model’s output.

Second, extending existing uncertainty functions for language generation models quickly renders them computationally intractable. Consider a VQA task with a model that can generate answers up to a length of with each word belonging to a vocabulary of size . The output space is of the order of . So, measuring uncertainty in model outputs grows exponentially with the vocabulary size. To circumvent intractability, Monte-Carlo simulations or similar approximations are often used to reduce the output space [settles2012active, Settles:2008:AAL:1613715.1613855]. However, there is no clear extension of such approximations for discrete tokens, like the words in an answer or a caption.

In this paper, we propose a novel active learning uncertainty sampling paradigm that estimates uncertainty in an embedding space instead of an output probability space. Our proposed method enables the use of dropout-based Bayesian Monte-Carlo uncertainty estimation 

[gal2016dropout]

for sequence generation. We claim that our strategy is effective, even in the presence of multiple correct outputs with discrete words. We showcase the efficacy of our approach in a VQA setting, as it represents a natural manifestation of the phenomenon of having multiple correct candidates for a given input, but insist that our solution is extendable to any language generation task, provided that a semantic space can be created. Instead of immediately generating answers, we impose a semantic structure on the hidden representation of existing VQA models. Our semantic space is designed to embed paraphrases close together. It allows us to abstract away from the discrete tokens generated by VQA models and estimate uncertainty in the embedding space instead.

We show the utility of our uncertainty sampling strategy for VQA generation on the Visual Genome v1.4 [krishna2017visual] and the VQA v2.0 [antol2015vqa] datasets in an active learning setting. We observe that our solution achieves state-of-the-art results in multiple language generation metrics compared to existing uncertainty sampling strategies [shen2017deep, culotta2005reducing, scheffer2001active], while being times more cost-efficient on Visual Genome and times more cost-efficient on VQA 2.0.

2 Related Work

We explore the field of active learning and comment on how existing uncertainty measurements fail in environments where we have multiple correct outputs. We place our work in context with the task of VQA, as we believe it represents a highly representative manifestation of these environments. Next, we dive into visual-semantic embedding, a practical solution to construct semantic spaces using visual priors. Finally, we explore Bayesian uncertainty measurement techniques.

Active learning. A typical active learning setting starts with an initial small set of labeled data and a large set of unlabeled data . Knowing that labeling data is costly and time consuming, the task is to minimize the number of samples to annotate from while maximizing performance on a given task that requires those labels [tong2001support]

. Active learning strategies have been successfully applied to a wide number of machine learning tasks, including image recognition 

[joshi2009multi, sener2017active], information extraction [scheffer2001active, finn2003active, jones2003active, culotta2005reducing]

, named entity recognition 

[shen2017deep, hachey2005investigating] and text categorization [lewis1994sequential, hoi2006batch]. Active learning strategies have included uncertainty-based sampling [joshi2009multi, abramson2004active, collins2008towards] and information gain [houlsby2011bayesian]. Others have introduced a theoretical dropout-based framework to measure uncertainty [gal2017deep, kendall2017uncertainties]. Similar solutions have been adapted to NLP tasks like Named Entity Recognition [shen2017deep] and Neural Semantic Parsing by adding Gaussian noise to the network weights [dong2018confidence]. In this paper, we empirically show how previous uncertainty sampling strategies do not perform better than random sampling in a VQA setting and propose a novel strategy that uses Bayesian uncertainty in a semantically structured embedding space.

Visual question answering. VQA systems expect an input image and a natural language question and attempt to output the correct answer [antol2015vqa]. VQA has received a considerable amount of attention in recent years with the development of several datasets, proposed as benchmarks [antol2015vqa, malinowski2015ask, johnson2017clevr, goyal2017making, krishna2017visual, ren2015exploring, zhu2016visual7w], and of various models[antol2015vqa, fukui2016multimodal, lu2016hierarchical, zhou2015simple, yang2016stacked, zhu2016visual7w, jabri2016revisiting, malinowski2014multi, wu2016ask]. To tackle the task, many proposed architectures encode and then merge the visual and textual information in order to classify answers [malinowski2015ask, ma2016learning, jabri2016revisiting]. Attention mechanisms have proved to be successful in this task [shih2016look, xu2016ask, lu2016hierarchical, anderson2018bottom]. Other work focuses on designing effective multi-modal feature fusion schemes [ben2017mutan, fukui2016multimodal, kim2016hadamard]. While the performance has been encouraging, such approaches require a large amount of labelled data with a predefined, mutually exclusive set of answer categories. We explore a more realistic variant of VQA where models generate natural language answers, resulting in the generation of paraphrases.

Semantically structured embeddings. Our key insight lies in moving uncertainty estimation from the model’s output space to a semantically structed internal embedding space. Since we specifically study vision-language tasks, we build a visual-semantic space that combines both visual and textual information in order to create a unified latent representation. This problem has been studied extensively in the last few years with work that involves jointly embedding images and text, at the word level [frome2013devise, kong2014you, joulin2016learning, matuszek2012joint, klein2015associating, socher2010connecting], and sentence level [zitnick2013learning, karpathy2015, karpathy2014, Chen_2015_CVPR, kiros2014unifying, reed2016learning]. Visually grounded text representations have been applied to different tasks, including caption generation [kiros2014unifying, karpathy2015]

, image retrieval 

[lin2014visual], and visual question answering [malinowski2015ask]. Our approach is inspired by previous work [karpathy2014], where we use image regions with their associated captions and adapt a margin objective to build the semantic space. We use the semantic space to enable a new active learning strategy.

Uncertainty.

As uncertainty estimation represents the most popular sampling strategy for active learning, we explore uncertainty estimation techniques developed for deep learning models. Even though Deep Learning systems are performant, they remain uninterpretable, poor at quantifying predictive uncertainty, and overconfident in their predictions. To mitigate this problem, Bayesian models have been proposed by placing a prior distribution over model weights 

[williams1996gaussian, neal2012bayesian, mackay1992bayesian]

. Yet, even though the problem is simple to formulate, deriving a posterior distribution for Deep Bayesian Neural Networks (BNN) is intractable which makes Bayesian inference difficult 

[gal2016dropout]. Therefore, focus has shifted to approximating BNNs with variational inference. Bayesian modeling of stochastic processes introduced new techniques into the field such as sampling-based and stochastic variational inference [blei2017variational, blundell2015weight].

On the other hand, Monte-Carlo Dropout [gal2017deep, gal2016dropout] has empirically demonstrated comparable uncertainty estimation quality to variational inference [gal2017deep]

. A model trained with dropout can be used as a Bayesian model by making multiple predictions while sampling different dropout masks for each forward propagation. Estimating the posterior amounts to computing the mean and variance of the predictions. While previous work has showed promise in classification and regression tasks 

[yang2015multi, gal2016dropout, settles2012active], such methods have not been extended to language generation. In fact, most language-based uncertainty work adapts classification-based uncertainty techniques to language tasks, transforming the problem from an open language problem to a constrained closed problem [shen2017deep, hachey2005investigating, lewis1994sequential, hoi2006batch, dong2018confidence]. We propose a new sampling strategy that extends Monte-Carlo dropout uncertainty [gal2016dropout] to measure uncertainty in a semantic space.

3 Method

We design an uncertainty measurement for active learning even when questions have multiple correct answers. In this section, we formulate our overall active learning framework, and then describe our uncertainty measurement approach. Our approach depends on two components: a semantic space that structures the model’s output representations and a denoiser that refines our representations to get an accurate dropout-based Bayesian uncertainty estimation.

3.1 Visual question answer generation

We specifically study the scenario with multiple correct outputs using the VQA task, which expects an image-question pair as input and natural language answer as output. The goal is to train a model , where is the generated answer and is the trained VQA model. Since different sentences can have the same meaning, we define , the set of all sentences semantically similar to . When evaluating the performance of the VQA model, we consider any answer as a correct candidate for an answer. For example, in Figure 1, the question “What is the dog wearing?” can have multiple correct answer candidates, including “Glasses”, “Spectacles”, etc.

Figure 2: We build a Visual-Semantic embedding using a small amount of image-caption pairs from Visual Genome using contrastive loss such that similar language descriptions are mapped closer together since they will refer to semantically similar image regions.
Figure 3: (a) We train existing VQA models, which almost always follow an encoder-decoder architecture, with an additional embedding loss that structures the output according to the visual-semantic space. (b) We define an embedding denoiser, by combining the VQA decoder with the pretrained visual-semantic encoder. The denoiser refines these representations to get an accurate uncertainty estimation.

Traditionally, VQA classifies the answer in a mutually exclusive set of correct answers [antol2015vqa]

. However, we study a more realistic variant of VQA where our model generates the answer in natural language. So, in all our experiments, we use the popular bottom-up attention model 

[anderson2018bottom]

and replace its classifier with a long short term memory network (LSTM) 

[sundermeyer2012lstm] to generate answers instead of classifying them. Although we report results with the bottom-up attention model, our results are consistent with other existing VQA models [yang2016stacked]. Almost all VQA models follow a traditional encoder-decoder architecture, where are encoded to generate a hidden representation . This representation is then decoded to produce the answer .

3.2 Active learning for VQA

We follow a traditional pool-based active learning setting [Settles:2008:AAL:1613715.1613855, settles2012active]. The model is initially trained with a small bootstrapping training set to produce a starting model . At every time step , we receive a new pool of question-image pairs and must choose pairs to annotate using an oracle. The pairs are chosen using an uncertainty measurement . Once the oracle answers the chosen questions, they are added to and the model is re-trained to .

The process of incorporating new annotations by choosing data points from a pool continues for iterations. Our final model is . In our experiments, we compare and evaluate different uncertainty measurements by the performance of their resulting models.

3.3 Our contribution: Bayesian uncertainty as variance in semantic space

As discussed earlier, traditional uncertainty estimation techniques for language generation are based on the assumption that all outputs must be mutually exclusive, and thus, fail when paraphrases exist. Also since the existing methods take exponentially long to compute with respect the size of the vocabulary [settles2012active], these methods require approximations that lead to further errors [settles2012active].

Our solution tackles both of these challenges. First, we build a visual-semantic space that captures semantic language similarly and maps similar paraphrases close together, and second, uses Monte-Carlo dropout-based Bayesian uncertainty [gal2016dropout] with a denoiser to measure the model’s uncertainty.

Monte-Carlo dropout [gal2016dropout] is an approximate inference approach to Bayesian Neural Networks that approximates a Gaussian Process by enforcing model dropout in both training and test time. At test time, due to the randomness induced by dropout, our model inference becomes stochastic which makes our output potentially variable. Applying a Monte-Carlo process with simulations, equivalent to applying forward passes each with a different dropout mask, results in a probabilistic distribution that can be used as a Bayesian interpretation of a neural network and allows the use of variance as an uncertainty measure.

Figure 4: Active learning strategies’ performance on Visual Genome [krishna2017visual] measured using multiple language generation metrics. Performance of the initial model is not shown as it is the same for all strategies. We outperform all existing strategies. Additional metrics with exact numbers are included in the appendix.

Instead of dealing with intractable output spaces, we measure uncertainty within a visual-semantic space that is computationally independent of the vocabulary size. More specifically, given an input , we encode the pair into a hidden representation using forward passes, each with a randomly sampled dropout mask, resulting in

hidden representations. We assume that the representations embed into a visual-semantic space, a space where similar visual and language concepts lie close together. We measure uncertainty as the sum of variances across all the dimensions of the representation vectors:

(1)

The inputs with the highest uncertainty scores are chosen and sent to an oracle for annotation.

Intuitively, our measurement outputs a small uncertainty score when the forward passes all produce representations that occupy a small volume in the embedding space, implying that all the answers the model produces are paraphrases or at least semantically similar. Similarly, our measurement outputs a large uncertainty score when the representations occupy a large volume in the embedding space, implying that the answers produced by the model are different.

3.4 Training the visual-semantic space

Since we are demonstrating our method using a vision-language task, we build a visual-semantic space to structure language outputs. Similar to previous work, we design the embedding space using a dataset of image regions and their corresponding natural language captions [kiros2014unifying, frome2013devise, kong2014you, joulin2016learning, matuszek2012joint, klein2015associating, socher2010connecting, zitnick2013learning, karpathy2015, karpathy2014, Chen_2015_CVPR, reed2016learning]. Concretely, given a dataset of pairs of images and captions, we first embed the images using a pretrained ResNet50 [he2016deep] and the caption using a visual-semantic encoder . Next, we use contrastive loss, often called triplet loss, to ensure that similar image regions and similar captions are embedded nearby:

(2)
(3)
(4)
(5)

where and are two distinct image-captions pairs and is a hyper-parameter representing our loss margin. is the euclidean distance function though any other distance function can also be used. lower bounds all values to zero. To ensure that our embedding space does not collapse, we add a cross entropy loss which reconstructs the caption from its embedding.

are hyperparameters we optimize using a validation set of image-caption pairs. Figure 

2 visualizes this training process.

3.5 Training VQA using visual semantic space

Finally, with the visual-semantic space trained, we can use it to train a VQA model and structure its outputs. Almost all VQA models proposed in Computer Vision follow an encoder-decoder architecture with an encoder which embeds the question-image pair

and then a decoder that converts the embedding into an answer  [yang2016stacked, anderson2018bottom]. We train our model in a traditional way by imposing the following reconstruction cross entropy loss along with a visual-semantic embedding loss:

(6)
(7)
(8)

where is the size of the dataset and is the projected answer embedding in the visual-semantic space. Figure 3(a) visualizing the training objective.

3.6 Denoising the representations for uncertainty estimation

Our approach depends on two central components: a semantic space that structures the model’s output representations and a denoiser that refines our representations to get an accurate uncertainty estimation. As discussed previously, we use the distance loss with respect to a visual-semantic space to semantically structure our VQA encoder’s hidden representation. In practice, the resulting representation is an approximation of the reference visual-semantic space. We find that embeddings for some concepts often result in a noisy representations. In our dropout-based framework, we find that variance measured on these noisy representations does not guarantee an accurate model uncertainty estimation and results in random sampling behavior.

Figure 5: Active learning strategies’ performance on VQA [antol2015vqa] measured using multiple language generation metrics. Performance of the initial model is not shown as it is the same for all strategies. We outperform all existing strategies. Additional metrics with exact numbers are included in the appendix.

To refine the image-question pair representations to be similar to the answer representation, we introduce an embedding denoiser module. The denoiser consists of components we have already introduced during training: the pretrained VQA decoder and the pretrained visual-semantic encoder. As the VQA decoder is trained to map the image-question representations to text, it learns the inherent noise in the embedding space generated by the VQA encoder. By decoding these concepts to answer outputs and then using the visual-semantic encoder to re-project them back, we semantically correct the embeddings. Figure 3(b) visualizes the denoiser. As the denoiser does not expect the VQA model’s hidden representation to be semantically structured, we might expect that it renders the training VQA models with loss unnecessary. However, we find in practice that the semantic information provided by combined with the denoiser, not only increases the quality of the samples but also improves model performance (see Section 4.3 for ablations).

3.7 Implementation Details

We implement our visual-semantic model by combining a Resnet50 [he2016deep] encoder for the image, with a bidirectional LSTM encoder for the question, with a hidden size of and layers. For our decoder, we use a 2-layer LSTM decoder with a hidden size of . We train our the model using image region and corresponding description pairs from Visual Genome [krishna2017visual]. We also feed VQA image-answer pairs from the VQA training set of all iterations ensure that answer concepts are mixed together with the captions. The model is optimized using Adam [kingma2014adam] with a learning rate of with zero weight decay. We initialize our LSTM embeddings using GLoVe [pennington2014glove] and train the model for a total of iterations with a batch size of . When training our VQA model [anderson2018bottom], we find that a dropout rate of results in good performance along with reasonable hidden representations for our uncertainty framework. To estimate our uncertainty, we simulate forward passes with dropout to generate our embedding distribution. If we increase beyond , we find that the change in the uncertainty value is negligible.

4 Experiments

In our experiments, we empirically demonstrate that Bayesian uncertainty, measured as variance in the visual-semantic space, is a better sampling strategy for active learning than existing strategies. Furthermore, we show that all existing uncertainty measures perform better when the output of the VQA model is structured with a visual-semantic space.

Datasets. We study our approach using two existing VQA datasets: Visual Genome [krishna2017visual] and VQA [antol2015vqa]. Visual Genome contains k images densely annotated with scene graphs containing objects, attributes and relationships, as well as region descriptions and visual question answers. We use k of M randomly sampled region descriptions to create our visual-semantic space. For the VQA model, we use the M question answers. We split our data into a train set of images with k question answers and a validation set of k images with k pairs. VQA is a dataset of k MSCOCO [lin2014microsoft] images with k visual question answers. We use the default train/val split with a training set of k images with k question answers and a test set of k images with k pairs.

Active learning setup. In order to showcase the advantage of usin g our uncertainty measurement, we test its efficiency against widely used uncertainty sampling strategies [culotta2005reducing, scheffer2001active, shen2017deep]. We setup our active learning pipeline as follows: we randomly initialize our bootstrapping set with size representing of the original training set, and pretrain our model . is then used as the starting point for all our experiments. For each active learning iteration, we sample a pool with size representing of the original dataset size. Using our sampling strategies, we update with the best scoring data points such that we have an increase of of the entire dataset in each iteration. We retrain our model using the resulting train set and repeat the procedure for iterations, resulting in a final size of of the original dataset. We train the visual-semantic space with the same amount of image-caption pairs as the question-answer pairs the initial uses.

Active learning iteration
Q-Type Model 1 2 3 4 5
Performance What Ours 97.60 104.38 110.17 112.47 114.62
Margin 84.98 86.47 95.00 98.75 99.44
Where Ours 100.77 107.06 112.73 114.35 116.62
Margin 85.57 92.01 95.09 99.56 100.48
Who Ours 104.79 111.27 116.92 119.31 121.2
Margin 84.98 92.07 96.84 99.21 98.92
How Ours 98.14 104.84 111.23 112.45 114.77
Margin 84.40 90.28 95.79 100.36 97.08
How many Ours 104.95 110.60 115.98 117.13 120
Margin 86.15 91.44 93.55 99.34 101.87
Sampling % What Ours 62.80% 60.50% 70.00% 60.30% 61.60%
Margin 53.20% 50.10% 49.20% 48.30% 49.50%
Where Ours 20.30% 21.60% 22.20% 21.40% 22.70%
Margin 25.90% 28.80% 29.00% 28.50% 28.40%
Who Ours 3.60% 4.80% 3.80% 4.80% 4.00%
Margin 5.30% 5.90% 6.10% 6.90% 6.90%
How Ours 6.10% 6.30% 5.40% 6.50% 5.20%
Margin 4.80% 4.50% 4.50% 4.20% 4.10%
How many Ours 1.40% 1.00% 0.90% 1.00% 1.00%
Margin 5.30% 5.40% 6.20% 7.20% 6.60%
Table 1: We report CIDEr performance per question type for our measure, as well as Margin sampling. We also report corresponding samples distributions.
Bert Recall Bert Precision METEOR CIDEr
Iteration 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Dataset Size (%) 10 15 20 25 30 10 15 20 25 30 10 15 20 25 30 10 15 20 25 30
Random 66.94 67.61 68.02 68.39 68.64 65.28 65.86 66.22 66.54 66.89 12.96 13.57 14.00 14.37 14.68 87.01 91.75 94.90 97.63 100.09
Margin 66.90 67.70 68.16 68.55 68.81 65.16 66.02 66.4 66.75 66.95 12.82 13.70 14.11 14.52 14.70 86.10 92.28 95.66 98.70 100.30
LC 67.05 67.72 68.18 68.55 68.79 65.28 65.95 66.45 66.65 66.99 12.91 13.62 14.16 14.43 14.76 86.76 91.80 95.91 97.94 100.22
Entropy 67.00 67.69 68.26 68.62 68.74 65.29 66.04 66.54 66.76 67.00 12.92 13.66 14.22 14.49 14.76 87.05 92.26 96.36 98.47 100.45
Random + VS 69.57 70.35 70.72 71.10 71.35 66.39 67.24 67.81 68.09 68.41 14.52 15.34 15.91 16.23 16.55 101.68 107.76 111.88 114.27 116.72
Margin + VS 69.56 70.36 70.75 71.08 71.42 66.29 67.24 67.64 68.11 68.49 14.43 15.40 15.86 16.30 16.70 100.73 108.20 111.61 115.14 117.97
LC + VS 69.53 70.26 70.56 71.00 71.27 66.25 67.13 67.42 67.96 68.34 14.39 15.28 15.68 16.18 16.56 100.29 107.00 110.44 113.68 116.56
Entropy + VS 69.47 70.31 70.68 71.07 71.37 66.17 67.22 67.67 68.10 68.38 14.31 15.33 15.83 16.27 16.56 100.26 108.02 111.49 114.66 116.98
Ours (Baye) 67.15 67.76 68.14 68.46 68.73 65.47 65.99 66.35 66.70 67.05 13.04 13.67 14.03 14.42 14.75 87.92 92.24 95.33 97.71 100.15
Ours (Baye + Deno) 67.13 67.86 68.38 68.68 68.98 65.49 66.17 66.63 66.93 67.22 13.05 13.83 14.31 14.67 14.96 87.59 93.01 96.81 99.34 101.44
Ours (Baye + VS) 69.48 70.30 70.59 70.88 71.27 66.34 67.27 67.62 68.01 68.47 14.53 15.38 15.74 16.18 16.59 101.47 107.97 110.73 113.91 117.06
Ours (Baye + VS + Deno) 69.74 70.51 70.99 71.35 71.63 66.54 67.44 68.01 68.34 68.70 14.66 15.53 16.12 16.47 16.91 102.19 108.87 112.80 115.67 118.67
Table 2: We report our strategy’s performance and find that it outperforms all baselines. We also notice that the baselines improve considerably if we add visual-semantic information (+VS). (+Deno refers to the denoiser.

Baselines. We compare our uncertainty measure against the most popular uncertainty-based strategies. All baselines use a traditional encoder-decoder architecture without imposing a visual-semantic embedding loss. Random sampling (Random) is a passive learning strategy that queries points following a uniformly random distribution. Least confidence (LC[culotta2005reducing] queries the instances for which the model has the least confidence in its most likely generated sequence: where is the probability of the most confident model response. Margin Sampling (Margin[scheffer2001active]: queries the instances with the least margin between the output probabilities for the two most likely generated answers of our model: , where is the probability of the most confident model response, . Maximum Entropy Sampling (Entropy[shen2017deep] queries the instances that maximize the entropy of our models output: . We use a beam size of to get our uncertainty score.

While other active learning strategies like information gain and density based methods exist, some require an explicit enumeration over the output space, which is intractable, while others are not scalable for large datasets and adapting them in language generation models is still an open research problem [settles2012active].

Evaluation.

To evaluate answer quality, we use the standard automatic evaluation metrics, namely CIDEr 

[vedantam2015cider], METEOR [denkowski2014meteor]

, and BERT (precision and recall

[zhang2019bertscore]. We report additional metrics, like Bert F1 [zhang2019bertscore], Bleu [papineni2002bleu], ROUGE-L [lin2004rouge] and accuracy in the appendix.

4.1 Active learning performance

Setup. We report quantitative results from our active learning experiments in Table 4, as well as in Figure 5. To ensure that our results are not a product of noise, we run all experiments times for each strategy on both datasets. Each run is accompanied with a different random seed to alter how the weights of the model are initialized and which pool of image-question pairs arrives at every step. We present the average scores from all separate experiments.

Results. Our solution out-performs all traditional active learning strategies in all metrics by a large margin for both Visual Genome (see Table 4) and VQA 2.0 (see Table 5). The increase in performance is larger for Visual Genome, which has more unique answers than VQA 2.0, resulting in the model learning more sequences to express the same answer. On Visual Genome, our measure is over times more cost efficient than most existing sampling strategies, performing at of the annotation budget better than what other strategies manage after using the whole budget across all metrics. Similarly for VQA 2.0, we find that our solution is around times more cost efficient, performing at of the annotation budget better than what other strategies manage after using the whole budget across all metrics. This demonstrates our framework’s ability to maximize performance when multiple correct answers are present.

4.2 What questions are sampled?

Setup. Next, we dive into what types of questions are sampled by Ours versus Margin, which is the best performing baseline. We report the sampling statistics per question category, i.e the percentage of a specific type of question the strategy chooses at every step. We also report the performance of the models on the test set for those specific categories. We only report CIDEr scores as we see the same trend across the other metrics. Results are reported in Table 1. We look at “what”, “where”, “who”, “how” and “how many (counting)” questions in particular as they make up more than of questions in Visual Genome. We also report how well the models perform at every iteration on the test split containing only that type of questions.

Results. Not only do we outperform Margin in cases where we sample more points for a specific type of question but also in cases where we sample less. In order to explain this behavior, we study samples from the iteration of active learning, in which we sample significantly less “where” questions and more “what” questions and yet perform better than Margin on both types. We find that Margin samples more answers of length on “where” questions. We also look at the average length of answers in Visual Genome [krishna2017visual] and find that “what” questions have an average length of while “where” questions have an average length of . Longer answers typically have a higher potential of having paraphrases. We find this assumption to be true qualitatively: Margin samples questions with ground truth answers like “next to the large brick buildings”, “near the tall buildings” and “by the buildings” when it already has paraphrases already in its training set. Furthermore, Ours samples more new concepts on “what” question that might be less prone to redundancy since they are generally shorter answers (refer to Appendix for more details). We also see that our method samples very few counting questions as the model picks up on the dataset bias where the answer is usually “2” and therefore has a lower uncertainty.

4.3 Ablations

Setup. Our solution involves three components: a visual-semantic space (VS), applying dropout-based Bayesian uncertainty within that space (Baye), and utilizing a denoiser to make the measurements more accurate (Deno). The visual-semantic space, however, can be used during training for any VQA model to structure the outputs and doesn’t need to specifically rely on dropout-based Bayesian uncertainty. So, we can add as a loss and still utilize existing uncertainty functions. Here, we perform experiments by ablating the three components of our model as well as adding the visual-semantic space loss and reporting how they impact existing methods.

Results. We report the impact of the structure provided by the visual-semantic space when still using existing uncertainty measure in Table 2. We notice that even though our proposed strategy outperforms the extended traditional strategies, all existing uncertainty methods achieve a considerable boost in performance over iterations. This leads to two important conclusions: (1) structuring the output space of problems that have multiple correct answers can improve uncertainty estimations by all existing measures and (2) measuring uncertainty in the embedding space instead of model outputs proves to be more robust with multiple correct answers. Our ablations also demonstrate the importance of the denoiser and visual-semantic space when using Bayesian uncertainty estimation. We show that Baye by itself performs just as well as the existing baselines but combined with Deno or VS increases its performance and together Baye+VS+Deno performs the best.

5 Discussion

While promising, it is important to note that our method does have a few limitations. First, constructing a semantic space requires additional data. In our case, the availability of Visual Genome’s [krishna2017visual] region descriptions made constructing a visual-semantic space feasible. However, in different tasks where such data might not be already available, such an approach can increase the overall cost. Second, it is possible for semantic embeddings for some concepts to occupy a larger convex hull than others. Therefore, variance as a measure might overestimate uncertainty for concepts that cover a larger convex hull while underestimating uncertainty for those placed within a smaller hull. Measuring uncertainty with respect to the density of specific concepts is an open research question we leave to future work. Third, as we measure the variance on sampled embeddings as our uncertainty, our solution needs multiple forward passes. This gives us a runtime and computational disadvantage over existing baselines. Even though these limitations exist, we consider them implementation limitations, left for further research work, and emphasize the higher level contribution of our paper which is the proposal of a new paradigm that investigates uncertainty as variance in semantic space.

6 Conclusion

Existing uncertainty sampling strategies in active learning are no better than random sampling for sequence generation tasks with multiple correct answers. We propose a novel uncertainty sampling paradigm that moves away from a probabilistic to an embedding-based uncertainty estimation and overcomes the paraphrastic nature of language. We evaluate our solution in an active learning setting for the VQA generation task and outperform existing sampling strategies. Our model samples fewer paraphrases and more novel concepts and is more cost-efficient on Visual Genome and on VQA 2.0.

Acknowledgements. This work was partially funded by the Brown Institute of Media Innovation and by Toyota Research Institute (“TRI”) but solely reflects the opinions and conclusions of its authors and not TRI or any Toyota entity.

References

7 Appendix

We provide more details on how to implement our VQA model. We then motivate our design decisions in developing a new active learning framework for tasks with multiple correct ouputs, by investigating when and how existing uncertainty measurements fail. Next, we explore the active learning framework baselines and provide a more comprehensive quantitative evaluation, as well as a qualitative analysis of our proposed sampling strategy.

7.1 Implementation details: VQA model

We implement a bottom-up, top-down attention [anderson2018bottom] VQA generation model and use it for all of our active learning experiments. We use a question encoder similar to the LSTM encoder to the Visual-Semantic Encoder. We feed our bottom-up attention mechanism the output hidden state of the LSTM encoder for the question’s embedding, along with a region per corresponding image. We add a dropout layer (with ) to our attention mechanism when we attend over the joint question-image representation and use the original architecture otherwise. We decode the model’s answer using an LSTM decoder. We optimize the model with the same hyper-parameters as the Visual-Semantic model and train it for a total of epochs.

7.2 Uncertainty estimation with multiple answers: CLEVR experiment

To motivate our design decisions in developing a new active learning framework for VQA with multiple correct answers, we investigate when and how existing uncertainty measurements fail. To systematically perform this evaluation without confounding factors like noise in real-world datasets, we use the synthetic CLEVR dataset [johnson2017clevr]. Even though CLEVR only has questions with one correct answer, we modify the answers by introducing paraphrases. Our insights from experimenting on a modified-CLEVR motivate both the importance of moving to different uncertainty estimation for language generation models, as well as the specific uncertainty measure described in the next section.

Modifying CLEVR to include paraphrases

CLEVR is a diagnostic dataset that tests a range of visual reasoning abilities, including Visual Question Answering. It contains minimal biases and has detailed annotations describing the kind of reasoning each question requires. The dataset is composed of a train set of triples with multiple answer categories including binary (yes / no), attributes, counts, objects and spatial relationships.

In a real world dataset, some answers will have multiple correct answers while other answers might only have one. To mimic such a setup, we modify CLEVR by taking a fraction of answer categories and replacing them with synonyms. For example, we corrupt the answer ‘yes’ to seven different tokens ‘yes’, ‘yeah’, , ‘yup’. Specifically, for every answer that is ‘yes’, we randomly modify it to one of the seven paraphrases. This is a conservative modification; in language, there are usually more than seven ways of expressing the same meaning.

Figure 6: Even though the model is uncertain about the question on the left, existing uncertainty measurements assign it a lower uncertainty than the example on the right. When an input has multiple correct answers, existing measures overestimate uncertainty as they are unable to relate which outputs are paraphrases.

Overestimation of uncertainty

We train a state-of-the art VQA model [yang2016stacked], on both the modified-CLEVR dataset for epochs with a learning rate of . Next, for all the data points in modified-CLEVR’s validation set, we measure the model’s uncertainty from the model’s outputs and compare how uncertainty measurements differ for questions with multiple correct answers and those with only one.

We report uncertainty scores using entropy, which is the most popular uncertainty measurement used in active learning settings. We run similar experiments with other measures, like least confidence and margin, but omit them from our analysis here as they follow the same trend. Since exactly measuring entropy is intractable, we approximate entropy by decoding the answer with a beam size of  [shen2017deep].

Qualitatively, Figure 6 demonstrates how the model’s uncertainty scores differ for questions with one or multiple correct answers. On the left, we show an example of a question that the model is relatively unsure about. It assigns most of its weight to the correct answer (“large”) but also assigns a sizable weight to an incorrect answer (“small”). However, the uncertainty associated with this question is lower than the example we show on the right. When multiple correct answers exist, the model fails to choose a clear answer and tends to assign similar weights to multiple answers with the same meaning. Even though we can interpret such a result as the model learning that for such a question, multiple correct candidates exist, uncertainty measures are unaware of which answers are paraphrases and overestimate uncertainty. This scenario showcases a common failure case of active learning with existing uncertainty measurements, which would choose to collect more labels for the example with multiple correct answers, even though the model might benefit more from sampling the question on the left.

Figure 7: As existing uncertainty measurements tend to overestimate uncertainty when an input has multiple correct answers, we end up sampling data points that the model already knows how to answer reducing active learning efficiency.

Quantitatively, we show how prevalent this problem is for all questions in the modified-CLEVR’s validation set in Figure 7. Entropy assigns high uncertainty for all questions with multiple correct answers (shown in orange). And majority of these questions are assigned a higher uncertainty score than questions with a single correct answer (shown in green). Finally, if we correct the uncertainty scores by summing up the weights assigned to all the synonyms together, we find that the model is actually quite certain about a lot of these questions (shown in blue).

Our experiments with CLEVR allows us to conclude that we need an uncertainty measurement that can perform a similar correction, moving overestimated uncertainty scores (as shown in orange) to their correct values (as shown in blue). We need to design a mechanism which suggests which answers are semantically similar so that such a correction can be performed.

7.3 Active learning framework

We visualize the active learning framework described in the main paper using an algorithm in Algorithm 1. We also add additional notes and details about the baselines used below:

1:   Initialize an initial training set
2:   Initialize VQA model and pretrain it on to get
3:   for  in to  do
4:       Get a new pool of pairs of
5:       initialize a list
6:       for each pair  do
7:           Measure uncertainty ,
8:           Add uncertainty score to list
9:       end for
10:       SAMPLES
11:       Annotate SAMPLES using an oracle
12:       Add new data to training set SAMPLES
13:       Retrain using the updated to get
14:   end for
Algorithm 1 Pool Based Active learning for VQA

Random Sampling (Passive Learning):

We choose following a uniformly random distribution from .

where returns k data points from

following a uniform distribution.

Least Confidence Sampling [culotta2005reducing]:

This approach queries the instances for which our model has the least confidence in its most likely generated sequence. We choose as the set with the top points from that the current VQA model has the least confidence in generating an answer for. We define our acquisition function as follows:

where is the probability of the most confident model response.

Margin Sampling [scheffer2001active]:

This approach queries the instances with the least margin between the output probabilities for the two most likely generated answers of our model. In this strategy, we use a beam size of 2 and choose as the set with the top K points from with the smallest margin. We define our acquisition function as follows:

where is the probability of the most confident model response, .

Maximum Entropy Sampling [shen2017deep]:

This approach queries the instances that maximize the entropy of our models output. We choose as the data points with the highest prediction entropy. We define our acquisition function as follows:

In order to measure entropy of our predictions in a VQA setting, we use a beam size of to get different predictions with probabilities for .

7.4 More active learning performance:

The experiments we ran in the main paper can also be evaluated using additional language metrics like BLEU, ROUGE and BERT F1. We include those metrics here. They follow the same trend as the other metrics. We also report model accuracy.

Results on Visual Genome are shown in Table 3 and results on VQA 2.0 are shown in Table 4.

VG Bert F1 Bert Recall Bert Precision
AL iteration 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Dataset Size (%) 10 15 20 25 30 10 15 20 25 30 10 15 20 25 30
Random 66.12 66.75 67.16 67.46 67.74 66.96 67.63 68.05 68.38 68.60 65.31 65.88 66.30 66.57 66.89
Margin 65.92 66.84 67.23 67.60 67.87 66.83 67.65 68.10 68.51 68.77 65.04 66.04 66.38 66.71 66.98
LC 66.17 66.82 67.36 67.59 67.88 67.05 67.72 68.18 68.55 68.79 65.28 65.95 66.45 66.65 66.99
Max Entropy 66.14 66.84 67.36 67.68 67.86 67.00 67.69 68.26 68.62 68.74 65.30 66.01 66.48 66.77 67.00
Ours 68.10 68.94 69.47 69.81 70.14 69.74 70.51 70.99 71.35 71.63 66.54 67.44 68.01 68.34 68.70
VG METEOR CIDEr ROUGE-L
AL iteration 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Dataset Size (%) 10 15 20 25 30 10 15 20 25 30 10 15 20 25 30
Random 12.96 13.57 14.00 14.37 14.69 87.01 91.75 94.90 97.63 100.09 30.38 31.67 32.55 33.31 33.96
Margin 12.82 13.70 14.11 14.52 14.70 86.10 92.28 95.66 98.70 100.30 30.06 31.68 32.57 33.37 33.79
LC 12.91 13.62 14.16 14.43 14.76 86.76 91.80 95.91 97.94 100.22 30.17 31.44 32.53 33.06 33.65
Max Entropy 12.92 13.66 14.22 14.49 14.76 87.05 92.26 96.36 98.47 100.45 30.31 31.66 32.77 33.36 33.89
Ours 14.66 15.53 16.12 16.47 16.91 102.19 108.87 112.80 115.67 118.67 35.02 36.70 37.62 38.32 39.02
VG Bleu-1 Bleu-2 Bleu-3
AL iteration 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Dataset Size (%) 10 15 20 25 30 10 15 20 25 30 10 15 20 25 30
Random 29.1 29.96 30.52 31.12 31.74 18.86 19.63 20.08 20.57 20.94 12.72 13.41 13.81 14.2 14.47
Margin 28.7 30.22 30.85 31.41 31.63 18.71 19.91 20.47 21.03 21.21 12.67 13.62 14.17 14.72 14.92
LC 28.98 29.94 30.96 31.14 31.77 18.99 19.78 20.59 20.95 21.31 12.92 13.66 14.26 14.71 14.98
Max Entropy 29.0 29.99 31.01 31.25 31.79 18.91 19.75 20.56 20.84 21.17 12.81 13.52 14.17 14.51 14.72
Ours 31.59 33.07 33.75 34.29 35.19 23.16 24.22 24.82 25.41 25.91 17.21 18.17 18.67 19.37 19.75
VG Bleu-4 Accuracy
AL iteration 1 2 3 4 5 1 2 3 4 5
Dataset Size (%) 10 15 20 25 30 10 15 20 25 30
Random 7.94 8.40 8.88 9.15 9.26 16.36 17.40 18.12 18.7 19.21
Margin 7.98 8.72 9.20 9.71 9.93 16.14 17.34 18.05 18.67 19.02
LC 8.17 8.85 9.23 9.67 9.88 16.14 17.11 17.94 18.40 18.81
Max Entropy 8.05 8.74 9.10 9.51 9.47 16.32 17.43 18.19 18.69 19.09
Ours 13.20 14.04 14.16 15.18 15.40 19.85 21.25 22.07 22.60 23.07
Table 3: We report our model’s efficacy with multiple metrics. We use language modeling metrics to measure its capability to generate answers similar to the ground truth as we progress through the AL process. Scores are multiplied by 100 to show more significant digits.
VQA 2.0 Bert F1 Bert Recall Bert Precision
AL iteration 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Dataset Size (%) 10 15 20 25 30 10 15 20 25 30 10 15 20 25 30
Random 83.82 84.87 85.57 86.04 86.4 83.83 84.9 85.58 86.05 86.42 83.8 84.85 85.57 86.02 86.39
Margin 83.68 84.92 85.32 85.8 86.29 83.7 84.93 85.33 85.82 86.31 83.66 84.91 85.32 85.78 86.28
LC 84.08 85.19 85.71 86.1 86.03 84.1 85.20 85.73 86.12 86.04 84.06 85.17 85.69 86.09 86.02
Max Entropy 84.30 85.05 85.54 84.8 86.42 84.32 85.07 85.56 84.83 86.44 84.29 85.03 85.52 84.77 86.39
Ours 85.02 85.87 86.44 86.85 87.01 85.07 85.88 86.47 86.89 87.02 84.98 85.85 86.42 86.82 87.00

VQA 2.0 METEOR CIDEr ROUGE-L
AL iteration 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Dataset Size (%) 10 15 20 25 30 10 15 20 25 30 10 15 20 25 30
Random 29.00 32.74 35.42 35.92 36.73 112.51 120.92 126.53 130.08 133.25 43.96 47.22 49.31 50.69 51.94
Margin 28.60 32.88 32.84 35.65 36.66 111.04 120.35 123.17 127.08 131.65 43.4 46.96 47.98 49.53 51.31
LC 29.49 34.21 34.92 35.31 35.85 113.35 121.95 125.72 128.98 127.84 44.25 47.53 48.97 50.21 49.72
Max Entropy 30.06 32.72 34.75 33.90 37.03 114.96 119.50 123.54 119.54 130.58 44.84 46.56 48.05 46.62 50.80
Ours 31.21 34.58 36.82 37.62 37.03 118.64 125.72 130.23 134.07 136.36 46.35 48.96 50.67 52.17 52.98
VG Bleu-1 Bleu-2 Bleu-3
AL iteration 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Dataset Size (%) 10 15 20 25 30 10 15 20 25 30 10 15 20 25 30
Random 43.93 47.18 49.35 50.74 51.94 38.7 41.98 44.29 45.17 46.81 37.55 40.7 42.8 42.89 45.05
Margin 43.38 46.96 48.06 49.56 51.34 37.35 41.4 42.75 44.77 45.58 35.42 40.94 41.08 43.96 43.31
LC 44.30 47.60 49.05 50.30 49.85 38.92 42.57 44.04 45.53 45.58 35.79 40.15 41.73 43.92 43.03
Max Entropy 44.87 46.65 48.21 46.61 50.89 40.79 42.41 45.12 40.96 48.40 40.26 40.52 44.06 35.75 48.58
Ours 46.33 49.07 50.78 52.26 53.15 42.52 44.79 47.31 49.09 48.97 42.55 43.38 46.99 48.81 47.05
VG Bleu-4 Accuracy
AL iteration 1 2 3 4 5 1 2 3 4 5
Dataset Size (%) 10 15 20 25 30 10 15 20 25 30
Random 16.07 19.59 6.92 13.09 11.53 43.36 46.6 48.69 50.05 51.32
Margin 0.01 10.64 10.63 7.50 18.92 42.77 46.33 47.35 48.90 50.66
LC 5.18 4.36 4.81 13.10 14.64 43.60 46.88 48.32 49.56 49.05
Max Entropy 0.33 18.00 18.46 11.94 26.87 44.22 45.91 47.41 46.00 50.19
Ours 8.85 19.83 33.98 12.88 27.01 45.68 48.27 50.03 51.52 52.31
Table 4: We report our model’s efficacy with multiple metrics. We use language modeling metrics to measure its capability to generate answers similar to the ground truth as we progress through the AL process. Scores are multiplied by 100 to show more significant digits.

7.5 Qualitative analysis between Margin vs Ours

As discussed in section 5.2, we qualitatively study the behavior of both Margin, as as our solution on different types of questions. We find that our solution out-performs Margin in all types of questions by a considerable margin. We also study the percentage of sampled points with respect to performance for each type of question. We find that our solution not only performs better when we sample more points but also in cases where we sample less.

In order to understand this behavior, we study samples from the iteration of active learning, where we sample significantly less “where” questions and more “what” questions and yet perform better than Margin on both types. We investigate the distribution of answer length for samples in both types of questions. Figure 8 and 9 show the sampled distributions. We find that Margin samples more answers of length on “where” questions ( vs ).

Figure 8: Distribution of sampled answer lengths in “where” questions for Ours and Margin in the 3rd iteration.

We also look at the average length of answers in Visual Genome [krishna2017visual] and find that “What” questions have an average length of while “where” questions have an average length of . Longer answers typically have a higher potential of having paraphrases. We find this assumption to be true qualitatively: an example of the redundancy of Margin is shown in Table 5. We look at sampled points containing the word “building”. We find that Margin heavily re-samples concepts that are already available or have paraphrases in the training set. We also report the corresponding samples for Ours in Table 6 and find that we sample less paraphrases and are less redundant.

Figure 9: Distribution of sampled answer lengths in “what” questions for Ours and Margin in the 3rd iteration.
Margin iteration samples Paraphrases already in the Training Set Points
between the buildings
between the red brick buildings
over the street and between buildings
between the buildings
in between two buildings
between the pear and lemon buildings
overhead between the buildings
between two buildings
between buildings
hanging between buildings
in between buildings
above the buildings
on buildings
on the buildings
on the buildings behind the vehicles
above buildings
on buildings
above the train and buildings
on the buildings
on top of buildings
on the buildings in the background
on the buildings
above the city buildings
above the buildings
hanging over buildings
in front of the buildings
in front of buildings
parked in front of the buildings
in front of the buildings and tower
in front of buildings
in front of the buildings
near the buildings
near buildings
next to the large brick buildings
by the buildings
in the street , near buildings
next to the buildings
by the tall buildings
near the buildings
beside the buildings
by the trees and buildings
near buildings
close to buildings
next to the buildings
behind buildings
behind the buildings
behind the buildings
behind the other buildings
behind all the buildings
Table 5: We look at all the sampled phrases containing the word “building” with Margin. We realize that we end up heavily re-sampling concepts that are already available in our training set. This shows the redundancy of this strategy and explains why sampling more points leads to worse performance.
Ours iteration samples Paraphrases already in the Training Set Points
between the buildings
between the two buildings
between two buildings
in the middle of the buildings
overhead between the buildings
over the street and between buildings
between the buildings
in between the buildings
in between buildings
above the buildings
on buildings
on top of buildings
on the buildings
on buildings
above the buildings
in front of buildings
in front of the buildings and tower
in front of buildings
in front of the buildings
near buildings
by the buildings
near the buildings
by the trees and buildings
near buildings
sidewalk along buildings
near the brown buildings
by the buildings
behind older buildings
behind the buildings
behind the other buildings
behind the buildings
Table 6: We look at all the sampled phrases containing the word “building” with Margin. We find that compared to Margin, we cover the same range of concepts while being less redundant.