A Simple Loss Function for Improving the Convergence and Accuracy of Visual Question Answering Models

08/02/2017 ∙ by Ilija Ilievski, et al. ∙ National University of Singapore 0

Visual question answering as recently proposed multimodal learning task has enjoyed wide attention from the deep learning community. Lately, the focus was on developing new representation fusion methods and attention mechanisms to achieve superior performance. On the other hand, very little focus has been put on the models' loss function, arguably one of the most important aspects of training deep learning models. The prevailing practice is to use cross entropy loss function that penalizes the probability given to all the answers in the vocabulary except the single most common answer for the particular question. However, the VQA evaluation function compares the predicted answer with all the ground-truth answers for the given question and if there is a matching, a partial point is given. This causes a discrepancy between the model's cross entropy loss and the model's accuracy as calculated by the VQA evaluation function. In this work, we propose a novel loss, termed as soft cross entropy, that considers all ground-truth answers and thus reduces the loss-accuracy discrepancy. The proposed loss leads to an improved training convergence of VQA models and an increase in accuracy as much as 1.6

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual question answering (VQA) requires an AI agent to answer questions about an image. As a challenging multimodal problem and a proxy task for visual reasoning, it has attracted a lot of attention from the deep learning community. Multiple models were introduced [6, 16, 10] and a new dataset with a specific focus on visual reasoning [9].

The currently largest VQA dataset, VQA v2.0 [3], contains million questions for the thousand MS COCO images [13]. Each question is paired with ten human-provided answers. The usual VQA model uses a pretrained ResNet [4] network to obtain an image representation and an LSTM [5]

unit to learn a representation of the question words. The model then fuses the two representations into a single multimodal representation via element-wise multiplication or other more sophisticated methods. Finally, the most common answer out of the ten provided is used to train the model to classify the multimodal representation to a correct answer 

[20, 7, 19, 17, 18].

Recently, several representation fusion methods were developed [2, 11, 1] and some novel attention mechanisms were introduced [15, 14]. But, very little attention has been put on the VQA model loss function, which is an essential part of its training. The prevailing approach is to use the most common answer and a cross entropy loss function (Eq. (1)). However, a VQA model is evaluated by comparing the predicted answer with all the ground-truth answers for a given question and if there is a match, a partial point is given. This causes a discrepancy between the model’s cross entropy loss and the model’s accuracy as calculated by the VQA evaluation function, which in turn results in a delayed training convergence and reduced test accuracy.

In this work, we propose a new loss function, termed as soft cross entropy, that considers all ground-truth answers and thus solves the discrepancy problem. In contrast to the standard cross entropy loss, the soft cross entropy loss provides to the model a set of plausible answers for a given question and information about the question’s ambiguity. As a consequence, the VQA models trained with the proposed loss have a stable training process, converge faster, and achieve on average higher accuracy than models trained with the standard cross entropy loss function.

In summary, the contributions of this work are:

  • We propose a novel loss function for VQA, that more closely reflects a VQA model’s performance. The proposed loss is justified with error analysis and empirical evaluation.

  • We provide an efficient code for reproducing the experimental results and to serve as a starter code to the VQA community.

Loss Function All Y/N Num Other

AVG

Cross Entropy 46.8 55.8 29.8 42.4
Soft Cross Entropy

POOL

Cross Entropy 58.8 70.1 37.5 53.1
Soft Cross Entropy
Table 1: Best validation accuracy on the VQA v2.0 validation set.
Figure 1: Training (dashed lines) and validation (solid lines) loss and validation accuracy (dotted lines) for the AVG VQA model.
Figure 2: Training (dashed lines) and validation (solid lines) loss and validation accuracy (dotted lines) for the POOL VQA model.

2 Soft Cross Entropy Loss

The VQA problem can be reduced to a maximum likelihood estimation problem, where the model classifies a question-image pair to an answer from the training set. Generally, a deep learning model is trained on a classification problem using a cross entropy loss function:

(1)

where

is a vector of network activations for each class and

is the index of the correct class.

However, contrary to conventional classification problems, the VQA evaluation metric considers a predicted answer as correct if the answer was given by at least three out of ten human annotators. The accuracy is then averaged over all

subsets of ground-truth answers:

As a result, the model’s performance is not properly assessed with the cross entropy loss function during the training phase. The improper loss function has significant negative impact on the model’s training and delays the convergence. Furthermore, it results in abnormal and counter-intuitive validation loss – accuracy relationship where both the loss and the accuracy increase (Fig. 1 and 2).

As a solution, we propose to use a loss function that considers all ground-truth answers. The proposed loss function, termed as soft cross entropy, is a simple weighted average of each unique ground-truth answer:

(2)

where is a vector of unique ground-truth answers and is a vector of answer weights computed as the number of times the unique answer appears in the ground-truth set divided by the total number of answers.

3 Experiments

We evaluate the proposed loss function on the recently released VQA v2.0 benchmark dataset [3]. To demonstrate the general applicability of the proposed loss we train a variant of the two most common VQA models [8, 11].

Both models use an LSTM to encode the question words to a vector representation. The AVG model is based on [8] and it utilizes the activations of the penultimate layer of pretrained ResNet-152[4] as image representation and does not employ attention mechanism. The POOL model is based on [11]

and it considers the tensor of activations of the last pooling layer of the same ResNet and employs attention mechanism over the regions to obtain an image representation in a vector form. Both models are trained with Adam 

[12] and a batch size of 111Code available at github.com/ilija139/vqa-soft.

4 Discussion and Conclusion

From Table 1 we can observe that the proposed loss increases the overall accuracy by in the simpler model and increase in the pooling model. The accuracy is increased for both models and all answer types which proves the general applicability of the soft cross entropy loss.

In Figures 1 and 2 we can clearly observe the abnormal relationship between the validation loss and accuracy where they both start to increase near the half of the training process. Furthermore, we can observe how the cross entropy loss rapidly reduces on the training set without an increase in validation accuracy and a decrease in validation loss.

The evaluation results show that by modeling the VQA evaluation metric more faithfully than conventional classification loss functions, the proposed loss function is able to bring a consistent increase in accuracy for VQA models.

References