Exploiting Test Time Evidence to Improve Predictions of Deep Neural Networks

11/24/2018 ∙ by Dinesh Khandelwal, et al. ∙ 0

Many prediction tasks, especially in computer vision, are often inherently ambiguous. For example, the output of semantic segmentation may depend on the scale one is looking at, and image saliency or video summarization is often user or context dependent. Arguably, in such scenarios, exploiting instance specific evidence, such as scale or user context, can help resolve the underlying ambiguity leading to the improved predictions. While existing literature has considered incorporating such evidence in classical models such as probabilistic graphical models (PGMs), there is limited (or no) prior work looking at this problem in the context of deep neural network (DNN) models. In this paper, we present a generic multi task learning (MTL) based framework which handles the evidence as the output of one or more secondary tasks, while modeling the original problem as the primary task of interest. Our training phase is identical to the one used by standard MTL architectures. During prediction, we back-propagate the loss on secondary task(s) such that network weights are re-adjusted to match the evidence. An early stopping or two norm based regularizer ensures weights do not deviate significantly from the ones learned originally. Implementation in two specific scenarios (a) predicting semantic segmentation given the image level tags (b) predicting instance level segmentation given the text description of the image, clearly demonstrates the effectiveness of our proposed approach.



There are no comments yet.


page 1

page 6

page 8

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the last decade, Deep Neural Networks (DNNs) have become a leading technique for solving variety of problems in Artificial Intelligence. For example, Semantic Segmentation [10, 11, 35, 47, 6], Image Classification [22]

, and Optical Flow Estimation 

[25, 39]

in Computer Vision; and Named Entity Recognition 

[13, 28], and Machine Translation [44, 14]

in Natural Language Processing.

Figure 1: Three examples of Mooney face images [37]. Even from human perception perspective, the images are difficult to interpret at the outset, but once additional information is provided that each image contains face of a lady, it becomes easier to perceive them. Our proposed framework is inspired from this behavior and tries to improve the prediction of a DNN by providing additional cues at the test time. As we show in the result section, MaskRCNN [20], as it is, fails to detect a face in the above images, but when given the additional cue in the form of natural language description of the images (“Beautiful lady smiling in front of a screen”), easily detects the faces.

Often, such networks are designed and trained for a specific task at hand. However, when multiple correlated tasks are given, Multi-Task Learning (MTL) [9] framework is used to allow a deep network to jointly learn shared features from multiple tasks simultaneously. MTL often achieves better generalization ability by using the shared information contained in the different tasks, and improves the performance of each of the individual tasks.

In an MTL framework, usually one is interested in the output of all the tasks at hand. However, researcher have also looked at the scenarios, when only a subset of tasks (called ‘principal’ task(s)) are of interest and the other tasks (called ‘auxiliary’ task(s)) are only to jointly learn the generic shared representation [31, 8, 41]. In such cases, auxiliary tasks are generally derived from the easily available side information about the data. For example, one can use a MTL framework for semantic segmentation of an image as a principal task, and an auxiliary task to predict types of object present in the image.

One significant limitation of the MTL frameworks suggested so far is that they make use of auxiliary information only during the training process. This is despite the fact that, many times such information is also available at the test time e.g., tags on a Facebook image. Further, in several other cases, this additional information could be gathered using relatively inexpensive means. Arguably, exploiting this information at test time can significantly boost up prediction accuracy by resolving the underlying ambiguity and/or correcting modelling errors.

The motivation for incorporating auxiliary information at test time can also be drawn from human perception. Figure 1 shows two-tone Mooney images [37] used by Craig Mooney to study the perceptual closure in children. Here, perceptual closure is an ability to form a complete percept of an object or pattern from incomplete one. It was shown that, though, it may be difficult to make much sense of any structure in the given images in the beginning, but once additional information is provided that these represent the faces of a woman, one easily starts perceiving them.

A natural question to ask is, whether there is a way to incorporate similar instance specific additional clues in modern deep neural network models. While similar works in classical machine learning literature such as PGMs 

[27] have considered conditional inference; to the best of knowledge, there is no prior work incorporating such auxiliary information in the context of DNN models, especially using the MTL based framework. We will henceforth refer to the instance specific auxiliary information as evidence.

We model our task of interest, e.g., semantic segmentation, as the primary task in the MTL framework. The evidence is modelled as the output of one or more secondary tasks, e.g., image tags. While our training process is identical to the one used by standard MTL architectures, our testing phase is quite different. Instead of simply doing a forward propagation during prediction, we back-propagate the loss on the output of the secondary tasks (evidence) and re-adjust the weights learned to match the evidence. In order to avoid over-fitting the observed evidence, we employ a regularizer in the form of two norm penalty or early stopping, so that the weights do not deviate significantly from their originally learned values.

We provide the implementation of our framework for two specific tasks: (a) Image segmentation with image level tags available as evidence (b) Instance level segmentation with image captions as evidence. In both the cases, our experiments show significant improvements in prediction accuracy using our approach. The contributions of our work can be summarized as follows: (1) We propose a generic MTL framework to incorporate evidence at prediction time in deep learning models (2) We propose an approach to re-adjust the weights of the deep network so as to match the network output to evidence (3) We provide two task specific implementations of our proposed approach demonstrating its effectiveness.

2 Related Work

As argued earlier, though our architecture may seem similar in style to existing work trying to boost up the performance of the primary task based on auxiliary tasks [31, 8, 41], the key difference is that, while, the earlier works exploit the use of correlated tasks only during the training process, we in addition, focus on back-propagating the available (instance specific) evidence during prediction time as well. This is an important conceptual difference and can result in significant improvements by exploiting additional information as shown by our experiments.

We would also like to differentiate our work from that of posterior inference with priors. While priors can be learned for sample distributions, our work suggests conditional inference in the presence of sample specific evidence. Similarly, posterior regularization technique [17] changes the output distribution directly, albeit, only based on characteristics of the underlying data statistics. No sample specific evidence is used.

Another closely related research area is multi-modal inference [26, 12] which also incorporates additional features from auxiliary information. While this does effectively incorporate evidence at prediction time in the form of additional features, but practically speaking, designing a network to take additional information from highly sparse information source is non-trivial 111For example, in one of our experiments we show the semantic segmentation conditioned upon image level tags as given by an image classification task. It is easy to see that designing an MTL based DNN architecture for semantic segmentation and image classification is not difficult. On the other hand, designing a network which takes a single image label and generates features for merging with RGB image features seems non-trivial.. However, the strongest argument in support of our framework is its ability to work even when only single set of annotations are available. It is possible to train our architecture even when we have dataset containing either primary or auxiliary annotations. On the other hand, multi-modal input based architecture would require a dataset containing both the annotations at the same time. This greatly restricts its applicability. Note that the argument extends to test time also. At test time, if auxiliary information is unavailable our framework can fall back to regular predictions, while architecture with multi-modal input will fail to take-off.

Some recent works [38, 45, 36] have proposed constraining the output of DNN, to help regularizing the output and reduce the amount of training data required. While all these works suggests constraints during training, our approach imposes the constraints both at the train and inference time.

We note that our framework is similar in spirit to another contemporary work by Lee et al. [29], who have also proposed to enforce test time constraints on a DNN output. However, while their idea is to enforce ‘prior deterministic constraints’ arising out of natural rule based processing, our framework is inspired from using any easily available and arbitrary type of auxiliary information. Our framework can be used together with theirs, as well as, is more generalizable due to lack of requirement of using the constraints on the output of the same stream.

3 Framework for Back-propagating Evidence

In this section, we present our approach for boosting up the accuracy of a given task of interest by incorporating evidence. Our solution employs a generic MTL [43] based architecture which consists of a main (primary) task of interest, and another auxiliary task whose desired output (label) represents the evidence in the network. The key contribution of our framework is its ability to back-propagate the loss on the auxiliary task during prediction time, such that weights are re-adjusted to match the output of the auxiliary task with given evidence. In this process, as we will see, the shared weights (in MTL) also get re-adjusted producing a better output on the primary task. This is what we refer to as back-propagating evidence through the network (at prediction time). We note that though we describe our framework using a single auxiliary task to keep the notation simple, it is straightforward to extend this to a setting with more one than auxiliary task (and associated evidence at prediction time).

Figure 2: Model Architecture for Multi-task learning (MTL)
Figure 3: Loss propagation at train and prediction time

3.1 Background on MTL


We will use to denote the primary task of interest. Similarly, let denote the auxiliary task in the network. Let denote the training example, where

is input feature vector,

is desired output (label) of the primary task, and denotes the desired output (label) of the auxiliary task. Correspondingly, let and denote the output produced by the network for the primary task and auxiliary tasks, respectively.


Figure 2 shows the MTL based architecture [43] for this set-up. There is a common set of layers shared between the two tasks, followed by the task specific layers. represents the common hidden feature representation fed to the two task specific parts of the architecture. For ease of notation, we will refer to the shared set of layers as trunk.

The network has three sets of weights. First, there are weights associated with the trunk denoted by . and are the sets of weights associated with the two task specific branches, respectively. The total loss is a function of these weight parameters and can be defined as:

Here, and denote the loss for the primary and auxiliary tasks, respectively. is the importance weight for the auxiliary task. The sum is taken over the examples in the training set. is a function of the shared set of weights , and the task specific weights . Similarly, and is a function of the shared weights and task specific weights , respectively.


The goal of training is to find the weights which minimize the total loss over the training data. Using the standard approach of gradient descent, the gradients can be computed as follows:

Note that the weights in the task specific branches, i.e., and , can only affect losses defined over the respective tasks (items 1 and 2 above). On the other hand, weights in the trunk affect the losses defined over both the primary as well as the auxiliary tasks. Next, we describe our approach of back-propagating the loss over the evidence.

3.2 Our Approach - Prediction

During test time, we are given additional information about the output of the auxiliary task. Let us denote this by (evidence) to distinguish it from the auxiliary outputs during training time. Then, for the inference, instead of directly proceeding with the forward propagation, we instead first decide to adjust the weights of the network such that the network is forced to match the evidence on the auxiliary task. Since the two tasks are correlated, we expect that this process will adjust the weights of the network in a manner such that resolving the ambiguity over the auxiliary output will also result in an improved prediction over the primary task of interest.

This feat can be achieved by defining a loss in terms of and then back-propagating its gradient through the network. Note that this loss only depends on the set of weights in the auxiliary branch, and the weights in the trunk. In particular, the weights remain untouched during this process. Finally, we would also like to make sure that our weights do not deviate too much from the originally learned weights. This is to avoid over-fitting over evidence. This can be achieved by adding a two-norm based regularizer which discourages weights which are far from the originally learned weights. The corresponding weight update equations can be derived using the following gradients:

Here, and denote the weights learned during training and is the regularization parameter. Note that these equations are similar to those used during training (item 2 and 3), with the differences that (1) The loss is now computed with respect to the single test example (2) Effect of the term dependent on primary loss has been zeroed out. (3) A regularizer term has been added. In our experiments, we also experimented with early stopping instead of adding the norm based regularizer. Though, the latter worked marginally better in our experimental analysis.

Algorithm 1 describes our algorithm for weight update during test time, and Figure 3 explains it pictorially. Once the new weights are obtained, they can be used in the forward propagation to obtain the desired value on the primary task.

1: (input), : evidence
2: (Learning rate), (Iterations),
3:, : Originally trained weights
5:for  do;
6:     Calculate the loss over evidence
7:     Compute and , using back-propagation
8:     Update and using gradient descent rule
9:end for
10:Return the newly optimized weights
Algorithm 1 weight update algorithm

3.3 Interpreting as a Graphical Model

In this section, we present a Probabilistic Graphical Model’s perspective of our approach. Referring back to Figure 2

, we can define a probabilistic graphical model over the random variables

(input), (hidden presentation), (primary output) and

(auxiliary output). Interpreting this as a Bayesian network (with arrows going from

, and

, we are interested in computing the probabilities

and at inference time. Further, we have:


In the first conditional probability term in the RHS, dependence on is taken away since is independent of given . Since, in our network is fully determined by (due to the nature of forward propagation), we can write this dependence as . In other words, there is a value , such that . Therefore, above equation can be equivalently written as:

Note that sum over disappears since is non-zero only when as defined above. Similarly:


The goal of inference is to find the values of and maximizing and , respectively. The parameters of the graphical model are learned by maximizing the cross entropy or some other kind of surrogate loss over the training data.

Let us analyze what happens at test time. We are given the evidence at test time. In the light of this observation, we would like to change our distribution over such that the probability of observing is maximized, i.e., is equal to . Recalling that , in order to affect this, we may:

  1. Change the distribution to , or

  2. Change the function to ,

such that is as close to as possible. How to do this in a principled manner? We define the appropriate loss capturing the discrepancy between the value predicted using the distribution and the evidence , i.e., . The loss term also incorporates a regularizer so that new parameters do not deviate significantly from original set of parameters, avoiding overfitting the evidence .

In order to minimize the loss, we can back-propagate its gradient in the DNN and learn the new set of parameters. This results in change of dependence of on , i.e., , as well as that of on , i.e., . The resulting parameters are and , which effectively generate a new distribution over . Hence, adjusting the DNN weights in order to match the evidence also results in an updated prediction over the primary task aligned with the observed evidence.

4 Semantic Segmentation

Figure 4: DeepLab-MTL architecture with auxiliary task as classification (lower branch).
Figure 5: Comparison of visual results: DeepLab-MTL vs DeepLab-Aux. Our approach results in significantly improved segmentation.
Figure 6: Sensitivity of results on number of back-propagation iterations (early stopping) and parameter (L2-norm)

The task of semantic segmentation involves assigning a label to each pixel in the image from a fixed set of object categories. Semantic segmentation is an important part of scene understanding and is critical first step in many computer vision tasks. In many semantic segmentation applications, image level tags are often easily available and encapsulate important information about the context, scale and saliency. We explore the use of such tags as auxiliary information at test time for improving the prediction accuracy. As clarified in earlier sections as well, though using auxiliary information in the form of natural language sentences

[24, 34] have been suggested, these earlier works have used this information only during the training time. This is unlike us where we are interested in exploiting this information both during training as well as test.


Most current state-of-the-art methods for semantic segmentation, such as, Segnet [6], DeepLabv2 [10], PSPNet [47], and U-net [42] etc., are all based on DNN architectures. Most of these works use a fully convolutional (FCN) architecture replacing earlier models which used fully connected layers at the end. Each convolution layer is typically followed by a pooling layer. Several innovations have been proposed in the design of pooling layers (or their replacements) which include the introduction of atrous/dilated convolutions [23] or pyramid pooling [18, 21]. Other set of models are based on encoder-decoder architecture [32, 42]

which retain the spatial resolutions by using long-range residual connections.

Our Implementation

Our implementation builds on DeepLabv2 [10] which in one of the popular segmentation models. DeepLabv2 has been one of the leaders on the Pascal VOC data challenge [16]. DeepLab builds over the Resnet-101 architecture which was originally designed for classification tasks. We have used the publicly available implementation of DeepLabv2 [1]. For ease of notation, we will refer the DeepLabv2 model as ‘DeepLab’.

To use our framework, we have extended the DeepLab architecture to simultaneously solve the classification task in an MTL setting. Figure 4 describes our proposed MTL architecture in detail. Starting with the original DeepLab architecture (top part in the figure), we branch off from layer 3 to solve the classification task 222Branching from this layer worked best in our experiments. The resultant feature map is passed through an average pooling layer, a fully connected layer, and then finally a softmax over 20 classes (background class is excluded).

For training, we make use of cross-entropy based loss, both for the primary as well as the secondary tasks. We first train the segmentation only network to get the initial set of weights. These are then used to initialize the weights in the MTL based architecture (for the segmentation branch). The weights in the classification branch are randomly initialized. This is followed by a joint training of the MTL architecture. During prediction time, for each image, we back-propagate the loss based on observed evidence over the auxiliary task (for test image) resulting in weights re-adjusted to fit the evidence. These weights are used to make the final prediction (per-image). The parameters in our experiments were set as follows. During training, the parameter controlling the relative weights of the two losses is set of in all our experiments. During prediction, number of early stopping iterations was set of . parameter for weighing the two norm regularizer was set to .

Methodology and Dataset

We compare the performance of following four models in our experiments: (a) DeepLab (b) DeepLab-MTL (c) DeepLab-Aux-ES (d) DeepLab-Aux-L2. The first model uses vanilla DeepLab based architecture. The second enhances it further by using an MTL framework as described above. The last two models are based on our proposed approach and start with suffix DeepLab-Aux. We experiment with two variations based on the choice of regularizer during prediction: DeepLab-Aux-ES uses early stopping and DeepLab-Aux-L2 uses an L2-norm based penalty.

For our evaluation, we make use of PASCAL VOC 2012 segmentation benchmark [16]. It consists of 20 foreground object classes and one background class. We further augmented the training data with additional segmentation annotations provided by Hariharan et al. [19]

. For our experiments, we only worked with those subset of images which had single object. This is because our classification network is currently designed to handle single labels. The resultant dataset had 6120 training and 927 validation images. We use mean intersection over union (mIOU) as our evaluation metric which is a standard for segmentation tasks.


Method mIoU
DeepLab 76.5
DeepLab-MTL 78.4
DeepLab-Aux-ES 82.6
DeepLab-Aux-L2 83.1
Table 1: Comparison of results for semantic segmentation.

Table  1 compares the performance of the four models. We see some improvement in prediction accuracy due to the use of the MTL framework. However, adding auxiliary information at test time results in further significant improvement over the baselines.

The gain is as much as 6 mIoU points compared to vanilla DeepLab and more than 4 points compared to the MTL based architecture. Both our variations have comparable performance, with L2 norm based model doing slightly better. Table 2 presents the results for each of the object categories. For all the object categories except one, we perform better than the baselines. The gain is as high as 10 points (or more) for the first three classes. For more than half of the classes, we beat the baseline by at least points.

Model chair table sofa dog sheep cow boat mbike plant bike train aero tv bird car cat horse bus bg person bottle DeepLab-MTL 40.3 73.2 67.3 84.3 81.4 85.4 70.3 84.7 65.0 34.2 88.6 89.1 73.5 83.8 92.2 91.7 83.9 94.3 95.6 90.4 78.2 DeepLab-Aux-ES 55.7 83.6 78.7 92.8 89.1 93.1 74.2 88.0 68.4 37.4 91.6 92.1 75.9 86.0 94.2 93.7 84.7 94.8 95.8 90.9 75.1 DeepLab-Aux-L2 55.8 85.4 79.1 92.7 89.7 93.3 74.6 88.5 68.8 37.7 91.6 92.1 76.4 86.1 94.4 93.7 85.6 95.3 95.9 90.7 76.9
Table 2: Object category-wise comparison of results for semantic segmentation on Pascal VOC. Numbers denote mIoU.

Figure 5 shows the visual comparison of results for a set of hand picked examples. Our results are significantly better in terms of visual quality; our model is not only able to enhance the segmentation quality of already discovered objects, it can also discover new objects which are completely missed by the baseline. Figure 6 presents the sensitivity analysis with respect to number of early stopping iterations and the parameter controlling the weight of the L2 regularizer (during prediction). There is a large range of values in both cases where we get significant improvements.

5 Instance Segmentation

Figure 7: Mask-RCNN with LSTM based caption generator.
Figure 8: Comparison of visual results: Mask-MTL vs Mask-Aux-L2. Our approach is able to discover new objects (and their segmentation) which are missed by the baseline.
Mask-MTL 31.4 52.8 26.0 58.5 72.8
MASK-Aux-ES 32.5 54.8 28.2 60.7 74.8
MASK-Aux-L2 32.6 54.8 27.5 60.8 75.0
Table 3: Comparison of results for instance segmentation. AP0.5 refers to AP at mIoU of 0.5. APL, APM,APS represent AP0.5 values for large, medium and small objects, respectively
Type Mask-MTL Mask-Aux-L2
P R F1 P R F1
All 88.4 38.6 53.8 86.7 42.2 56.8
Small 81.6 18.0 29.5 78.1 19.9 31.7
Medium 88.7 45.6 60.3 87.5 50.0 63.6
Large 91.9 64.8 76.0 90.9 70.4 79.3
Table 4: Mask-MTL vs Mask-Aux-L2 at 0.5 mIoU and 0.9 confidence threshold. P: Precision, R: Recall, F1: F-measure

Next, we present our experimental evaluation on a multi-modal task of object instance segmentation given textual description of the image. In object instance segmentation the goal is to detect and localize individual objects in the image along with segmentation mask around the objects. In our framework, we model instance segmentation as the primary task and and image captioning as the auxiliary task. Arguably, instance segmentation is more challenging task than semantic segmentation discussed in the last section, since latter can also be described as an instance of the former.


Recently proposed Mask R-CNN [20] is one of the most successful instance segmentation approaches. It is based on the Faster R-CNN  [40] technique for object detection. In the first step, Faster R-CNN generates box level proposal using the Region Proposal network (RPN). In the second step, each box level proposal is given an object label to detect the objects present in the overall image. Mask R-CNN uses the detector feature map and produces a segmentation mask for each detected bounding box, by re-aligning the misaligned feature maps using a special designed RoIAlign operation. Mask R-CNN predicts masks and class labels in parallel. Other notable works [30, 15] predicts the instance segmentation using a fully conventionally network, to get similar benefits as FCNs for semantic segmentation. There have also been proposals to use CRFs for post processing FCN outputs to group pixels of individual object instances [7, 5]. We have used Mask-RCNN in our experiments.

Our Implementations:

Our MTL based architecture is shown is Figure 7. Here we combine the Mask R-CNN with an LSTM decoder to generate the captions. We take the LSTM decoder from the state of the art captioning generator “Show, Attend and Tell”(SAT) [46]. We use the publicly available implementations of both Mask R-CNN [3] and SAT [4]. To extract image features, we use ResNet-50 as the convolutional backbone network denoted as ResNet-50-C4 in the Figure 7. Here C4 denotes that the features are extracted from the final convolutional layer of the 4-th stage ResNet-50. The backbone architecture is shared between both Mask R-CNN and captioning decoder. We use the pre-trained weights of the Mask R-CNN provided in their implementation  [2] for the primary network that does instance segmentation. No fine tuning has been done for the primary task. The weights of the backbone network also remain fixed during the training phase of the caption decoder (secondary task). We do not perform joint training of the primary and secondary task networks for this particular application. Early stopping iterations parameter was set of 10, and parameter was set to 1000.

Methodology and Dataset:

We compare three different models. We refer to the baseline model as Mask-MTL. We refer to our approaches as Mask-Aux-ES and Mask-Aux-L2, respectively, for the two types of regularizers used during prediction. We have used MS-COCO dataset [33] to evaluate our approach. The training set consist of nearly 1150k images and 5k validation images. We report our results on the validation images. In the dataset, each image has at least five captions assigned by different annotators. We use AP (average precision) as our evaluation metric.


Table 3 compares the performance of three approaches. We are two points better than the baseline for AP0.5. Comparing results across different object sizes, our gain seems to be maximum on large objects. The two variations of our model perform similar to each other. For the remaining experiments, we only present results comparing Mask-MTL with our Mask-Aux-L2 variant. Table 4 compares their performance in terms of precision,recall and f-measure at AP0.5. Our model gains on recall while suffering slight loss on precision. The total gain in F1 score is points over the total set of objects, and is mostly maintained across different object sizes. A careful analysis revealed that ground truth itself has inconsistencies and misses a large number of very small objects, which are discovered by our algorithm. This leads to undue penalization of the scores of our algorithm. We plan to fix the ground truth in the final version leading to even better comparison numbers.

Figure 8 presents visual comparison of results. Our algorithm can detect newer objects (sometimes those not even mentioned in the caption but correlated - see the caption containing “messy counter space”) and also improve segmentation at the same time. Comparison in the last row and last column is a Mooney face [37] as referred in the introduction. Mask-MTL incorrectly detects a bird whereas Mask-Aux can correctly detect a person with reasonable segmentation.

6 Conclusion

We have presented a novel approach to incorporate evidence into deep networks at prediction time. Our key idea is to model the evidence as auxiliary information in an MTL architecture and then modify the weights at prediction time such that output of auxiliary task(s) matches evidence. Experiments on two different computer vision applications demonstrate the efficacy our proposed model over state-of-the-art. In future, we would like to experiment with additional applications including those defined over video.


  • [1] DeepLab implementaion. https://bitbucket.org/aquariusjay/deeplab-public-ver2. Accessed: 2017-11-02.
  • [2] Detectron Model Zoo and Baselines. https://github.com/facebookresearch/Detectron/blob/master/MODEL_ZOO.md. Accessed: 2018-10-22.
  • [3]

    Mask RCNN pytorch implementaion.

    https://github.com/roytseng-tw/Detectron.pytorch. Accessed: 2018-10-22.
  • [4] Show, Attend, and Tell pytorch implementaion. https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning. Accessed: 2018-10-23.
  • [5] A. Arnab and P. H. Torr. Pixelwise instance segmentation with a dynamically instantiated network. In Proc. of CVPR, pages 879–888. IEEE, 2017.
  • [6] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12):2481–2495, 2017.
  • [7] M. Bai and R. Urtasun. Deep watershed transform for instance segmentation. In Proc. of CVPR, pages 2858–2866. IEEE, 2017.
  • [8] J. Bingel and A. Søgaard. Identifying beneficial task relations for multi-task learning in deep neural networks. In Proc. of EACL, pages 164–169, 2017.
  • [9] R. Caruana.

    Learning many related tasks at the same time with backpropagation.

    In Proc. of NIPS, pages 657–664, 1995.
  • [10] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915, 2016.
  • [11] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
  • [12] M.-M. Cheng, S. Zheng, W.-Y. Lin, V. Vineet, P. Sturgess, N. Crook, N. J. Mitra, and P. Torr. Imagespirit: Verbal guided image parsing. ACM Transactions on Graphics (TOG), 34(1):3, 2014.
  • [13] J. Chiu and E. Nichols. Named entity recognition with bidirectional lstm-cnns. Transactions of the Association of Computational Linguistics, 4(1):357–370, 2016.
  • [14] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
  • [15] J. Dai, K. He, Y. Li, S. Ren, and J. Sun. Instance-sensitive fully convolutional networks. In Proc. of ECCV, pages 534–549, 2016.
  • [16] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
  • [17] K. Ganchev, J. Gillenwater, B. Taskar, et al. Posterior regularization for structured latent variable models. Journal of Machine Learning Research, 11(Jul):2001–2049, 2010.
  • [18] K. Grauman and T. Darrell. The pyramid match kernel: Discriminative classification with sets of image features. In Proc. of ICCV, pages 1458–1465, 2005.
  • [19] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In Proc. of ICCV, pages 991–998, 2011.
  • [20] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Proc. of ICCV, pages 2980–2988, 2017.
  • [21] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In Proc. of ECCV, pages 346–361, 2014.
  • [22] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. of CVPR, pages 770–778, 2016.
  • [23] M. Holschneider, R. Kronland-Martinet, J. Morlet, and P. Tchamitchian. A real-time algorithm for signal analysis with the help of the wavelet transform. In Wavelets, pages 286–297. 1989.
  • [24] R. Hu, M. Rohrbach, and T. Darrell. Segmentation from natural language expressions. In Proc. of ECCV, pages 108–124, 2016.
  • [25] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proc. of CVPR, pages 2462–2470, 2017.
  • [26] M. Kilickaya, N. Ikizler-Cinbis, E. Erdem, and A. Erdem. Leveraging captions in the wild to improve object detection. In In Proc. of the 5th Workshop on Vision and Language, pages 29–38, 2016.
  • [27] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
  • [28] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer. Neural architectures for named entity recognition. In Proc. of NAACL-HLT, pages 260–270, 2016.
  • [29] J. Y. Lee, M. Wick, J.-B. Tristan, and J. Carbonell. Enforcing constraints on outputs with unconstrained inference. arXiv preprint arXiv:1707.08608, 2017.
  • [30] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instance-aware semantic segmentation. In Proc. of CVPR, pages 2359–2367, 2017.
  • [31] C. Liang-Chieh, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In Proc. of ICLR, 2015.
  • [32] G. Lin, A. Milan, C. Shen, and I. Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proc. of CVPR, pages 5168–5177. IEEE, 2017.
  • [33] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In Proc. of ECCV, pages 740–755. Springer, 2014.
  • [34] C. Liu, Z. Lin, X. Shen, J. Yang, X. Lu, and A. L. Yuille. Recurrent multimodal interaction for referring image segmentation. In ICCV, pages 1280–1289, 2017.
  • [35] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proc. of CVPR, pages 3431–3440, 2015.
  • [36] P. Márquez-Neila, M. Salzmann, and P. Fua. Imposing hard constraints on deep networks: Promises and limitations. arXiv preprint arXiv:1706.02025, 2017.
  • [37] C. M. Mooney. Age in the development of closure ability in children. Canadian Journal of Psychology/Revue canadienne de psychologie, 11(4):219, 1957.
  • [38] D. Pathak, P. Krahenbuhl, and T. Darrell.

    Constrained convolutional neural networks for weakly supervised segmentation.

    In Proc. of CVPR, pages 1796–1804, 2015.
  • [39] A. Ranjan and M. J. Black. Optical flow estimation using a spatial pyramid network. In Proc. of CVPR, pages 2720–2729, 2017.
  • [40] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proc. of NIPS, pages 91–99, 2015.
  • [41] B. Romera-Paredes, A. Argyriou, N. Berthouze, and M. Pontil. Exploiting unrelated tasks in multi-task learning. In Proc. of AISTATS, pages 951–959, 2012.
  • [42] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Proc. of MICCAI, pages 234–241, 2015.
  • [43] S. Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098, 2017.
  • [44] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Proc. of NIPS, pages 3104–3112, 2014.
  • [45] J. Xu, Z. Zhang, T. Friedman, Y. Liang, and G. V. d. Broeck. A semantic loss function for deep learning with symbolic knowledge. arXiv preprint arXiv:1711.11157, 2017.
  • [46] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proc. of ICML, pages 2048–2057, 2015.
  • [47] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In Proc. of CVPR, pages 2881–2890, 2017.

Instance Segmentation

In the main paper, we have presented results on the semantic segmentation and object instance segmentation problems. We notice that in the case of instance segmentation, though our results are significantly better qualitatively, the same is not fully reflected in the quantitative comparison. A careful analysis of the results reveals that there are often inconsistencies in the ground truth annotations itself. For example, when many smaller objects are present, ground truth annotations often tag only a few of them as separate objects whereas others are combined into a single object. In other cases, some of the objects are either completely missed or the boundaries are not marked correctly. In many such instances, our framework is able to predict the correct output, but since the ground truth label is incorrect, we are penalized for detecting false positives resulting in less than expected improvement in numbers. In the figures below, we highlight some examples to support our claim.

Input Image
Ground Truth
Mask-MTL Output
Mask-Aux-L2 Output
(Our Approach)
Figure 9: Input caption: A cat looking at another cat on the television”.
Mask-Aux-L2 is able to detect both the cats - the one in the main image and the one in the television. This is in contrast to Mask-MTL which can only detect the main cat. Ground truth also incorrectly misses the cat in the television.
Input Image
Ground Truth
Mask-MTL Output
Mask-Aux-L2 Output
(Our Approach)
Figure 10: Input caption: “A group of sheep that are in the grass”.
Mask-Aux-L2 is able to detect 7 extra sheep given the caption as compared to Mask-MTL. These are marked as a single sheep in the ground truth.
Input Image
Ground Truth
Mask-MTL Output
Mask-Aux-L2 Output
(Our Approach)
Figure 11: Input caption: “Pizza on a serving tray with empty plates next to it”.
Mask-Aux-L2 gets penalty compared to Mask-MTL as it (correctly) detects a piece of pizza which is not annotated in the ground truth.
Input Image
Ground Truth
Mask-MTL Output
Mask-Aux-L2 Output
(Our Approach)
Figure 12: Input caption: “Two plastic containers next to a banana on a table”.
Mask-Aux-L2 is able to detect both the bowls present in the image as compared to Mask-MTL. The ground truth has only bowl annotated.
Input Image
Ground Truth
Mask-MTL Output
Mask-Aux-L2 Output
(Our Approach)
Figure 13: Input caption: “A picture of a cellphone on a cellphone”.
Mask-Aux-L2 is able to detect cellphone inside a cellphone which is not discovered by Mask-MTL. The inner cellphone is also not annotated in the ground truth.
Input Image
Ground Truth
Mask-MTL Output
Mask-Aux-L2 Output
(Our Approach)
Figure 14: Input caption: “A woman is cooking on a large black stove”.
Mask-Aux-L2 is able to detect the (single) oven which is missed by Mask-MTL. The ground truth annotation incorrectly marks two visible sides of a single oven as two separate ovens.
Input Image
Ground Truth
Mask-MTL Output
Mask-Aux-L2 Output
(Our Approach)
Figure 15: Input caption: “A sandwich with nachos and a salad on a plate.”.
Mask-Aux-L2 correctly detects a single piece of sandwich which is missed by Mask-MTL. Ground annotation incorrectly labels it as two pieces of sandwiches.
Input Image
Ground Truth
Mask-MTL Output
Mask-Aux-L2 Output
(Our Approach)
Figure 16: Input caption: “Two teenage girls playing a video game together.”.
Mask-Aux-L2 is able to correctly detect two of the remotes present in the image. Mask-MTL misses them altogether. Since these remotes are occluded by the hands, one of them is counted as false positive because IoU between predicted and ground truth remote is less than 0.5.
Input Image
Ground Truth
Mask-MTL Output
Mask-Aux-L2 Output
(Our Approach)
Figure 17: Input caption: “A baby sitting on a blanket on the ground eating a apple.”.
Mask-Aux-L2 is able to predict a bed in the image (there is no separate blanket class in the annotation) which is missed by Mask-MTL. Ground truth annotation only labels the baby.
Input Image
Ground Truth
Mask-MTL Output
Mask-Aux-L2 Output
(Our Approach)
Figure 18: Input caption: “A sink a mirror a towel and some bottles”.
Mask-Aux-L2 is able to detect 2 extra bottles and the sink as compared to Mask-MTL. One of the bottles predicted by our model (present in the middle of two bottles) is not annotated in the ground truth.