1 Introduction†† This work was done as part of remote internship at NAVER WEBTOON Corp.
Questions play a crucial role in communication and learning. Not only do the questions of a curious child help her learn facts about the world, but also such questions help to understand what she learnt and what she did not. In this paper, we attempt to explore the merit of having questions from the perspective of observers, rather than learners. For instance, if the learning agent asks, “what is in the box?”, and the answer is “a cat”, we focus our attention on the new information that the agent did not know what is present in the box.
Questions help us understand the information state of the model. Allowing models to ask questions thus brings a positive impact on the interpretability of a learning system. We explicitly explore the asking ability of a learning agent for a better understanding of the decisions it makes. Our research hypothesis is thus:
Allowing a learning agent to ask questions makes its inner-workings more interpretable to observers.
To demonstrate our hypothesis, we propose a class of neural networks that can ask questions, which we call asking networks. Following our asking paradigm an agent learns a task while iteratively asking questions and being guided by pre-defined answering rules. The answers should be designed to be beneficial for the agent to predict the label, so that in an attempt to reduce the loss, the agent would ask questions and take the answers in return. We focus on those questions the agent generates because they can be used as a means of a communication window.
Among possible applications, we test our asking paradigm on deep automatic colorization, a field of conditional image generation in which the model performs colorization given a grayscale image or outline drawing. Deep colorization is inherently multimodal [26, 29], meaning that there are multiple plausible outputs given the same input. The one-to-many ambiguity of deep colorization has been partially addressed by using an autoregressive method  or by incorporating user priors and interactions [14, 29]. We instead design an asking network that asks for ground-truth colors in such circumstances where model output may vary.
Concretely, we allow our model to ask for the color of a model-chosen region of an image. Then, the single color answer, which is computed as the average ground-truth color of the corresponding region, is given to the model, and based on this answer the model performs colorization. We design our network in a recurrent setting so that the model asks questions one by one and updates the colorization output sequentially.
illustrates an example of this process. Quantitative analyses show that our model learns to ask carefully thought-out questions to utilize the provided answer towards the maximum improvement of the loss function for colorization. Interestingly, the first question is shown to be the most effective for reducing the loss. The quantitative analysis on the VOC, an image dataset with labeled class segmentation, also shows that our model learns semantically meaningful segmentations to be colorized as a single color and asks questions based on the learned segmentations.
We complete our discussion by validating that questions generated by our model justify our initial hypothesis that questions help us interpret the learner. To summarize, the contribution of this paper is twofold. First, we introduce the new class of asking networks that are capable of interpretable modeling. Second, we develop an exemplary asking network in the deep colorization domain to show the potential application of our proposed approach and support our research hypothesis.
2 Related Work
Interpretable deep learning
Despite the rapid growth in the field of deep neural networks, due to their often unexplainable output and black-box nature there have been calls for methods to disclose the inner-workings of these complicated models. Interpretable models have been studied in different areas, such as image generation , speech generation  and text grounding for images 
. Many approaches to interpret the hidden features of convolutional neural networks rely on visualization, either directly visualizing activation maps[8, 25] or revealing the subparts of an image that are responsible for the predicted outputs [19, 30]. These visualization methods are powerful tools for post-hoc analysis of a model , but they hardly convey information about the inner-workings of a model, i.e. which step of prediction attributes to such predictions. On the other hand, in our proposed asking paradigm, we explicitly stimulate models to expose weak parts, so that observers can interpret a particular stage of model behavior.
Deep automatic colorization has been gaining significant attention owing to recent advances in deep learning based image processing and encoding methods[14, 28]. Most deep colorization models are composed of the U-net  architecture, a popular encoder-decoder framework which has been adopted by many deep colorization studies [10, 3]. Among practical applications of deep colorization is outline drawing colorization. Colorizing line sketches could mitigate laborious and repetitive work , because some shapes and characters follow similar color patterns. A common approach to outline colorization is to use the U-net for color prediction, then further boost performance by employing adversarial training mechanisms [10, 12, 27] for realistic or artistic colorization.
Colorization is essentially a multimodal task in which the desired outcome could vary by person . Numerous studies have introduced new interactive colorization methods to incorporate user color perception [14, 28, 29]. These models postulate that user priors are necessary components for real-time user experience, and allow human interaction via global control [6, 14] or local control [17, 29]. However, in most models users generally have limited access during the colorization procedure, due to confusion about how influential a single provided hint will be in the result image. Our model explicitly gives a region of influence, making it easy to understand how the hints provided by the user will be applied.
3 Asking Paradigm
Given a feature space and a target space
, a conventional approach for supervised learning consists of a predictive function, which models for and . We now establish our asking paradigm by setting an additional question space and an answer space . Two new functions and map to and to , respectively. The high-level goal is to learn , so as to optimize . As we assume sequential learning and training of the model, we have the three functions in our framework as follows:
where denotes the timestep for each data instance, is the predicted target, and is a pre-specified function to provide a hint from the target. Under this framework we attempt to allow the learning agent model both and . More concretely, we make a model that is provided () opportunities to ask . Answers to the corresponding questions are calculated in the form of the function , which is an information-reducing function that provides some limited information about that corresponds to . The calculated is relayed back to the model. The model then produces intermediate predictions before producing the final output . Loss is then applied based on a differentiable similarity metric between and . This process encourages models to extract meaningful questions about the final answer. We call this class of networks that model both and asking networks.
Fig. 2 shows a generic overview of asking networks. The agent takes as input and generates a question for every timestep . Based on the question and a pre-defined answer-generating function , the agent obtains an answer to its question. At the next timestep, using all available information the agent predicts the next question and output. The same pattern iterates until the agent receives the error signal at the terminal step .
4 Automatic Colorization via Asking
As a particular application of the above-described asking paradigm, this paper mainly considers an automatic colorization task, which is an example of image-to-image translation where each pixel value of a target image has to be predicted. This task is exemplar of a one-to-many problem that inherently involves ambiguity in prediction tasks.
A natural, straightforward form of a question in this task would be, say, "what is the (groundtruth) color of a particular sub-region of a target image?". In order to avoid the trivial question of asking the color of the entire region, we limit the provided answer as the single color that is the average color of those pixels contained in the sub-region.
Relaxing the sub-region in a way that each pixel can be partially contained in it, the question has the form of an image-sized heatmap where the range of each pixel value is in . Here, the value of 1 indicates the corresponding pixel is completely contained in the sub-region of interest while 0 indicates it is not contained at all. Accordingly, the answer-providing function is straightforwardly computed as
is the characteristic function that calculates the desired local characteristic, say, groundtruth pixel values, of the target image, and and are the height and width of the image, respectively.
As explained in Section 3, we allow the model to ask questions multiple times in a sequential manner to gradually improve its prediction result of the target using the provided answers s. Specifically, at time , we obtain the predicted target image and at each timestep, i.e.,
We transform into the image size by multiplying , resulting in a new image as
Then all images are concatenated to form an input to the network in the next timestep. When generating and , as there is no or , we simply set these inputs with zeros.
When applying Eq. 1 in practice, we do not directly provide all previous hints to the model. Instead, we expect the model to embed and transmit its previous hints through . This ensures that the answers provided to the questions of the network are immediately applied. Fig. 3 displays visual illustration of the proposed colorization asking networks.
The model is given a random integer number of question opportunities Unif during training. After the model uses a given number of its question opportunities, we apply L2 loss to the final predicted target domain image according to Eq. 9. We also apply a small ‘smoothing loss’ according to Eq. 10 to make questions more understandable to humans; without this smoothing loss, questions tend to be discontinuous. We optimize the model by back-propagating the total loss defined as
4.1 Improving Question Quality
During our colorization experiments, we find a few techniques useful for improving the quality of questions generated by asking networks, as follows.
In our objective function defined as Eq. 8, if the model happens to optimize only and not while it learns to utilize questions, the generated questions tend to be discontinuous, leaving the questions difficult for humans to interpret. To mitigate this problem, we may apply , which encourages the model to have nearby pixels to hold similar values in the question.
Injecting random noise to answer
The generated questions often exhibit low contrast between those pixels mainly contained in them and those not. In other words, the model essentially seeks for the color answer across multiple different objects, reducing the training efficiency. To address the problem, we inject a small amount of random noise to . Such random noise makes small color difference indiscriminable, thus enforcing the model to learn to identify colors and objects as clearly as possible. We observe that it learns to suppress the regions that it does not highlight and focuses its question solely on a relatively small region, which mostly corresponds to a single object. A significant performance gain from the noise-injecting technique is reported in Section 5.1.3.
4.2 Peripheral Details
We use a U-net  as the network architecture of our model. In addition to the main output channels that contain the predicted color values of an uncolored input image, our output contains an additional channel that acts as a question heatmap generated by a sigmoid layer at the end, rendering the values ranging from 0 to 1. We use for natural images and for cartoon images. We use the Adam optimizer 
to optimize the neural networks. When we colorize real-world images, we use the CIELAB color space and try to predict the a*b* channel value given the grayscale L channel. When we colorize comic images given an outline sketch of images, we predict the RGB channel values. We use the 2011 ImageNet training dataset to train the real-world image colorization models. For the cartoon images, a set of images collected from the NAVER WEBTOON platform is used; they are used with the authorization of the artists, which was obtained with the support of NAVER WEBTOON Corp.
5.1 Quantitative Analysis
5.1.1 Effects of Question Order
We allow the model to ask several questions per image. However, the number of questions varies randomly from 0 to times during training. Thus the model does not know in advance how many chances will be given. It drives the model to take greedy actions to maximally benefit from every question opportunity. To evaluate how much the model benefits from each question, we train an auxiliary colorization network following the main network architecture proposed in Zhang et al.  as a baseline, trained with no hints at all. We use the baseline to measure the average loss per every pixel of an image, and whether our model actually asks those regions ambiguous to colorize. To this end, using 10,000 ImageNet validation images, we forward-propagate each image using the baseline model to computing the pixel-wise error map in the color prediction result. Using the same input image, we generate three questions from the trained asking network of our model and obtain the three question maps ’s for . Afterwards, we compute the global average error per pixel as , which we call a baseline, as well as its weighted version using as the weights, i.e., for .
Fig. 4(a) shows the distributions of these errors across 10,000 validation images. The highest errors of Q1 among the baseline as well as all the questions indicate that the first question asks about the color of regions that the model predicts worse than any other regions. Q2 asks about those regions that the model can predict slightly better than Q1 but still worse than the global average. Finally, Q3 asks about those regions with substantially low prediction error. These results imply that our model learns to greedily ask what it finds as the most challenging to predict at every step.
|0||22.82 0.52||22.82 0.30||22.82 0.30||0.14||0.13|
|1||22.96 0.51||23.28 0.30||24.01 0.30||25.37 0.14||0.14|
|2||22.93 0.52||23.55 0.29||24.85 0.29||26.19 0.14||0.15|
|3||23.44 0.51||23.85 0.29||25.27 0.29||26.69 0.14||0.14|
|Max||0||23.13 0.31||22.97 0.30||22.97 0.30||24.43 0.14||-|
|1||18.21 0.27||23.82 0.30||19.22 0.29||25.58 0.14||-|
|2||20.95 0.29||24.54 0.29||23.68 0.29||26.94 0.14||-|
|3||22.14 0.29||25.04 0.29||24.94 0.28||27.76 0.14||-|
5.1.2 Performance Comparisons of Hint-based Colorization
In this experiment, we analyze the gain per answer given to the question asked by our model, in the context of interactive colorization. To this end, we compare the performance gain in terms of the peak signal-to-noise ratio (PSNR) between our proposed method and other hint-based colorization models [2, 7, 16, 29]. Specifically, in our model, we calculate the performance gain per every answer given to the model-generated question. In other baseline models, we follow Zhang et al.  and adopt two methods to compute performance, which are hints (or groundtruth colors) at random positions (Rand) and at the positions that has the highest errors in (Max).
Table 1 shows the results of this experiment. Compared to the results of Rand methods, our model outperforms all the baseline models at all of the steps. It is notable that our model performs better than any other baseline with only one hint. We attribute the competitive performance to the early question regions being the most difficult regions to predict. Qualitative results (see Section 5.2) suggest our model learns to distinguish semantically meaningful objects based on color similarity and then ask about the region belonging to a single object with consistent colors. In other words, our asking network reveals that the colorization model implicitly learns object segmentation in an intelligent manner while learning to colorize.
5.1.3 Class Precision Analysis with VOC Segmentation Dataset
To test whether the region corresponding to a particular question is semantically meaningful, we measure the precision of heatmaps to be within the same class object in an image. To this end, we use Visual Object Classes Challenge 2012  (VOC2012) segmentation dataset to generate four questions on a model trained on ImageNet. In detail, we match each question heatmap with 22 different classes of VOC2012 dataset to find the class with the highest match. To illustrate, in Fig. 4 (b), three question maps are shown at the bottom, and we compute in what percentage each question overlaps (precision) with the class segmentation map shown on the top-right position. For instance, the first question records 93% of precision with the person class. By averaging the precision values across all VOC2012 images and question maps we compute the overall precision.
We record the precision performance of 75.4% without the noise injection technique (Section 4.1). As discussed in Section 4.1, adding random noise to the answer provides improved results both qualitatively and quantitatively. The questions shown in Fig. 4 (b) display high contrast between pixels, and it clearly shows which region the model is asking. In addition, we record a precision rate of 86.7% with the random noise, which suggests that our model asks about the region corresponding mostly to a single object.
5.2 Qualitative Analysis
In this section, we visually illustrate the question maps and the step-by-step colorization process of our model. To show that our model works well without grayscale input which may act as a hint for segmentation, we also train another model using the comic images datasets described in Section 4.2 and show colorized results on test images. Our model generates distinctive segmentations given simple outlines. More examples can be found in the supplemental material. Figs. 5 and 6 show the example colorization results of a natural image and a cartoon image, respectively. The images in the middle row are intermediate colorization results, the bottom row shows questions generated by the model while the top row images represent answers provided. One can see that the model asks semantically coherent questions.
6 Conclusion and Future Work
Questioning is an effective source of learning for human beings, and it is also an essential way of telling what a person knows and does not. Thus providing a network the opportunity to question, so as to let it reveal by itself its weaknesses is an important milestone in machine learning. In this work, we proposed a general framework in which models are allowed to ask about parts of the answer, and suggested a novel approach in image colorization. Both quantitative and qualitative analyses show that the proposed colorization model creates meaningful questions readily interpretable and understandable by humans. This poses as a promising new approach for interpretability of neural networks, in which they do not only produce results, but can actively exhibitwhy they thought in such a manner, due to which part of the answer.
Nonetheless, our model has room to improve. A problem is that training takes a relatively long time; it requires a significant amount of training time until the model learns to properly generate different questions at different timesteps. Our future work involves building theoretical foundation the optimal training of the asking networks.
Nevertheless, we believe that the proposed asking networks provide multiple interesting research directions. One bright side of our model is that it naturally enables an interactive steering of deep neural networks. Since the model asks questions in a human-readable or an interpretable form, we could communicate with the machine and steer the final output production to what we desire, via any custom answers. We briefly show how an interactive steering of deep colorization works in the supplemental material.
. Moreover, we do not limit the scope of potential application areas within computer vision, but some other domains such as natural language processing could also largely benefit from the proposed asking paradigm to make outcomes interpretable. For instance, in machine translation, one might adopt the asking networks in the decoding stage to find out in which part of the word sequence prediction the model strives most. Our future work therefore involves extending our asking paradigm to a broader range of domains including computer vision as well as natural language processing.
We would like to express our gratitude towards NAVER WEBTOON Corp. and the comic artists Donggeon Lee, Pipp Choi, Tae Hoon Shin /Seung Hoon Ra, Joong Rok Kook/Sang Sin Lee, and Omyo for providing the comic images for this research.
- Bahdanau et al.  Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. Proceedings of the International Conference on Learning Representations (ICLR), 2014.
- Barron and Poole  Jonathan T Barron and Ben Poole. The fast bilateral solver. In European Conference on Computer Vision, pages 617–632. Springer, 2016.
- Chang et al.  Huiwen Chang, Ohad Fried, Yiming Liu, Stephen DiVerdi, and Adam Finkelstein. Palette-based photo recoloring. ACM Transactions on Graphics (TOG), 34(4):139, 2015.
- Charpiat et al.  Guillaume Charpiat, Matthias Hofmann, and Bernhard Schölkopf. Automatic image colorization via multimodal predictions. In Proceedings of the IEEE European Conference on Computer Vision(ECCV), pages 126–139. Springer, 2008.
- Chen et al.  Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172–2180, 2016.
Cho et al. 
Junho Cho, Sangdoo Yun, Kyoungmu Lee, and Jin Young Choi.
Palettenet: Image recolorization with given color palette.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 62–70, 2017.
Endo et al. 
Yuki Endo, Satoshi Iizuka, Yoshihiro Kanamori, and Jun Mitani.
Deepprop: extracting deep features from a single image for edit propagation.In Computer Graphics Forum, volume 35, pages 189–201. Wiley Online Library, 2016.
- Erhan et al.  Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer features of a deep network. 2009.
-  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
- Frans  Kevin Frans. Outline colorization through tandem adversarial networks. arXiv preprint arXiv:1704.08834, 2017.
- Guadarrama et al.  Sergio Guadarrama, Ryan Dahl, David Bieber, Mohammad Norouzi, Jonathon Shlens, and Kevin Murphy. Pixcolor: Pixel recursive colorization. Proceedings of the British Machine Vision Conference (BMVC), 2017.
- Hensman and Aizawa  Paulina Hensman and Kiyoharu Aizawa. cgan-based manga colorization using a single training image. arXiv preprint arXiv:1706.06918, 2017.
- Hsu et al.  Wei-Ning Hsu, Yu Zhang, and James Glass. Unsupervised learning of disentangled and interpretable representations from sequential data. In Advances in neural information processing systems, pages 1876–1887, 2017.
- Iizuka et al.  Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Let there be color!: joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM Transactions on Graphics (TOG), 35(4):110, 2016.
- Kingma and Ba  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980.
- Levin et al.  Anat Levin, Dani Lischinski, and Yair Weiss. Colorization using optimization. In ACM Transactions on Graphics (ToG), volume 23, pages 689–694. ACM, 2004.
- Li et al.  Xujie Li, Hanli Zhao, Guizhi Nie, and Hui Huang. Image recoloring using geodesic distance based color harmonization. Computational Visual Media, 1(2):143–155, 2015.
- Lipton  Zachary C Lipton. The mythos of model interpretability. arXiv preprint arXiv:1606.03490, 2016.
- Lu et al.  Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 6, 2017.
- Pathak et al.  Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2536–2544, 2016.
- Ronneberger et al.  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), pages 234–241. Springer, 2015.
- Russakovsky et al.  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
-  Justus Thies, Michael Zollhöfer, Matthias Nießner, Levi Valgaerts, Marc Stamminger, and Christian Theobalt. Real-time expression transfer for facial reenactment.
- Yeh et al.  Raymond Yeh, Jinjun Xiong, Wen-Mei Hwu, Minh Do, and Alexander Schwing. Interpretable and globally optimal prediction for textual grounding using image concepts. In Advances in Neural Information Processing Systems, pages 1909–1919, 2017.
- Zeiler and Fergus  Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
- Zhang et al. [2017a] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 5907–5915, 2017a.
Zhang et al. [2017b]
Lvmin Zhang, Yi Ji, and Xin Lin.
Style transfer for anime sketches with enhanced residual u-net and auxiliary classifier gan.Asian Conference on Pattern Recognition(ACPR), 2017b.
- Zhang et al.  Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In Proceedings of the IEEE European Conference on Computer Vision (ECCV), pages 649–666. Springer, 2016.
- Zhang et al. [2017c] Richard Zhang, Jun-Yan Zhu, Phillip Isola, Xinyang Geng, Angela S Lin, Tianhe Yu, and Alexei A Efros. Real-time user-guided image colorization with learned deep priors. ACM Transactions on Graphics (TOG), 2017c.
- Zhou et al.  Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 2921–2929. IEEE, 2016.