Human-AI communication is increasing rapidly in importance and extent across multiple modalities. For example, voice-machine interaction is becoming more and more popular with deep learning networks recognizing text from speech. Similar, the progress in image recognition, has lowered error rates in gesture and optical character recognition. Still, key technologies in AI such as deep learning are not perfect. They might also error given ambiguous inputs created by humans. Errors might be more likely by humans being in a hurry, being unaware of AI’s recognition mechanism, sloppiness or lack of skill. Safety critical application areas such as autonomous driving or medical applications, where humans and AI communicate in one way or another, are becoming more and more prominent. Thus, errors should be avoided. Apart from avoiding errors, humans might also have an incentive to communicate with less effort, eg. “Why try to speak clearly and loudly in the presence of noise, if mumbling works just as good? Why doing that extra stroke in writing a character, if detection works just as well without it?” In this work, we do not focus on how to improve AI systems that recognize and interpret human communication, but we aim at strategies how humans can improve their interaction with such a system by adjusting their behavior. While for humans identifying potential improvements is generally difficult, it is even more so when deep learning is involved. Often, finding improvements is based on a deep understanding of mechanisms of the task at hand, ie. how an AI system processes inputs. Deep learning is said to follow a black-box behavior. Even worse, deep learning is well-known to reason very differently from humans: Deep learning models might astonish due to their high accuracy rates, but disappoint at the same time by failing on simple examples that were just slightly modified as well-documented by so called “adverserial examples”. As such humans might depend even more on being shown opportunities for generating better data that serves as input to AI systems. In this work, we formalize the aforementioned partially conflicting goals such as minimizing misunderstandings of AI and humans and reducing effort for humans – both in terms of need to adjust their behavior as well as to interact effortlessly. We focus on the classification problem of digits, where we aim to provide suggestions to humans by altering their generated inputs as illustrated in Figure1. We express the problem in terms of a multi-objective optimization problem, ie. as a linear weighted sum, using as model a conditional convolutional autoencoder. Our qualitative and quanatitave evaluation highlights that the generated samples are visually appealing, easy to interpret and also lead to improved communication of humans to AI systems.
2 Challenges of Human to AI communication
We consider the problem of improving communication from a human to an AI-system illustrated in Figure 1. A human wants to communicate information to an AI-system using some mode, eg. speech, writing, gestures. The processing of the received signals by the AI often involves two steps for the AI-system: (i) recognition, ie. identifying and extracting relevant information in the input signal, and (ii) interpretation, ie. deriving actions by utilizing the information in a specific context. For recognition, the information has to be extracted from a physical (analog) signal, eg. using speech recognition, image recognition etc. In case information is communicated in a digital manner using structured data, recognition is commonly obsolete. Often the extracted information has to be further processed by the AI-system using some form of sense-making or interpretation. The AI-system requires potentially semantic understanding capabilities and might rely on the use of context such as prior discourse or surrounding. We assume that the human interacts frequently with such a system, so that it is reasonable for the human to improve on objectives such as errors and efficiency in communication. In this paper, we consider the challenge of discovering variations of the original inputs that might support a human to improve.
More formally, we consider a classification problem, where a user provides data . Each sample should be recognized as class
by a classifier. We denote by the -th feature of sample . For illustration, for the case of handwritten digits a sample is a gray-tone scan of a digit and the digitized number. gives the brightness of the -th pixel in the scan. The classification model was trained to optimize classification performance of human samples, eg. maximize . We regard the model as a given, ie. we do not alter it in any way, but use it in our optimization process. The Human-to-AI coach “H2A” takes as input one sample with its label . It returns at least one proposal , ie. . The new sample should be superior to according to some objective, eg. we might demand higher certainty in recognition . In a handwriting scenario a human might use feedback sample based on an input to adjust her strokes.
3 Model and Objectives
An essential requirement is that the modified samples are similar to the given input, otherwise a trivial solution is to always return “the perfect sample” that is the same for any input. This motivates utilizing an auto-encoder (Section 3.1) and adding multiple loss terms to handle various objectives (Section 3.2).
Two approaches that allow to create (modified) samples are generative adverserial networks (GANs) and autoencoders(AE). There are also combinations thereof, eg. the pix2pix architecture or conditional variational auto-encoder .  and  contain an autoencoder which has a decoder serving as a generator based on a latent representation from the encoder and, additionally, a discriminator. Autoencoders tend to generate outcomes that are closer to the inputs. But they are often smoother and less realistic looking. In our application staying close to the input is a key requirement, since we only want to show how a sample can be modified rather than generating completely new samples. Thus, we decided to focus on an autoencoder-based architecture. We also investigate including a discriminator to improve generated samples. More precisely, we utilize conditional autoencoders with extra loss terms for regularization covering not only a discriminator loss but also losses for efficiency and classification of modified samples as shown in Figure 3. Conditional auto-encoder are given as input the class of a sample in addition to the sample itself. This often improves generated samples, in particular for samples that are ambiguous, ie. samples that seem to match equally well multiple classes.
Convolutional autoencoders are known to work well on image data, therefore we propose convolutional conditional autoencoder (CCAE) as shown in Figure 2
, where the NN-upsample layers in the decoder denote nearest-neighbor upsampling. After each convolutional layer, there is a ReLU unit that is not shown in Figure2. Compared to transposed convolutional layers, NN-upsampling with convolutional layers prevents checkerboard artifacts in the resulting images.
3.2 Objectives and Loss Terms
The generated input samples should meet multiple criteria, each of which is implemented as a loss term. The loss terms and their weighted sum (with parameters ) are given in Equation 1 and illustrated in Figure 3. The total loss contains four parameters , , and . It is possible to keep and use the other three to control the relative importance of the following objectives:
Minimal effort to change: Change might be difficult and tedious for humans. Thus, the effort for humans to adjust their behavior should be minimized. This implies that the original samples created by humans and the newly generated variations thereof by the human-to-AI coach should be similar. This is covered by the reconstruction loss of the autoencoder (see Equations (1)). It enforces the output and the input to be similar. But parts of the input might be changed fairly drastically, ie. for handwritten digits pixels might change from 0(black) to 1(white) and vice versa. For that reason, we do not employ an -metric, which heavily penalizes such differences, but rather opt for an -metric.
Reduce mis-communication: The error, ie. the amount of wrongly extracted or interpreted information by the AI should be reduced. Auto-encoders are known to have a denoising, averaging effect. They are also known to improve performance in some cases in conjunction with classification tasks . To further foster a reduction in miscommunication we minimize the classification loss for generated examples for the model the human communicates with.
Realistic samples: The generated samples should still be comprehensible for humans or other systems, ie. look realistically. In principle, it might happen that the generated proposals can be so optimized for the given AI model that they are not meaningful in general. They might appear very different from prototypical examples of the class they are classified as. We use a discriminator resulting in a GAN architecture that should distinguish between real and generated samples. The added discriminator loss is , where is the generated sample for an input sample of a human.
Minimal effort to create samples: Communication should be effortless for the human and AI. To quantify effort of a human to create a sample, time might be a good option if available. If not, application specific measures might be more appropriate. For measuring effort in handwriting, amount (and length) of strokes can be used. A good approximation can be the total amount of needed “ink”, which corresponds to the -loss of the input , ie. . We chose the over the -metric, since having many low intensity pixels (as fostered by ) is generally discouraged.
We conducted both a qualitative and quantitative evaluation on the MNIST dataset. It consists of 50000 handwritten digits from 0 to 9 for training and 10000 for testing.111The dataset is commonly used by recent work in similar contexts [7, 6]. The classification model
, ie. the system a user is supposed to communicate well with, is a simple convolutional neural network (CNN) consisting of two convolutional layers (8 and 16 channels) that are both followed by a ReLU and 2x2 Max-Pooling Layer. The last layer is a fully connected layer. The network achieved a test accuracy of 95.97. While this could be improved, it is not of prime relevance for our problem, since the classifier is treated as a given. The architecture of the H2A coach is shown in Figure 3 with details of the autoencoder in Figure 2 and loss terms in Equations 1
. We did not employ any data augmentation. We used the AdamOptimizer with learning rate 1e-4 for all models. Training lasted for 10 epochs with a batchsize of 8. We trained 5 networks for each hyperparameter setting. We perform a statistical analysis of our results. A difference in a metric, eg. accuracy, of two hyperparameter settings is significant, if the p-value of a t-test is below 0.1.
For the ablation study we consider adding each of the losses in isolation to the baseline by varying parameters that control their impact. Finally, we consider a model, where we add all losses.
Our qualitative analysis is a visual assessment of the generated images. We investigate images that were improved (in terms of each of the metrics), worsened and remained roughly the same. As quantitative measures we used the losses as defined in Equations 1 except for classification, where we used the more common accuracy metric.
We first describe outcomes on a qualitative level before discussing our outcomes in terms of computed metrics.
4.1 Qualitative Analysis
Figure 4 shows unmodified samples (left most column) and various configurations of loss weights . The autoencoder (2nd column, ) on its own already has overall a positive impact yielding smoother images than the original ones. It tends to improve efficiency by removing “exotic” strokes, eg. for the 2 in the 6th row and the 5 in the last row, and sometimes helps also in improving readability (eg. ease of classification), eg. the 8 in the first row and the 6 in the second last row both become more readable. Other digits might seem more readable but are actually worsened, eg. the 6 in the 6th row appears to become a 0 (it is actually a 6) the 7 in the 7th row appears to become a 9 (it is actually a 7). Many others also do not change significantly though there is obvious ways of improvement. When optimizing in addition for efficiency (3rd column), some parts of digits gets deleted, which is sometimes positive and sometimes negative. Some benefits of the AE seem to get undone, eg. the 6 in the second last row now looks again more like the original with missing parts, the same holds for the 8 in the first row, though for both some improvement in shape remains. More interestingly, the digits and the 6th row both get changed to 0, which is incorrect. On the positive side, several figure become more readable through subtle changes, eg. removals of parts like the 5 in the last row, the 2 in the second last row or the 3 in the 3rd row. When using the auto-encoder and the discriminator (without the efficiency loss) (fourth column in Figure 4), we can observe that the samples become slighly more realistic, ie. crispier. We can see clear improvements for 7 in row 7 and 2 in row 9 and many digits remaining the same. When using the auto-encoder and the classification loss (last column) smoothness increases and digits appear blurry. Readability worsens for a few digits, ie. the left four in row 2 can now be easily confused with a 9, the 6 in row 9 is no better than the original and worse than the one using a discriminator. Overall, the classification loss helps to improve many other samples. Some only now become well readable, eg. the 6 and 2 in row or the 5 in row 8. Also some digits become simpler, eg. the one in the first row and the 7s in rows 3,4 and 7.
When combining all losses Figure 5 it can be observed that for some parameters larger values are possible to get reasonable results, since the objectives might counteract each other. For example, the discriminator loss pushes pixels to become brighter, whereas the efficiency loss pushes them to become darker. We noticed that the strong smoothing effect due to the classification loss is essentially removed due to the discriminator loss but also efficiency loss. The benefits of the classification loss, however, mainly remain and are also improved: The 4 in the 2nd row and the 6 in the 9th row become more readable. However, there are also some differences in quality among the three configurations. Interestingly, the original images shows somewhat more contrast, in particular compared to the second column. A careful observer will notice a few bright points in the second row in the upper part of both 4. These seem to be artifacts of the optimization. It is well-known that training GANs might lead to non-convergence, which also observed, if the discriminator loss is set too high, but it other forms of undesired behavior might also arise. For example, we observed a form of mode collapse for large values of and bad outcomes for large values of as shown in the last column. Examples in the last column still score high in some of the metrics, ie. accuracy and efficiency loss, but perform poorly with respect to reconstruction loss. Still, overall combining all losses leads to best results.
4.2 Quantitative Analysis
We first discuss accuracy. The autoencoder on its own leads to a small gain in accuracy compared to the baseline classifier of 95.97. Not surprisingly, optimizing accuracy directly (using a classification loss) by increasing leads to best results, eg. perfect accuracy on the test set for a value of 0.24 and even for a seemingly small value accuracy exceeds .999. While it appears that differences in accuracy between various values of are not significant, from a statistical perspective (using a t-test) they are (p-value .001). For any
, the network tends to always fail to learn the same samples, leading to very low variance in accuracy. For all the high accuracies are no surprise, since also for the test set, the network is fed the correct label and therefore could in principle always return a “prototypical” class sample, ignoring all other information. When varying the efficiency loss weightaccuracy decreases, but the decrease was only statistically significant for for large terms (p-value .001). Adding a discriminator also negatively impacts accuracy with showing statistically significant worse results (p-value .01).
The reconstruction loss is most tightly correlated with the visual quality of the outcomes. In particular, large auto-encoder loss is likely to imply poor visual outcomes, despite the fact that other metrics such as accuracy might appear well. This can be observed in Table 1, for instance, for large values for as well as for . Generally, the reconstruction loss worsens when optimizing for accuracy or adding a discriminator . Differences to the baseline are significant (p-value .01). For adding an efficiency loss differences are only significant for values (p-value .01). The efficiency loss decreases when adding other losses. For the discriminator differences are not significant compared to the baseline, while for all other losses they are for any value and (p-value .01).
5 Related Work
There are numerous types of auto-encoders. Related to our applications are de-noising autoencoders that are typically used through intentional noise injection with the goal of weight regularization . In contrast, we assume that noise is part of the input data and its removal is thus not motivated by regularization. The idea to combine auto-encoder and GANs for image generation has been explored previously, eg. 
uses conditional variational auto-encoder and applies it for image inpainting and attribute morphing. Essentially, in this work we consider a novel application of this architecture type. Our work is a form of image-to-image translation. Typically, input and outputs are fairly different, eg. the input could be a colored segmentation of an image not showing any details and the output could be a photo like image with many details. In contrast, in our scenario in- and outputs are fairly similar. For image in-painting or completion [8, 14] a network learns to fill in blank spaces of an image. In contrast, we might both in-paint and erase. Image manipulation based on user edits has been studied in . They learn the natural image manifold using a generative adversarial network and express manipulations as constraint optimization problem. They apply both spatial and channel, ie. color, flow regularization. Their primarily goal is to obtain realistically looking images after manipulations. Thus, their problem and approach is fairly different. Furthermore, in contrast to the mentioned prior works [9, 8, 14, 15, 3]
we perform more of an unsupervised learning. That is, we do not know the final outputs, i.e. the images that should be proposed to the human. Prior work trains by comparing their outcome to a target. In our case, we do not have pairs of human input (images) and improved input (images) in our training data.
The field of human-AI interaction is fairly broad. The effect of various user and system characteristics has been extensively studied . There has been little work on how to improve communication and prevent misunderstandings.  discusses high level, non-technical strategies to deal with errors in communication using speech that originate either from humans or from machines.  lists some errors that occur when interacting with a robot using natural language, such as grammatical, geometrical misunderstandings as well as ambiguities.  highlighted the impact of nonverbal communication on efficiency and robustness in communication. It is shown that nonverbal communication can reduce errors.
. It aims to explain to a user how she might improve interaction with an AI-system. Explainability in the context of machine learning is generally more focused on interpreting decisions and models(see[1, 13] for recent surveys). Counterfactual explanations also seek to identify some form of modification of the input.  explains by answering “How to modify an input to get classification Y?” and “What is minimally needed?”. The former focuses on mis-classified examples with the goal of changing them with minimal effort to the correct class. For the latter all objectives except efficiency are ignored and there is only the constraint of maintaining classification confidence above a threshold. Thus,  discusses special cases of our work. Technically,  generates a perturbation added to the sample such that the perturbation is minimal given a threshold confidence of the prediction (either as the correct class or as an alternative class) has been achieved. They use an ordinary auto-encoder as an optional element on the perturbation, which does only slightly alter results. In contrast, we use a CCAE on the inputs, which is essential. We optimize for multiple linear weighted objectives without thresholds.  aims at explaining counterfactuals, ie. showing how to change a class to another by combining images of both classes. That is given a query image and a distractor image they generate a composite image that essentially uses parts of each input. For instance, in the right part of Figure 6 the “7” in the second row serves as query image, the “2” in the middle as distractor and the right most column shows the outcome. The implementation relies on a gating mechanism to select image parts. Differences are also noticeable in the outcomes as shown in Figure 6. The highlighted differences appear noisy in  and are not necessarily intuitive.222eg. for column CEM-PP for digit “3” a stroke on top is missing, but  finds a miniature “3” within the given digit. The generated images in  appear more natural, but do have artifacts, eg. the “2” being a composition of a “7” and a “2” has a “dot” in the bottom originating from the “7”. In conclusion, while counterfactual explanations [6, 7] are related with our work, the objectives differ, eg. we include efficiency, as well as methodology and outcomes.
Human to AI interaction is likely to gain in importance. This paper investigated improving human to AI communication by proposing adjustments to human generated examples based on optimizing multiple objectives. Our evaluation highlights that such an automatic approach is indeed feasible.
-  A. Adadi and M. Berrada. Peeking inside the black-box: A survey on explainable artificial intelligence (xai). IEEE Access, 6:52138–52160, 2018.
-  C. C. Aggarwal. Neural networks and deep learning. Cham: Springer Int. Publishing, 2018.
J. Bao, D. Chen, F. Wen, H. Li, and G. Hua.
Cvae-gan: fine-grained image generation through asymmetric training.
Proc. of the Int. Conf. on Computer Vision, 2017.
-  Y. Bisk, D. Yuret, and D. Marcu. Natural language communication with robots. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016.
-  C. Breazeal, C. D. Kidd, A. L. Thomaz, G. Hoffman, and M. Berlin. Effects of nonverbal communication on efficiency and robustness in human-robot teamwork. In Int. Conf. on intelligent robots and systems, 2005.
-  A. Dhurandhar, P.-Y. Chen, R. Luss, C.-C. Tu, P. Ting, K. Shanmugam, and P. Das. Explanations based on the missing: Towards contrastive explanations with pertinent negatives. In Advances in Neural Information Processing Systems, 2018.
-  Y. Goyal, Z. Wu, J. Ernst, D. Batra, D. Parikh, and S. Lee. Counterfactual visual explanations. arXiv preprint arXiv:1904.07451, 2019.
-  S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally and locally consistent image completion. ACM Transactions on Graphics (ToG), 36(4):107, 2017.
P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.
Image-to-image translation with conditional adversarial networks.In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
-  A. Makhzani and B. Frey. K-sparse autoencoders. arXiv preprint arXiv:1312.5663, 2013.
-  A. I. Niculescu and R. E. Banchs. Strategies to cope with errors in human-machine spoken interactions: using chatbots as back-off mechanism for task-oriented dialogues. In Proceedings ERRARE 2015-Errors by Humans and Machines in multimedia, multimodal and multilingual data processing, 2015.
-  C. Rzepka and B. Berger. User interaction with ai-enabled systems: a systematic review of is research. 2018.
-  J. Schneider and J. Handali. Personalized explanation in machine learning. European Conference on Information Systems (ECIS), 2019.
-  J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5505–5514, 2018.
-  J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision, pages 597–613. Springer, 2016.