Image inpainting is a process for restoring damaged or missing sections of images, such that the results are visually plausible. Naturally, performance of restoration algorithm degrades when the corrupted sections become dense or large, since more semantic information is missing.
Due to the lack of semantic information, restored images can contain artifacts like areas with inconsistent texture or monotone color as shown in the third image of Fig. 1. Despite this, given those pre-restored images, human beings can easily deduce the semantics in the corrupted regions. Therefore, this awareness can be used to accomplish restoration tasks. Based on this intuition, we extend Deep Image Prior (DIP) ulyanov2018deep with Human-Computer Interaction (HCI) and present Interactive Deep Image Prior (iDIP), a collaborative, interactive image restoration framework (Sect. 3). This framework enables human and algorithms to collaboratively restore images in an iterative manner. With the proposed framework, even people with little painting knowledge can generate plausible images and manage restoration task. Furthermore, frequent feedback promises higher sense of control and better user satisfaction than non-interactive methods.
We then evaluate iDIP-based image restoration system with respect to two research questions:
Does the interactive approach produce higher quality images?
How do users view such a system regarding user experience and satisfaction?
2 Related Work
Previous research works attempted to fully automate the image restoration process. As one of the state of the art approaches, DIP restores images by exploiting image prior modelled by a Convolution Neural Network (CNN)krizhevsky2012imagenet
. DIP minimizes the following loss function for image inpainting:
where is a CNN parameterized with , is a fixed input, is a corrupted image, is Hadamard product and is the mask for damage area. DIP overcomes the drawbacks of exemplar-based barnes2009patchmatch; hays2007scene; kwatra2005texture; he2012statistics and learning-based methods yu2018generative; iizuka2017globally; yeh2017semantic; yan2018shift, such as difficulties in recovering sophisticated texture and requirement of large training set, respectively.
Same as classic machine learning models, training of DIP is non-interactive and will be performed only once. However, DIP cannot use human understanding of textural semantics and leads to poor user satisfaction due to its low transparency. Nonetheless, interactive Machine Learning (iML) fails2003interactive increases the sense of control by introducing human intervention into learning loops amershi2014power. The increased sense of control can improve trust and user experience in many scenarios amershi2014power; cohn2003semi; holzinger2016interactive; johnson2008active.
To our best knowledge, there is no previous work combining DIP with iML. In this work, we extend the DIP with interactivity fails2003interactive and bring humans into the training loops of iDIP. The updates of iDIP is iterative, focused and rapid. These properties make the restoration process more transparent and contribute to a user-satisfied approach (Sect.5).
iDIP restores images by iteratively exploiting image prior and human knowledge via human-in-the-loop intervention. The underlying human involvement could be either creating new mask (correction) or painting on the corrupted regions (guidance).
Training iteration: One training iteration can be visualized in Fig. 3 and it consists of three stages. 1. User is presented with the image restored by iDIP from the last iteration, where is the current timestamp. 2. User paints on the image to obtain a refined image . 3. iDIP restores image by minimizing the loss function in Eq. 1 and output the image . Note, the output image of the th iteration is equivalent to the input image of the th iteration.
Given pre-restored image , users can come up more easily with textural semantics in the damage region than only given . Furthermore, iDIP exploits its restoration performance by distilling the reconstructed textual information in the refined image . In this way, iDIP and human knowledge can jointly boost on each other.
Besides, this iterative approach endows users with better control of their impact through trial-and-error. Therefore, users can better determine their involvement intensity in next iterations. Frequent interaction contributes to better user satisfaction and system transparency. What’s more, early stopping can be applied on time since users continuously observe the textural consistency and can terminate the process in any iteraction to avoid overfitting.
User interface (UI): Fig. 3 shows the UI for iDIP. The image in center is the pre-restored one with the mask as red overlay. Users were supposed to pick appropriate color and paint in the masked region.
We conducted experiments to answer the first research questions in this section: Does the interactive approach produce higher quality images?
Dataset: We used the Dunhuang Grottoes Painting Dataset yu2019dunhuang for the experiments. The dataset contains 500 full frame paintings with artificially generated masks for damage region, of which we randomly picked ten.
Metrics: As performance measurement, we computed Dissimilarity Structural Similarity Index Measure (DSSIM) wang2004image and Local Mean Squared Error (LMSE) grosse2009ground
between restored and ground truth images. Mean Squared Error (MSE) and Structural Similarity Index Measure (SSIM) are common and easy-to-compute measures of the perceived quality of digital images or videos in computer vision. In this paper, we compute MSE and equalize it to LMSE by setting. By using we let the DSSIM also be inversely proportional to restoration quality as LMSE.
Baselines: To show the effectiveness, we compared our approach with five state of the art baselines. For learning based methods, we used their pre-trained model on Places2 zhou2017places
, because it is one of the widely-used scene recognition dataset.
EdgeConnect: EdgeConnect nazeri2019edgeconnect proposed a two-stage adversarial model and can deal with irregular masks.
PartialConv: PartialConv liu2018image used partial convolutions with an automatic mask update step.
PatchMatch: PatchMatch barnes2009patchmatch can quickly find approximate nearest-neighbor matches between image patches and was adopted by Photoshop.
PatchOffset: PatchOffset he2012statistics minimizes an energy function to find patches with dominant offsets.
Deep Image Prior: DIP ulyanov2018deep exploits the image prior by minimizing Mean Squared Error (MSE) in the unmasked region.
For objective evaluation, we compared images restored by all six algorithms on ten randomly picked corrupted images using two metrics. The images generated by iDIP for the objective evaluation were recovered by domain expert. Each image was completed within 1200 iterations (600 iterations before painting and 600 after) and the domain expert painted only once per image. However, only pixel-wise measures can not account for human criteria used to judge the quality such as semantic correctness and consistency. In consequence we asked each of the 19 participants from our user study (described in the following section) to subjectively select two best recovered images out of the six produced by the six algorithms.
Objective evaluation: In the Tab. 5 we can see that, although we initialized the networks with pre-trained weights, two learning-based methods still have the worst performance. Style transfer failed because the image style of Dunhuang dataset varied too much from the training set. PatchMatch has the best LMSE score by a large margin. However, our approach slightly outperformed all non-interactive methods on DSSIM and has the second smallest LSME score. This suggests that interactivity positively contributes to output quality.
Subjective evaluation: Fig.5
shows the probability of one algorithm being picked as top two algorithm in the subjective evaluation. We left out learning-based methods, since they had not been picked. The two DIP-variants significantly outperformed other methods, even though PatchMatch demonstrated the best result on LMSE. Compared to DIP, iDIP still showed a considerable improvement, which indicates interactivity introduced in iDIP added to the output quality.
The difference between the two evaluations is also noteworthy: While PatchMatch has the lowest LMSE score, subjectively it appears far inferior to the DIP-based methods. This may be an indicator that simple similarity measures are insufficient to account for human perception.
To summarize, introducing interactivity in iDIP positively affected the restoration performance. Therefore, we confidently give a positive answer to the first research question.
5 User Study
With iDIP outperforming the other baselines in the subjective perception and being not far off with respect to objective measures, it remains whether an iML approach is attractive from a usage point of view. We evaluated this in a user study and via a questionnaire.
Participants in this study (n = 19; 9 male, 9 female, 1 other; 20-29 years old: 10, 30-39 years old: 7, 40-49 years old: 2) were people medium expertise with image manipulation (mean: 2.68/5, std: 1.25) and low expertise with image reconstruction (mean: 1.74/5, std: 1.19). We presented to them the UI and asked them to reconstruct two images. Due to practical reasons, we limited their working time to seven minutes per image. We then asked the participants to fill out the questionnaire regarding general satisfaction with the process using the System Usability Scale (SUS) and workload using NASA TLX as well as questions regarding the benefits of our interactive approach.
Results from the SUS and TLX were very positive (average score SUS: 86/100, TLX: 3.4/10). Measured on a 5-point Likert scale, the opinion of the participants regarding iML being suitable for image reconstruction (4.5/5) and in general (4.0/5) were also very positive. Participants also did not believe that a non-interactive ML process (0.9/5) or a manual approach (1.8/5) would perform better.
The fact that all participants stated that they liked the combination of interactivity and machine learning, as well as other feedback, led us to conclude that iML can make machine learning more approachable. Whether it is an actual boost to expert-productivity remains to be seen in future work.
6 Conclusion and Future Work
In this paper we have outlined our framework for interactive image restoration. This framework allows users to interactively contribute to DIP-based image restoration process so that both image prior and human knowledge can be well leveraged in an iML fashion. Our experiments show that the designed interactions positively affected the output quality as iDIP outperformed all five state of the art baselines. Meanwhile, good user satisfaction has been achieved according to the user study, as participants stated their appreciation and confidence of the proposed method. In summary, the positive answers of two research questions indicate that our goal of human-centric machine learning have been fulfilled for image restoration tasks.
As human-in-the-loop approach demonstrated its effectiveness in terms of algorithm performance and user satisfaction, we remain the interpretation of rich interactions forms in image restoration as future work.
7 Supplementary Material
As supplementary materials, we provide the subjective evaluation record of restoration performance, the questionnaire used in the user study and the statistical summary of user study. See pages - of appendix/SubjectiveEvaluation.pdf See pages - of appendix/Survey.pdf See pages - of appendix/Statistics.pdf