Replacing the background and simultaneously adjusting foreground objects is a challenging task in image editing. Current techniques for generating such images are heavily relied on user interactions with image editing softwares, which is a tedious job for professional retouchers. Some exciting progress on image editing has been made to ease their workload. However, few models focused on guarantee the semantic consistency between the foreground and background. To solve this problem, we propose a framework – ART(Auto-Retoucher)，to generate images with sufficient semantic and spatial consistency from a given image. Inputs are first processed by semantic matting and scene parsing modules, then a multi-task verifier model will give two confidence scores for the current matching and foreground location. We demonstrate that our jointly optimized verifier model successfully guides the foreground adjustment and improves the global visual consistency.
The goal of most image editing tasks is to come up with an automatic model which could replace the human labor. Consider when someone wants to turn his life photos into tourist photos in Hawaii. This would be a mission impossible for a Photoshop layman. How to replace the background and adjust the foreground location simultaneously so that the retouched image looks consistent and semantically reasonable? Now we call this specific task as Auto-Retouching and propose a Auto-Retoucher(ART) framework to solve it.
Appearance and location between foreground and background regions are vital for the realism of composite images. Recently, the Generative Adversarial Networks successfully transfer images from one domain to another. Johnson et al.  proposed a method for generating images from scene graphs. Although these generative models could process images end-to-end, the generated images mostly have poor visual quality and low fidelity in appearance. ST-GAN  seeks image realism by operating in the geometric warp parameter space to find the best location, while the appearance consistency is relatively poor.
Other works in Image Harmonization focus on adjusting the appearances of foreground and background to generate realistic composites. Tsai et al. 
proposed an end-to-end deep convolutional neural network for image harmonization, which can capture both the context and semantic information of the composite images during harmonization. However, image harmonization models only adjust foreground appearances and retain the background region, while the location and scale of the given foreground are fixed. The edited image would not be real if the foreground objects were put in the wrong location at the very beginning. Zhao et al. present a new image search technique. Given a specific rectangle bounding box on background image, the method returns the most compatible foreground objects to paste from given candidates. However, the model is still unable to find the best location and scale of the foreground, since the bounding box is directly given by human.
To overcome these shortcomings, we proposed Auto-Retoucher(ART) framework to generate composite images that are harmonious both in appearance and location. A multitask verifier model will utilize semantic parsing and content features to calculate scores for both the global consistency and the spatial consistency. Then a gradient-based adjustment algorithm based on the surface of our verifier model will adjust the foreground object into the most suitable location. Since no previous work has been done on this specific task, existing datasets is not available. Therefore, we created a new dataset with 300K images for the Auto-Retouching task. Experiments are conducted on this dataset to empirically evaluate the effectiveness of our method. We prove that our model performs well on this dataset.
To summarize, the main contributions of our work are four-fold:
We proposed a novel multitask verifier model to evaluate the semantic and spatial consistency, which could jointly fit both the global semantic consistency and spatial rationality.
We introduce a gradient-based adjustment algorithm to adjust the foreground objects into a plausible location and scale.
We construct the ART framework to solve the auto-retouching task.
A large scale and high quality dataset for the auto-retouching task is created.
2 Related Work
- Image Harmonization
Traditional methods for image harmonization use color and tone matching techniques to ensure consistent appearances, such as transferring global statistics , , applying gradient domain methods , matching multi-scale statistics . Tsai. et al  propose an end-to-end deep convolutional neural network for image harmonization, which can capture both the context and semantic information of the composite images during harmonization. These image harmonization methods only learn to adjust the color tone of foreground objects, but do not consider where to put the foreground images, which could be semantically inconsistent. With the help of our framework, composite images with much semantic and spatial consistency will be generated for later harmonization, further improving the quality of edited images.
In ST-GAN 
, spatial transformer networks are used to find geometric corrections to foreground images such that the composite images are natural and realistic. However, the model cannot ensure the color-tone consistency between foreground and background, thus selecting coherent foreground objects and background scenes still requires human interactions. In our work, after given the foreground image, no human intervention is required. The best matching background and the best location and scale will be given by a model pretrained on large dataset.
- Multi-task learning in CV
has been widely used in various computer vision problems. As Xu et al. summarized, there has been progress in many exciting tasks, such as joint inference scene geometric and semantic 
, face attribute estimation, contour detection and semantic segmentation  . Yao et al. 
proposed an approach for joint learning three tasks i.e. object detection, scene classification and semantic segmentation. Hariharan et al. proposed to simultaneously learn object detection and semantic segmentation based on the R-CNN framework. We introduce the multitask learning method in our verifier model, enabling the model to jointly fit both the global semantic consistency and spatial rationality.
3 Auto-Retouching Dataset
Since the auto-retouching task is newly defined in our paper, the first challenge is the lack of data. To solve this problem, we create a large scale auto-retouching dataset. The foregrounds in this dataset are persons in different clothes. The backgrounds contain 16 different types of scenes (beach, office, desert, etc.), which fully meets our requirement for scene diversity.
The source images are from The Celebrity in Places dataset . This dataset contains 36K images of celebrities in different types of scenes, in which 4611 celebrities and 16 places are involved. We processed these images and divided all the data in our dataset into 3 categories: the positive cases, the content-level negative cases and the spatial negative cases.
The foreground persons are first detected and cut out. Then the blank on the remained backgrounds are filled out by a content-aware filling algorithm. The filled backgrounds are later fed into a Cascade-DilatedNet  to generate scene parsing maps. We denote the foreground, background, and semantic parsing map of image as and .
Training Data Preparation and Evaluation
To generate positive and negative cases, we hypothesize that the foreground and background of the original image are consistent both in content, regardless of the foreground’s location and scale, while other fore-background combinations must be inconsistent.
Under this assumption, the positive cases are simply a set of tuples , whose content labels are True. Similarly, is the set of content-level negative cases, whose content labels are False.
To create the spatial negative samples, we randomly select a series of locations and scales for the foreground objects. We assume that the randomly scattered, different scaled fore-ground patches are not consistent with the original background. Now each foreground has several substitutes with wrong location and scale. The spatial negative samples are , whose content label is True based on our hypothesis.
To evaluate the inconsistency of spatial negative cases, we proposed a spatial rationality score.
where is the moving distance of , is the maximum moving distance bounded by canvas and is the scaling ratio. , are two constants. According to this formula, a lower spatial score indicates larger deviation from the original foreground.
Some statistics of our dataset are listed in Table 1 below.
|content label||spatial score||number of cases|
We choose accuracy as the metric of content label classification, and RMSE as the metric of spatial score regression.
4 Auto-Retoucher Framework
The ART framework consists of 2 stages: background selection stage and foreground adjusting stage.
In the background selection stage, several background images will be sampled from gallery. A global consistency score is given by our pre-trained multitask model. Top-k backgrounds with high global consistency scores will be sent into the next stage.
In the foreground adjusting stage, the multitask verifier model generates a score of the current foreground location and scale. The foreground images will move to a reasonable location in 3-dimensional space guided by numerical gradient. Eventually, it would be placed in the plausible location with a plausible scale. This adjustment procedure will be performed multiple times with random initial points, and the best result will be selected as output.
4.2 Multi-task Verifier Model
Formally, we represent the tasks of our verifier model as follows: given a set of tuples , where and are foregrounds and backgrounds, is the corresponding scene parsing representations of . and
are content label and spatial consistency score. The verifier model has to handle a classification task and a regression task simultaneously. The classification task is to estimate the probabilityfor global consistency. The regression task is to fit the spatial score. Our model will jointly optimize these two tasks during training.
The architecture of our model is illustrated in Figure 2. The verifier model consists of three layers: Encoding Layer, Fusion Layer and Prediction layer.
4.2.1 Encoding Layer:
In our work, we utilize pretrained ResNet-50  as our encoder. These networks will be finetuned on our auto-retouching dataset. Now we denote the extracted features of foreground, background and scene parsing as , , .
4.2.2 Fusion Layer:
Then, the feature matrix , , , are flattened into
dimensional vectors. Since estimating semantic consistency and spatial consistency are highly correlated, we perform a bi-attention mechanism on the deep features.
Where ,,, project feature vectors into 30 dims. Now fusion layer generates the final representation by soft-parameter sharing:
4.2.3 Prediction Layer:
The prediction layer receives fused information and
and handles two prediction tasks: (1) global consistency classification (2) spatial consistency regression. Finally, fully connect layers summarize the fused vectors, and utilize the softmax and sigmoid function to produce a conditional probabilityfor global consistency and a confidence score for spatial consistency.
where are trainable weight matrices.
We define two independent loss for the two tasks. The content-level loss and spatial-level loss .
where is the indicator of the content label.
Finally, we combine the two losses as follow. is a weight constant.
4.3 Gradient Based Foreground Adjustment
Finding the best location and scale for the foreground objects is always a tough problem in image composition tasks. In our works, we abandon the traditional method which directly regresses the optimal location and scale of the foreground. Since the performance of such methods are relatively poor due to the extremely large solution space. Instead, we hope to learn something from the spirit of Generative Adversarial Networks, using gradients of the verifier model to guide the adjustment of our foregrounds.
Here we propose a gradient-based adjustment algorithm to solve this problem: We train our verifier model to score the foregrounds that are randomly moved and scaled, where the score is defined as equation (1). Consequently, the surface of the verifier model is meaningful in the sense of gradient, as foregrounds with right location and scale get high scores and vice versa. By calculating the numerical partial gradient of location and scale , we can move the foreground on the surface of the verifier model by gradient ascent and finally find the best location and scale.
Figure 4 shows two visualized moving sequences during the gradient ascent. In case 1, the foreground is gradually adjusted to the appropriate location and scale, proving that the surface of our verifier model is well-trained and meaningful. In case 2, the river background is globally inconsistent with foreground, thus the foreground is shifted out of the background, which is intuitively reasonable and illustrates the capability of our model to utilize the semantic information of background.
Our model is trained on our auto-retouching dataset, where of the dataset are split as test set. We use the Adam optimizer with a learning rate of 1e-5, and a dropout of 0.3 is applied for preventing overfitting. The hyper-parameter is selected by grid search. The batch size is 20 and the input image size is . We choose accuracy as the metric for the classification task and Root Mean Square Error(RMSE) for the regression task.
We did ablation experiments on our dataset to validate the performance of the verifier model and the function of the attention mechanism.
|Task1 Accuracy||Task2 RMSE|
The result in Table 2 shows that our model does well on the background selection task and foreground adjustment task. Also, the attention mechanism actually improves the overall performance.
We focused on a specific auto-retouching task and designed a novel framework for this task. The ART framework is able to replace the image background and adjust the foreground location and scale simultaneously, while keep the edited image semantically and visually harmonious. We first introduce multitask learning method to combine two auxiliary losses to help the verifier model concentrate on content-level consistency and spatial consistency respectively, and then utilize the confidence scores to guide the background selection and foreground adjustment. We created an auto-retouching dataset containing 300K images. Our system has achieved good visual performance on human judgment. Looking forward, we plan to design new structures for verifier model to handle more complicated circumstances.
Lin, C.H., Yumer, E., Wang, O., Shechtman, E. and Lucey, S., 2018, March. ST-GAN: Spatial Transformer Generative Adversarial Networks for Image Compositing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 9455-9464).
-  Johnson, J., Gupta, A. and Fei-Fei, L., 2018. Image generation from scene graphs. arXiv preprint.
-  Tsai, Y.H., Shen, X., Lin, Z., Sunkavalli, K., Lu, X. and Yang, M.H., 2017, February. Deep image harmonization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Vol. 2).
-  Zhao, H., Shen, X., Lin, Z., Sunkavalli, K., Price, B. and Jia, J., 2018, September. Compositing-aware Image Search. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 502-516).
-  Xu, D., Ouyang, W., Wang, X. and Sebe, N., 2018. PAD-Net: Multi-Tasks Guided Prediction-and-Distillation Network for Simultaneous Depth Estimation and Scene Parsing. arXiv preprint arXiv:1805.04409.
-  I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Crossstitch networks for multi-task learning. In CVPR, 2016
-  S. Ruder, J. Bingel, I. Augenstein, and A. Søgaard. Sluice networks: Learning what to share between loosely related tasks. arXiv preprint arXiv:1705.08142, 2017.
-  A. Kendall, Y. Gal, and R. Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. arXiv preprint arXiv:1705.07115, 2017.
-  H. Han, A. K. Jain, S. Shan, and X. Chen. Heterogeneous face attribute estimation: A deep multi-task learning approach. arXiv preprint arXiv:1706.00906, 2017.
-  S. Gupta, P. Arbelaez, and J. Malik. Perceptual organization and recognition of indoor scenes from rgb-d images. In CVPR, 2013.
-  J. Yao, S. Fidler, and R. Urtasun. Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation. In CVPR, 2012.
-  B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Simul- ´taneous detection and segmentation. In ECCV, 2014.
-  F. Pitie and A. Kokaram. The linear monge-kantorovitch linear colour mapping for example-based colour transfer. In CVMP, 2007.
-  E. Reinhard, M. Ashikhmin, B. Gooch, and P. Shirley.Color transfer between images. IEEE Comp. Graph. Appl.,21(5):34–41, 2001
-  P. Perez, M. Gangnet, and A. Blake. Poisson image editing. ACM Trans. Graph. (proc. SIGGRAPH), 22(3), 2003.
-  K. Sunkavalli, M. K. Johnson, W. Matusik, and H. Pfister.Multi-scale image harmonization. ACM Trans. Graph. (proc. SIGGRAPH), 29(4), 2010
-  Zhong, Y., Arandjelovi, R. and Zisserman, A., 2016,September. Faces in places: Compound query retrieval.In BMVC-27th British Machine Vision Conference.
-  Kaiming He, Georgia Gkioxari, Piotr Dollar, and RossGirshick. 2017. Mask R-CNN. In Proc. ICCV.
-  He, Kaiming, et al. ”Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.