The official pytorch code of DeFLOCNet: Deep Image Editing via Flexible Low-level Controls (CVPR2021)
User-intended visual content fills the hole regions of an input image in the image editing scenario. The coarse low-level inputs, which typically consist of sparse sketch lines and color dots, convey user intentions for content creation (, free-form editing). While existing methods combine an input image and these low-level controls for CNN inputs, the corresponding feature representations are not sufficient to convey user intentions, leading to unfaithfully generated content. In this paper, we propose DeFLOCNet which relies on a deep encoder-decoder CNN to retain the guidance of these controls in the deep feature representations. In each skip-connection layer, we design a structure generation block. Instead of attaching low-level controls to an input image, we inject these controls directly into each structure generation block for sketch line refinement and color propagation in the CNN feature space. We then concatenate the modulated features with the original decoder features for structure generation. Meanwhile, DeFLOCNet involves another decoder branch for texture generation and detail enhancement. Both structures and textures are rendered in the decoder, leading to user-intended editing results. Experiments on benchmarks demonstrate that DeFLOCNet effectively transforms different user intentions to create visually pleasing content.READ FULL TEXT VIEW PDF
Deep encoder-decoder based CNNs have advanced image inpainting methods f...
We present a novel image editing system that generates images as the use...
Sketch-based image editing aims to synthesize and modify photos based on...
This paper proposes a novel two-stream encoder-decoder network, which
Font generation is a challenging problem especially for some writing sys...
We propose a deep learning approach for user-guided image colorization. ...
Sketch-to-image (S2I) translation plays an important role in image synth...
The official pytorch code of DeFLOCNet: Deep Image Editing via Flexible Low-level Controls (CVPR2021)
The investigation on image editing is growing as it reduces significant manual efforts during image content generation. Benefiting from the realistic image representations brought by convolutional neural networks (CNNs), image editing is able to create meaningful while visually pleasant content. As shown in Fig.1, users can draw arbitrary holes in a natural image as inputs to indicate the regions to be edited. If there are no further inputs given as shown in (a), image editing degenerates to image inpainting, where CNNs automatically fill hole regions by producing coherent image content as shown in (b). If there are additional inputs from users (, lines in (c) and both lines and colors in (e)), CNNs will create meaningful content accordingly while maintaining visual pleasantness. Deep image editing provides flexibility for users to generate diversified content, which can be widely applied in the areas of data enhancement, occlusion removal, and privacy protections.
The flexibility of user controls and the quality of user-intended content generation are challenging to achieve simultaneously in practice. The main difficulty resides on how to transform flexible controls into user-intended content. Existing attempts utilize high-level inputs (, semantic parsing map , attributes , latent code , language , and visual context ) for semantic content generation, but flexibility hinges on the predefined semantics.
On the other hand, utilizing coarse low-level controls (, sketch lines and colors) makes the editing more interactive and flexible. And in this paper, we mainly focus on incorporating such user inputs for image editing, in which we observe two main challenges: (1) Most prior investigations [33, 10, 23] simply combine an input image and low-level controls together in the image level for CNN inputs. The guidance from these low-level inputs gradually diminishes in the CNN feature space, weakening their influence on generating user-intended contents. Fig. 6 (c)-(f) show such examples where facial components are not effectively produced, (2) Since users only provide sparse color strokes to control the generated colors, the model needs to propagate these spatially sparse signals to the desired regions guided by sketches (, colors should fill in the regions indicated by the sketches and not be wrongly rendered across sketch lines) as illustrated in Figs. 5 and 7.
To resolve these issues, we propose DeFLOCNet (, Deep image editing via Flexible LO-level Control) to retain the guidance of low-level controls for reinforcing user intentions. Fig. 2 summarizes DeFLOCNet, which is built on a deep encoder-decoder for structure and texture generations on the hole regions. At the core of our contribution is a novel structure generation block (Fig. 3 and Sec. 3.1), which is plugged into each skip connection in the network. Low-level controls are directly injected into these blocks for sketch line generation and color propagation in the feature space. The structure features from these blocks are concatenated to the original decoder features accordingly for user-intended structure generation in the hole regions.
Moreover, we introduce another decoder for texture generation (Sec. 3.2). Each layer of the texture generation decoder is concatenated to the original decoder for texture enhancement. Thus, both structure and texture are effectively produced in the CNN feature space. They supplement original decoder features to bring coarse-to-fine user-intended guidance in the CNN feature space and output visually pleasing editing results. Experiments on the benchmark datasets demonstrate the effectiveness of our DeFLOCNet compared to state-of-the-art approaches.
Deep Generative Models.
The advancements in deep generative models [25, 27, 29] are inspired by generative adversarial learning [5, 26]. Instead of image generation from random noise, conditioned image generation from inputs activates a series of image translation work. In , a general framework is proposed to translate semantic labels to natural images. This framework is further improved by using a coarse-to-fine generator and a multi-scale discriminator . Besides holistic image generation, subregion image generation (, image inpainting) receives heavy investigations [32, 16, 18]. In contrast to existing image-to-image generation frameworks, our free-form image editing is more flexible to transfer user intentions (, monotonous sketch lines and color dots) into natural image content.
GANs have a lasting influence on image editing development. In , Invertible Conditional GANs are proposed to control high-level attributes of generated faces. Then, more effective editing is proposed in  by approximating a disentangled latent space. Semantic parsing maps are utilized in [8, 6, 3] as the intermediate representation for guided image editing, while natural language navigates editing in [2, 19]. Methods based on semantic guidance typically require an explicit correspondence between editing content and semantic guidance. As the semantic guidance is usually fixed with limited options, the editing is thus not flexible (, color and sketch controls). To improve the input flexibility, SC-FEGAN  proposes to directly combine sketch lines and colors as inputs and send them together with an input image to CNN. Gated convolution is proposed in  for flexibility improvement. As these methods attach user controls directly to input images for CNN input, the influence of user controls diminishes gradually. As a result, limited editing scenarios are supported by these methods (, facial component editing). Different from existing approaches, we inject low-level controls in the skip connection layers of an encoder-decoder with our structure generation block to gradually reinforce user intentions in a coarse-to-fine manner.
Fig. 2 shows an overview of our DeFLOCNet built on top of an encoder-decoder CNN model. The input to the encoder is an image with arbitrary hole regions. Low-level controls, represented by sketch lines and color dots, are sent to the structure generation blocks (SGB) set on the skip-connection layers (Sec. 3.1). Meanwhile, we propose another decoder named texture generation branch (TGB) (Sec. 3.2). The features from SGB and TGB are fused with the original decoder features hierarchically for output image generation.
. A vanilla encoder-decoder CNN may be prevalent for image generation while it is insufficient to recover missing content in the hole regions which are almost empty with sparse low-level controls. This is due to the limited ability of the encoder features to represent both natural image contents and free-form controls. To improve feature representations, we only send images to the encoder while injecting controls multiple times into all the skip connection layers via SGB. Consequently, user intentions are reinforced continuously via feature modulations. The reinforced features, together with the texture generation features, supplement the original decoder features in a coarse-to-fine fashion to generate both user-intended and visually pleasant visual content.
We inject low-level controls into SGBs to modulate features of user intention reinforcement. Fig. 3 shows the architecture of one SGB, which consists of three branches for progressive sketch line generation, color propagation, and feature fusion, respectively. The size of one SGB increases when it is integrated into a shallower encoder layer, since the shallower is close to the image level and we need stronger low level to guide the feature generation. The sketch line generation branch repeatedly injects and refines the sketch generation, avoiding the user control diminishing phenomenon. Then, the features of sketch are utilized in the color propagation branch to regularize the color to fill in the desired regions. Finally, the fusion branch injects the sketch and color features into the original feature to produce output editing results.
We first introduce a control injection operation, which is a basic building block in our SGB. The control operation follows . Specifically, we denote as the input feature map and as the information we want to inject into . Suppose the injection operation is , and the element in the injected feature can be obtained by:
where and are two variables controlling the influence from in element-wise precision, , and . In this paper, we use two convolutional layers to generate and at each element location. As a result, low-level controls are mapped into the feature space to correlate to the input feature maps.
Sketch line generation
. Given an input feature , the sketch line generation branch first performs an element-wise average along the channel dimension to produce a single-channel feature map . We denote the sketch image as , a random noise image as , and a mask image containing a hole region as . The output feature after control injection can be written as:
where is the injected feature, is the concatenation operator, and is the element-wise multiplication operator. In practice, a single injection does not generate recovered sketch lines completely. We use several injections in the sketch line generation branch to progressively refine sketch lines. The injection in the -th skip connection is:
where the output features from the previous injection are passed through a convolutional layer for the current injection input. During training, we use the ground truth sketch lines extracted from the original images. When editing images, we adopt user inputs for sketch line refinement.
We propose a color propagation branch in parallel to the sketch generation branch. In order to guide the color propagation via sketch lines, we use injected features from the sketch line generation branch. The guiding process can be written as:
where is the color features guided by and
is the sigmoid activation function.
Fig. 4 illustrates how color propagates under sketch guidance. In (a) and (b), the sketch lines in gray are not recovered well and the blue color tends to diffuse in all directions. As the sketch lines are gradually refined to complete contours as shown in (c)-(e), the blue color does not penetrate the contour lines via the consecutive and operations. Finally, blue color propagates along the contour lines without penetration as shown in (f).
The features from the sketch and color branches are fused together via the injection operation as follows:
where is the fused feature. We set different numbers of injection operations in the fusion branch for each scale, respectively. For the skip-connection from the initial encoder layer where features are in large resolution, we employ 6 injection operations in the fusion branch. We gradually decrease this number to 1 on the skip connection layers from the last encoder layer.
We use SGB to reinforce user intentions during hole filling in the CNN feature space. The modulated features represent structure content while texture representation is limited. This is partially because low-level controls injected into the skip-connection layers do not contain sufficient texture guidance. Meanwhile, the sketch lines attend the encoder features to structure content rather than textures.
As encoder features do not relate to low-level controls, we propose a texture generation branch that takes the features from the last encoder layer as input. Fig. 2 shows how TGB is integrated into the current pipeline. The architecture of TGB is the same as that of the original decoder. We add the feature maps from each layer of TGB to the corresponding decoder features for texture generation. TGB supplements decoder features via residual aggregations. As structure features are learned via SGB, TGB will focus on the features representing region details. The enriched decoder features are then concatenated with the features from SGBs for output generation where there are both structures and textures in the hole regions.
We utilize several objective loss functions to train DeFLOCNet in an end-to-end fashion. These functions include pixel reconstruction loss, perceptual loss, style loss , relativistic average LS adversarial loss , and total variation loss . During training, we extract the sketch lines and color in the hole regions. We denote as the output result and as the ground truth ( and ). The loss terms can be written as follows:
Pixel reconstruction loss.
We measure the pixel-wise difference between and as:
We consider high-level feature representation and human perception to utilize the perceptual loss, which is based on the ImageNet-pretrained VGG-16 backbone. The perceptual loss can be written as:
where is the feature map of the -th layer of the VGG-16 backbone and, is the number of elements in . In our work, corresponds to the activation maps from layers ReLu1_1, ReLu2_1, ReLu3_1, ReLu4_1, and ReLu5_1.
The transposed convolutional layers of the decoder will bring checkerboard effect , which can be mitigated by the style loss. Suppose the size of feature map is . We write the style loss as:
where is a Gram matrix computed based on the feature maps, and is the number of elements in . These feature maps are the same as those used in the perceptual loss as illustrated above.
Relativistic average LS adversarial loss.
We utilize global and local discriminators for perception enhancement. The relativistic average LS adversarial loss is adopted for our discriminators, which can be written as:
where and indicates the local or global discriminator, and real and fake data pairs are sampled from and .
Total variation loss.
This loss is set to add the smoothness penalty on the generated regions, which is defined as:
where is the number of elements in , and denotes the hole regions.
The whole objective function of DeFLOCNet can be written as:
where , , , and are the scalars controlling the influence of each loss term. We empirically set , , , and .
|(a) Input||(b)||(c)||(d)||(e)||(f)||(g)||(h) Output|
The feature representations of input sketch lines are gradually enriched to become those of contours in one SGB. The enriched lines guide the color propagation process for edge-preserving feature modulation. To validate this effect, we visualize the feature maps from the fusion branch of one SGB set in the coarse level. Following the visualization techniques in , we visualize the 6 injection operations in the fusion branch. Specifically, we use a convolutional layer to map each CNN feature () to one color image, and another convolutional layer to map each CNN feature to one grayscale image. The weights of the mapping convolutions are learned. The content shown in the images indicates the corresponding feature representations in SGB.
Fig. 5 shows the visualization results. The input image is shown in (a) where the low-level controls are sent to SGBs. The visualization of is shown in (b)-(g) where the features from the sketch generation branch are shown at the top left corner, respectively. We observe that during initial sketch line generation shown in (b)-(c), color propagates in all directions. When sketch lines are gradually completed as shown in (d)-(g), color propagates along these lines to formulate a clear boundary. These feature maps are then concatenated to the original decoder for image structure generation shown in (h). The visualization of color propagation is similar to that in Fig. 4, where the operation is effective to prevent the color diffusions.
We evaluate on a natural image dataset Places2  and a face image dataset CelebA-HQ . During training, we follow PConv  and HED  to create hole regions and input sketch lines, respectively. The color inputs for face images are from GFC  and for natural images are from RTV . During training, we choose arbitrary hole regions and low-level controls. Adam optimizer  is adopted with a learning rate of
. The training epoches for CelebA-HQ and Place2 datasets are 120 and 40, respectively. The resolution of synthesized images is 256256. All the experiments are conducted on one Nvidia 2080 Ti GPU. The various edges of training images ensure our method to handle diverse and deformed strokes.
|(a) Original||(b) Input||(c) PConv ||(d) DF2 ||(e) P2P ||(f) FEGAN ||(g) Ours|
and Pix2Pix. All these methods are retrained using the official implementations on the same datasets with the same input configurations for fair comparisons. The only difference between DeFLOCNet and other methods is that we send low-level controls into skip-connection layers rather than the encoder.
Fig. 6 shows the visual comparison results. The original clean images are shown in (a). The input of existing methods is shown in (b). For a straight-forward display, we combine input images with masks and low-level controls together. The results produced by Partial Conv and Deepfill2 are shown in (c) and (d) where structure distortions and blurry textures exist. This is because these two methods tend to incorporate neighboring content when filling the hole regions. They are not effective to generate meaningful content given by the users. In comparison, the results generated by Pix2Pix and SC-FEGAN are improved as shown in (e) and (f). These two methods focus more on the user controls and utilize adversarial learning to generate perceptual realistic contents. However, as these methods attach user controls to color images for the network input, structure generation is thus limited. The feature representations are not sufficient to convey both color images and low-level controls since their data distributions are extremely unbalanced. The structures of the eye regions shown in the last three rows are not effectively generated in (e) and (f).
Unlike existing methods that combine all the inputs together, DeFLOCNet sends the color image into the encoder and low-level controls into skip-connection layers via SGBs. These controls gradually enrich encoder features in each skip-connection layer, and refine these features from coarse to fine across multiple skip-connection layers. The results of our method are shown in (g) where image content is effectively generated with detailed textures.
|Model||1 Inject||1 block||w/o 1-||w/o texture||Ours||1 Inject||1 block||w/o 1-||w/o texture||Ours|
We evaluate existing methods on the two benchmark datasets numerically from two aspects. First, we use standard metrics including PSNR, SSIM, and FID  to measure the pixel-level, structure-level, and holistic-level similarities between the output results and original images. When producing these results, we use the sketch lines and color dots from the hole regions without modifications. Table 1 shows the evaluation results where our method performs favorably against the existing methods. This indicates that our method is more effective to generate user-intended contents while maintaining visual pleasantness.
Besides standard metrics, we perform a human subject evaluation. There are over 20 volunteers to evaluate the results on both CelebA-HQ and Places2 datasets. The volunteers all have image processing background. There are 15 rounds for each subject. In each round, the subject needs to select the most visually pleasant result from the 5 results generated by existing methods without knowing the hole region in advance. We tally the votes and show the statistics in the last row in Table 1. The comparison results with respect to existing methods indicate that our method is more effective to generate visually high-quality image content.
Our DeFLOCNet improves baseline results via SGBs and TGB. We analyze the effects of SGB and TGB on the output results.
|(a) Input||(b) 1 inject||(c) 1 SGB||(d) w/o 1-||(e) Ours|
Structure generation block.
We set SGBs across multiple skip connection layers to reinforce user intentions from coarse to fine. Meanwhile, within each skip connection layers we gradually enrich encoder feature representations. Fig. 7 shows two visual examples. Input images are in (a), and we only use 1 injection within each SGB to produce the results in (b). On the other side, we only use one SGB set on the middle-level skip connection layer, and generate the results in (c). The results in (b) and (c) indicate that using limited injections or SGBs are not effective during structure generation. In constrast to these two configurations, we do not use sketch line constraint and produce the results in (d). They show that color propagates in arbitrary directions for unintended structure generation. By using more injections and propagation guidance, we are able to produce visually pleasant structures in (e). Table 2 numerically shows that multiple injections and propagation constraints improve the structure quality of the generated content.
|(a) Input||(b) w/o TGB||(c) w/ TGB|
Texture generation branch.
We analyze the effect of TGB by comparing the results produced with TGB and without TGB. Fig. 8 shows two examples. Input images with user controls in (a). Without TGB, texture details are blurry in some regions in (b) (, the forelock hair and mountain boundaries). The utilization of TGB synthesizes texture details based on the input image and thus reduces the blurry artifacts in (c). The numerical evaluation in Table 2 also indicates that TGB improves the generated content.
We propose structure generation blocks set on skip connection layers to receive low-level controls, while only color images are sent to the encoder. The encoder features, representing only color images, are modulated via these blocks gradually to reinforce user intentions within each skip connection layer. Furthermore, the modulated encoder features with structure ingredients supplement the decoder features together with the generated texture features across multiple skip connection layers. Therefore, both structures and textures are generated from coarse to fine in the CNN feature space, bringing both user-intended and visually pleasing image content. The Experiments on the benchmark datasets indicate the effectiveness of our methods both numerically and visually compared to the state-of-the-art approaches.
This work was supported in part by the National Natural Science Foundation of China under grant 62072169.
This work was partially supported by an ECS grant from the Research Grants Council of the Hong Kong (Project No. CityU 21209119) and an APRC grant from CityU, Hong Kong (Project No. 9610488) .
Language-based image editing with recurrent attentive models.In , 2018.
Texture synthesis using convolutional neural networks.In Neural Information Processing Systems, 2015.
Sc-fegan: Face editing generative adversarial network with user’s sketch and color.In IEEE/CVF International Conference on Computer Vision, 2019.
Perceptual losses for real-time style transfer and super-resolution.In European Conference on Computer Vision, 2016.
Places: A 10 million image database for scene recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.