Reference-based line-art colorization has achieved impressive performance in generating a realistic color image from a line-art image [graph-anime-colorization, cocosnet2]. This technique is in high demand in comics, animation, and other content creation applications [aiqiyi-cmft, animation-transformer]. Different from painting with other conditions such as color strokes [s2p5, DCSGAN], palette [palette], or text [tag2pix], using a style reference image as condition input not only provides richer semantic information for the model but also eliminates the requirements of precise color information and the geometric hints provided by users for every step. Nevertheless, due to the huge information discrepancy between the sketch and reference, it is challenging to correctly transfer colors from reference to the same semantic region in the sketch.
Several methods attempt to tackle the reference-based colorization by fusing the style latent code of reference into the sketch [s2p, icon, spade]. Inspired by the success of the attention mechanism [Attention_is_All_you_Need, non_local_net], researchers adopt attention modules to establish the semantic correspondence and inject colors by mapping the reference to the sketch [scft, aiqiyi-cmft, cocosnet]. However, as shown in Figure 1, the images generated by these methods often contain color bleeding or semantic mismatching, indicating considerable room for improving attention methods in line-art colorization.
There are many possible reasons for the deficiency of line-art colorization using attention: model pipeline, module architecture, or training. Motivated by recent works[ham, mocov3] concerning the training issues of attention models, we are particularly interested in the training stability of attention modules in line-art colorization. It is even more challenging to train attention models in line-art colorization because state-of-the-art models [scft]
deploy multiple losses using a GAN-style training pipeline, which can double the training instability. Therefore, we carefully analyze the training dynamics of attention in terms of its gradient flow in the context of line-art colorization. We observe the gradient conflict phenomenon, namely, a gradient branch contains a negative cosine similarity with the summed gradient.
To eliminate the gradient conflict, we detach the conflict one while preserving the dominant gradient, which ensures that the inexact gradient has a positive cosine similarity with the exact gradient and meet theory requirements [Geng2021Training, pcgrad]. This training strategy visibly boosts the training stability and performance compared with the baseline attention colorization models. Combined with architecture design, this paper introduces Stop-Gradient Attention, SGA, whose training strategy eliminates the gradient conflict and helps the model learn better colorization correspondence. SGA properly transfers the style of reference images to the sketches, establishing accurate semantic correspondence between sketch-reference pairs. Our experiment results on several image domains show clear improvements over previous methods, i.e., up to 27.21% and 25.67% regarding FID and SSIM, respectively.
Our contributions are summarized as follows:
We reveal the gradient conflict in attention mechanism for line-art colorization, i.e., a gradient branch contains a negative cosine similarity with the summed gradient.
We propose a novel attention mechanism with gradient and design two attention blocks based on SGA, i.e., cross-SGA and self-SGA.
Both quantitative and qualitative results verify that our method outperforms state-of-the-art modules on several image datasets.
2 Related Work
2.0.1 Reference-based Line-Art Colorization.
The reference-based line-art colorization is a user-friendly approach to assist designers in painting the sketch with their desired color [s2p, scft, aiqiyi-cmft, animation-transformer]. Early studies attempt to get the style latent code of reference and directly mix it with sketch feature maps to generate the color image [s2p, icon]. To make better use of reference images, some studies propose spatial-adaptive normalization methods [spade, SEAN].
Different from the aforementioned methods that adopt latent vectors for style control,[scft, aiqiyi-cmft, cocosnet] learn dense semantic correspondences between sketch-reference pairs. These approaches utilize the dot-product attention [Attention_is_All_you_Need, non_local_net] to model the semantic mapping between sketch-reference pairs and inject color into sketch correctly. Although traditional non-local attention is excellent in feature alignment and integration between different modalities, the model cannot learn robust representation due to the gradient conflict in attention’s optimization. Thus, our work proposes the stop-gradient operation for attention to eliminate the gradient conflict problem in line-art colorization.
2.0.2 Attention Mechanism.
The attention mechanism [visual_attention, Attention_is_All_you_Need] is proposed to capture long-range dependencies and align signals from different sources. It is widely applied in vision [non_local_net, SAGAN], language [Attention_is_All_you_Need, TransformerXL], and graph [GAT] areas. Due to the quadratic memory complexity of standard dot-product attention, many researchers from the vision [A^2-Nets, GloRe, EMANet, latentGNN, ham] and language [TransformersAreRNN, Linformer, RoutingT] communities endeavor to reduce the memory consumption to linear complexity. Recently, vision transformer [ViT] starts a new era for modeling visual data through the attention mechanism. The booming researches using transformer substantially change the trend in image [Deit, Swin, PVT, CvT], point cloud [PCT, PT], gauge [he2021gauge], and video [VTN, ViViT] processing.
Unlike existing works concerning the architectures of attention mechanism, we focus on the training of attention modules regarding its gradient flow. Although some strategies have been developed to improve the training efficiency [Deit, SAMViT] for vision transformer, they mainly modify the objective function to impose additional supervision. From another perspective, our work investigates the gradient issue in the attention mechanism.
2.0.3 Stop-Gradient Operation.
Backpropagation is the foundation for training deep neural networks. Recently some researchers have paid attention to the gradient flow in the deep models. Hamburger [ham] proposes the one-step gradient to tackle the gradient conditioning and gradient norm issues in the implicit global context module, which helps obtain stable learning and performance. SimSiam [simsiam]
adopts the one-side stop-gradient operation to implicitly introduce an extra set of variables to implement Expectation-Maximization (EM) like algorithm in contrastive learning. VQ-VAE[VQVAE] also encourages discrete codebook learning by the stop-gradient supervision. All of these works indicate the indispensability of the gradient manipulation, which demonstrates that the neural network performance is related to both the advanced architecture and the appropriate training strategy.
Inspired by prior arts, our work investigates the gradient conflict issue for training non-local attention. The stop-gradient operation clips the conflict gradient branches while preserving correction direction for model updates.
3 Proposed Method
3.1 Overall Workflow
As illustrated in Fig. 2, we adopt a self-supervised training process similar to [scft]. Given a color image , we first use XDoG [2012xdog] to convert it into a line-art image . Then, the expected coloring result is obtained by adding a random color jittering on . Additionally, we generate a style reference image through applying the thin plate splines transformation on .
In the training process, utilizing as the reference to color the sketch , our model first uses encoder and to extract sketch feature and reference feature . In order to leverage multi-level representation simultaneously for feature alignment and integration, we concatenate the feature maps of all convolution layers outputs after using 2D adaptive average pooling function to down-sample them into the same spatial size.
To integrate the content in sketch and the style in reference, we employ our SGA blocks. There are two types of SGA blocks in our module: cross-SGA integrates the features from different domains and self-SGA models the global context of input features. Then several residual blocks and a U-net decoder with skip connections to sketch encoder are adopted to generate the image by the mixed feature map . In the end, we add an adversarial loss [gan] by using a discriminator to distinguish the output and the ground truth .
3.2 Loss Function
Image Reconstruction Loss. According to the Section 3.1, both generated images and ground truth images should keep style consistency with reference and outline preservation with sketch . Thus, we adopt regularization to measure the difference between and , which ensures that the model colors correctly and distinctly:
where means coloring the sketch with the reference .
Adversarial Loss. In order to generate a realistic image with the same outline as the prior sketch , we leverage a conditional discriminator to distinguish the generated images from real ones [isola2017image]. The least square adversarial loss [lsgan] for optimizing our GAN-based model is formulated as:
Style and Perceptual Loss. As shown in previous works [scft, Perceptual_loss]
, perceptual loss and style loss encourage a network to produce a perceptually plausible output. Leveraging the ImageNet pretrained network, we reduce the gaps in multi-layer activation outputs between the target imageand generated image by minimizing the following losses:
where represents the activation map of the
layer extracted at the relu from VGG19 network, andis the gram matrix.
In summary, the overall loss function for the generatorand discriminator is defined as:
3.3 Gradient Issue in Attention
In this section, we use SCFT [scft], a classic attention-based method in colorization, as an example to study the gradient issue in attention. is the feature projection transformed by from the input . The feature projections from input are transformed by and . Given the attention map , the classic dot-product attention mechanism can be formulated as follows:
Previous works [mocov3, ham, SAMViT, Deit] present the training difficulty of vision attention: instability, worse generalization, etc . For line-art colorization, it is even more challenging to train the attention models, as the training involves GAN-style loss and reconstruction loss, which are understood to lead to mode collapse [gan] or trivial solutions. Given a training schedule, the loss of colorization network can shake during training and finally deteriorate.
The histograms of the gradient cosine value distribution in 40 epochs. A large cosine value means that the network mainly uses this branch of gradient to optimize the loss function.
To better understand reasons behind the training difficulty of attention in colorization, we analyze the gradient issue through the classic SCFT model [scft]. We visualize the gradient flow back through the attention module in terms of each gradient branch and the summed gradient.
Fig. 4 offers the cosine value between different gradient branches and the total gradient. We separately calculate and for each pixel in branch (means gradient in sketch feature maps ), and in branch (means gradient in reference feature maps ) to explore the gradient flow of the network during learning.
Note that first order optimization methods usually require the surrogate gradient for update to be ascent, i.e., , where is the exact gradient. Then the update direction based on the surrogate gradient can be descent direction. The visualization in Fig. 4 implies that the gradient from the skip connection for the branch and the gradient from for the branch has already become an ascent direction for optimization, denoting that and from the attention map construct the “conflict gradient” in respect of the total gradient , i.e., .
Figs. 3(b) and 3(a) show that and are usually highly correlated with the total gradient, where over 78.09% and 52.39% of the cosine values are greater than 0.935 in the 40th epoch, respectively. Moreover, these percentages increase during training, indicating the significance of the representative gradient. On the other hand, nearly 30.57% of in Fig. 3(c) and 10.77% of in Fig. 3(d) have negative cosine values in the 40th epoch. These proportions are 22.81% and 5.32% in the 20th epoch, respectively, gradually increasing during training.
The visualization regarding the gradient flows demonstrates that the two gradient branches compete with each other for a dominant position during training process, while and construct an ascent direction and and remain as the conflict gradient in respect of the total gradient in each branch. According to existing works in multi-task learning [pcgrad], large gradient conflict ratios may result in significant performance drop. It motivates us to detach the conflict gradient while preserving the dominant gradient as inexact gradient to approximate the original gradient, illustrated in Fig. 3.
shows that the gradient clipping through the stop-gradient operation can effectively improve the model performance. We can also removeand since there is no gradient propagating in them and they will not be updated in the training process. The lower FID and higher SSIM mean that model can generate more realistic images with higher outline preservation during colorization after the stop-gradient clipping.
In order to investigate the reliability of gradient conflicts, we test the gradient cosine distributions when using a certain loss to confirm the trigger to gradient issue is the dot-product attention. We use the SCFT model to compute the gradients cosine distribution of each loss to investigate whether loss functions or architectures cause the conflict. Fig. 5 shows that all loss terms cause similar conflicts, implying that the attention architecture leads to gradient conflicts.
3.4 Stop-Gradient Attention
Combining with the training strategy, we propose the Stop-Gradient Attention (SGA). As Fig. 5(a) illustrates, in addition to the stop-gradient operation, we also design a new feature integration and normalization strategy for SGA. Treating stop-gradient attention map as a prior deep graph structure input, inspired by [gcn, wang2019learning], features can be effectively aggregated from adjacency nodes and the node itself:
is the leaky relu activate function andis the attention map normalized by double normalization method analogous to Sinkhorn algorithm [cuturi2013sinkhorn]. Different from softmax employed in classic non-local attention, the double normalization makes the attention map insensitive to the scale of input features [ex-att]. The normalized attention map can be formulated as follows:
where means correlation between feature vector in and feature vector in . The pseudo-code of SGA is summarized in Algorithm 1.
, the only difference between them is whether the inputs are the same or not. Cross-SGA calculates pixel correlation between features from different image domains and integrates features under a stop-gradient attention map. Self-SGA models the global context and fine-tunes the integration. For stable training, we also adopt batch normalization layer and shortcut connections[he2016identity]. Combining above techniques, our SGA blocks integrate the sketch feature and reference feature into generated feature effectively.
4.1 Experiment Setup
Dataset. We test our method on popular anime portraits [art-editing-dataset] and Animal FacesHQ (AFHQ) [afhq-dataset] dataset. The anime portraits dataset contains 33323 anime faces for training and 1000 for evaluation. AFHQ is a dataset of animal faces consisting of 15,000 high-quality images at 512 × 512 resolution, which contains three categories of pictures, i.e., cat, dog, and wildlife. Each class in AFHQ provides 5000 images for training and 500 for evaluation. To simulate the line-art drawn by artists, we use XDoG [2012xdog] to extract sketch inputs and set the parameters of XDoG algorithm with to keep a step transition at the border of sketch lines. We randomly set to be 0.3/0.4/0.5 to get different levels of line thickness, which generalizes the network on various line widths to avoid overfitting. And we set by default in XDoG.
Implementation Details. We implement our model with the size of input image fixed at 256×256 for each dataset. For training, we set the coefficients for each loss terms as follows: and . We use Adam solver [adam] for optimization with . The learning rate of generator and discriminator are initially set to 0.0001 and 0.0002, respectively. The training lasts 40 epochs on each dataset.
is used to assess the perceptual quality of generated images by comparing the distance between distributions of generated and real images in a deep feature embedding. Besides measuring the perceptual credibility, we also adopt the structural similarity index measure (SSIM) to quantify the outline preservation during colorization, by calculating the SSIM between reference image and original color image of sketch.
4.2 Comparison Results
We compare our method with existing state-of-the-art modules include not only reference-based line-art colorization [scft]
but also image-to-image translation,i.e., SPADE [spade], CoCosNet [cocosnet], UNITE [uot-cvpr2021] and CMFT [aiqiyi-cmft]. For fairness, all networks use the same auto-encoder architecture and aforementioned train losses in our experiment. Table 2 shows that SGA outperforms other techniques by a large margin. With respect to our main competitor SCFT, SGA improves by 27.21% and 25.67% on average for FID and SSIM, respectively. This clear-cut improvement means that SGA produces a more realistic image with high outline preservation compared with previous methods. According to Fig. 7, the images generated by SGA have less color-bleeding and higher color consistency in perceptual.
Furthermore, we explore the superiority of SGA over SCFT in terms of rescaling spectrum concentration of the representations. We compare the accumulative ratios of squared top singular values over total squared singular values of the unfolded feature maps (i.e., ) before and after passing through the attention module, illustrated in Fig. 8. The sum of singular values is the nuclear norm, i.e.,
the convex relaxation for matrix rank that measures how compact the representations are, which is widely applied in machine learning[kang2015logdet]. The accumulative ratios are obviously lifted after going through SCFT and SGA, which facilitates the model to focus more on critical global information [ham]. However, our effective SGA can not only further denoise feature maps but also enforce the encoder before attention module to learn energy-concentrated representations, i.e., under the effect of SGA, the CNN encoder can also learn to focus on the global information.
4.3 Ablation Study
We perform several ablation experiments to verify the effectiveness of SGA blocks in our framework, i.e., stop-gradient operation, attention map normalization, and self-SGA. The quantitative results are reported in Table 3, showing the superiority of our SGA blocks.
Specifically, to evaluate the necessity of stop-gradient in non-local attention, we design a variant SGA without stop-gradient. In Table 3, it obtains inferior performance, which verifies the benefit of eliminating gradient conflict through stop-gradient.
|SGA w/o stop-gradient||36.34||0.876||40.73||0.796||72.34||0.808||19.90||0.791|
|SGA w/o double-norm||33.42||0.861||34.42||0.811||55.08||0.828||15.95||0.809|
|SGA w/o self-SGA||31.56||0.917||34.26||0.842||55.69||0.839||16.36||0.821|
Furthermore, we conduct an ablation study on the attention map normalization to validate the advantage of double normalization in our framework. Table 3 demonstrates that SGA with double normalization outperforms that with classic softmax function. Although classic softmax can generate realistic images, it suffers a low outline preservation, i.e., the SSIM measure.
Based on the framework with stop-gradient and double normalization, we make an ablation study on the improvement of self-SGA additionally. Although our model has achieved excellent performance without self-SGA, there is still a clear-cut enhancement on most datasets after employing the self-SGA according to Table 3. The stacks of SGA can help model not only integrate feature effectively, but also fine-tune a better representation with global awareness for coloring.
Extending the training schedule to 200 epochs, Fig. 9 shows that SGA can still perform better with more epochs (29.71 in the 78th epoch) and collapse later than SCFT [scft], demonstrating the training stability for attention models in line-art colorization.
Additionally, to be more rigorous, we visualize the gradient distributions in the ”SGA w/o stop-gradient”. Fig. 10 implies the existing of gradient conflicts is a general phenomena in dot-product attention mechanism.
In this paper, we investigate the gradient conflict phenomenon in classic attention networks for line-art colorization. To eliminate the gradient conflict issue, we present a novel cross-modal attention mechanism, Stop-Gradient Attention (SGA) by clipping the conflict gradient through the stop-gradient operation. The stop-gradient operation can unleash the potential of attention mechanism for reference-based line-art colorization. Extensive experiments on several image domains demonstrate that our simple technique significantly improves the reference-based colorization performance with better the training stability.
Acknowledgments: This research was funded in part by the Sichuan Science and Technology Program (Nos. 2021YFG0018, 2022YFG0038).