As billions of images are uploaded and shared every day [126amazing, rupprecht2018guide]
, image editing has become one of the most demanding tasks in social media. However, to edit an image as desired, one may have to master professional software such as Adobe PhotoShop. In contrast with manual editing, automatic image editing, has recently attracted much interest in computer vision. This paper studies the problem of Sentence-based Image Editing (SIE)[dong2017semantic, nam2018text, li2020manigan]
that intends to deploy natural language to assist image editing automatically. One main challenge for SIE is to build the cross-modal mapping from the query sentence to the pixels in image. In the last decade, Deep Neural Networks[lyu2020multi, lyu2019attend] enabling generative models to produce pixel-level manipulation from another image have become the main solution to SIE.
Based on Generative Adversarial Networks (GANs) [zhu2019dm, zhang2018photographic, yin2019semantics, tan2019semantics, qiao2019learn, lao2019dual], recent works on SIE focus on combining sentence and image information. For example, AttnGAN [xu2018attngan] maps the query sentence and the image to be edited into a shared hidden space and minimizes the multi-modal similarity to improve the quality of text-to-image generation. TAGAN [nam2018text] provides word-level feedback to the generator through a fine-grained text discriminator. ManiGAN [li2020manigan] combines language and image with a three-stage network structure, which progressively generates images from three different scales. These methods show impressive results with short query sentence or phrase.
Nevertheless, when the query sentence is long and contains multiple attributes to be edited, the existing methods can hardly produce effective editing for all attributes. As a typical example shown in Fig. 1, a query sentence “The bird has a black and yellow striped belly, black eye rings, a black crown, and yellow breasts” is used to guide the editing on a given bird image. Intuitively, the sentence has a few different editable attributes: “black and yellow striped belly”, “black eye rings”, “black crown” and “yellow breasts”. The current state-of-the-art ManiGAN [li2020manigan] fails to edit the attribute “black eye rings”. The main reason for the failure is that the existing methods only focus on sentence-level editing rather than each attribute. Going further, we argue that there are three main obstacles for these GAN-based sentence-level editing methods: 1) they cannot parse sentences effectively and the attribute differences are indistinguishable; 2) they cannot build the attribute-pixel correspondence properly; 3) the sentence-level discriminator utilised in these methods is limited in detecting the failed attribute editing.
To tackle these drawbacks, we aim to strengthen each editable attribute so as to attain an accurate SIE model. Concretely, inspired by the contrastive training, we propose a novel Contrastive Attention Generative Adversarial Network (CA-GAN) for SIE. Our proposed CA-GAN contains three main components: 1) Sentence Parsing and Attribute Combination. To facilitate the training for the attribute-level editing, we first parse the query sentence based on POS Tagging to ensure the attribute-object correspondence. Then we augment the query space by random attribute combinations, which prove to significantly highlight attribute-level information. 2) Contrastive Training using attention. Intuitively, based on the augmentation from random attribute combinations, different combinations yield different editing. Thus, we construct the Contrastive Attention for different combinations in the GAN architecture. Our model can then enlarge the editing difference between any two attribute combinations, whilst keeping the background invariant. 3) Attribute-level Discriminator. In the discriminator, we build an attribute-level discriminator for providing effective editing feedback on each attribute to the generator. With the proposed CA-GAN, the editable attributes in the sentence can be well distinguished via training. Thus an effective SIE model can be generated which emphasizes each attribute appropriately. To the best of our knowledge, this is the first work that proposes to separate attributes from long sentences to strengthen the attribute editing. We evaluate the proposed CA-GAN on two benchmark image editing datasets, i.e., CUB and MS-COCO. The evaluation results show that our method can edit the attributes at the pixel level effectively and accurately.
2 Related Work
Sentence-based image editing. In recent years, based on Generative Adversarial Networks (GANs) [zhu2019dm, zhang2018photographic, yin2019semantics, tan2019semantics, qiao2019learn, lao2019dual], researchers pay much attention to the image generation or transformation from text or image, such as Text-to-Image Generation [nguyen2017plug, xu2018attngan, qiao2019mirrorgan, zhang2017stackgan, zhang2018stackgan++]ye2019unsupervised, huang2018multimodal, press2020emerging, yu2018singlegan]. To make the transformation controllable, Text-based Image Editing will only edit the target area of the image through text description. Generally, the query text can be a word, a short phrase or a long sentence. Dong et al. proposed an encoder-decoder structure to edit images matched with a given text [dong2017semantic]. In order to keep the content irrelevant to the text in the original image, [vo2018paired] proposed to construct foreground and background distribution with different recognizers. Nam et al. eliminated different visual attributes by introducing a text adaptive discriminator, which can provide more detailed training feedback to the generator [nam2018text]. Li et al. adopted the structure of the multi-level network, and could generate high-quality image content through the combination module of ACM and DCM [li2020manigan]. However, the generators of these methods ignore the difference between long sentences with words and phrases that may contain multiple editable attributes as well. In this paper, we focus on the Sentence-based Image Editing, and propose to construct Contrastive Attention to enhance the attribute editing.
Contrastive training. For a given anchor point in the data, the purpose of contrast learning [yu2019multi, he2020momentum, chen2020simple] is to bring the anchor point closer to the positive point and push the anchor point further away to the negative point in the representation space, thus enhancing the consistency of the feature representation. In previous vision tasks [wu2018unsupervised, lee2021infomax, deng2020disentangled, kang2020contragan], the idea of contrast learning is also applied by exploring the relationship between positive and negative samples. It was also demonstrated in [park2020contrastive]
that contrast learning methods are effective in the task of image to image conversion. Some works also studied the contrastive training in natural language processing[zhang2020unsupervised, he2020momentum]. CDL-GAN [zhou2021cdl] add Consistent Contrastive Distance (CoCD) and Characteristic Contrastive Distance (ChCD) into a principled framework to improve GAN performance. CERT [fang2020cert] uses back-translation for data augmentation. BERT-CT [carlsson2020semantic] uses two individual encoders for contrastive learning. In this work, we argue that different attributes should be edited discriminatively and this leads to the idea of contrast attention on SIE.
Given an image and a query sentence , SIE aims to transform guided by to an edited image . Our Contrastive Attention Generative Adversarial Network (CA-GAN) is based on the popular three-stage editing architecture [xu2018attngan]. To be more specific, there are usually three stages in the main module, and each stage contains a generator and a discriminator. Three stages are trained at the same time, and progressively generate images of three different scales, i.e., → → . As shown in Fig. 2, CA-GAN contains three main components: (1) Sentence Parsing and Attribute Combination (Fig. 2(b)). The query sentence is parsed to multiple editable attributes based on a Lexical rules based on POS tagging [bird2009natural]. Then, the attributes are randomly combined into two groups for augmentation. (2) Contrastive Training using attention (Fig. 2(c)). This module uses attention distinguishes different combinations, and each attribute can be learned. (3) Attribute-level Discriminator (Fig. 2(d)). This module is designed to provide the feedback if an attribute is edited well. We will elaborate these steps in the following.
3.2 Sentence Parsing and Attribute Combination
Some off-the-shelf methods can parse a sentence to multiple phrases, such as Topicrank [bougouin2013topicrank] and Sentence Transformers [nikzad2021phraseformer] and BERT [devlin2018bert]. However, these methods suffer from two main problem: 1) task-specific design and not for image editing; 2) large-scare network with massive parameters. In contrast, we propose a parsing rule to effectively parse attributes from a sentence and perform in lightweight scale. Specifically, given a sentence , as shown in Fig. 2(c), we first use POS tagging [bird2009natural] (NLTK [loper2002nltk]) to label each word in a sentence with a lexical property (e.gnoun, adjective). With the lexical property, we need to further separate the attributes from the sentences in the form of “adjective-noun”. But it is difficult to determine the adjectives belongingness of two neighboring attributes in the sentence. This can be illustrated in the examples “bird with a black wing”“[bird black], [wing]”, “yellow belly and wings”“[yellow belly], [wings]”, where the adjective word is wrongly categorized. To cope with these problems, we leverage two state values to assist the attribute separation from sentence based on the “noun-adjective” rule. We have for noun word and for adjective word. If the next word of a noun is “has” and “ with”, then . If the word is a conjunction and the previous word and the next word are both adjectives, then . If and only if
, all nouns and adjectives that have been traversed are classified as one attribute. By this simple rule, we can effectively divideinto attributes, i.e., . We compare several sentence parsing methods with the proposed rules with several multi-attribute sentence. As shown in Tab. 1, the parsing results of our work can effectively extract editable attributes against other related methods.
|Query Sentence||a grey bird with webbed feet, a short and blunt orange bill, grey head and wings and has white eyes, a white stripe behind its eyes and white belly and breast||the bird is black with a white belly and an orange bill|
|Method||Attribute Extraction||Attribute Extraction|
|Transformers: [nikzad2021phraseformer]||[‘grey bird’, ‘feet’, ‘short’, ‘blunt orange bill’, ‘grey head’, ‘wings’, ‘white eyes’, ‘white stripe’, ‘eyes’, ‘white belly’, ‘breast’]||[‘bird’, ‘black’, ‘white belly’, ‘orange bill’]|
|[‘grey bird with webbed feet’, ‘blunt orange bill’, ‘grey head’, ‘wings’, ‘white eyes’, ‘white stripe’, ‘eyes’, ‘breast’]||[‘bird’, ‘orange bill’]|
|[‘orange bill’, ‘grey head wings’, ‘white belly’, ‘white eyes’, ‘white stripe’]||[‘white belly’, ‘orange bill’]|
|[‘grey bird’, ‘webbed feet’, ‘short blunt orange bill’, ‘grey head wings’, ‘white eyes’, ‘white stripe eyes’, ‘white belly breast’]||[‘bird black’, ‘white belly’, ‘orange bill’]|
In the real world, the attribute distribution are highly imbalanced. For example, in CUB dataset [wah2011caltech], the attribute about “belly” appears 34,899 times, but the attribute about “eyering” only appears 3,125 times. This phenomenon leads to the poor editing on some kinds of attributes with fewer number. To this end, we propose to combine attributes randomly to augment data. Specifically, in the training phase, we randomly combine attributes from to build and , where . By randomly combining attributes, we obtain more editing alternatives and each attribute will be learned without the limit of distribution imbalance. To further study the attribute-specific editing, we design a contrastive training strategy to make each kind of attribute trained effectively.
3.3 Contrastive Training using attention
In a sentence , using different attribute combinations to edit image will yield differnt results. Contrastive training [he2020momentum] aims to learn a representation to pull “positive" pairs in certain metric space and push apart the representation between “negative" pairs. That means, based on this observation, we can impose contrastive training on the network to control the editing difference between two attribute combinations. We implement our design based on the famous AttenGAN [xu2018attngan], which calculates the spatial attention of the image w.r.t each word for text-to-image generation. However, the proposed CA-GAN is quite different from AttenGAN because of the difference on how to construct attention. In specific, we construct Contrastive Attention for different attribute combinations in the proposed CA-GAN, and with the contrastive training, each attribute editing can be enhanced.
The generator of CA-GAN (See Fig. 2) has two inputs, image feature by CNN and attribute combination features and by RNN. We construct cross-modal attention matrix using Cross-Modal Attention Module (CMAM) a as shown in Fig. 2(e), where is calculated as follows:
where is the channel index, is the attribute combination index. Then, we can easy to obtain two attention maps w.r.t the attribute combination and by
The Contrastive Attention contains both spatial and channel attention map. Because the attention map represents the area of the image to be edited by attributes, with the attention matrix, we can get the attended feature from different attribute combination attention maps by the means of Hadamard Product. Thus, given and , we can easily get six kinds of attention-image pairs using Eq. (1) with the original image , the edited image (from combination ) and (from combination ). This can be seen in Fig. 2(b). The six pairs are denoted as
: positive sample for the first editing attribute combination;
: negative sample for the first editing attribute combination;
: positive sample for the second editing attribute combination;
: negative sample for the second editing attribute combination;
: editing areas for the first attribute combination on the original image;
: editing areas for the second attribute combination on the original image.
The six attended images are fed to a pretrained vgg-16 and get features , , , , and . Then, we construct the contrastive loss between different features in the image:
Through contrastive training, the generator can learn the distribution of each attribute, and establish an accurate association between attribute and image. In addition, to preserve text-independent background regions, we build the perceptual loss [johnson2016perceptual] to reduce the randomness in the generation process as
3.4 Attribute-level Discriminator
To encourage generators to edit multiple attributes based on sentences, the discriminator should provide attribute-level training feedback to the generator. Previous work attempted to use sentence-level discriminator [dong2017semantic] or word-level discriminator [nam2018text, li2020manigan], but they cannot establish an exact connection between the image area and each attribute. For instance, in the sentence of “the bird has black wings, a black head and a red belly", when the “black" attribute is passed through the discriminator, sentence-level discriminators do not provide the exact area of the feature in the image, and word-level discriminators localize to both “head" and “wing" regions. In order for the discriminator to provide feedback related to each attribute, we propose to develop the attribute-level discriminator.
Our attribute-level discriminator has two inputs, attribute combination features and image feature . We use represents the correlation between the () attribute combination and the whole image
Next, we sum the attribute-weighted image feature at the C dimension to get .
Finally, the attribute-level feedback between and is calculated by Binary Cross-Entropy (BCE) loss as
By calculating BCE loss, the discriminator is able to provide attribute-level training feedback to the generator, thus benefiting the alignment between the different attribute features and visual features in the sentence.
3.5 Objective Function
The generator and discriminator are trained alternatively by minimizing both the generator loss and discriminator loss . The generator of the whole network contains unconditional adversarial loss and conditional adversarial loss, contrastive loss , perceptual loss and text-image matching loss .
where is the real image sampled from the original image distribution, and is the generated image sampled from the training model distribution, and
are hyperparameters controlling different losses.is used to calculate the matching score between image and text, where , is the smoothing factor, denotes the picture feature corresponding to the word, and denotes the feature of the whole sentence. The complete discriminator objective is defined as:
where is the hyperparameters controlling .
4.1 Dataset and implementation detail
Dataset: Our model is evaluated on the Caltech-UCSD Birds (CUB) [wah2011caltech] and MS COCO [lin2014microsoft] datasets, where the query sentences of CUB and COCO are the bird description and image captions provided by themself. CUB contains 200 bird species with 11,788 images where each has 10 sentence deceptions. We pre-encode the sentence by a pretrained text encoder following AttnGAN [xu2018attngan]. The COCO [lin2014microsoft] dataset contains 82,783 training and 40,504 validation images, each of which has 5 corresponding text descriptions including word, phrase and sentence. Both the datasets have images to be edited by query sentences with multiple attributes.
Implemenation detail. CA-GAN is optimized by Adam [kingma2014adam]
and the learning rate is empirically set to 0.0002. The model trains 600 and 120 epochs for CUB and COCO dataset, respectively. The hyperparameters, , and are set to 0.7, 0.6, 1 and 0.9 empirically.
Comparing method. The comparing state-of-the-art approaches include SISGAN [dong2017semantic], TA- GAN [nam2018text], DMIT [yu2019multi], Open-edit [liu2020open], and ManiGAN [li2020manigan]. Note that these comparing methods never consider the attribute editing but focus on the sentence editing.
Evaluation Metric. We use the Fréchet Inception Distance (FID) [heusel2017gans] and the Learned Perceptual Image Patch Similarity (LPIPS) [zhang2018unreasonable]
as the evaluation metrics. FID calculates the distance between two multidimensional variable distributions, representing the image generation quality. LPIPS represents the diversity of the generated images by calculating the L1 distance of the features extracted from AlexNet pre-trained in ImageNet. We also report the performance by human ranking on each dataset. We test edited accuracy (Acc.) and realism (Real) by randomly sampling 100 images with the same conditions and collect more than 20 surveys from different population. Specially, for the COCO dataset, each pair of image and sentence is from the same category.
4.2 Comparisons with the state-of-the-arts
Qualitative comparison. Fig. 3 and Fig. 4 show the edit comparisons on the CUB and COCO datasets. It can be seen that, on CUB, the three methods, SISGAN, TAGAN, and DMIT, are unable to well generate bird contour information and lead to large blurred areas. These can be particularly observed in Fig. 4 where the tree trunk color is changed, black eye orbits are lost, and the orange head is modified. Although ManiGAN shows good image editing performance to some extent, it cannot edit multiple attribute regions better and the background regions change. We believe that this is due to the fact that ManiGAN is not trained to split and compare attribute features, and thus lacks attribute-level discriminators to offer training feedback related to each feature in the sentence.
Quantitative comparison. As shown in Table 2, compared with several existing state-of-the-art methods, our method leads to the best FID and LPIPS values on both CUB and COCO, implying that our method is able to achieve high-quality editing images. Moreover, our method also has the highest values for Acc. and Real, indicating that our model generates more favorable images for people. Our method produces higher quality images. This means people feel that our editing is better and the images are more realistic.
|Ours||Ours w/o SPAC||Ours w/o CA||Ours w/o AD|
4.3 Ablation Studies
In this section, we evaluate the main modules in CA-GAN and analyze their impact. The result can be seen in Fig. 5 (b) and Table 3. First, without using the Sentence Parsing and Attribute Combination (w/o SPAC) module, we alternatively combine every two neighboring words in the sentence to construct an editable attribute. As shown in Fig. 5 (b), in the absence of rules, randomly dividing sentences may perform random editing with no relevant to the query sentence and achieve the worst FID and LPIPS. Second, we evaluate the effect of Contrastive Attention (w/o CA) module. We leverage a whole sentences as input, removing the contrastive attention module and contrastive loss. This confirms our hypothesis that by generating different editable information through sentence parsing and attribute combination, the differences between attributes are amplified by different contrastive attention, which helps the model to focus on the corresponding attributes in a given sentence. Finally, we evaluate the effectiveness of the proposed Attribute-level Discriminator (w/o AD), and the model cannot effectively operate on the image content based on the sentence information. For example in Fig. 5 (b), the bird’s torso shows blurring and artifacts, and the color of the feathers changes considerably. This indicates that the generator failed to decompose the different visual attributes due to the lack of attribute-level training feedback, and thus cannot effectively establish attribute-region connections to edit the images.
4.4 Visualization of contrastive attention and failure cases
In this section, we visualize the generated results of the attention maps corresponding to the different attributes in the CMAM from the third stage in Fig. 5 (a). We can observe that the model can better generate the accurate attention maps based on the attribute information after the sentence parsing and attribute combination, and with better accurate position, finer shape, and better semantic consistency between the attributes and the editable content. We show some failure cases in Fig. 6 on COCO. We find that the description semantics maybe fuzzy when there exist multiple categories, and the model may fail to edit the image.
In this paper, we studied the task of Sentence-based Image Editing. In contrast to existing methods that cannot produce accurate editing in case of a query sentence with multiple editable attributes, we proposed to enhance the difference between attributes and attained much better performance consequently. Particularly, we developed a novel model called CA-GAN by designing a contrastive attention mechanism on Generative Adversarial Network. We first parsed attributes from the sentence with POS Tagging and generated different attribute combinations. Then, a contrastive attention module was built to enlarge the editing difference between the combinations. Last, we constructed an attribute discriminator to ensure the effective editing on each attribute. Extensive experiments show that our model can lead to effective editing for sentences with multiple attributes on CUB and COCO datasets.
This work was supported by the Natural Science Foundation of China (No. 61876121), Primary Research and Development Plan of Jiangsu Province (No. BE2017663), Natural Science Foundation of Jiangsu Province (19KJB520054), Graduate Research Innovation Program of Jiangsu Province (SJCX20-1119), Scientific Research Project of School of Suzhou Institute of Trade and Commerce (KY-ZRA1805), National Natural Science Foundation of China(No. 61876155), Jiangsu Science and Technology Programme (Natural Science Foundation of Jiangsu Province) (No. BE2020006-4), Key Program Special Fund in XJTLU (No. KSF-T-06).
This document provides supplementary material for the paper “Each Attribute Matters: Contrastive Attention for Sentence-based Image Editing” published on the British Machine Vision Conference (BMVC) 2021. In this material, we provide the further discussion and illustration of some details, and show more examples of SIE.
A.1 Evaluation Metrics
In our experiments, we use the Fréchet Inception Distance (FID) [heusel2017gans] and the Learned Perceptual Image Patch Similarity (LPIPS) [zhang2018unreasonable] as the evaluation metrics.
The FID of an edited image compared to its origin is evaluated by passing it through a pre-trained Inception-v3 [fu2019dual] and computing the distribution difference on the average pooled features. FID can be computed by
where and represents the feature mean of the real image and the generated image. and represents the covariance matrix of the features of the real image and the generated image. The smaller the FID value, the closer the distribution between generated image and real image.
We also use LPIPS to calculate the perceptual distance of two images. Traditionally, Perceptual distance [zhang2018unreasonable]
refers to the visual similarity of two images, the purpose of which is to evaluate the similarity of two images by imitating the human visual senses. We extract feature from the-th layer and unit-normalize it in the channel dimension, which we designate as and . is the feature size in different layers. is equivalent to computing cosine distance. LPIPS can be computed by
A.2 Sentence Parsing Strategy
In this paper, we propose a strategy to effectively parse sentence into different attributes, thus to facilitate the subsequent data augmentation and the construction of contrastive learning.
After the POS tagging of sentence , we put the corresponding lexical case of each sentence in . To classify the different attributes in a sentence, we have the following 5-step strategy. 1) Screening words for attributes; 2) Determine the adjective attribution of "bird has" case, divide “bird” into attributes and set the state of , to 0; 3) If the word is a noun and not bird, the status is set to 1; 4) If the word is an adjective and is not followed by a conjunction, the status is set to 1; 5) When , the attribute is divided. The detailed algorithm can be seen in Algorithm 1. Finally, we get the divided attributes .
A.3 Discussion of hyperparameters
The generator and discriminator have trained alternatively by minimizing both the generator loss and discriminator loss . In generator, control different attributes, control the invariance of the background, control text-image matching. In discriminator, discriminate the existence of attribute-level information.
The proposed algorithm is governed by four hyperparameters: , and are used in the generator to balance the generation of different attributes and to preserve irrelevant backgrounds. Our model is based on the AttnGAN [xu2018attngan] model, so for the hyperparameter , we follow its initial value and do not adjust it. In the discriminator, to control whether each attribute is present in the image or not. Table 4 shows the sensitivity analysis for hyperparameters using the CUB dataset. As a rule of thumb, we try from 1 and calculate the FID and LPIPS values for each model. We found that the models work better when in the range of 0.5 to 1.