Faces à la Carte: Text-to-Face Generation via Attribute Disentanglement

06/13/2020 ∙ by Tianren Wang, et al. ∙ The University of Queensland 0

Text-to-Face (TTF) synthesis is a challenging task with great potential for diverse computer vision applications. Compared to Text-to-Image (TTI) synthesis tasks, the textual description of faces can be much more complicated and detailed due to the variety of facial attributes and the parsing of high dimensional abstract natural language. In this paper, we propose a Text-to-Face model that not only produces images in high resolution (1024x1024) with text-to-image consistency, but also outputs multiple diverse faces to cover a wide range of unspecified facial features in a natural way. By fine-tuning the multi-label classifier and image encoder, our model obtains the vectors and image embeddings which are used to transform the input noise vector sampled from the normal distribution. Afterwards, the transformed noise vector is fed into a pre-trained high-resolution image generator to produce a set of faces with the desired facial attributes. We refer to our model as TTF-HD. Experimental results show that TTF-HD generates high-quality faces with state-of-the-art performance.



There are no comments yet.


page 1

page 3

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the advent of Generative Adversarial Networks (GAN) [1]

, image generation has made huge strides in terms of both image quality and diversity. However, the original GAN model

[1] cannot generate images tailored to meet design specifications. To this end, many conditional GAN models have been proposed to fit different task scenarios [2, 3, 4, 5, 6, 7, 8]

. Among these works, Text-to-Image (TTI) synthesis is a challenging yet less studied topic. TTI refers to generating a photo-realistic image which matches a given text description. As an inverse image captioning task, TTI aims to establish an interpretable mapping between image space and the text semantic space. TTI has huge potential and can be used in many applications including photo editing and computer-aided design. However, natural language is high dimensional information which is often less specific but also much more abstract than images. Therefore, this research problem is quite challenging.

Just like TTI synthesis, the sub-topic of Text-to-Face (TTF) synthesis also has practical value in areas such as crime investigation and also biometric research. For example, the police often need professional artists to sketch pictures of suspects based on the descriptions of the eyewitnesses. This task is time-consuming, requires great skill and often results in inferior images. Many police may not have access to such professionals. However, with a well-trained Text-to-Face model, we could quickly produce a wide diversity of high-quality photo-realistic pictures based simply on the descriptions of eyewitnesses. Moreover, TTF can be used to address the emerging issues of data scarcity arising from the growing ethical concerns regarding informed consent for the use of faces in biometrics research.

A major challenge of the TTF task is that the linkage between face images and their text descriptions are much looser than for, say, bird and flower images commonly used in TTI research. A few sentences of description are hardly adequate to cover all the variations of human facial features. Also, for the same face image, different people may use quite different descriptions. This increases the challenge of finding mappings between these descriptions and the facial features. Therefore, in addition to the aforementioned two criteria, a TTF model should have the ability to produce a group of images with high diversity conditioned on the same text description. In a real-world application, a witness could choose one picture among these output images which they think is the closest to the appearance of the suspect. This feature is also important for biometric researchers to get sufficient data from rare ethnicities and demographics when synthesising ethical face datasets that do not require informed consent.

Therefore, to meet these demands, we proposed a model which includes a novel TTF framework satisfying: 1) high image quality; 2) improved consistency of synthesised images and their descriptions; and 3) ability to generate a group of diverse faces from the same text description.

To achieve these goals, we propose a pre-trained BERT [9]

multi-label model for natural language processing. This model outputs sparse text embeddings of length 40. We fine-tune a pre-trained MobileNets

[10] model using CelebA’s [11] training data where images have paired labels. We then predict the labels from the input images. Next, we structure a feature space with 40 orthogonal axes based on the noise vectors and the predicted labels. After this operation, the input noise vectors can be moved along a specified axis to render output images which have the desired features. Last but certainly not least, we use the state-of-the-art image generator, StyleGAN2 [12], which maps the noise vectors into a feature disentangled latent space, to generate high-resolution images. As Fig.1 shows, the synthesised images match the features of the description with good diversity and image quality.

Our work has the following main contributions.

  • We propose a novel TTF-HD framework that comprises a text multi-label classifier, an image label encoder, and a feature-disentangled image generator to generate high-quality faces with a wide range of diversity.

  • In addition, we added a novel design to the framework: a 40-label orthogonal coordinate system to guide the trajectory of the input noise vector.

  • Last but not least, we use the state-of-the-art StyleGAN2 [12] as our generator to map the manipulated noise vectors into the disentangled feature space to generate our 10241024 high-resolution images.

This paper is continued as follow. In Section 2, we review the important works in TTI, TTF, and models of the generators. In Section 3, we describe our proposed framework in detail. In Section 4, experimental results are presented both qualitatively and quantitatively. An ablation study is also conducted to show the importance of the vector manipulating operations. In Section 5, we conclude our work by summarising our contributions and the limitations of the approach for future research.

Figure 2: TTF-HD diagram. The text is fed into the multi-label classifier and then output text vector which represents 40 facial attributes. The image generator firstly synthesises an image from random noise vector . Then the image encoder output the image embeddings . The differentiated embedding is used to manipulate the original noise vector from to . Finally, the generator synthesises an image with desired features from .

Ii Related Works

Ii-a Text-to-image Synthesis

In the area of TTI, Reel et al. [6] first proposed to take advantage of GAN, which includes a text encoder and an image generator and concatenated the text embedding to the noise vector as input. Unfortunately, the model failed to establish good mappings between the keywords and the corresponding image features. Besides, due to the final results being directly generated from the concatenated vectors, the image quality was poor so that images could be easily discerned as fake. To address these two issues, StackGAN [7] proposed to generate images hierarchically by utilising two pairs of generators and discriminators. Later, Xu et al. proposed AttnGAN [8]

. By introducing the attention mechanism, the model successfully matched the keywords with the corresponding image features. Their interpolation experimental results indicated that the model could correctly render the image features according to the selected keywords. The model works remarkably well in translating bird and flower descriptions. However, such descriptions are mostly just one sentence. If the descriptions have more sentences, the efficacy of the text encoder deteriorates because the attention map becomes harder to train.

Ii-B Text-to-face Synthesis

Compared to the number of works in TTI, the published works in TTF are far fewer. The main reason is that a face description has a much weaker connection to facial features compared to that of, say, bird or flower images. Typically, the descriptions of birds and flowers are mostly about the colour of feathers and petals. Descriptions of faces can be much more complicated with gender, age, ethnicity, pose, and other facial attributes. Moreover, most of the TTI models are trained with Oxford-102 [13], CUB [14], and COCO [15] which are not face image datasets. On the other hand, the only face dataset that is suitable is Face2text [16] which has just five thousand pairs of samples, which is not sufficient for training a satisfactory model.

With all the challenges mentioned above, there are still several inspiring works engaging in text-to-face synthesis. In a project named T2F [17], Akanimax proposed to encode the text descriptions into a summary vector using the LSTM network. ProGAN [18] was adopted as the generator of the model. Unfortunately, the final output images exhibited poor image quality. Later, the author improved his work, which he named T2F 2.0, by replacing the ProGAN with MSG-GAN [19]. As a result, the image quality and image-text consistency improved considerably, but the output showed low diversity in facial appearance. To address the data scarcity issue, O.R. Nasir et al. [20] proposed to utilise the labels of CelebA [11] to produce structured pseudo text descriptions automatically. In this way, the samples in the dataset are paired with sentences which contains the positive feature names separated by conjunctions and punctuation. The results are 6464 pixel images showing a certain degree of diversity in appearance. The best output image quality so far is from [23] which also adopted the model structure of AttnGAN [8]. Therefore, this work has the same issues with text encoding mentioned previously.

Ii-C Feature-disentangled Latent Space

Conventionally, the generator will produce random images from noise vectors sampled from a normal distribution. However, we desire to control the rendering of the images in response to the feature labels. To do this, Chen et al[24] proposed to disentangle the desired features, by maximising the mutual information between the latent code of the desired features and the noise vector . In his experiments, he introduced a variation distribution to approach . Finally, the latent code indicates that it has managed to learn interpretable information by changing the value in a certain dimension. However, the latent code in this work has only 3 or 4 dimensions, but we require 40 features, which is much more complicated. Later, Karras et al[21] established a novel style-based generator architecture, named StyleGAN, which does not take the noise vector as input like the previous works. The input vector is mapped into an intermediate latent space through a non-linear network before being fed into the generator network. The non-linear network consists of eight fully connected layers. A benefit for such a setting is that the latent space does not have to support sampling according to any fixed distribution [21]. In other words, we have more freedom to combine the desired features.

Iii Proposed Method

Our proposed model, named TTF-HD, comprising a multi-label classifier , image encoder , and a generator is shown in Fig.2. Details will be discussed in the following subsections.

Iii-a Multi-label Text Classification

To conduct the TTF task, it is of vital importance to have sufficient facial attribute labels to best describe a face. We propose to use the CelebA [11] dataset which includes 40 facial attribute labels for each face. To map the free-form natural language descriptions to the 40 facial attributes, we propose to fine-tune a multi-label text classifier to get text embeddings of length 40. With these considerations, we adopt the state-of-the-art natural language processing model, Bidirectional Transformer (BERT) [9]. In light of the fact that this is a 40-class classification task, we choose to use the large network of the BERT model to have a stronger fitting ability for high-dimensional training data. Some features have different names for opposites. For example, when training the model , the feature “age” could be represented by either “young” or “old” where “young” would be a value close to 0 and “old” would be a be a value close to 1. If a feature isn’t specified, it is set to 0. This process is shown in Fig.3. Finally, the classifier outputs a text vector of length 40 for each description.

Figure 3: A possible classification result of the text classifier .

Note that there is one advantage of the text classifier compared to the traditional text encoder in previous works. It is that there are no restrictions to the length of text descriptions. In previous works, the text encoders are mostly crammed into one or two sentences. But for face descriptions, the length is longer than for bird and flower descriptions, which makes traditional text encoders less appropriate.

Iii-B Image Multi-label Embeddings

In the proposed framework, an image encoder is required to predict the feature labels of the generated images. To do this, we fine-tune a MobileNet model [10], with the samples of CelebA [11]. The reason for choosing MobileNet is that it is a light-weight network model which has a good trade-off between accuracy and speed. With this model, we can obtain the image embeddings which have the same length of that of the text vectors of the images generated from the noise vectors.

Iii-C Feature Axes

After training the image encoder, now we can find the relationship between the noise vectors and the predicted feature labels by logistic regression. The length of the noise vectors is 512 (

) and the feature vectors is 40 (). Therefore, we can obtain:


where is a matrix to be solved with dimention 51240.

This matrix needs to be orthogonalised because we need to disentangle all the attributes so that the noise vectors can move along a certain feature axis without affecting other ones. By the Gram-Schmidt process, the projection operator is:


where is the axis to be orthogonalised and is the reference axis. Then, we can obtain:


In (3), the matrix is normalised so that becomes unitary.

After these steps, we get the feature axes which are used to guide the update direction of the input noise vectors to obtain the desired features in the output images.

Iii-D Noise Vector Manipulation

Manipulating the noise vectors is vital to our work because this determines whether the output images will have the described features in the text corpus. In the model diagram Fig. 2, this is the process of changing the random noise vector from to by (4) where is a column vector to determine the direction and magnitude of the movement along feature axes.


To ensure that the model will produce an image of desired features no matter where the noise vectors are in the latent space, we introduce four operations.

Differentiation. As shown in Fig.2, the text classifier embedding output is denoted as and the predicted embedding from the initial random vector is denoted as . Intuitively, we can use to guide the movement of noise vectors in the feature axes. However, the value range of is . This means that the model cannot render features in opposite directions, say, young versus old, because there are no labels of negative value. To solve this, we use differentiated embeddings to guide the feature editing obtained by (5)


In this way, the noise vectors can be moved in both positive and negative direction along the feature axes because the value range of the differentiated embeddings is

. For the features which have a similar probability value in the text embeddings and the image embeddings, their probability value is cancelled out and they will not be rendered repeatedly in the output images. This operation is shown in Fig.

2. For each feature, according to its probability value level in and , the movement direction can be positive, negative or cancelled out.

Note that to minimize interference of the unspecified features in the text descriptions, we will not apply the differentiation operation to such features and we keep their value as zero in the differentiated embeddings.

Reweighting. In the differentiated embeddings, the labels whose value approaching -1 or 1 are the specified features where the text descriptions may specify in a positive or negative way. Apart from these labels, there may be some other labels whose value are between -1 and 1 which interfere with the desired feature rendering. Therefore, we need to give higher weights to the values of the specified features. To do this, we propose to map the differentiated embeddings value range from to . Then we compute the tangent value of every factor of the mapped differentiated embeddings. As a result, the value approaching the ends of the value range will get a higher weight. In our scenario, the weighed value range is .

Figure 4: Images produced with single-sentence input. With less specified labels in the text, the model can generate samples with higher diversity.

Normalisation. As the noise vectors are sampled from a normal distribution, they have a higher probability to be sampled near the origin of the axes where the probability density is high. However, the more steps we move the vectors along different feature axes, the larger the distance may become between the vectors and the origin, which will lead to more artifacts in the generated images. That is why we need to renormalise the vectors after every movement along the axes. This distance can be denoted as distance. Therefore, for the noise vector , we get with (LABEL:eq6)


Feature lock. To make the face morphing process more stable, we have a feature lock step every time we move the vectors along a certain axis. In other words, the model only uses the axes along which the vectors have been moved as the basis axes to disentangle the following feature axis. While for other axes of unspecified attributes in the textual descriptions, the movement direction and step size along such axes are not fixed to ensure a diversity of generated images. In this way, the noise vectors are locked only in terms of the features mentioned in the descriptions.

Iii-E High Resolution Generator

The generator we use is a pre-trained model of StyleGAN2 [12]. On the basis of mapping the noise vectors which are sampled from the normal distribution to the intermediate latent space, StyleGAN2 improves the small artifacts by revisiting the structure of the network. With this generator, not only can the model synthesise high-resolution images, but it can also render the desired features from the manipulated input vectors.

Figure 5: Image morphing process of each group in ablation study. (A) A group with all operations. (The default setting for TTF-HD) (B) A group with reweighting, differentiation, and normalisation operations. (C) A group with reweighting, differentiation operations. (D) A group with the reweighting operation. (E) Blank group. We fix the noise vector input of each group. The figure shows the morphing process from the random image on the left column to the final output on the right column.

Iv Experiments Evaluation

Dataset. The dataset we use is CelebA [11] which contains over 200k face images. For each sample, there is a paired one-shot label vector whose length is 40. In addition, there is another paired text description corpus set in which every description has almost 10 sentences. There may be some redundant sentences in some of them, but every description includes all the features the paired label vector indicates. We use this dataset to fine-tune the pre-trained multi-label text classifier and the pre-trained image encoder.

Experimental setting. In our evaluation experiments, we randomly choose 100 text descriptions. With each of them, the model will randomly generate 10 images. Therefore, the test set has 1000 images in total. As the experiments show, there will be significant image morphing when the noise vector moves twice along certain feature disentangled axis. Thus, we set the step size as 1.2, which multiplies the reweighted output of the differentiated vector. This guarantees a final weight which is used to move along the axis of around 2 ().

Iv-a Qualitative Evaluation

Image quality. Fig.1 also shows the paired descriptions in each group. We can see that most of the generated images are correctly rendered with described features.

Image diversity. To show the proposed method has great feature generalisation ability, we conduct the image synthesis conditioned on the single-sentence description. In other words, apart from the key features that the sentence refers to, the model should generalise the other features in the output. As Fig.4 shows, for each single-sentence description, the proposed model can produce images with high diversity.

Iv-B Quantitative Evaluation

In this section, we use three metrics to evaluate the above three criteria respectively. They are Inception Score (IS) [25] which is used in many previous works, Learned Perceptual Image Patch Similarity (LPIPS) [22]

which is for evaluating the diversity of the generated images, and Cosine Similarity which is widely used to evaluate the similarity of two chunks of a corpus in natural language processing. Due to the lack of the source code for most of the works in the TTF area such as T2F 2.0

[17], we compare our experimental results with the TTF implementation of AttnGAN [8].

Methods IS CS* LPIPS
TTF-HD (ours) 1.1170.127 0.664 0.5830.002
AttnGAN 1.0620.051 0.511 ——
*Maximum for each group.
Table I: Evaluation results of different models

Table. I shows the evaluation results of different models. We can see the proposed method outperforms one of the state-of-the-art methods AttnGAN [8]

in terms of image quality and Text-to-Image similarity.

Iv-C Ablation Study

In Section 3, we propose four operations to manipulate the noise vector to get the desired features. In this subsection, we conduct the ablation study and discuss the effects of the different operations applied.

To conduct the ablation study, we have 5 experiment settings. We choose one face description and produce 100 random images under each experimental setting respectively. Then, we use the above three metrics to evaluate the effect of different operations.

Fig. 5 shows the morphing process of the generated images. We can see that with the proposed four manipulating operations, Group A can finally obtain an output with all desired features. While for other groups, the final morphing images all suffer from the artifact issue on the rendering of the face and the background. This is because with too many feature axis moving steps, the noise vector has been moved to a low-density region of the latent space distribution, which also leads to a mode collapse problem.

Exp. Evaluation Metrics
Settings IS CS* LPIPS
Group A 1.1220.043 0.754 0.6340.005
Group B 1.1160.080 0.739 0.6080.005
Group C 1.1870.062 0.762 0.6030.005
Group D 1.1010.095 0.683 0.5210.006
Group E 1.1020.033 0.706 0.5320.005
*Maximum for each group
Table II: Ablation study evaluation results

Table.II shows the quantitative evaluation metrics on different groups of TTF-HD. We can see that Group A has the best diversity score as well as the second-best performance in terms of IS and CS score. This suggests that applying all operations leads to a good trade-off between image quality, text-to-face similarity and diversity.

V Conclusion

In this paper, we set three main goals in the text-to-face image synthesis task: 1) High image resolution, 2) Good text-to-image consistency, and 3) High image diversity. To this end, we propose a model, named TTF-HD, comprising a multi-label text classifier, an image encoder, a high-resolution image generator, and feature-disentangled axes. From the qualitative and quantitative experiment results, we can see the generated images have good image quality, text-to-image similarity, and image diversity.

However, the model is still not entirely robust. There are always some images in a batch that are far more consistent with the text descriptions. This is possibly caused by insufficient accuracy of the text classifier and the image encoder due to lack of training data. In addition, features in the latent space are still not well disentangled, so that when you are moving the noise vector along one feature axis, other features which are highly correlated with it may change too. These issues need to be addressed in future research.