The source code for our paper "DotFAN: A Domain-transferred Face Augmentation Network for Pose and Illumination Invariant Face Recognition"
The performance of a convolutional neural network (CNN) based face recognition model largely relies on the richness of labelled training data. Collecting a training set with large variations of a face identity under different poses and illumination changes, however, is very expensive, making the diversity of within-class face images a critical issue in practice. In this paper, we propose a 3D model-assisted domain-transferred face augmentation network (DotFAN) that can generate a series of variants of an input face based on the knowledge distilled from existing rich face datasets collected from other domains. DotFAN is structurally a conditional CycleGAN but has two additional subnetworks, namely face expert network (FEM) and face shape regressor (FSR), for latent code control. While FSR aims to extract face attributes, FEM is designed to capture a face identity. With their aid, DotFAN can learn a disentangled face representation and effectively generate face images of various facial attributes while preserving the identity of augmented faces. Experiments show that DotFAN is beneficial for augmenting small face datasets to improve their within-class diversity so that a better face recognition model can be learned from the augmented dataset.READ FULL TEXT VIEW PDF
The source code for our paper "DotFAN: A Domain-transferred Face Augmentation Network for Pose and Illumination Invariant Face Recognition"
We regard the proposed DotFAN as a face augmentation approach in which any identity class—no matter a minority class or not—can be enriched by synthesizing face samples based on the knowledge learned from rich face datasets in other domains via domain transfer. To this end, DotFAN first learns a disentangled facial representation, through which the face information can be spanned by various face attribute codes, from rich datasets. Then, exploiting the disentangled facial representation, DotFAN can generate synthetic face samples neighboring to the input faces in the sample space so that the diversity of each face-identify class can be significantly enhanced. As a result, the performance of a face recognition model trained on the enriched dataset can be improved as well.
Utilizing two auxiliary subnetworks, namely a face-expert model (FEM) [6, 26] and a face shape regressor (FSR), DotFAN operates intrinsically in a 3D model-assisted data-driven fashion. FEM is a purely data-driven subnetwork pretrained on a domain rich in face identities, whereas FSR is driven by a 3D face model and pretrained on another domain with rich poses and expressions. Hence, FSM ensures that synthesized vaiants of an input face are of the same identity as the input, while FSR collaborating with illumination code offers the model for synthesizing faces with various poses, lighting (shadow) conditions, and expressions.111Although DotFAN can synthesize faces with various expressions, we do not particularly consider it in this work, as we do not find a suitable labeled training set with rich expressions. In addition, inspired by FaceID-GAN , we used the 3D face model (e.g., 3DMM ) to characterize face attributes with only hundreds of parameters. Thereby, the size of FSR, and its training set as well, is largely reduced, making it realizable with a light CNN. Furthermore, the loss terms related to FEM and FSR act as regularizers during the training stage. This design prevents DotFAN from common issues in data-driven approaches, e.g. overfitting due to small training dataset.
Moreover, DotFAN is distinguishable from FaceID-GAN because of following reasons. First, based on a 3-player game strategy, FaceID-GAN regards its face-expert model as an additional discriminator that needs to be trained jointly with its generator and discriminator in an adversarial training manner. Because its face-expert model assists its discriminator rather than its generator, FaceID-GAN guarantees only the upper-bound of identity-dissimilarity. This design also prevents its face expert model from pretraining and impedes the whole training speeds. Furthermore, since it cannot be pretrained on a rich-domain data, this makes it very difficult to do knowledge transfer from a rich dataset to another dataset in an on-line learning manner. On the contrary, DotFAN regards its FEM as a regularizer to guarantee that the identity information is not altered by the generator. Accordingly, FEM can be pretrained on a rich dataset and play a role of an inspector in charge of overseeing identity-preservability. This design not only carries out the identity-preserving face generation task, but also stabilizes and speeds up the training process by not intervening the competition between generator and discriminator.
The main contributions of DotFAN are threefold.
We are the first to propose a domain-transferred face augmentation scheme that can easily transfer the knowledge distilled from a rich domain to an anemic domain, while preserving the identity of augmented faces in the target domain.
Through well disentangled facial representation learned from existing face data, DotFAN offers a unique unified framework that can incorporate prominent face attributes (pose, illumination, shape, expression) for face recognition and can be easily extended to other face related tasks.
DotFAN well beats the state-of-the-arts by a significant gain margin in face recognition application with small-size training data available. This makes it a powerful tool for low-shot learning applications.
Recently, various algorithms have been proposed to address the issue of small sample size with dramatic variations in facial attributes in face recognition [5, 24, 23, 29]. This section reviews works on GAN-based image-to-image translation, face generation, and face frontalization/rotation techniques related to face augmentation.
(A) GAN-based image-to-image translation:
GAN and its variants have been widely adopted in a variety of fields, including image super-resolution, image synthesis, image style transfer, and domain adaptation. DCGAN incorporates deep CNNs into GAN for unsupervised representation learning. DCGAN enables arithmetic operations in the feature space so that face synthesis can be controlled by manipulating attribute codes. The concept of generating images with a given condition has been adopted in succeeding works, such as Pix2pix  and CycleGAN . Pix2pix requires pair-wise training data to derive the translation relationship between two domains, whereas CycleGAN relaxes such limitation and exploits unpaired training inputs to achieve domain-to-domain translation. After CycleGAN, StarGAN 
addresses the multi-domain image-to-image translation issue. With the aids of a multi-task learning setting and a design of domain classification loss, StarGAN’s discriminator minimizes only the classification error associated to a known label. As a result, the domain classifier in the discriminator can guide the generator to learn the differences among multiple domains. Recently, an attribute-guided face generation method based on a conditional CycleGAN was proposed in. This method synthesizes a high-resolution face based on an low-resolution reference face and an attribute code extracted from another high-resolution face. Consequently, by regarding faces of the same identity as one sub-domain of faces, we deem that face augmentation can be formulated as a multi-domain image-to-image translation problem that can be solved with the aid of attribute-guided face generation strategy.
(B) Face frontalization and rotation:
We regard the identity-preserving face rotation task as an inverse problem of the face frontalization technique used to synthesize a frontal face from a face image with arbitrary pose variation. Typical face frontalization and rotation methods synthesize a 2D face via 3D surface model manipulation, including pose angle control and facial expression control, such as FFGAN  and FaceID-GAN . Still, some designs utilize specialized sub-networks or loss terms to reach the goal. For example, based on TPGAN , the pose invariant module (PIM) proposed in  contains an identity-preserving frontalization sub-network and a face recognition sub-network; the CNN proposed in  establishes a dense correspondence between paired non-frontal and frontal faces; and, the face normalization model (FNM) proposed in  involves a face-expert network, a pixel-wise loss, and a face attention discriminators to generate a faces with canonical-view and neutral expression. Finally, some methods approached this issue by means of disentangled representations, such as DR-GAN  and CAPG-GAN . The former utilizes an encoder-decoder structure to learn a disentangled representation for face rotation, whereas the latter adopts a two-discriminator framework to learn simultaneously pose and identity information.
(C) Data augmentation for face recognition:
To facilitate face recognition, there are several face normalization and data augmentation methods. Face normalization methods aim to align face images by removing the volatility resulting from illumination variations, changes of facial expressions, and different pose angles , whereas the data augmentation method attempts to increase the richness of face images, often in aspects of pose angle and illumination conditions, for the training routine. To deal with illumination variations, conventional approaches utilized either physical models, e.g. Retinex theory , or 3D reconstruction strategy to remove/correct the shadow on a 2D image [10, 33]. Moreover, to mitigate the influence brought by pose angles, two categories of methods were proposed, namely pose-invariant face recognition methods and face rotation methods. While the former category focuses on learning pose-invariant features from a large-scale dataset [25, 2], the latter category, including face frontalization techniques, aims to learn the relationship between rotation angle and resulting face image via a generative model [35, 18, 37, 31, 16, 30]. Because face rotation methods are designed to increase the diversity of the view-points of face image data, they are also beneficial for augmentation tasks.
Based on these meticulous designs, DotFAN is implemented as a uni-generator conditional CycleGAN, involving an encoder-decoder framework and two sub-networks for learning disentangled attribute codes and triggered by several loss terms, such as cycle-consistency loss and domain classification loss, as will be elaborated later.
The proposed DotFAN is a framework to synthesize face images of one domain based on the knowledge, i.e., disentangled facial representation, learned from others. For a given input face , the generator of DotFAN is trained to synthesize a face based on an input attribute code comprising i) a general latent code extracted from by the general facial encoder, ii) an identity code indicating the face identity, iii) an attribute code describing facial attributes including pose angle and facial expressions, and iv) an illumination code . Through this design, a face image can be embedded via a disentangled representation in an attribute code . Fig. 2 depicts the flow-diagram of DotFAN, where each component will be elaborated in following subsections.
To obtain a disentangled representation, the attribute code used by DotFAN for generating face variants is derived collaboratively by a general facial encoder , a face-expert sub-network FEM, a shape-regression sub-network FSR, and an illumination code . FEM and FSR are two well pre-trained sub-networks. FEM learns to extract identity-aware features from faces (of each identity) with various head poses and facial expressions, whereas FSR aims to learn pose features based on a 3D model. The illumination code is a
one-hot vector specifyinglabel-free case (corresponding to data from CASIA ) and illumination conditions (associated with selected Multi-PIE dataset ).
(A) Face-Expert Model (FEM): FEM, denoted by , enables DotFAN to extract and to transplant the face identity from an input source to synthesized face images. Though conventionally face identity extraction is considered as a classification problem and optimized by using a cross-entropy loss, recent methods, e.g., CosFace  and ArcFace , proposed to adopt angular information instead. ArcFace maps face features onto a unit hyper-sphere and adjust between-class distances by using a pre-defined margin value so that a more discriminative feature representation can be obtained. Using ArcFace, FEM ensures not merely a fast training speed for learning face identity but also the efficiency in optimizing the whole DotFAN network.
(B) Face Shape Regressor (FSR): FSR, denoted as , aims to extract face attributes including face shape, pose, and expression. A fully data-driven approach requires to learn a CNN model of high complexity to completely characterize the face attributes without a prior model, which implies the need of a large variety of labeled face samples for training, thereby running a high risk of overfitting. Instead, we use a face model-assisted CNN based on the widely adopted 3D Morphable Model (3DMM ) to significantly reduce the model size (say, a light CNN), as 3DMM can fairly accurately characterize the face attributes using only hundreds of parameters. We follow HPEN’s strategy  to prepare ground-truth 3DMM parameters of a given face from CASIA dataset . Then, we train FSR via Weighted Parameter Distance Cost (WPDC)  defined in Eq. (1), with a modified importance matrix, as shown in Eq. (2).
where , , , and are the distance-based weighting coefficients for (including a vectorized rotation matrix , a translation vector , a vector , and a ) derived by 3DMM. Note that 3DMM expresses a face as , where is the mean face, denotes the PCA basis spanning shape information, is the basis for facial expressions, and and are weighting vectors. While training DotFAN, that represents facial shape is unchanged, and components of the translation, rotation, and expression could be replaced by arbitrary values.
(C) General facial encoder and illumination code :
is used to capture other features, which cannot be represented by shape and identity codes, on a face. is a one-hot vector specifying the lighting condition, based on which our model synthesizes a face. Note that because CASIA has no shadow labels, for of a face from CASIA, its former entries are set to be ’s and its entry ; this means to skip shading and to generate a face with the same illumination setting and the same shadow as the input.
The generator takes an attribute code as its input to synthesize a face . Described below are loss terms composing the loss function of our generator.
(A) Cycle-consistency loss:
In our design, we adopt the cycle-consistency loss to retain face contents after performing two transformations dual to each other. That is,
where is the number of pixels, is a synthetic face derived according to an input attribute code . This loss guarantees our generator can learn the transformation relationship between any two dual attribute codes.
(B) Pose-symmetric loss:
Based on a common assumption that a human face is symmetrical, a face with an pose angle and a face with a angle should be symmetric about the axis. Consequently, we design a pose-symmetric loss based on which DotFAN can learn to generate faces from either training sample. This pose-symmetric loss is evaluated with the aid of a face-mask , which is defined as a function of 3DMM parameters predicted by FSR and makes this loss term focus on the face region by filtering out the background, as described below:
Here, , in which , and the other three attribute codes are extracted from . Additionally, is the horizontally-flipped version of . In sum, this term measures the -norm of the difference between a synthetic face and the horizontally-flipped version of within a region-of-interest defined by a mask .
(C) Identity-Preserving Loss:
We adopt the following identity-preserving loss to ensure that the identity code of a synthesized face is identical to that of input face . That is,
(D) Pose-consistency loss:
This term guarantees that the pose and expression feature extracted from a synthetic face is consistent withused to generate the synthetic face. That is,
By regarding faces of the same identity as one sub-domain of faces, the task of augmenting faces of different identities becomes a multi-domain image-to-image translation problem addressed in StarGAN . Hence, we exploit an adversarial loss to make augmented faces photo-realistic. To this end, we use the domain classification loss to verify if is properly classified to a target domain label that we used to specify the illumination condition of . In addition, in order to stabilize the training process, we adopted the loss design used in WGAN-GP . Consequently, these two loss terms can be expressed as follows:
where is a trade-off factor for the gradient penalty,
is uniformly sampled from the linear interpolation betweenand synthesized , and reflects a distribution over sources given by the discriminator; and,
where is the ground-truth illumination code of
. In sum, the discriminator aims to produce probability distributions over both source and domain labels, i.e.,. Empirically, .
In order to optimize the generator and alleviate the training difficulty, we pretrained FSR and FEM with corresponding labels. Therefore, while training the generator and the discriminator, no additional label is needed. The full objective functions of DotFAN can be expressed as:
Two loss terms in are equal-weighted; and, the weighting factors of terms in in turn are , , , , , and .
DotFAN is trained jointly on CMU Multi-PIE  and CASIA . Multi-PIE contains more than images of identities, each with 20 different sorts of illumination and 15 different poses. We select images of pose angles ranging in between and illumination codes from 0 to 12 to form our first training set, containing totally faces. From this training set, DotFAN learns the representative features for a wide range of pose angles, illumination conditions, and resulting shadows. Our second dataset is the whole CASIA set that contains images of identifies, each having about 50 images of different poses and expressions. Since CASIA contains a rich collection of face identities, it helps DotFAN learn features for representing identities.
To evaluate the performance of DotFAN on face synthesis, four additional datasets are used: LFW , IJB-A , SurveilFace-1, and SurveilFace-2. LFW has images of identities; IJB-A has images of identities; SurveilFace-1 has images of identities; and SurveilFace-2 contains images of identities. We evaluate the performance of DotFAN’s face frontalization on LFW and IJB-A. Besides, because faces in two SurveilFace datasets are taken in uncontrolled real working environments, they are contaminated by strong backlight, motion blurs, extreme shadow conditions, or influences from various viewpoints. Hence, they mimic the real-world conditions and thus are suitable for evaluating the face augmentation performance. The two SurveilFace sets are private data provided by a video surveillance provider. We will make them publicly available after removing personal labels.
We exploit CelebA to simulate the data augmentation process. CelebA contains images of identities with 40 kinds of diverse binary facial attributes. We randomly select a fixed number of images of each face identity from CelebA to form our simulation set, called “sub-CelebA” and conducted data augmentation experiments on both CelebA and sub-CelebA by using DotFAN.
Before training, we align the face images in the Multi-PIE and CASIA by MTCNN . Structurally, our FEM is obtained by Resnet-50 pretained on MS-Celeb-1M , and FSR is implemented by a MobileNet  pretained on CASIA. To train DotFAN, each input face is resized to . Both generator and discriminator exploit Adam optimizer  with and . The total number of training iterations is with a batch-size of
, and the number of training epochs is. The learning rate is initially set to be and begins to decay after the -th training epoch.
We verify the efficacy of DotFAN through the visual quality of i) face frontalization and ii) face rotation results.
(A) Face frontalization: First, we verify if the identity information extracted from a frontalized face, produced by DotFAN, is of the same class as the identity of a given source face. Following , we measure the performance by using a face recognition model trained on MS-Celeb-1M. Next, we conduct frontalization experiments on LFW. Fig. 3 illustrates the frontalization results derived by different methods; meanwhile, Tables I and II show the comparison of face verification results of frontalized faces. This experiment set validates that i) compared with other methods, DotFAN achieves comparable visual quality in face frontalization, ii) shadows can be effectively removed by DotFAN, and iii) DotFAN outperforms the other methods in terms of verification accuracy, especially in the experiment on IJB-A shown in Table II, where DotFAN reports a much better TAR, i.e., on FAR@0.001 and on FAR@0.01, than existing approaches.
(B) Face Rotation Fig. 4 demonstrates DotFAN’s capability in synthesizing faces of given attributes, including pose angles, facial expressions, and shadows, while retaining the associated identities. The source faces presented in the left-most column in Fig. 4 come from four datasets, i.e., CelebA, LFW, CFP , and SurveilFace. CelebA and LFW are two widely-adopted face datasets; CFP contains images with extreme pose angles, e.g., ; and, SurveilFace contains faces of variant illumination conditions and faces affected by motion-blurs. This experiment shows that DotFAN can stably synthesize visually-pleasing face images based on 3DMM parameters describing 3D templates. Finally, Fig. 5 shows some synthesized faces with shadows assigned with four different illumination codes. Note that all synthesized faces presented in this paper are produced by the same DotFAN model; no more data-oriented fine-tuning is required.
To evaluate the comprehensiveness of domain-transferred augmentation by DotFAN, we perform data augmentation on the same dataset by using DotFAN, FaceID-GAN, and StarGAN first; then, we compare the recognition accuracy of different MobileFaceNet models , each trained on an augmented dataset, by testing them on LFW and SurveilFace. StarGAN used in this experiment is trained on Mutli-PIE that is rich in illumination conditions; meanwhile, the FaceID-GAN is trained on CASIA to learn pose and expression representations.
|(a) Sub-CelebA(3) (totally images)|
|(b) Sub-CelebA(8) (totally images)|
|(c) Sub-CelebA(13) (totally images)|
|(d) CelebA (full CelebA dataset, images)|
Table III summarizes the results of this experiment set. We interpret the results focusing on Sub-experiment(a). In Sub-experiment(a), we randomly select faces of each identity from CelebA to form the RAW training set, namely Sub-CelebA(3), leading to about training samples in raw Sub-CelebA(3). The MobileFaceNet trained on raw Sub-CelebA(3) achieves a verification accuracy of on LFW, a true accept rate (TAR) of at FAR = on SurveilFace-1, and a TAR of at FAR = on SurveilFace-2. After generating about additional face images via DotFAN to double the size of training set, the verification accuracy on LFW becomes , and the TAR values on SurveilFace datasets are all nearly doubled, as shown in the row named Proposed 1x. This experiment shows DotFAN is effective in face data augmentation and outperforms StarGAN and FaceID-GAN significantly. Furthermore, when we augment about additional faces to quadruple the size of training set, i.e., Proposed 3x, we have only a minor improvement in verification accuracy compared to Proposed 1x. This fact reflects that the marginal benefit a model can extract from the data diminishes as the number of samples increases when there is information overlap among data, as is what reported in .
Consequently, Table III and Fig. 6 reveals three remarkable points. First, although the improvement in verification accuracy decreases as the size of raw training set increases, DotFAN achieves significant performance gain on augmenting a small-size face training set, as demonstrated in all (RAW, Proposed 1x) data pairs. Second, the results obey the law of diminishing marginal utility in Economics, as demonstrated in all (Proposed 1x, Proposed 3x) data pairs. That is, a 1x procedure is adequate to enrich a small dataset. Experiments also show that the Proposed 3x procedure seems to reach the upper-bound of data richness. Third, by integrating attribute controls on pose angle, illuminating condition, and facial expression with an identity-preserving design, DotFAN outperforms StarGAN and FaceID-GAN in domain-transferred face augmentation tasks.
In this section, we verify the effect brought by each loss term. Fig. 7 depicts the faces generated by using different combinations of loss terms. The top-most row shows faces generated by using the full generator loss described in Eq.(9), whereas the remaining rows respectively show synthetic results derived without one certain loss term.
As illustrated in Fig. 7(b), without , DotFAN fails to preserve the identity information although other facial attributes can be successfully retained. By contrast, without , DotFAN cannot control the illumination condition, and the resulting faces all share the same shadow (see Fig. 7(c)). These two rows evidence that and are indispensable in DotFAN design. Moreover, Fig. 7(d) shows some unrealistic faces, e.g., a rectangular-shaped ear in the frontalized face; accordingly, is important for photo-realistic synthesis. Finally, Fig. 7(e)–(f) show that and are complementary to each other. As long as either of them functions, DotFAN can generate faces of different face angles. However, because is designed to learn only the mapping relationship between face and face by ignoring background outside the face region, artifacts may occur in the background region if works solely, as shown in Fig. 7(e).
We proposed a Domain-transferred Face Augmentation network (DotFAN) for generating a series of variants of an input face image based on the knowledge of disentangled facial representation distilled from huge datasets. DotFAN is designed as a conditional CycleGAN with two extra subnetworks to learn the disentangled facial representation and produce a normalized face so that it can effectively generate face images of various facial attributes while preserving identity of synthetic images. Moreover, we proposed a pose-symmetric loss through which DotFAN can synthesize a pair of pose-symmetric face images directly at once. Extensive experiments demonstrate the effectiveness of DotFAN in augmenting small-size face datasets and improving their within-subject diversity. As a result, a better face recognition model can be learned from an enriched training set derived by DotFAN.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 5187–5196. Cited by: §II.
Image-to-image translation with conditional adversarial networks. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1125–1134. Cited by: §II.
In this section, we show i) architectures of DotFAN’s generator, general facial encoder, and discriminator, ii) face examples in the SurveilFace dataset, iii) DotFAN’s capability for disentangled face representation, and iv) face images generated by DotFAN’s data augmentation process.
Listed in Table S4, Table S5, and Table S6 are the network structures of DotFAN’s encoder (), generator (), and discriminator (), respectively.
Specified below are the notations used in Tables S4-S6.
H: Height of the input image.
W: Width of the input image.
N: Number of output channels.
K: Kernel size.
: Stride size.
: Padding size.
: Batch normalization layer.
Demonstrated in Figure S8 are face examples of the SurveilFace datasets. The two SurveilFace datasets were collected from a working-place surveillance system. Hence, uncontrolled real working environments may result in face photos affected by various extreme conditions, such as strong backlight, motion blurs, extreme shadows, or unconstrained viewpoints. These two datasets mimic the real-world conditions and thus are suitable for evaluating the face augmentation performance.
Figure S9 exhibits synthesized faces to show DotFAN’s capabilily for disentangled face representation. By exploiting our attribute code , this experiment aims to show we can control the face synthesis result by manipulating . In Figure S9, each row shows a sequence of faces. Each sequence was derived according to the convex combination—controlled by a scalar parameter —of two input attribute codes, i.e., of the right-most face and of the left-most. The sequence shown in the first row is derived by and . With the illumination condition being fixed, we show that both the hairdo and the identity information vary smoothly with . The second row and the third row of Figure S9 show the face interpolation results of controlled pose codes . Because both pose information and expression information were encoded into , these two sequences evidence that we can control the face synthesis by even editing only a segment of . Finally, the fourth row shows synthetic faces derived according to linearly interpolated illumination codes. Note that although the illumination code is a one-hot-vector, DotFAN can still approximate a shadow that varies almost linearly.
Finally, demonstrated in Figure S10 are face augmentation examples derived by different methods.