Character customization system is an important component in Role-Playing Games (RPGs), where players are allowed to edit the facial appearance of their in-game characters with their own preferences rather than using default templates. This paper proposes a method for automatically creating in-game characters of players according to an input face photo. We formulate the above "artistic creation" process under a facial similarity measurement and parameter searching paradigm by solving an optimization problem over a large set of physically meaningful facial parameters. To effectively minimize the distance between the created face and the real one, two loss functions, i.e. a "discriminative loss" and a "facial content loss", are specifically designed. As the rendering process of a game engine is not differentiable, a generative network is further introduced as an "imitator" to imitate the physical behavior of the game engine so that the proposed method can be implemented under a neural style transfer framework and the parameters can be optimized by gradient descent. Experimental results demonstrate that our method achieves a high degree of generation similarity between the input face photo and the created in-game character in terms of both global appearance and local details. Our method has been deployed in a new game last year and has now been used by players over 1 million times.READ FULL TEXT VIEW PDF
Face-to-Parameter Translation for Game Character Auto-Creation. ICCV 2019
The character customization system is an important component in role-playing games (RPGs), where players are allowed to edit the profiles of their in-game characters according to their own preferences (e.g. a pop star or even themselves) rather than using default templates. In recent RPGs, to improve the player’s immersion, character customization systems are becoming more and more sophisticated. As a result, the character customization process turns out to be time-consuming and laborious for most players. For example, in “Grand Theft Auto Online111https://www.rockstargames.com/GTAOnline”, “Dark Souls III222https://www.darksouls.jp”, and “Justice333https://n.163.com”, to create an in-game character with a desired facial appearance according to a real face photo, a player has to spend several hours manually adjusting hundreds of parameters, even after considerable practice.
A standard work-flow for the creation of the character’s face in RPGs begins with the configuration of a large set of facial parameters. A game engine then takes in these user-specified parameters as inputs and generates the 3D faces. Arguably, the game character customization can be considered as a special case of a “monocular 3D face reconstruction” [2, 34, 37] or a “style transfer” [10, 11, 12]
problem. Generating the semantic content and 3D structures of an image have long been difficult tasks in the computer vision field. In recent years, thanks to the development of deep learning techniques, computers are now able to automatically produce images with new styles[12, 25, 16] and even generate 3D structures from single facial image [34, 37, 39]
by taking advantages of deep Convolutional Neural Networks (CNN).
But unfortunately, the above methods cannot be directly applied in the game environment. The reasons are threefold. First, these methods are not designed to generate parameterized characters, which is essential for most game engines as they usually take in the customized parameters of a game character rather than images or 3D meshgrids. Second, these methods are not friendly to user interactions as it is extremely difficult for most users to directly edit the 3D meshgrids or rasterized images. Finally, the rendering process of a game engine given a set of user-specified parameters is not differentiable, which has further restricted the applicability of deep learning methods in the game environment.
Considering the above problems, this paper proposes a method for the automatic creation of an in-game character according to a player’s input face photo, as shown in Fig. 1. We formulate the above “artistic creation” process under a facial similarity measurement and a parameter searching paradigm by solving an optimization problem over a large set of facial parameters. Different from previous 3D face reconstruction approaches [2, 34, 37] that produce 3D face meshgrids, our method creates 3D profile for a bone-driven model by predicting a set of facial parameters with a clear physical significance. In our method, each of the parameters controls an individual attribute of each facial components, including the position, orientation and scale. More importantly, our method supports additional user interactions on the basis of the creating results, where players are allowed to make further improvements on their profiles according to their needs. As the rendering process of a game engine is not differentiable, a generative network is designed as an “imitator” to imitate the physical behavior of the game engine, so that the proposed method can be implemented under a neural style transfer framework and the facial parameters can be optimized by using gradient descent, thus we refer to our method as a “Face-to-Parameter (F2P)” translation method.
As the facial parameter searching in our method is essentially a cross-domain image similarity measurement problem, we take advantages of the deep CNN and multi-task learning to specifically design two kinds of loss functions, i.e. a “discriminative loss” and a “facial content loss” – the former corresponding to the similarity measurement of the global facial appearance and the latter focusing more on local details. Due to the all CNN design, our model can be optimized in a unified end-to-end framework. In this way, the input photo can be effectively converted to a realistic in-game character by minimizing the distance between the created face and the real one. Our method has been deployed in a new game since Oct. 2018 and now has been providing over 1 million times services.
Our contributions are summarized as follows:
1) We propose an end-to-end approach for face-to-parameter translation and game character auto-creation. To our best knowledge, there are few previous works have been done on this topic.
2) As the rendering process of a game engine is not differentiable, we introduce an imitator by constructing a deep generative network to imitate the behavior of a game engine. In this way, the gradient can smoothly back-propagate to the input so that the facial parameters can be updated by gradient descent.
3) Two loss functions are specifically designed for the cross-domain facial similarity measurement. The proposed objective can be jointly optimized in a multi-task learning framework.
Neural style transfer: The style transfer from one image onto another has long been a challenging task in image processing [10, 11]. In recent years, Neural Style Transfer (NST) has made huge breakthroughs in style transfer tasks [10, 11, 12], where the integration of deep convolutional features makes it possible to explicitly separate the “content” and “style” from input images. Most of the recent NST models are designed to minimize the following objective:
where and correspond to the constraints on the image content and the image style. controls the balance of the above two objectives.
Current NST methods can be divided into two groups: the global methods [12, 29, 21, 26, 36, 9, 19, 35, 5] and the local methods [25, 6, 27], where the former measures the style similarity based on the global feature statistics, while the latter performs the patch-level matching to better preserve the local details. To integrate both advantages of global and local methods, the hybrid method 
is proposed more recently. However, these methods are specifically designed for image-to-image translation rather than a bone-driven 3D face model thus cannot be applied to in-game environments.
Monocular 3D face reconstruction: Monocular 3D face reconstruction aims to recover the 3D structure of the human face from a single 2D facial image. The traditional approaches of this group are the 3D morphable model (3DMM)  and its variants [32, 3], where a 3D face model is first parameterized  and is then optimized to match a 2D facial image. In recent years, deep learning based face reconstruction methods are now able to achieve end-to-end reconstruction from a 2D image to 3D meshgrids [39, 34, 37, 8, 40, 38]. However, these methods are not friendly for user interactions since it is not easy to edit on 3d meshgrids, and their generated face parameters lack explicit physical meanings. A similar work to ours is Genova’s “differentiable renderer” , in which they directly render parameterized 3D face model by employing a differentiable rasterizer. In this paper, we introduce a more unified solution to differentiate and imitate a game engine regardless of the type of its renderer and 3D model structure by using a CNN model.
Generative Adversarial Network (GAN): In addition to the above approaches, GAN  has made great progress in image generation [31, 33, 1, 30], and has shown great potential in image style transfer tasks [28, 44, 46, 20, 4]. A similar approach with our method is Tied Output Synthesis (TOS) , which takes advantage of adversarial training to create parameterized avatars based on human photos. However, this method is designed to predict discrete attributes rather than continuous facial parameters. In addition, in RPGs, learning to directly predict a large set of 3D facial parameters from a 2D photo will lead to the defect of parameter ambiguity since the intrinsic correspondence from 2D to 3D is an “one-to-many” mapping. For this reason, instead of directly learning to predict continuous facial parameters, we frame the face generation under an NST framework by optimizing the input facial parameters to maximize the similarity between the created face and the real one.
Our model consists of an Imitator and a Feature Extractor , where the former aims to imitate the behavior of a game engine by taking in the user-customized facial parameters and producing a “rendered” facial image , while the latter determines the feature space in which the facial similarity measurement can be performed to optimize the facial parameters. The processing pipeline of our method is shown in Fig. 2.
We train a convolutional neural network as our imitator to fit the input-output relationship of a game engine so that to make the character customization system differentiable. We take the similar network configuration of DCGAN  in our imitator , which consists of eight transposed convolution layers. The architecture of our imitator is shown in Fig. 3. For simplicity, our imitator only fits the front view of the facial model with the corresponding facial customization parameters.
We frame the learning and prediction of the imitator as a standard deep learning based regression problem, where we aim to minimize the difference between the in-game rendered image and the generated one in their raw pixel space. The loss function for training the imitator is designed as follows:
where represents the input facial parameters, represents the output of the imitator, represents the rendering output of the game engine. We use loss function rather than as encourages less blurring. The input parameters
are sampled from a multidimensional uniform distribution. Finally, we aim to solve:
In the training process, we randomly generate 20,000 individual faces with their corresponding facial customization parameters by using the engine of the game “Justice”. 80% face samples are used for training and the rests are used for validation. Fig. 4 shows three examples of the “rendering” results of our imitator. The facial parameters of these images are created manually. As the training samples are generated randomly from a unified distribution of facial parameters, it may look strange for most characters (please see our supplementary material). Nevertheless, as we can still see from Fig. 4 that, the generated face image and the rendered ground truth share a high degree of similarity, even in some regions with complex textures, e.g. the hair. This indicates that our imitator not only fits the training data in a low dimensional face manifold but also learns to decouple the correlations between different facial parameters.
Once we have obtained a well-trained imitator , the generation of the facial parameters essentially becomes a face similarity measurement problem. As the input face photo and the rendered game character belong to different image domains, to effectively measure the facial similarity, we design two kinds of loss functions as measurements in terms of both global facial appearance and local details. Instead of directly computing their losses in raw pixel space, we take advantage of the neural style transfer frameworks and compute losses on the feature space that learned by deep neural networks. The parameter generation can be considered as a searching process on the manifold of the imitator, on which we aim to find an optimal point that minimizes the distance between and the reference face photo , as shown in Fig. 5.
We introduce a face recognition modelas a measurement of the global appearances of the two faces, e.g. the shape of the face and the overall impression. We follow the idea of the perceptual distance, which has been widely applied in a variety of tasks, e.g. image style transfer 21, 24], and feature visualization , and assume that for the different portraits of the same person, their features should have similar representations. To this end, we use a state of the art face recognition model “Light CNN-29 v2”  to extract the 256-d facial embeddings of the two facial images and then compute the cosine distance between them as their similarity. The loss function is referred as “discriminative loss” since it predicts whether the faces from a real photo and the imitator belong to the same person. The discriminative loss function of the above process is defined as follows:
where the cosine distance between two vectorsand is defined as:
In additional to the discriminative loss, we further define a content loss by computing pixel-wise error based on the facial features extracted from a face semantic segmentation model. The facial content loss can be regarded as a constraint on the shape and displacement of different face components in two images, e.g. the eyes, mouth, and nose. As we care more about modeling facial contents rather than everyday images, the face semantic segmentation network is specifically trained to extract facial image features instead of using off-the-shelf models that are pre-trained on the ImageNet dataset. We build our facial segmentation model based on the Resnet-50  where we remove its fully connected layers and increase its output resolution from 1/32 to 1/8. We train this model on the well-known Helen face semantic segmentation dataset 
. To improve the position sensitivity of the facial semantic feature, we further use the segmentation results (class-wise probability maps) as the pixel-wise weights of the feature maps to construct the position-sensitive content loss function. Our facial content loss is defined as follows:
where represents the mapping from input image to the facial semantic features, represents the pixel-wise weights on the features, e.g., is the eye-nose-mouth map.
The final loss function of our model can be written as a linear combination of the two objectives and :
where the parameter is used to balance the importance of two tasks. An illustration of our feature extractor is shown in Fig. 6. We use the gradient descent method to solve the following optimization problem:
where represents the facial parameters to be optimized and represents an input reference face photo. A complete optimization process of our method is summarized as follows:
Stage I. Train the imitator , the face recognition network and the face segmentation network .
Stage II. Fix , and , initialize and update facial parameters , until reach the max-number of iterations:
(: learning rate).
Project to : .
Imitator: In our imitator, the convolution kernel size is set to
, the stride of each transposed convolution layer is set to 2 so that the size of the feature maps will be doubled after each convolution. The Batch-Normalization and ReLU activation are embedded in our imitator after every convolution layers, except for its output layer. Besides, we use the SGD optimizer for training with theand . The learning rate is set to
, the learning rate decay is set to 10% per 50 epochs, and the training stops after 500 training epochs.
Facial segmentation network: We use Resnet-50  as the backbone of our segmentation network by removing its fully connected layers and adding an additional convolution layer at its top. Besides, to increase the output resolution, we change the stride from 2 to 1 at Conv_3 and Conv_4. Our model is pre-trained on the ImageNet , and then fine-tuned on the Helen face semantic segmentation dataset  with the pixel-wise cross entropy loss. We use the same training configurations as our imitator, except that the learning rate is set to .
Facial parameters: The dimension
of the facial parameters is set to 264 for “male” and 310 for “female”. In these parameters, 208 of them are in continuous values (such as eyebrow length, width, and thickness) and the rest are discrete ones (such as hairstyle, eyebrow style, beard style, and lipstick style). These discrete parameters are encoded as one-hot vectors and are concatenated with the continuous ones. Since the one-hot encodings are difficult to optimize, we use the softmax function to smooth these discrete variables by the following transform:
where represents the dimension of discrete parameters’ one-hot encoding. controls the degree of smoothness. We set relatively large , say, , to speed up optimization. We use “average face” to initialize the facial parameters , i.e. we set the all elements in continuous part as 0.5 and set those in discrete part as 0. For a detailed description of our facial parameters, please refer to our supplementary material.
Optimization: As for the optimization in Stage II, we set as 0.01, the max-number of iterations as 50, the learning rate as 10 and its decay rate as 20% per 5 iterations.
Face alignment: The face alignment is performed (by using dlib library ) to align the input photo before it is fed into the feature extractor, and we use the rendered “average face” as its reference.
We construct a celebrity dataset with 50 facial close-up photos to conduct our experiments. Fig. 7 shows some input photos and generated facial parameters, from which an in-game character can be rendered by the game engine at multiple views and it shares a high degree of similarity to the input photo. For more generated examples, please refer to our supplementary material.
The ablation studies are conducted on our dataset to analyze the importance of each component of the proposed framework, including 1) discriminative loss and 2) facial content loss.
1) Discriminative loss. We run our method on our dataset w/ or w/o the help of discriminative loss and further adopt the Gatys’ content loss  as the baseline. We compute the similarities between each photo and the corresponding generated results by using the cosine distance on the output of the face recognition model , as shown in Fig. 8. We can observe noticeable similarity improvement when we integrate the discriminative loss.
2) Facial content loss. Fig. 9 shows a comparison of the generated faces w/ or w/o the help the facial content loss. For a clearer view, the facial semantic maps and the edges of facial components are extracted. In Fig. 9, the yellow pixels of the edge maps correspond to the edge of the reference photo and the red pixels correspond to the generated faces. We can observe a better correspondence of the pixel location between the input photo and the generated face when we apply the facial content loss.
3) Subjective evaluation. To quantitatively analyze the importance of the two losses, we follow the subjective evaluation method used by Wolf . Specifically, we first generate 50 groups of character auto-creation results with different configurations of the similarity measurement loss functions ( only, only and ) on our dataset. We then ask 15 non-professional volunteers to select the best work in each group, of which three characters are in random order. Finally, the selection ratio of an output character is defined as how many percentage it is selected by volunteers as the best one in its group, and the overall selection ratio is used to evaluate the quality of the results. The statistics are shown in Tab. 1, which indicates that both losses are beneficial for our method.
|Method||Global style ||Local style ||3DMM-CNN ||Ours|
|Fréchet Inception Distance||–|
|Time (run on TITAN Xp)||22s||43s||15s||16s|
We compare our method with some popular neural style transfer methods: global style method  and local style method . Although these methods are not specifically designed for generating 3D characters, we still compare with them since they are similar with our approach in multiple aspects. Firstly, these methods are all designed to measure the similarity of two images based on deep learning features. Secondly, the iterative optimization algorithms in these methods are all performed at the input of networks. As shown in Fig. 10, we can see that by separating the image styles from content and reorganizing them, it is difficult to generate vivid game characters. This is because the generated images are not exactly sampled from the game character manifold, thus it’s hard to apply these methods in RPG environments. We also compare our method with a popular monocular 3D face reconstruction method: 3DMM-CNN , as shown in the right side of Fig. 10. Our auto-created game characters have a high similarity with the inputs while the 3DMM method can only generate masks with similar facial outlines.
as our metrics. For each test image, we randomly select an image from the imitator training set as its reference and compute the average MS and FID over the entire test set. The above operations are repeated 5 times for computing the final mean value and the standard deviation as listed in Tab.2. The runtime of each method is also recorded. Our method achieves higher style similarity and good speed performance compared with other methods.
We further evaluate our method on different blurring and illumination conditions, and our method proves to be robust to these changes, as shown in Fig. 11. The last group gives a failure case of our method. Since is defined on local features, our method is sensitive to pose changes.
Not limited to real photos, our method can also generate game characters for some artistic portraits, including the sketch image and caricature. Fig. 12 shows some examples of the generation results. Although the images are collected from a totally different distribution, we still obtain high-quality results since our method measures the similarity based on facial semantics rather than raw pixels.
In this paper, we propose a method for the automatic creation of an in-game character based on an input face photo. We formulate the creation under a facial similarity measurement and a parameter searching paradigm by solving an optimization problem over a large set of physically meaningful facial parameters. Experimental results demonstrate that our method achieves a high degree of generation similarity and robustness between the input face photo and the rendered in-game character in terms of both global appearance and local details.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
Arbitrary style transfer with deep feature reshuffle.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
Image-to-image translation with conditional adversarial networks.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
Dlib-ml: A machine learning toolkit.Journal of Machine Learning Research, 10:1755–1758, 2009.
Mofa: Model-based deep convolutional face autoencoder for unsupervised monocular reconstruction.In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
A detailed configuration of our Imitator and Face Segmentation Network are listed in Table 3 and Table 4. As for the details of Face Recognition Network , i.e. Light CNN-29 v2, please refer to Wu ’s paper .
Specifically, in a of Convolution / Deconvolution layer, denotes the number of filters, denotes the filter’s size and denotes the filter’s stride. In a of Maxpool layer, denotes the pooling window size, and denotes the pooling stride. In an Bottleneck block , denotes the number of planes, and denotes the block’s stride.
|Conv_1||Deconvolution + BN + ReLU||512x4x4 / 1||4x4|
|Conv_2||Deconvolution + BN + ReLU||512x4x4 / 2||8x8|
|Conv_3||Deconvolution + BN + ReLU||512x4x4 / 2||16x16|
|Conv_4||Deconvolution + BN + ReLU||256x4x4 / 2||32x32|
|Conv_5||Deconvolution + BN + ReLU||128x4x4 / 2||64x64|
|Conv_6||Deconvolution + BN + ReLU||64x4x4 / 2||128x128|
|Conv_7||Deconvolution + BN + ReLU||64x4x4 / 2||256x256|
|Conv_8||Deconvolution||3x4x4 / 2||512x512|
|Conv_1||Convolution + BN + ReLU||64x7x7 / 2|
|MaxPool||MaxPool||3x3 / 2|
|Conv_2||3 x Bottleneck||64 / 2|
|Conv_3||4 x Bottleneck||128 / 1|
|Conv_4||6 x Bottleneck||256 / 1|
|Conv_5||3 x Bottleneck||512 / 1|
|Conv_6||Convolution||11x1x1 / 1|
During the training process, we train our imitator with randomly generated game faces other than regular ones, as shown in Fig. 16. In our experiment, we adopt two imitators to fit female and male 3D models respectively, in order to auto-create characters for different genders.
Table 5 lists a detailed description of each facial parameter, where the “Component” represents the facial parts which parameters belong to, the “Controllers” represents user-adjustable parameters of each facial part (one controller correspond to one continuous parameter), and the “# Controllers” represents the total number of controllers, i.e. 208. Besides, there are additional 102 discrete parameters for female (22 hair styles, 36 eyebrow styles, 19 lipstick styles, and 25 lipstick colors) and 56 discrete parameters for male (23 hair styles, 26 eyebrow styles, and 7 beard styles).
|Eyebrow||eyebrow-head||horizontal-offset, vertical-offset, slope, …||8||208|
|eyebrow-body||horizontal-offset, vertical-offset, slope, …||8|
|eyebrow-tail||horizontal-offset, vertical-offset, slope, …||8|
|Eye||whole||horizontal-offset, vertical-offset, slope, …||6|
|outside upper eyelid||horizontal-offset, vertical-offset, slope, …||9|
|inside upper eyelid||horizontal-offset, vertical-offset, slope, …||9|
|lower eyelid||horizontal-offset, vertical-offset, slope, …||9|
|inner corner||horizontal-offset, vertical-offset, slope, …||9|
|outer corner||horizontal-offset, vertical-offset, slope, …||9|
|Nose||whole||vertical-offset, front-back, slope||3|
|bridge||vertical-offset, front-back, slope, …||6|
|wing||horizontal-offset, vertical-offset, slope, …||9|
|tip||vertical-offset, front-back, slope, …||6|
|bottom||vertical-offset, front-back, slope, …||6|
|Mouth||whole||vertical-offset, front-back, slope||3|
|middle upper lip||vertical-offset, front-back, slope, …||6|
|outer upper lip||horizontal-offset, vertical-offset, slope, …||9|
|middle lower lip||vertical-offset, front-back, slope, …||6|
|outer lower lip||horizontal-offset, vertical-offset, slope, …||9|
|corner||horizontal-offset, vertical-offset, slope, …||9|
|Face||forehead||vertical-offset, front-back, slope, …||6|
|glabellum||vertical-offset, front-back, slope, …||6|
|cheekbone||horizontal-offset, vertical-offset, slope, …||5|
|risorius||horizontal-offset, vertical-offset, slope, …||5|
|cheek||horizontal-offset, vertical-offset, width, …||6|
|jaw||vertical-offset, front-back, slope, …||6|
|lower jaw||horizontal-offset, vertical-offset, slope, …||9|
|mandibular corner||horizontal-offset, vertical-offset, slope, …||9|
|outer jaw||horizontal-offset, vertical-offset, slope, …||9|