Our experience with communication systems is two-dimensional, mostly via video teleconferencing (e.g., Messenger), that includes both audio and video transmissions. Recent studies on videoconferencing have shown that the more closely technology can simulate a face-to-face interaction, the more participants are able to focus, engage, and retain information . A more advanced level of communication with virtual reality (VR) via telepresence [25, 33, 34, 8, 21] will allow virtual presence at a distant location and a more authentic interaction. If successful, this new form of face-to-face interaction can reduce the time and financial commitments of travel, make sales meetings or family meetings more immersive, with a huge impact for the environment and use of personal time.
Today most real-time systems for avatars in AR/VR are cartoon-like (e.g., Hyprsense, Loom AI); on the other hand, Hollywood has animated nearly uncanny digital humans as virtual avatars using advanced computer graphics technology and person-specific models (e.g., Siren). While some of these avatars can be driven in real-time from cameras, building the PS model is an extremely time-consuming and hand-tuned process that prevents democratization of this technology. This paper makes progress in this direction by generating video-realistic avatars by transferring subtle facial expressions from the headset mounted cameras (HMC) images in a VR headset to a 3D talking head (see Fig. 1).
|Facial appearance||Facial appearance||blueFacial appearance||blueFacial appearance||blueFacial appearance||
We build on recent work on codec avatars (CA)  that learn a PS model from a Plenoptic study. Recall that driving an avatar from HMC cameras is typically more challenging than driving it from regular cameras (e.g., iPhone) [28, 17, 41], due to the domain difference between IR cameras and the texture/shape of the avatar, variability in HMC images due to headset variability (e.g., camera location, IR LED illumination), high-distortion introduced by the near-camera views, and partial visibility of the face (Fig. 2). Wei 
proposed an end-to-end deep learning network for learning the mapping between the HMC images and the parameterized avatar. First, this model solves the unknown correspondence between HMC images and the avatar parameters in an unsupervised manner using an eleven-view HMC headset. Second, to animate the CA in real time from three HMC images(i.e., inference), Wei learns an encoder network to regress from 3-view HMC images to CA’s parameters (Fig. 3).
While previous work has reported compelling photo-realistic facial expression transfer results, the existing method has limitations due to the PS nature of the approach. It is time-consuming, expensive and error-prone to capture sufficient statistical variability when collecting PS samples to learn a robust model. It will typically require recording several sessions with variability across lighting, headsets and iconic changes (e.g., makeup, beard), which limits its scalability. Our first contribution is to show that the proposed multi-identity architecture (MIA) is able to drive the shape and texture of the trained subjects with better accuracy and robustness than PS models, because it learns to marginalize the variability of nuisance parameters across subjects. In addition, we show how MIA is able to drive the shape of an untrained subject from HMCs using minimal personalized information (i.e., only the neutral shape). This is the basis for any type of stylized avatars. Our second and most important contribution is to show that MIA factorizes nuisance parameters such as camera parameters, facial aesthetic changes (e.g., beard, makeup) and environmental factors (e.g., lighting) from the facial motion (i.e., facial expression). This is critical because the encoder is able to extract from the HMC images only the information that is relevant to the final task, which is transferring subtle facial expressions, and it is able to marginalize information that is not relevant (headset, facial appearance, environment). More importantly, the method is able to discover this factorization without the need of any supervision (i.e., expression correspondence across subjects). Moreover, if the PS texture decoder is available, the model can also transfer the appearance changes to new subjects in a robust manner with a minimal training/fine-tuning. Experimental results in a new dataset captured across challenging environmental factors, and a variety of headsets, show the benefits of our method.
2 Prior Work
While avatar generation has been a common topic of research in computer vision and computer graphics, photo-realistic avatars for VR is a relatively unexplored problem. This section reviews two related areas: animating stylized and codec avatars, and
D shape estimation.
2.1 Animating Stylized and Codec Avatars
Animating stylized avatars from video has a long history, for instance  fits a generic DMM to the face and use it to retarget the facial motion to a D characters. To improve the accuracy, Chaudhuri  proposed to learn person-specific expression blendshapes and dynamic albedo maps from the input video of subjects. In , facial action unit intensity is estimated in a self-supervise manner by utilizing a differentiable rendering layer for fitting the expression and to retarget the expression to the character. In contrast, expression transfer from a VR headset [24, 8, 12, 19] is more challenging due to partial visibility of face in HMC images, the specific hardware, and limited existing data.
CAs animate avatars by estimating the parameters of a PS shape and texture model from HMC [18, 38, 5, 29], see Fig. 3. In , combination of real and synthetic HMC images are utilized for reducing the domain gap between real HMC images in IR spectrum and rendered images for training encoder and reducing the HMC-avatar domain gap. Wei  utilize a cycle-GAN to achieve accurate cycle consistency between 11-view HMC images and CA. Then, they train a person specific regressor from 3-view HMC images to the CA’s parameter. Chu  propose to use modular CA to have more freedom for animating the eyes and mouth. In a different approach, Richard  animate the CA based on the gaze direction and audio inputs. The aforementioned methods rely on PS models, are typically not robust to variations in headsets and environments.
2.2 3D Shape Estimation
Early approaches for model-based shape and texture estimation are based on active shape model  (ASM) and Active Appearance Model [6, 20] (AAM). The ASM utilizes local appearance models to estimate shape variations of the global shape model. On the other hand AMM methods learn a joint holistic model of shape and appearance. D Morphable Model (DMM) provides a dense D representation for faces the Basel Face Model  and the Facewearhouse . In , DMM is incorporated in an end-to-end CNN training to dicriminatively estimate the D shape of faces given single input image. Tran  propose to learn a nonlinear
DMM via deep neural network from in-the-wild images, and in this wayDMMs are capable of representing non-linear facial expressions. Similar to  we learn a non-linear discriminative DMM, but we extend it to learn the model from the HMC images given a neutral D shape, and align the expressions across subjects in an unsupervised manner. To the best of our knowledge, this is the first work that solves the correspondence of expression across subjects in unsupervised manner.
3 Multi-Identity Model
This section describes the proposed multi-identity architecture (MIA) and augmentation techniques to robustify and generalize existing encoder models for driving CAs.
3.1 Multi-Identity Architecture (MIA)
Given 3-view HMC images of the eyes and mouth (see Fig. 1), our goal is to estimate the facial expression of a CA (shape+texture), and render it in an arbitrary view in VR. The MIA has three main parts (see Fig. 4): the backbone network, the D shape network, and the texture branch.
Backbone network: The backbone network, in Fig. 4, is shared among subjects. Its goal is to factorize the expression from other nuisance factors such as lighting, background, or camera views, and build an internal representation that is invariant to those factors. As we will show in the experiments section, MIA naturally finds that the best way to encode HMC images across subjects, is by marginalizing out person-specific factors in addition to the nuisance factors mentioned. This results in learning an embedding that only preserves expression without the need of solving for correspondence across expression among subjects.
D shape network: MIA assumes that the neutral shape of the test subject, , in given111Extracting neutral face from a single or few-shot phone-captured images is a well studied problem [28, 17, 41], and there are a number of commercial solutions available [9, 14]. . This is the only information MIA needs to generalize the shape component of the network to untrained subjects. A network is trained to estimate D shape, from HMC images. The network takes both of the output of the backbone network and to estimate the person specific D shape expression residual. The neutral D shape is used to re-inject person-specific information that was factored out in . For instance, eye openness, which varies across identities, can be extracted from the neural D shape of each subject. With this, we reconstruct the D shape of subject as:
The network is trained by minimizing the Euclidean distance between the estimated and target D shapes,
where is the weight mask for the visible areas.
Texture network: When the pre-trained PS texture decoder is available for each identity, our goal is to be able to animate the CA from HMC images robustly and with minimal adaptation effort. In this paper, we presume pre-trained decoders from  are available, but our work can be similarly applied to other PS models as well (e.g. [35, 16]). The network takes, as input, an expression parameter
and a view vector, and generates person-specific and view-specific texture that, together with shape, can be used to render the avatar,
However, since each PS model is trained independently of all others, the structure of the latent space, , is not consistent across identities. We would like to utilize the shared backbone encoder across identities to encourage robustness via joint training. Inspired by multi-task learning techniques [23, 2], we additionally learn person-specific adaptation layers, , that transform the identity-consistent expression embedding produced by to each identity’s personalized latent space. Finally, to eliminate unnecessary dimensions in , non-informative dimensions, we apply PCA dimensionality reduction, denoted to each identity’s latent space and fix it during training. Together, these components are used to generate PS expression parameters as follows:
where is subject index, and is the average expression parameter for subject . Then, we use Eqn. 3 to generate the estimated texture from view v. To guide the network, we minimize the Euclidean loss between the estimated and the target expression parameters and textures:
where is the weight mask for the visible areas from the HMC images and is the weight for the texture loss.
Total Loss: The entire MIA network is trained end-to-end to optimize the networks’ parameters by minimizing:
where is the number of subjects and is the weight for the shape loss.
Data augmentation is a wildly practiced heuristic in many deep learning tasks. The main goal is to make the distribution of variations in training data more similar to those in the test set. Most commonD data augmentations techniques include scaling , color augmentation , and simple geometric transformations . However, a major source of variability in our task stems from headset factors, such as variations in camera placement and focus, as well as the slop of the headset relative to the face which varies during usage. These variations are not easily modeled using standard augmentation techniques that do not take the D shape of the face into account. In this paper, we simulate headset-based variations by perturbing the 3D rotation and translation of the face shape in the training set, and use it to re-render augmented views of each HMC image on random backgrounds. Some examples are shown in Fig. 5. As demonstrated in the experiments section below, this simple augmentation technique substantially improves the robustness of our method to real world variations.
4 Experimental Results
This section reports experimental results and analysis on MIA. The first experiment shows how MIA can estimate accurate D shapes directly from HMC images of untrained subjects. In the second experiment, we evaluate the quality of MIA’s texture prediction for identities with pre-trained avatars under challenging testing scenarios. In the third experiment, we show how MIA can incorporate new subjects with minimal training. In addition, we also present further analysis about what MIA learns prior and during adaptation.
Data: We used HMC captures of different subjects for training and HMC captures for testing. Training and testing HMC captures do not overlap. Each HMC capture is a minutes long video (fps) of -views HMC images, and contains peak expressions, two sets of continuous range-of-motion, recitation of sentences and - minutes of conversion. The HMC images are in the IR spectrum with a resolution of . During testing only -views are available. For each subject, we have a pre-trained decoder to generate PS texture for various expressions from arbitrary views. For more information, of how to build the PS decoder see  and Eqn. 3.
Ground Truth: We utilize the result of the method in , that solves for the correspondence between -views HMC images and the CA parameters as the ground truth. Recall that the training data is captured with -views to achieve more precise results in the correspondence between HMC and CA, while the testing data has only -views.
Baseline method: We compare MIA with the person specific (PS) encoder in . The PS encoder is trained with one HMC capture (-view images) and uses a CNN architecture with the same number of parameters as ours.
Evaluation metrics: We report the average Euclidean error for the eyes, mouth and face areas separately for both D shape and the texture. The D shape errors are measured in millimeters and the texture errors in raw intensity values (i.e. 0-255). We report the localized error metrics to analyze failure modes better. For example, the D shape error in the eyes capture openness and blinking errors, while in the mouth, they capture deviations in lip shapes important for visual-speech. Similarly, texture error in the eyes is typically due to the errors in gaze direction, and in mouth, it corresponds to incorrect teeth and tongue estimation.
|Subject||D Shape Error (mm)|
|Test||Sample||Variations||Method||D Shape Error||Texture Error|
|1||Fig. 2.(b)||Headset||PS |
|2||Fig. 2.(c)||Headset||PS |
|3||Fig. 2.(d)||Headset||PS |
|4||Fig. 2.(e)||Environment||PS |
Implementation details: In training, we use the Adam optimizer, setting the batch size to and the initial learning rate to . We decrease learning rate by after each K iterations. In total, we train the encoder for K iterations, and set both of and to . We crop and resize the HMC images to to focus on the face areas.
The backbone network, , consists of two residual networks , one for eye images and another for the mouth . Each network consists of a Res-Net head module, five BottleNeck blocks and a -way fully connected layer. Each BottleNeck block consists of ten convolutional layers with and
filters. We add shortcut connections among the convolutional layers, and each layer is followed by ReLU and Instance normalization  layers. To extract the final identity invariant features, we apply a global average pooling and a -way fully connected layer to the activations of the last BottleNeck block. The architecture of the D shape network, , consists of four fully connected layers where each one is followed by a leaky ReLU  layer with negative slope of . We normalize the extracted features from the HMC images and the neutral D shape, to account for their different domains, by employing group normalization  after concatenating the features. Finally, for the texture network , we utilize the combination of a ReLU layer and a fully connected layer without bias.
4.1 Quantitative Evaluations
This section quantifies the performance of MIA using three experiments: (1) driving the D shape of untrained subjects. (2) robustness of shape and texture estimation for subjects with trained PS models. (3) generalization of learned features on new subjects.
Driving 3D Shape: Inputs to the shape generation network, , are the HMC images and the corresponding identitiy’s neutral
D shape. We train the network with 120 subjects using the loss function in Eqn.2 as guidance. Fig. 6 shows the estimated D shape for extreme expression examples from six untrained subjects along with their ground truth. Our D shape estimator captures subtle details in expressions necessary for inferring social signal. Table 1 shows the D shape errors for face, eyes and mouth areas of the whole sequence for ten untrained subjects. The error is less than 2mm in the face/eyes and 3mm in the mouth. Recall that MIA does not use any sample from the test subject other than the neutral shape and has never seen any HMC images for these subjects during training. Fig. 7 shows testing results of one untrained subject for a wide range of expressions. Note that PS  is not able to estimate the D shape for untrained subjects.
Comparing Table 1 with PS’s results for the D shape error in Table 2 (different capture), we find that MIA outperforms PS, despite PS having access to subject-specific HMC images, and their target shapes, during training. We suspect the reason for this is that MIA learns to marginalize the extrinsic variability of the problem (i.e. environment, headset) from the 120 subjects that is trained on, while the PS tends to overfit to the specific HMC capture session used for training. More comparative results can be found in the video in the supplementary material.
Driving Full Avatars: In this experiment, we evaluate the ability of MIA to generate both shape and texture and its robustness against extrinsic factors such as headset, environment and facial appearance variations. Here, data for test subject is available during training, but from a different HMC capture. The selected subject was captured on five different dates; examples of the HMC images are shown in Fig. 2. These samples show large appearance variations due to facial hair, pose changes in the headset slop, and camera assembly differences across headsets; it also contains background variation due to changing environment and overall lighting differences. We use one HMC capture (Fig. 2(a)) of the subject with HMC captures of other subjects for training, and test on the remaining four HMC captures of that subject. Table 2 compares the testing errors of MIA against PS . On test capture , which is very similar to the training capture, PS  performs better than MIA. But, its performance declines significantly when testing on the other captures, where variations in environment and facial appearance are more extreme. Note that the overall errors for MIA, for all areas of D shape and texture, are more stable and are similarly low across all test captures. The first two rows of Fig. 10 shows visual comparison of methods on the test HMC captures, where a significant reduction in expressive detail is noticeable in results for PS . We refer the reader to the supplementary material for more results.
Adaptation to New Identities:
We evaluated the generalization of MIA’s feature extraction to new subjects on HMC captures ofsubjects that are not trained in MIA. Each of the 6 subjects has more than one HMC capture exhibiting variations in extrinsic factors. We used the pre-trained MIA network with subjects (excluding the test 6 subjects), and fix the shape generation network, , and backbone network, . For each new subject, we trained a new small texture network . During the testing on HMC captures with variations, we used the newly trained texture estimation branch for estimating the texture parameters, and decode both the texture and the D shape by utilizing Eqn. 3. Table 3 shows the overall errors for D shape and texture for different areas of testing HMC captures of the subjects. MIA achieves lower errors for all areas with smaller variability, demonstrating the effectiveness of the features extracted from the fixed backbone network. The last three rows of Fig. 10 show visualizations of this case for.
|Method||D Shape Error||Texture Error|
4.2 Ablation Study
D augmentation layer: To analyze the advantage of using the D augmentation layer, we compare the errors of the PS  model, MIA with D augmentation trained with subject, MIA without D augmentation (D Aug) trained with subjects, and MIA with D Aug trained with subjects. Fig. 8 shows the average errors for the four test captures in Table 2. It shows that even using the D Aug layer with subject reduces errors slightly in comparison to PS . However, there is huge drop in errors by using the D Aug layer with subjects. This reduction of error is more significant in the mouth area. It shows that the combination of MIA and D Aug is effective.
Influence of number of subjects: We evaluate the influence of number of training subjects in the performance of MIA during testing. We train MIA with and subjects, and test them on ten untrained subjects for estimating the D shapes. Fig. 9 shows that by increasing number of training subjects the D shape errors are decreasing, especially for the mouth area.
4.3 Unsupervised Expression Correspondence
The MIA implicitly learns to solve for correspondence across expressions in order to marginalize nuisance parameters (e.g., lighting). It naturally discovers that the best way to encode HMC multi-identity data is finding a latent space that only contains expression information. Fig. 11 illustrates how MIA learns to solve for correspondence across expressions. The first column shows the input HMC images and the second column is the CA of the subject in the first column. The remaining columns are the CAs of other subjects driven from the HMC images in first column, that is, the same extracted features from HMC images are utilized to estimate (by using the corresponding ) a new expression parameter (with the same facial expression meaning), z, in the latent space of each of the remaining subjects. As we can observe, MIA is able to align the expression across all of the subjects in an unsupervised manner, and creating a common expression-only space. Please pay attention to the mouth area in the second row of Fig. 11 that shows the same expression with different mouth interior.
|PS ||MIA||GT||HMC||PS ||MIA||GT|
This paper proposes MIA to robustify and generalize existing PS methods for driving CAs. MIA learns to extract identity invariant features related to facial expression while marginalizing nuisance factors (headset, environment, facial expression) in an unsupervised manner. We show that MIA is able to drive the shape component in untrained subjects, and if the PS texture decoder is available, with a minimum training, MIA can drive CAs for new subjects. Our experiments illustrate, that counter-intuitively, MIA outperform both quantitatively and qualitatively PS methods. This is not surprising, since MIA learns to marginalize the extrinsic variability of the data from many subjects.
-  (2013) Facewarehouse: a 3d facial expression database for visual computing. IEEE Trans. Vis. Comput. Graph. 20 (3), pp. 413–425. Cited by: §2.2.
Partially shared multi-task convolutional neural network with local constraint for face attribute learning. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 4290–4299. Cited by: §3.1.
-  (2020) Personalized face modeling for improved face reconstruction and motion retargeting. In Eur. Conf. Comput. Vis., pp. 142–160. Cited by: §2.1.
Joint face detection and facial motion retargeting for multiple faces. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 9719–9728. Cited by: §2.1.
-  (2020) Expressive telepresence via modular codec avatars. In Eur. Conf. Comput. Vis., pp. 330–345. Cited by: §2.1.
-  (1998) Active appearance models. In Eur. Conf. Comput. Vis., pp. 484–498. Cited by: §2.2.
-  (1994) Active shape models: evaluation of a multi-resolution method for improving image search.. In Brit. Mach. Vis. Conf., Vol. 1, pp. 327–336. Cited by: §2.2.
-  (2019) Egoface: egocentric face performance capture and videorealistic reenactment. arXiv preprint arXiv:1905.10822. Cited by: §1, §2.1.
-  FacePlusPlus. Note: https://www.faceplusplus.com/3dface/ Cited by: footnote 1.
-  (2009) Video conferencing versus telephone calls for team work across hospitals: a qualitative study on simulated emergencies. Emergency Medicine 22. Cited by: §1.
-  (2016) Deep residual learning for image recognition. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 770–778. Cited by: §4.
Eyemotion: classifying facial expressions in VR using eye-tracking cameras. In IEEE Winter Conference on Applications of Computer Vision, pp. 1626–1635. Cited by: §2.1.
-  (2017) Pose-invariant face alignment with a single cnn. In Int. Conf. Comput. Vis., pp. 3200–3209. Cited by: §2.2.
-  KeenTools. Note: https://keentools.io/ Cited by: footnote 1.
-  (2012) Imagenet classification with deep convolutional neural networks. In Adv. Neural Inform. Process. Syst., pp. 1097–1105. Cited by: §3.2.
-  (2020) Uncertainty-aware mesh decoder for high fidelity 3D face reconstruction. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 6100–6109. Cited by: §3.1.
-  (2020) Towards high-fidelity 3d face reconstruction from in-the-wild images using graph convolutional networks. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 5891–5900. Cited by: §1, footnote 1.
-  (2018) Deep appearance models for face rendering. ACM Trans. Graph. 37 (4), pp. 1–13. Cited by: §1, Figure 3, §2.1, §3.1, §4.
-  (2019) Realistic facial expression reconstruction for VR hmd users. IEEE Trans. Multimedia 22 (3), pp. 730–743. Cited by: §2.1.
-  (2004) Active appearance models revisited. Int. J. Comput. Vis. 60 (2), pp. 135–164. Cited by: §2.2.
-  (2018) PaGAN: real-time avatars using dynamic textures. ACM Trans. Graph. 37 (6), pp. 1–12. Cited by: §1.
Rectified linear units improve restricted boltzmann machines.
Int. Conf. Machine Learning, Cited by: §4.
-  (2020) High-resolution neural face swapping for visual effects. In Computer Graphics Forum, Vol. 39, pp. 173–184. Cited by: §3.1.
-  (2016) High-fidelity facial and speech animation for vr hmds. ACM Trans. Graph. 35 (6), pp. 1–14. Cited by: §2.1.
-  (2016) Holoportation: virtual 3d teleportation in real-time. In Proceedings of Symposium on User Interface Software and Technology, pp. 741–754. Cited by: §1.
-  (2009) A 3d face model for pose and illumination invariant face recognition. In International Conference on Advanced Video and Signal Based Surveillance, pp. 296–301. Cited by: §2.2.
-  (2020) Audio-and gaze-driven facial animation of codec avatars. In IEEE Winter Conference on Applications of Computer Vision, Cited by: §2.1.
-  (2016) Adaptive 3d face reconstruction from unconstrained photo collections. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 4197–4206. Cited by: §1, footnote 1.
-  (2020) The eyes have it: an integrated eye and face model for photorealistic facial animation. ACM Trans. Graph. 39 (4), pp. 91–1. Cited by: §2.1.
-  (2014) CNN features off-the-shelf: an astounding baseline for recognition. In IEEE Conf. Comput. Vis. Pattern Recog. Worksh., pp. 806–813. Cited by: §3.2.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.2.
-  (2020) Unsupervised learning facial parameter regressor for action unit intensity estimation via differentiable renderer. In Proceedings of ACM International Conference on Multimedia, pp. 2842–2851. Cited by: §2.1.
-  (2019) Fml: face model learning from videos. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 10812–10822. Cited by: §1.
-  (2016) Face2face: real-time face capture and reenactment of rgb videos. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 2387–2395. Cited by: §1.
-  (2019) Towards high-fidelity nonlinear 3D face morphable model. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 1126–1135. Cited by: §3.1.
-  (2019) On learning 3d face morphable model from in-the-wild images. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: §2.2.
-  (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §4.
-  (2019) VR facial animation via multiview image translation. ACM Trans. Graph. 38 (4), pp. 1–16. Cited by: §1, Figure 3, §2.1, Figure 6, Figure 10, §4.1, §4.1, §4.2, Table 2, Table 3, §4, §4.
-  (2018) Group normalization. In Eur. Conf. Comput. Vis., pp. 3–19. Cited by: §4.
-  (2015) Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853. Cited by: §4.
-  (2020) ReDA: reinforced differentiable attribute for 3d face reconstruction. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 4958–4967. Cited by: §1, footnote 1.