Zhenan Sun

is this you? claim profile


Professor at Chinese Academy of Sciences

  • Cross-spectral Face Completion for NIR-VIS Heterogeneous Face Recognition

    Near infrared-visible (NIR-VIS) heterogeneous face recognition refers to the process of matching NIR to VIS face images. Current heterogeneous methods try to extend VIS face recognition methods to the NIR spectrum by synthesizing VIS images from NIR images. However, due to self-occlusion and sensing gap, NIR face images lose some visible lighting contents so that they are always incomplete compared to VIS face images. This paper models high resolution heterogeneous face synthesis as a complementary combination of two components, a texture inpainting component and pose correction component. The inpainting component synthesizes and inpaints VIS image textures from NIR image textures. The correction component maps any pose in NIR images to a frontal pose in VIS images, resulting in paired NIR and VIS textures. A warping procedure is developed to integrate the two components into an end-to-end deep network. A fine-grained discriminator and a wavelet-based discriminator are designed to supervise intra-class variance and visual quality respectively. One UV loss, two adversarial losses and one pixel loss are imposed to ensure synthesis results. We demonstrate that by attaching the correction component, we can simplify heterogeneous face synthesis from one-to-many unpaired image translation to one-to-one paired image translation, and minimize spectral and pose discrepancy during heterogeneous recognition. Extensive experimental results show that our network not only generates high-resolution VIS face images and but also facilitates the accuracy improvement of heterogeneous face recognition.

    02/10/2019 ∙ by Ran He, et al. ∙ 16 share

    read it

  • IntroVAE: Introspective Variational Autoencoders for Photographic Image Synthesis

    We present a novel introspective variational autoencoder (IntroVAE) model for synthesizing high-resolution photographic images. IntroVAE is capable of self-evaluating the quality of its generated samples and improving itself accordingly. Its inference and generator models are jointly trained in an introspective way. On one hand, the generator is required to reconstruct the input images from the noisy outputs of the inference model as normal VAEs. On the other hand, the inference model is encouraged to classify between the generated and real samples while the generator tries to fool it as GANs. These two famous generative frameworks are integrated in a simple yet efficient single-stream architecture that can be trained in a single stage. IntroVAE preserves the advantages of VAEs, such as stable training and nice latent manifold. Unlike most other hybrid models of VAEs and GANs, IntroVAE requires no extra discriminators, because the inference model itself serves as a discriminator to distinguish between the generated and real samples. Experiments demonstrate that our method produces high-resolution photo-realistic images (e.g., CELEBA images at 1024^2), which are comparable to or better than the state-of-the-art GANs.

    07/17/2018 ∙ by Huaibo Huang, et al. ∙ 10 share

    read it

  • Variational Capsules for Image Analysis and Synthesis

    A capsule is a group of neurons whose activity vector models different properties of the same entity. This paper extends the capsule to a generative version, named variational capsules (VCs). Each VC produces a latent variable for a specific entity, making it possible to integrate image analysis and image synthesis into a unified framework. Variational capsules model an image as a composition of entities in a probabilistic model. Different capsules' divergence with a specific prior distribution represents the presence of different entities, which can be applied in image analysis tasks such as classification. In addition, variational capsules encode multiple entities in a semantically-disentangling way. Diverse instantiations of capsules are related to various properties of the same entity, making it easy to generate diverse samples with fine-grained semantic attributes. Extensive experiments demonstrate that deep networks designed with variational capsules can not only achieve promising performance on image analysis tasks (including image classification and attribute prediction) but can also improve the diversity and controllability of image synthesis.

    07/11/2018 ∙ by Huaibo Huang, et al. ∙ 6 share

    read it

  • Global and Local Consistent Wavelet-domain Age Synthesis

    Age synthesis is a challenging task due to the complicated and non-linear transformation in human aging process. Aging information is usually reflected in local facial parts, such as wrinkles at the eye corners. However, these local facial parts contribute less in previous GAN based methods for age synthesis. To address this issue, we propose a Wavelet-domain Global and Local Consistent Age Generative Adversarial Network (WaveletGLCA-GAN), in which one global specific network and three local specific networks are integrated together to capture both global topology information and local texture details of human faces. Different from the most existing methods that modeling age synthesis in image-domain, we adopt wavelet transform to depict the textual information in frequency-domain. under the premise of preserving the identity information, age estimation network and face verification network are employed. Moreover, five types of losses are adopted: 1) adversarial loss aims to generate realistic wavelets; 2) identity preserving loss aims to better preserve identity information; 3) age preserving loss aims to enhance the accuracy of age synthesis; 4) pixel-wise loss aims to preserve the background information of the input face; 5) the total variation regularization aims to remove ghosting artifacts. Our method is evaluated on three face aging datasets, including CACD2000, Morph and FG-NET. Qualitative and quantitative experiments show the superiority of the proposed method over other state-of-the-arts.

    09/20/2018 ∙ by Peipei Li, et al. ∙ 2 share

    read it

  • Wasserstein CNN: Learning Invariant Features for NIR-VIS Face Recognition

    Heterogeneous face recognition (HFR) aims to match facial images acquired from different sensing modalities with mission-critical applications in forensics, security and commercial sectors. However, HFR is a much more challenging problem than traditional face recognition because of large intra-class variations of heterogeneous face images and limited training samples of cross-modality face image pairs. This paper proposes a novel approach namely Wasserstein CNN (convolutional neural networks, or WCNN for short) to learn invariant features between near-infrared and visual face images (i.e. NIR-VIS face recognition). The low-level layers of WCNN are trained with widely available face images in visual spectrum. The high-level layer is divided into three parts, i.e., NIR layer, VIS layer and NIR-VIS shared layer. The first two layers aims to learn modality-specific features and NIR-VIS shared layer is designed to learn modality-invariant feature subspace. Wasserstein distance is introduced into NIR-VIS shared layer to measure the dissimilarity between heterogeneous feature distributions. So W-CNN learning aims to achieve the minimization of Wasserstein distance between NIR distribution and VIS distribution for invariant deep feature representation of heterogeneous face images. To avoid the over-fitting problem on small-scale heterogeneous face data, a correlation prior is introduced on the fully-connected layers of WCNN network to reduce parameter space. This prior is implemented by a low-rank constraint in an end-to-end network. The joint formulation leads to an alternating minimization for deep feature representation at training stage and an efficient computation for heterogeneous data at testing stage. Extensive experiments on three challenging NIR-VIS face recognition databases demonstrate the significant superiority of Wasserstein CNN over state-of-the-art methods.

    08/08/2017 ∙ by Ran He, et al. ∙ 0 share

    read it

  • Recent Progress of Face Image Synthesis

    Face synthesis has been a fascinating yet challenging problem in computer vision and machine learning. Its main research effort is to design algorithms to generate photo-realistic face images via given semantic domain. It has been a crucial prepossessing step of main-stream face recognition approaches and an excellent test of AI ability to use complicated probability distributions. In this paper, we provide a comprehensive review of typical face synthesis works that involve traditional methods as well as advanced deep learning approaches. Particularly, Generative Adversarial Net (GAN) is highlighted to generate photo-realistic and identity preserving results. Furthermore, the public available databases and evaluation metrics are introduced in details. We end the review with discussing unsolved difficulties and promising directions for future research.

    06/15/2017 ∙ by Zhihe Lu, et al. ∙ 0 share

    read it

  • Deep Supervised Discrete Hashing

    With the rapid growth of image and video data on the web, hashing has been extensively studied for image or video search in recent years. Benefit from recent advances in deep learning, deep hashing methods have achieved promising results for image retrieval. However, there are some limitations of previous deep hashing methods (e.g., the semantic information is not fully exploited). In this paper, we develop a deep supervised discrete hashing algorithm based on the assumption that the learned binary codes should be ideal for classification. Both the pairwise label information and the classification information are used to learn the hash codes within one stream framework. We constrain the outputs of the last layer to be binary codes directly, which is rarely investigated in deep hashing algorithm. Because of the discrete nature of hash codes, an alternating minimization method is used to optimize the objective function. Experimental results have shown that our method outperforms current state-of-the-art methods on benchmark datasets.

    05/31/2017 ∙ by Qi Li, et al. ∙ 0 share

    read it

  • Combining Data-driven and Model-driven Methods for Robust Facial Landmark Detection

    Facial landmark detection is an important but challenging task for real-world computer vision applications. This paper proposes an accurate and robust approach for facial landmark detection by combining data-driven and model-driven methods. Firstly, a fully convolutional network (FCN) is trained to generate response maps of all facial landmark points. Such a data-driven method can make full use of holistic information in a facial image for global estimation of facial landmarks. Secondly, the maximum points in the response maps are fitted with a pre-trained point distribution model (PDM) to generate initial facial landmark shape. Such a model-driven method can correct the location errors of outliers by considering shape prior information. Thirdly, a weighted version of Regularized Landmark Mean-Shift (RLMS) is proposed to fine-tune facial landmark shapes iteratively. The weighting strategy is based on the confidence of convolutional response maps so that FCN is integrated into the framework of Constrained Local Model (CLM). Such an Estimation-Correction-Tuning process perfectly combines the global robustness advantage of data-driven method (FCN), outlier correction advantage of model-driven method (PDM) and non-parametric optimization advantage of RLMS. The experimental results demonstrate that the proposed approach outperforms state-of-the-art solutions on the 300-W dataset. Our approach is well-suited for face images with large poses, exaggerated expression, and occlusions.

    11/30/2016 ∙ by Hongwen Zhang, et al. ∙ 0 share

    read it

  • A Light CNN for Deep Face Representation with Noisy Labels

    Convolution neural network (CNN) has significantly pushed forward the development of face recognition and analysis techniques. Current CNN models tend to be deeper and larger to better fit large amounts of training data. When training data are from internet, their labels are often ambiguous and inaccurate. This paper presents a light CNN framework to learn a compact embedding on the large-scale face data with massive noisy labels. First, we introduce the concept of maxout activation into each convolutional layer of CNN, which results in a Max-Feature-Map (MFM). Different from Rectified Linear Unit that suppresses a neuron by a threshold (or bias), MFM suppresses a neuron by a competitive relationship. MFM can not only separate noisy signals and informative signals but also plays a role of feature selection. Second, a network of five convolution layers and four Network in Network (NIN) layers are implemented to reduce the number of parameters and improve performance. Lastly, a semantic bootstrapping method is accordingly designed to make the prediction of the models be better consistent with noisy labels. Experimental results show that the proposed framework can utilize large-scale noisy data to learn a light model in terms of both computational cost and storage space. The learnt single model with a 256-D representation achieves state-of-the-art results on five face benchmarks without fine-tuning. The light CNN model is released on https://github.com/AlfredXiangWu/face_verification_experiment.

    11/09/2015 ∙ by Xiang Wu, et al. ∙ 0 share

    read it

  • Learning Structured Ordinal Measures for Video based Face Recognition

    This paper presents a structured ordinal measure method for video-based face recognition that simultaneously learns ordinal filters and structured ordinal features. The problem is posed as a non-convex integer program problem that includes two parts. The first part learns stable ordinal filters to project video data into a large-margin ordinal space. The second seeks self-correcting and discrete codes by balancing the projected data and a rank-one ordinal matrix in a structured low-rank way. Unsupervised and supervised structures are considered for the ordinal matrix. In addition, as a complement to hierarchical structures, deep feature representations are integrated into our method to enhance coding stability. An alternating minimization method is employed to handle the discrete and low-rank constraints, yielding high-quality codes that capture prior structures well. Experimental results on three commonly used face video databases show that our method with a simple voting classifier can achieve state-of-the-art recognition rates using fewer features and samples.

    07/09/2015 ∙ by Ran He, et al. ∙ 0 share

    read it

  • Geometry Guided Adversarial Facial Expression Synthesis

    Facial expression synthesis has drawn much attention in the field of computer graphics and pattern recognition. It has been widely used in face animation and recognition. However, it is still challenging due to the high-level semantic presence of large and non-linear face geometry variations. This paper proposes a Geometry-Guided Generative Adversarial Network (G2-GAN) for photo-realistic and identity-preserving facial expression synthesis. We employ facial geometry (fiducial points) as a controllable condition to guide facial texture synthesis with specific expression. A pair of generative adversarial subnetworks are jointly trained towards opposite tasks: expression removal and expression synthesis. The paired networks form a mapping cycle between neutral expression and arbitrary expressions, which also facilitate other applications such as face transfer and expression invariant face recognition. Experimental results show that our method can generate compelling perceptual results on various facial expression synthesis databases. An expression invariant face recognition experiment is also performed to further show the advantages of our proposed method.

    12/10/2017 ∙ by Lingxiao Song, et al. ∙ 0 share

    read it