Yu-Wing Tai

is this you? claim profile


Research Director of Youtu Lab at Tencent since 2017, Principal Researcher at SenseTime Group Limited 2015-2016, Adjunct Associate Professor at the Department of Computer Science and Engineering, HKUST, Associate professor at the Korea Advanced Institute of Science and Technology (KAIST) from July 2009 to August 2015, Assistant Professor at KAIST from 2009-2014, full-time student internship in the Microsoft Research Asia (MSRA) 2007-2008, Microsoft Research Asia Fellowship in 2007, and the KAIST 40th Anniversary Academic Award for Excellent Professor in 2011 respectively, PhD degree from the National University of Singapore in 2009.

  • LADN: Local Adversarial Disentangling Network for Facial Makeup and De-Makeup

    We propose a local adversarial disentangling network (LADN) for facial makeup and de-makeup. Central to our method are multiple and overlapping local adversarial discriminators in a content-style disentangling network for achieving local detail transfer between facial images, with the use of asymmetric loss functions for dramatic makeup styles with high-frequency details. Existing techniques do not demonstrate or fail to transfer high-frequency details in a global adversarial setting, or train a single local discriminator only to ensure image structure consistency and thus work only for relatively simple styles. Unlike others, our proposed local adversarial discriminators can distinguish whether the generated local image details are consistent with the corresponding regions in the given reference image in cross-image style transfer in an unsupervised setting. Incorporating these technical contributions, we achieve not only state-of-the-art results on conventional styles but also novel results involving complex and dramatic styles with high-frequency details covering large areas across multiple facial features. A carefully designed dataset of unpaired before and after makeup images will be released.

    04/25/2019 ∙ by Qiao Gu, et al. ∙ 20 share

    read it

  • Landmark Assisted CycleGAN for Cartoon Face Generation

    In this paper, we are interested in generating an cartoon face of a person by using unpaired training data between real faces and cartoon ones. A major challenge of this task is that the structures of real and cartoon faces are in two different domains, whose appearance differs greatly from each other. Without explicit correspondence, it is difficult to generate a high quality cartoon face that captures the essential facial features of a person. In order to solve this problem, we propose landmark assisted CycleGAN, which utilizes face landmarks to define landmark consistency loss and to guide the training of local discriminator in CycleGAN. To enforce structural consistency in landmarks, we utilize the conditional generator and discriminator. Our approach is capable to generate high-quality cartoon faces even indistinguishable from those drawn by artists and largely improves state-of-the-art.

    07/02/2019 ∙ by Ruizheng Wu, et al. ∙ 16 share

    read it

  • Learning Dual Convolutional Neural Networks for Low-Level Vision

    In this paper, we propose a general dual convolutional neural network (DualCNN) for low-level vision problems, e.g., super-resolution, edge-preserving filtering, deraining and dehazing. These problems usually involve the estimation of two components of the target signals: structures and details. Motivated by this, our proposed DualCNN consists of two parallel branches, which respectively recovers the structures and details in an end-to-end manner. The recovered structures and details can generate the target signals according to the formation model for each particular application. The DualCNN is a flexible framework for low-level vision tasks and can be easily incorporated into existing CNNs. Experimental results show that the DualCNN can be effectively applied to numerous low-level vision tasks with favorable performance against the state-of-the-art methods.

    05/14/2018 ∙ by Jinshan Pan, et al. ∙ 6 share

    read it

  • Physics-Based Generative Adversarial Models for Image Restoration and Beyond

    We present an algorithm to directly solve numerous image restoration problems (e.g., image deblurring, image dehazing, image deraining, etc.). These problems are highly ill-posed, and the common assumptions for existing methods are usually based on heuristic image priors. In this paper, we find that these problems can be solved by generative models with adversarial learning. However, the basic formulation of generative adversarial networks (GANs) does not generate realistic images, and some structures of the estimated images are usually not preserved well. Motivated by an interesting observation that the estimated results should be consistent with the observed inputs under the physics models, we propose a physics model constrained learning algorithm so that it can guide the estimation of the specific task in the conventional GAN framework. The proposed algorithm is trained in an end-to-end fashion and can be applied to a variety of image restoration and related low-level vision problems. Extensive experiments demonstrate that our method performs favorably against the state-of-the-art algorithms.

    08/02/2018 ∙ by Jinshan Pan, et al. ∙ 6 share

    read it

  • Weakly and Semi Supervised Human Body Part Parsing via Pose-Guided Knowledge Transfer

    Human body part parsing, or human semantic part segmentation, is fundamental to many computer vision tasks. In conventional semantic segmentation methods, the ground truth segmentations are provided, and fully convolutional networks (FCN) are trained in an end-to-end scheme. Although these methods have demonstrated impressive results, their performance highly depends on the quantity and quality of training data. In this paper, we present a novel method to generate synthetic human part segmentation data using easily-obtained human keypoint annotations. Our key idea is to exploit the anatomical similarity among human to transfer the parsing results of a person to another person with similar pose. Using these estimated results as additional training data, our semi-supervised model outperforms its strong-supervised counterpart by 6 mIOU on the PASCAL-Person-Part dataset, and we achieve state-of-the-art human parsing results. Our approach is general and can be readily extended to other object/animal parsing task assuming that their anatomical similarity can be annotated by keypoints. The proposed model and accompanying source code are available at https://github.com/MVIG-SJTU/WSHP

    05/11/2018 ∙ by Hao-Shu Fang, et al. ∙ 2 share

    read it

  • Sketch-to-Image Generation Using Deep Contextual Completion

    When the input to pix2pix translation is a badly drawn sketch, the output follows the input edges due to the strict alignment imposed by the translation process. In this paper we propose sketch-to-image generation, where the output edges do not necessarily follow the input edges. We address the image generation problem using a novel joint image completion approach, where the sketch provides the image context for completing, or generating the output image. We train a deep generative model to learn the joint distribution of sketch and the corresponding image by using joint images. Our deep contextual completion approach has several advantages. First, the simple joint image representation allows for simple and effective definition of losses in the same joint image-sketch space, which avoids complicated issues in cross-domain learning. Second, while the output is related to its input overall, the generated features exhibit more freedom in appearance and do not strictly align with the input features. Third, from the joint image's point of view, image and sketch are of no difference, thus exactly the same deep joint image completion network can be used for image-to-sketch generation. Experiments evaluated on three different datasets show that the proposed approach can generate more realistic images than the state-ofthe- arts on challenging inputs and generalize well on common categories.

    11/24/2017 ∙ by Yongyi Lu, et al. ∙ 0 share

    read it

  • Deep Video Generation, Prediction and Completion of Human Action Sequences

    Current deep learning results on video generation are limited while there are only a few first results on video prediction and no relevant significant results on video completion. This is due to the severe ill-posedness inherent in these three problems. In this paper, we focus on human action videos, and propose a general, two-stage deep framework to generate human action videos with no constraints or arbitrary number of constraints, which uniformly address the three problems: video generation given no input frames, video prediction given the first few frames, and video completion given the first and last frames. To make the problem tractable, in the first stage we train a deep generative model that generates a human pose sequence from random noise. In the second stage, a skeleton-to-image network is trained, which is used to generate a human action video given the complete human pose sequence generated in the first stage. By introducing the two-stage strategy, we sidestep the original ill-posed problems while producing for the first time high-quality video generation/prediction/completion results of much longer duration. We present quantitative and qualitative evaluation to show that our two-stage approach outperforms state-of-the-art methods in video generation, prediction and video completion. Our video result demonstration can be viewed at https://iamacewhite.github.io/supp/index.html

    11/23/2017 ∙ by Haoye Cai, et al. ∙ 0 share

    read it

  • Adversarial Attacks Beyond the Image Space

    Generating adversarial examples is an intriguing problem and an important way of understanding the working mechanism of deep neural networks. Recently, it has attracted a lot of attention in the computer vision community. Most existing approaches generated perturbations in image space, i.e., each pixel can be modified independently. However, it remains unclear whether these adversarial examples are authentic, in the sense that they correspond to actual changes in physical properties. This paper aims at exploring this topic in the contexts of object classification and visual question answering. The baselines are set to be several state-of-the-art deep neural networks which receive 2D input images. We augment these networks with a differentiable 3D rendering layer in front, so that a 3D scene (in physical space) is rendered into a 2D image (in image space), and then mapped to a prediction (in output space). There are two (direct or indirect) ways of attacking the physical parameters. The former back-propagates the gradients of error signals from output space to physical space directly, while the latter first constructs an adversary in image space, and then attempts to find the best solution in physical space that is rendered into this image. An important finding is that attacking physical space is much more difficult, as the direct method, compared with that used in image space, produces a much lower success rate and requires heavier perturbations to be added. On the other hand, the indirect method does not work out, suggesting that adversaries generated in image space are inauthentic. By interpreting them in physical space, most of these adversaries can be filtered out, showing promise in defending adversaries.

    11/20/2017 ∙ by Xiaohui Zeng, et al. ∙ 0 share

    read it

  • Image Dehazing using Bilinear Composition Loss Function

    In this paper, we introduce a bilinear composition loss function to address the problem of image dehazing. Previous methods in image dehazing use a two-stage approach which first estimate the transmission map followed by clear image estimation. The drawback of a two-stage method is that it tends to boost local image artifacts such as noise, aliasing and blocking. This is especially the case for heavy haze images captured with a low quality device. Our method is based on convolutional neural networks. Unique in our method is the bilinear composition loss function which directly model the correlations between transmission map, clear image, and atmospheric light. This allows errors to be back-propagated to each sub-network concurrently, while maintaining the composition constraint to avoid overfitting of each sub-network. We evaluate the effectiveness of our proposed method using both synthetic and real world examples. Extensive experiments show that our method outperfoms state-of-the-art methods especially for haze images with severe noise level and compressions.

    10/01/2017 ∙ by Hui Yang, et al. ∙ 0 share

    read it

  • Weakly- and Self-Supervised Learning for Content-Aware Deep Image Retargeting

    This paper proposes a weakly- and self-supervised deep convolutional neural network (WSSDCNN) for content-aware image retargeting. Our network takes a source image and a target aspect ratio, and then directly outputs a retargeted image. Retargeting is performed through a shift map, which is a pixel-wise mapping from the source to the target grid. Our method implicitly learns an attention map, which leads to a content-aware shift map for image retargeting. As a result, discriminative parts in an image are preserved, while background regions are adjusted seamlessly. In the training phase, pairs of an image and its image-level annotation are used to compute content and structure losses. We demonstrate the effectiveness of our proposed method for a retargeting application with insightful analyses.

    08/09/2017 ∙ by Donghyeon Cho, et al. ∙ 0 share

    read it

  • Conditional CycleGAN for Attribute Guided Face Image Generation

    State-of-the-art techniques in Generative Adversarial Networks (GANs) such as cycleGAN is able to learn the mapping of one image domain X to another image domain Y using unpaired image data. We extend the cycleGAN to Conditional cycleGAN such that the mapping from X to Y is subjected to attribute condition Z. Using face image generation as an application example, where X is a low resolution face image, Y is a high resolution face image, and Z is a set of attributes related to facial appearance (e.g. gender, hair color, smile), we present our method to incorporate Z into the network, such that the hallucinated high resolution face image Y' not only satisfies the low resolution constrain inherent in X, but also the attribute condition prescribed by Z. Using face feature vector extracted from face verification network as Z, we demonstrate the efficacy of our approach on identity-preserving face image super-resolution. Our approach is general and applicable to high-quality face image generation where specific facial attributes can be controlled easily in the automatically generated results.

    05/28/2017 ∙ by Yongyi Lu, et al. ∙ 0 share

    read it