Yanwei Fu

is this you? claim profile


Post Doctoral Associate at Disney Research, Assistant Professor at Fudan University

  • A Multi-task Neural Approach for Emotion Attribution, Classification and Summarization

    Emotional content is a crucial ingredient in user-generated videos. However, the sparsely expressed emotions in the user-generated video cause difficulties to emotions analysis in videos. In this paper, we propose a new neural approach---Bi-stream Emotion Attribution-Classification Network (BEAC-Net) to solve three related emotion analysis tasks: emotion recognition, emotion attribution and emotion-oriented summarization, in an integrated framework. BEAC-Net has two major constituents, an attribution network and a classification network. The attribution network extracts the main emotional segment that classification should focus on in order to mitigate the sparsity problem. The classification network utilizes both the extracted segment and the original video in a bi-stream architecture. We contribute a new dataset for the emotion attribution task with human-annotated ground-truth labels for emotion segments. Experiments on two video datasets demonstrate superior performance of the proposed framework and the complementary nature of the dual classification streams.

    12/21/2018 ∙ by Guoyun Tu, et al. ∙ 24 share

    read it

  • Parsimonious Deep Learning: A Differential Inclusion Approach with Global Convergence

    Over-parameterization is ubiquitous nowadays in training neural networks to benefit both optimization in seeking global optima and generalization in reducing prediction error. However, compressive networks are desired in many real world applications and direct training of small networks may be trapped in local optima. In this paper, instead of pruning or distilling an over-parameterized model to compressive ones, we propose a parsimonious learning approach based on differential inclusions of inverse scale spaces, that generates a family of models from simple to complex ones with a better efficiency and interpretability than stochastic gradient descent in exploring the model space. It enjoys a simple discretization, the Split Linearized Bregman Iterations, with provable global convergence that from any initializations, algorithmic iterations converge to a critical point of empirical risks. One may exploit the proposed method to boost the complexity of neural networks progressively. Numerical experiments with MNIST, Cifar-10/100, and ImageNet are conducted to show the method is promising in training large scale models with a favorite interpretability.

    05/23/2019 ∙ by Yanwei Fu, et al. ∙ 15 share

    read it

  • Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images

    We propose an end-to-end deep learning architecture that produces a 3D shape in triangular mesh from a single color image. Limited by the nature of deep neural network, previous methods usually represent a 3D shape in volume or point cloud, and it is non-trivial to convert them to the more ready-to-use mesh model. Unlike the existing methods, our network represents 3D mesh in a graph-based convolutional neural network and produces correct geometry by progressively deforming an ellipsoid, leveraging perceptual features extracted from the input image. We adopt a coarse-to-fine strategy to make the whole deformation procedure stable, and define various of mesh related losses to capture properties of different levels to guarantee visually appealing and physically accurate 3D geometry. Extensive experiments show that our method not only qualitatively produces mesh model with better details, but also achieves higher 3D shape estimation accuracy compared to the state-of-the-art.

    04/05/2018 ∙ by Nanyang Wang, et al. ∙ 8 share

    read it

  • Pixel2Mesh++: Multi-View 3D Mesh Generation via Deformation

    We study the problem of shape generation in 3D mesh representation from a few color images with known camera poses. While many previous works learn to hallucinate the shape directly from priors, we resort to further improving the shape quality by leveraging cross-view information with a graph convolutional network. Instead of building a direct mapping function from images to 3D shape, our model learns to predict series of deformations to improve a coarse shape iteratively. Inspired by traditional multiple view geometry methods, our network samples nearby area around the initial mesh's vertex locations and reasons an optimal deformation using perceptual feature statistics built from multiple input images. Extensive experiments show that our model produces accurate 3D shape that are not only visually plausible from the input perspectives, but also well aligned to arbitrary viewpoints. With the help of physically driven architecture, our model also exhibits generalization capability across different semantic categories, number of input images, and quality of mesh initialization.

    08/05/2019 ∙ by Chao Wen, et al. ∙ 8 share

    read it

  • Detecting Tiny Moving Vehicles in Satellite Videos

    In recent years, the satellite videos have been captured by a moving satellite platform. In contrast to consumer, movie, and common surveillance videos, satellite video can record the snapshot of the city-scale scene. In a broad field-of-view of satellite videos, each moving target would be very tiny and usually composed of several pixels in frames. Even worse, the noise signals also existed in the video frames, since the background of the video frame has the subpixel-level and uneven moving thanks to the motion of satellites. We argue that this is a new type of computer vision task since previous technologies are unable to detect such tiny vehicles efficiently. This paper proposes a novel framework that can identify the small moving vehicles in satellite videos. In particular, we offer a novel detecting algorithm based on the local noise modeling. We differentiate the potential vehicle targets from noise patterns by an exponential probability distribution. Subsequently, a multi-morphological-cue based discrimination strategy is designed to distinguish correct vehicle targets from a few existing noises further. Another significant contribution is to introduce a series of evaluation protocols to measure the performance of tiny moving vehicle detection systematically. We annotate a satellite video manually and use it to test our algorithms under different evaluation criterion. The proposed algorithm is also compared with the state-of-the-art baselines, and demonstrates the advantages of our framework over the benchmarks.

    07/05/2018 ∙ by Wei Ao, et al. ∙ 4 share

    read it

  • S^2-LBI: Stochastic Split Linearized Bregman Iterations for Parsimonious Deep Learning

    This paper proposes a novel Stochastic Split Linearized Bregman Iteration (S^2-LBI) algorithm to efficiently train the deep network. The S^2-LBI introduces an iterative regularization path with structural sparsity. Our S^2-LBI combines the computational efficiency of the LBI, and model selection consistency in learning the structural sparsity. The computed solution path intrinsically enables us to enlarge or simplify a network, which theoretically, is benefited from the dynamics property of our S^2-LBI algorithm. The experimental results validate our S^2-LBI on MNIST and CIFAR-10 dataset. For example, in MNIST, we can either boost a network with only 1.5K parameters (1 convolutional layer of 5 filters, and 1 FC layer), achieves 98.40% recognition accuracy; or we simplify 82.5% of parameters in LeNet-5 network, and still achieves the 98.47% recognition accuracy. In addition, we also have the learning results on ImageNet, which will be added in the next version of our report.

    04/24/2019 ∙ by Yanwei Fu, et al. ∙ 4 share

    read it

  • A Fine-Grained Facial Expression Database for End-to-End Multi-Pose Facial Expression Recognition

    The recent research of facial expression recognition has made a lot of progress due to the development of deep learning technologies, but some typical challenging problems such as the variety of rich facial expressions and poses are still not resolved. To solve these problems, we develop a new Facial Expression Recognition (FER) framework by involving the facial poses into our image synthesizing and classification process. There are two major novelties in this work. First, we create a new facial expression dataset of more than 200k images with 119 persons, 4 poses and 54 expressions. To our knowledge this is the first dataset to label faces with subtle emotion changes for expression recognition purpose. It is also the first dataset that is large enough to validate the FER task on unbalanced poses, expressions, and zero-shot subject IDs. Second, we propose a facial pose generative adversarial network (FaPE-GAN) to synthesize new facial expression images to augment the data set for training purpose, and then learn a LightCNN based Fa-Net model for expression classification. Finally, we advocate four novel learning tasks on this dataset. The experimental results well validate the effectiveness of the proposed approach.

    07/25/2019 ∙ by Wenxuan Wang, et al. ∙ 2 share

    read it

  • AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding

    Significant progress has been achieved in Computer Vision by leveraging large-scale image datasets. However, large-scale datasets for complex Computer Vision tasks beyond classification are still limited. This paper proposed a large-scale dataset named AIC (AI Challenger) with three sub-datasets, human keypoint detection (HKD), large-scale attribute dataset (LAD) and image Chinese captioning (ICC). In this dataset, we annotate class labels (LAD), keypoint coordinate (HKD), bounding box (HKD and LAD), attribute (LAD) and caption (ICC). These rich annotations bridge the semantic gap between low-level images and high-level concepts. The proposed dataset is an effective benchmark to evaluate and improve different computational methods. In addition, for related tasks, others can also use our dataset as a new resource to pre-train their models.

    11/17/2017 ∙ by Jiahong Wu, et al. ∙ 0 share

    read it

  • Left-Right Skip-DenseNets for Coarse-to-Fine Object Categorization

    Inspired by the recent neuroscience studies on the left-right asymmetry of the brains in the low and high spatial frequency processing, we introduce a novel type of network -- the left-right skip-densenets for coarse-to-fine object categorization. This network can enable both coarse and fine-grained classification in a single framework. We also for the first time propose the layer-skipping mechanism which learns a gating network to predict whether skip some layers in the testing stage. This layer-skipping mechanism assigns more flexibility and capability to our network for the categorization tasks. Our network is evaluated on three widely used datasets; the results show that our network is more promising in solving the coarse-to-fine object categorization than the competitors.

    10/28/2017 ∙ by Changmao Cheng, et al. ∙ 0 share

    read it

  • Multi-scale Deep Learning Architectures for Person Re-identification

    Person Re-identification (re-id) aims to match people across non-overlapping camera views in a public space. It is a challenging problem because many people captured in surveillance videos wear similar clothes. Consequently, the differences in their appearance are often subtle and only detectable at the right location and scales. Existing re-id models, particularly the recently proposed deep learning based ones match people at a single scale. In contrast, in this paper, a novel multi-scale deep learning model is proposed. Our model is able to learn deep discriminative feature representations at different scales and automatically determine the most suitable scales for matching. The importance of different spatial locations for extracting discriminative features is also learned explicitly. Experiments are carried out to demonstrate that the proposed model outperforms the state-of-the art on a number of benchmarks

    09/15/2017 ∙ by Xuelin Qian, et al. ∙ 0 share

    read it

  • A Jointly Learned Deep Architecture for Facial Attribute Analysis and Face Detection in the Wild

    Facial attribute analysis in the real world scenario is very challenging mainly because of complex face variations. Existing works of analyzing face attributes are mostly based on the cropped and aligned face images. However, this result in the capability of attribute prediction heavily relies on the preprocessing of face detector. To address this problem, we present a novel jointly learned deep architecture for both facial attribute analysis and face detection. Our framework can process the natural images in the wild and our experiments on CelebA and LFWA datasets clearly show that the state-of-the-art performance is obtained.

    07/27/2017 ∙ by Keke He, et al. ∙ 0 share

    read it