Thomas Huang

is this you? claim profile


Professor at University of Illinois at Urbana-Champaign, Leader of The Image Formation and Processing (IFP) group, Beckman Institute for Advanced Science and Technology

  • SG-One: Similarity Guidance Network for One-Shot Semantic Segmentation

    One-shot semantic segmentation poses a challenging task of recognizing the object regions from unseen categories with only one annotated example as supervision. In this paper, we propose a simple yet effective Similarity Guidance network to tackle the One-shot (SG-One) segmentation problem. We aim at predicting the segmentation mask of a query image with the reference to one densely labeled support image. To obtain the robust representative feature of the support image, we firstly propose a masked average pooling strategy for producing the guidance features using only the pixels belonging to the support image. We then leverage the cosine similarity to build the relationship between the guidance features and features of pixels from the query image. In this way, the possibilities embedded in the produced similarity maps can be adopted to guide the process of segmenting objects. Furthermore, our SG-One is a unified framework which can efficiently process both support and query images within one network and be learned in an end-to-end manner. We conduct extensive experiments on Pascal VOC 2012. In particular, our SG-One achieves the mIoU score of 46.3

    10/22/2018 ∙ by Xiaolin Zhang, et al. ∙ 46 share

    read it

  • When AWGN-based Denoiser Meets Real Noises

    Discriminative learning based image denoisers have achieved promising performance on synthetic noise such as the additive Gaussian noise. However, their performance on images with real noise is often not satisfactory. The main reason is that real noises are mostly spatially/channel-correlated and spatial/channel-variant. In contrast, the synthetic Additive White Gaussian Noise (AWGN) adopted in most previous work is pixel-independent. In this paper, we propose a novel approach to boost the performance of a real image denoiser which is trained only with synthetic pixel-independent noise data. First, we train a deep model that consists of a noise estimator and a denoiser with mixed AWGN and Random Value Impulse Noise (RVIN). We then investigate Pixel-shuffle Down-sampling (PD) strategy to adapt the trained model to real noises. Extensive experiments demonstrate the effectiveness and generalization ability of the proposed approach. Notably, our method achieves state-of-the-art performance on real sRGB images in the DND benchmark. Codes are available at

    04/06/2019 ∙ by Yuqian Zhou, et al. ∙ 20 share

    read it

  • Slimmable Neural Networks

    We present a simple and general method to train a single neural network executable at different widths (number of channels in a layer), permitting instant and adaptive accuracy-efficiency trade-offs at runtime. Instead of training individual networks with different width configurations, we train a shared network with switchable batch normalization. At runtime, the network can adjust its width on the fly according to on-device benchmarks and resource constraints, rather than downloading and offloading different models. Our trained networks, named slimmable neural networks, achieve similar (and in many cases better) ImageNet classification accuracy than individually trained models of MobileNet v1, MobileNet v2, ShuffleNet and ResNet-50 at different widths respectively. We also demonstrate better performance of slimmable models compared with individual ones across a wide range of applications including COCO bounding-box object detection, instance segmentation and person keypoint detection without tuning hyper-parameters. Lastly we visualize and discuss the learned features of slimmable networks. Code and models are available at:

    12/21/2018 ∙ by Jiahui Yu, et al. ∙ 16 share

    read it

  • TS2C: Tight Box Mining with Surrounding Segmentation Context for Weakly Supervised Object Detection

    This work provides a simple approach to discover tight object bounding boxes with only image-level supervision, called Tight box mining with Surrounding Segmentation Context (TS2C). We observe that object candidates mined through current multiple instance learning methods are usually trapped to discriminative object parts, rather than the entire object. TS2C leverages surrounding segmentation context derived from weakly-supervised segmentation to suppress such low-quality distracting candidates and boost the high-quality ones. Specifically, TS2C is developed based on two key properties of desirable bounding boxes: 1) high purity, meaning most pixels in the box are with high object response, and 2) high completeness, meaning the box covers high object response pixels comprehensively. With such novel and computable criteria, more tight candidates can be discovered for learning a better object detector. With TS2C, we obtain 48.0 which are the new state-of-the-arts.

    07/13/2018 ∙ by Yunchao Wei, et al. ∙ 10 share

    read it

  • High Frequency Residual Learning for Multi-Scale Image Classification

    We present a novel high frequency residual learning framework, which leads to a highly efficient multi-scale network (MSNet) architecture for mobile and embedded vision problems. The architecture utilizes two networks: a low resolution network to efficiently approximate low frequency components and a high resolution network to learn high frequency residuals by reusing the upsampled low resolution features. With a classifier calibration module, MSNet can dynamically allocate computation resources during inference to achieve a better speed and accuracy trade-off. We evaluate our methods on the challenging ImageNet-1k dataset and observe consistent improvements over different base networks. On ResNet-18 and MobileNet with alpha=1.0, MSNet gains 1.5 over both architectures without increasing computations. On the more efficient MobileNet with alpha=0.25, our method gains 3.8 of computations.

    05/07/2019 ∙ by Bowen Cheng, et al. ∙ 10 share

    read it

  • Devil in the Details: Towards Accurate Single and Multiple Human Parsing

    Human parsing has received considerable interest due to its wide application potentials. Nevertheless, it is still unclear how to develop an accurate human parsing system in an efficient and elegant way. In this paper, we identify several useful properties, including feature resolution, global context information and edge details, and perform rigorous analyses to reveal how to leverage them to benefit the human parsing task. The advantages of these useful properties finally result in a simple yet effective Context Embedding with Edge Perceiving (CE2P) framework for single human parsing. Our CE2P is end-to-end trainable and can be easily adopted for conducting multiple human parsing. Benefiting the superiority of CE2P, we achieved the 1st places on all three human parsing benchmarks. Without any bells and whistles, we achieved 56.50% (mIoU), 45.31% (mean AP^r) and 33.34% (AP^p_0.5) in LIP, CIHP and MHP v2.0, which outperform the state-of-the-arts more than 2.06%, 3.81% and 1.87%, respectively. We hope our CE2P will serve as a solid baseline and help ease future research in single/multiple human parsing. Code has been made available at <>.

    09/17/2018 ∙ by Ting Liu, et al. ∙ 8 share

    read it

  • Towards Instance-level Image-to-Image Translation

    Unpaired Image-to-image Translation is a new rising and challenging vision problem that aims to learn a mapping between unaligned image pairs in diverse domains. Recent advances in this field like MUNIT and DRIT mainly focus on disentangling content and style/attribute from a given image first, then directly adopting the global style to guide the model to synthesize new domain images. However, this kind of approaches severely incurs contradiction if the target domain images are content-rich with multiple discrepant objects. In this paper, we present a simple yet effective instance-aware image-to-image translation approach (INIT), which employs the fine-grained local (instance) and global styles to the target image spatially. The proposed INIT exhibits three import advantages: (1) the instance-level objective loss can help learn a more accurate reconstruction and incorporate diverse attributes of objects; (2) the styles used for target domain of local/global areas are from corresponding spatial regions in source domain, which intuitively is a more reasonable mapping; (3) the joint training process can benefit both fine and coarse granularity and incorporates instance information to improve the quality of global translation. We also collect a large-scale benchmark for the new instance-level translation task. We observe that our synthetic images can even benefit real-world vision tasks like generic object detection.

    05/05/2019 ∙ by Zhiqiang Shen, et al. ∙ 8 share

    read it

  • Universally Slimmable Networks and Improved Training Techniques

    Slimmable networks are a family of neural networks that can instantly adjust the runtime width. The width can be chosen from a predefined widths set to adaptively optimize accuracy-efficiency trade-offs at runtime. In this work, we propose a systematic approach to train universally slimmable networks (US-Nets), extending slimmable networks to execute at arbitrary width, and generalizing to networks both with and without batch normalization layers. We further propose two improved training techniques for US-Nets, named the sandwich rule and inplace distillation, to enhance training process and boost testing accuracy. We show improved performance of universally slimmable MobileNet v1 and MobileNet v2 on ImageNet classification task, compared with individually trained ones and 4-switch slimmable network baselines. We also evaluate the proposed US-Nets and improved training techniques on tasks of image super-resolution and deep reinforcement learning. Extensive ablation experiments on these representative tasks demonstrate the effectiveness of our proposed methods. Our discovery opens up the possibility to directly evaluate FLOPs-Accuracy spectrum of network architectures. Code and models will be available at:

    03/12/2019 ∙ by Jiahui Yu, et al. ∙ 6 share

    read it

  • Self-produced Guidance for Weakly-supervised Object Localization

    Weakly supervised methods usually generate localization results based on attention maps produced by classification networks. However, the attention maps exhibit the most discriminative parts of the object which are small and sparse. We propose to generate Self-produced Guidance (SPG) masks which separate the foreground, the object of interest, from the background to provide the classification networks with spatial correlation information of pixels. A stagewise approach is proposed to incorporate high confident object regions to learn the SPG masks. The high confident regions within attention maps are utilized to progressively learn the SPG masks. The masks are then used as an auxiliary pixel-level supervision to facilitate the training of classification networks. Extensive experiments on ILSVRC demonstrate that SPG is effective in producing high-quality object localizations maps. Particularly, the proposed SPG achieves the Top-1 localization error rate of 43.83 validation set, which is a new state-of-the-art error rate.

    07/24/2018 ∙ by Xiaolin Zhang, et al. ∙ 4 share

    read it

  • Decoupled Classification Refinement: Hard False Positive Suppression for Object Detection

    In this paper, we analyze failure cases of state-of-the-art detectors and observe that most hard false positives result from classification instead of localization and they have a large negative impact on the performance of object detectors. We conjecture there are three factors: (1) Shared feature representation is not optimal due to the mismatched goals of feature learning for classification and localization; (2) multi-task learning helps, yet optimization of the multi-task loss may result in sub-optimal for individual tasks; (3) large receptive field for different scales leads to redundant context information for small objects. We demonstrate the potential of detector classification power by a simple, effective, and widely-applicable Decoupled Classification Refinement (DCR) network. In particular, DCR places a separate classification network in parallel with the localization network (base detector). With ROI Pooling placed on the early stage of the classification network, we enforce an adaptive receptive field in DCR. During training, DCR samples hard false positives from the base detector and trains a strong classifier to refine classification results. During testing, DCR refines all boxes from the base detector. Experiments show competitive results on PASCAL VOC and COCO without any bells and whistles. Our codes are available at:

    10/05/2018 ∙ by Bowen Cheng, et al. ∙ 4 share

    read it

  • Weakly Supervised Scene Parsing with Point-based Distance Metric Learning

    Semantic scene parsing is suffering from the fact that pixel-level annotations are hard to be collected. To tackle this issue, we propose a Point-based Distance Metric Learning (PDML) in this paper. PDML does not require dense annotated masks and only leverages several labeled points that are much easier to obtain to guide the training process. Concretely, we leverage semantic relationship among the annotated points by encouraging the feature representations of the intra- and inter-category points to keep consistent, i.e. points within the same category should have more similar feature representations compared to those from different categories. We formulate such a characteristic into a simple distance metric loss, which collaborates with the point-wise cross-entropy loss to optimize the deep neural networks. Furthermore, to fully exploit the limited annotations, distance metric learning is conducted across different training images instead of simply adopting an image-dependent manner. We conduct extensive experiments on two challenging scene parsing benchmarks of PASCAL-Context and ADE 20K to validate the effectiveness of our PDML, and competitive mIoU scores are achieved.

    11/06/2018 ∙ by Rui Qian, et al. ∙ 4 share

    read it