Tae-Hyun Oh

is this you? claim profile


Postdoctoral Associate at MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) since 2017, Visiting Scholar at Qatar Computing Research Institute since 2017, Research Intern at Microsoft Research, Redmond 2016, Research Intern at Microsoft Research Asia from 2014-2015, (Ph.D.) Computer Vision and Machine Learning at 한국과학기술원(KAIST) 2012-2017

  • Speech2Face: Learning the Face Behind a Voice

    How much can we infer about a person's looks from the way they speak? In this paper, we study the task of reconstructing a facial image of a person from a short audio recording of that person speaking. We design and train a deep neural network to perform this task using millions of natural Internet/YouTube videos of people speaking. During training, our model learns voice-face correlations that allow it to produce images that capture various physical attributes of the speakers such as age, gender and ethnicity. This is done in a self-supervised manner, by utilizing the natural co-occurrence of faces and speech in Internet videos, without the need to model attributes explicitly. We evaluate and numerically quantify how--and in what manner--our Speech2Face reconstructions, obtained directly from audio, resemble the true face images of the speakers.

    05/23/2019 ∙ by Tae-Hyun Oh, et al. ∙ 22 share

    read it

  • Gradient-based Camera Exposure Control for Outdoor Mobile Platforms

    We introduce a novel method to automatically adjust camera exposure for image processing and computer vision applications of mobile robot platforms. Since most image processing algorithms heavily rely on low-level image features, which are largely based on local gradient information, we consider a gradient quantity to determine a proper exposure level, so that a camera is able to capture important image features robust to illumination conditions. We extend it to multi-camera system and present a new control algorithm to achieve both brightness consistency between adjacent cameras and a proper exposure level for each camera. We implement our prototype system with off-the-shelf machine vision cameras and demonstrate the effectiveness of the proposed algorithms on practical applications: pedestrian detection, visual odometry, surround-view imaging, panoramic imaging, and stereo matching.

    08/24/2017 ∙ by Inwook Shim, et al. ∙ 0 share

    read it

  • Weakly- and Self-Supervised Learning for Content-Aware Deep Image Retargeting

    This paper proposes a weakly- and self-supervised deep convolutional neural network (WSSDCNN) for content-aware image retargeting. Our network takes a source image and a target aspect ratio, and then directly outputs a retargeted image. Retargeting is performed through a shift map, which is a pixel-wise mapping from the source to the target grid. Our method implicitly learns an attention map, which leads to a content-aware shift map for image retargeting. As a result, discriminative parts in an image are preserved, while background regions are adjusted seamlessly. In the training phase, pairs of an image and its image-level annotation are used to compute content and structure losses. We demonstrate the effectiveness of our proposed method for a retargeting application with insightful analyses.

    08/09/2017 ∙ by Donghyeon Cho, et al. ∙ 0 share

    read it

  • Co-domain Embedding using Deep Quadruplet Networks for Unseen Traffic Sign Recognition

    Recent advances in visual recognition show overarching success by virtue of large amounts of supervised data. However,the acquisition of a large supervised dataset is often challenging. This is also true for intelligent transportation applications, i.e., traffic sign recognition. For example, a model trained with data of one country may not be easily generalized to another country without much data. We propose a novel feature embedding scheme for unseen class classification when the representative class template is given. Traffic signs, unlike other objects, have official images. We perform co-domain embedding using a quadruple relationship from real and synthetic domains. Our quadruplet network fully utilizes the explicit pairwise similarity relationships among samples from different domains. We validate our method on three datasets with two experiments involving one-shot classification and feature generalization. The results show that the proposed method outperforms competing approaches on both seen and unseen classes.

    12/05/2017 ∙ by Junsik Kim, et al. ∙ 0 share

    read it

  • Textually Customized Video Summaries

    The best summary of a long video differs among different people due to its highly subjective nature. Even for the same person, the best summary may change with time or mood. In this paper, we introduce the task of generating customized video summaries through simple text. First, we train a deep architecture to effectively learn semantic embeddings of video frames by leveraging the abundance of image-caption data via a progressive and residual manner. Given a user-specific text description, our algorithm is able to select semantically relevant video segments and produce a temporally aligned video summary. In order to evaluate our textually customized video summaries, we conduct experimental comparison with baseline methods that utilize ground-truth information. Despite the challenging baselines, our method still manages to show comparable or even exceeding performance. We also show that our method is able to generate semantically diverse video summaries by only utilizing the learned visual embeddings.

    02/06/2017 ∙ by Jinsoo Choi, et al. ∙ 0 share

    read it

  • Human Attention Estimation for Natural Images: An Automatic Gaze Refinement Approach

    Photo collections and its applications today attempt to reflect user interactions in various forms. Moreover, photo collections aim to capture the users' intention with minimum effort through applications capturing user intentions. Human interest regions in an image carry powerful information about the user's behavior and can be used in many photo applications. Research on human visual attention has been conducted in the form of gaze tracking and computational saliency models in the computer vision community, and has shown considerable progress. This paper presents an integration between implicit gaze estimation and computational saliency model to effectively estimate human attention regions in images on the fly. Furthermore, our method estimates human attention via implicit calibration and incremental model updating without any active participation from the user. We also present extensive analysis and possible applications for personal photo collections.

    01/12/2016 ∙ by Jinsoo Choi, et al. ∙ 0 share

    read it

  • Pseudo-Bayesian Robust PCA: Algorithms and Analyses

    Commonly used in computer vision and other applications, robust PCA represents an algorithmic attempt to reduce the sensitivity of classical PCA to outliers. The basic idea is to learn a decomposition of some data matrix of interest into low rank and sparse components, the latter representing unwanted outliers. Although the resulting optimization problem is typically NP-hard, convex relaxations provide a computationally-expedient alternative with theoretical support. However, in practical regimes performance guarantees break down and a variety of non-convex alternatives, including Bayesian-inspired models, have been proposed to boost estimation quality. Unfortunately though, without additional a priori knowledge none of these methods can significantly expand the critical operational range such that exact principal subspace recovery is possible. Into this mix we propose a novel pseudo-Bayesian algorithm that explicitly compensates for design weaknesses in many existing non-convex approaches leading to state-of-the-art performance with a sound analytical foundation. Surprisingly, our algorithm can even outperform convex matrix completion despite the fact that the latter is provided with perfect knowledge of which entries are not corrupted.

    12/07/2015 ∙ by Tae-Hyun Oh, et al. ∙ 0 share

    read it

  • Fast Randomized Singular Value Thresholding for Low-rank Optimization

    Rank minimization can be converted into tractable surrogate problems, such as Nuclear Norm Minimization (NNM) and Weighted NNM (WNNM). The problems related to NNM, or WNNM, can be solved iteratively by applying a closed-form proximal operator, called Singular Value Thresholding (SVT), or Weighted SVT, but they suffer from high computational cost of Singular Value Decomposition (SVD) at each iteration. We propose a fast and accurate approximation method for SVT, that we call fast randomized SVT (FRSVT), with which we avoid direct computation of SVD. The key idea is to extract an approximate basis for the range of the matrix from its compressed matrix. Given the basis, we compute partial singular values of the original matrix from the small factored matrix. In addition, by developping a range propagation method, our method further speeds up the extraction of approximate basis at each iteration. Our theoretical analysis shows the relationship between the approximation bound of SVD and its effect to NNM via SVT. Along with the analysis, our empirical results quantitatively and qualitatively show that our approximation rarely harms the convergence of the host algorithms. We assess the efficiency and accuracy of the proposed method on various computer vision problems, e.g., subspace clustering, weather artifact removal, and simultaneous multi-image alignment and rectification.

    09/01/2015 ∙ by Tae-Hyun Oh, et al. ∙ 0 share

    read it

  • Partial Sum Minimization of Singular Values in Robust PCA: Algorithm and Applications

    Robust Principal Component Analysis (RPCA) via rank minimization is a powerful tool for recovering underlying low-rank structure of clean data corrupted with sparse noise/outliers. In many low-level vision problems, not only it is known that the underlying structure of clean data is low-rank, but the exact rank of clean data is also known. Yet, when applying conventional rank minimization for those problems, the objective function is formulated in a way that does not fully utilize a priori target rank information about the problems. This observation motivates us to investigate whether there is a better alternative solution when using rank minimization. In this paper, instead of minimizing the nuclear norm, we propose to minimize the partial sum of singular values, which implicitly encourages the target rank constraint. Our experimental analyses show that, when the number of samples is deficient, our approach leads to a higher success rate than conventional rank minimization, while the solutions obtained by the two approaches are almost identical when the number of samples is more than sufficient. We apply our approach to various low-level vision problems, e.g. high dynamic range imaging, motion edge detection, photometric stereo, image alignment and recovery, and show that our results outperform those obtained by the conventional nuclear norm rank minimization method.

    03/04/2015 ∙ by Tae-Hyun Oh, et al. ∙ 0 share

    read it

  • Disjoint Multi-task Learning between Heterogeneous Human-centric Tasks

    Human behavior understanding is arguably one of the most important mid-level components in artificial intelligence. In order to efficiently make use of data, multi-task learning has been studied in diverse computer vision tasks including human behavior understanding. However, multi-task learning relies on task specific datasets and constructing such datasets can be cumbersome. It requires huge amounts of data, labeling efforts, statistical consideration etc. In this paper, we leverage existing single-task datasets for human action classification and captioning data for efficient human behavior learning. Since the data in each dataset has respective heterogeneous annotations, traditional multi-task learning is not effective in this scenario. To this end, we propose a novel alternating directional optimization method to efficiently learn from the heterogeneous data. We demonstrate the effectiveness of our model and show performance improvements on both classification and sentence retrieval tasks in comparison to the models trained on each of the single-task datasets.

    02/14/2018 ∙ by Dong-Jin Kim, et al. ∙ 0 share

    read it

  • Learning-based Video Motion Magnification

    Video motion magnification techniques allow us to see small motions previously invisible to the naked eyes, such as those of vibrating airplane wings, or swaying buildings under the influence of the wind. Because the motion is small, the magnification results are prone to noise or excessive blurring. The state of the art relies on hand-designed filters to extract motion representations that may not be optimal. In this paper, we seek to learn the filters directly from examples using deep convolutional neural networks. To make training tractable, we carefully design a synthetic dataset that captures small motion well, and use two-frame input for training. We show that the learned filters achieve high-quality results on real videos, with less ringing artifacts and better noise characteristics than previous methods. While our model is not trained with temporal filters, we found that the temporal filters can be used with our extracted representations up to a moderate magnification, enabling a frequency-based motion selection. Finally, we analyze the learned filters and show that they behave similarly to the derivative filters used in previous works.

    04/08/2018 ∙ by Tae-Hyun Oh, et al. ∙ 0 share

    read it