Justin Johnson

is this you? claim profile


Justin Johnson is a Ph.D. student in the Stanford Vision Lab, advised by Professor Fei-Fei Li. His primary interest is in the study of deep learning, especially as it pertains to computer vision.

The Stanford Vision Lab focuses on both computer vision and human vision and the relationship between the two. With reference to human vision, the lab studies the underlying neural mechanisms that allow the human visual system to perform the tasks that it does at the high level that it does. With reference to computer vision, the lab attempts to develop algorithms that perform various visual perception tasks such as object recognition and human motion recognition.

Since his PhD studies started in 2012, Johnson has published 12 research papers in the area of human and computer vision. He co-authored “Perceptual Losses for Real-Time Style Transfer and Super-Resolution,” which explores the benefits of training feed-forward transformation networks with perceptual loss functions in order to combine the benefits of both feed-forward image transformation tasks and optimized-based methods for image generation. The publication has been cited over 600 times since 2016.

Johnson graduated from the Illinois Mathematics and Science Academy in 2008 and went on to obtain a B.S. in Mathematics and Computer Science from Caltech in 2012 before starting his Ph.D. program at Stanford. He spent three consecutive summers as a software engineer intern at Google from 2011-2013 in Mountain View, Los Angeles, and San Francisco, CA, respectively. He has also done research as an intern at Yahoo in the summer of 2014 and at Facebook in Seattle in 2016.

  • On Network Design Spaces for Visual Recognition

    Over the past several years progress in designing better neural network architectures for visual recognition has been substantial. To help sustain this rate of progress, in this work we propose to reexamine the methodology for comparing network architectures. In particular, we introduce a new comparison paradigm of distribution estimates, in which network design spaces are compared by applying statistical techniques to populations of sampled models, while controlling for confounding factors like network complexity. Compared to current methodologies of comparing point and curve estimates of model families, distribution estimates paint a more complete picture of the entire design landscape. As a case study, we examine design spaces used in neural architecture search (NAS). We find significant statistical differences between recent NAS design space variants that have been largely overlooked. Furthermore, our analysis reveals that the design spaces for standard model families like ResNeXt can be comparable to the more complex ones used in recent NAS work. We hope these insights into distribution analysis will enable more robust progress toward discovering better networks for visual recognition.

    05/30/2019 ∙ by Ilija Radosavovic, et al. ∙ 16 share

    read it

  • Mesh R-CNN

    Rapid advances in 2D perception have led to systems that accurately detect objects in real-world images. However, these systems make predictions in 2D, ignoring the 3D structure of the world. Concurrently, advances in 3D shape prediction have mostly focused on synthetic benchmarks and isolated objects. We unify advances in these two areas. We propose a system that detects objects in real-world images and produces a triangle mesh giving the full 3D shape of each detected object. Our system, called Mesh R-CNN, augments Mask R-CNN with a mesh prediction branch that outputs meshes with varying topological structure by first predicting coarse voxel representations which are converted to meshes and refined with a graph convolution network operating over the mesh's vertices and edges. We validate our mesh prediction branch on ShapeNet, where we outperform prior work on single-image shape prediction. We then deploy our full Mesh R-CNN system on Pix3D, where we jointly detect objects and predict their 3D shapes.

    06/06/2019 ∙ by Georgia Gkioxari, et al. ∙ 5 share

    read it

  • PHYRE: A New Benchmark for Physical Reasoning

    Understanding and reasoning about physics is an important ability of intelligent agents. We develop the PHYRE benchmark for physical reasoning that contains a set of simple classical mechanics puzzles in a 2D physical environment. The benchmark is designed to encourage the development of learning algorithms that are sample-efficient and generalize well across puzzles. We test several modern learning algorithms on PHYRE and find that these algorithms fall short in solving the puzzles efficiently. We expect that PHYRE will encourage the development of novel sample-efficient agents that learn efficient but useful models of physics. For code and to play PHYRE for yourself, please visit https://player.phyre.ai.

    08/15/2019 ∙ by Anton Bakhtin, et al. ∙ 4 share

    read it

  • Love Thy Neighbors: Image Annotation by Exploiting Image Metadata

    Some images that are difficult to recognize on their own may become more clear in the context of a neighborhood of related images with similar social-network metadata. We build on this intuition to improve multilabel image annotation. Our model uses image metadata nonparametrically to generate neighborhoods of related images using Jaccard similarities, then uses a deep neural network to blend visual information from the image and its neighbors. Prior work typically models image metadata parametrically, in contrast, our nonparametric treatment allows our model to perform well even when the vocabulary of metadata changes between training and testing. We perform comprehensive experiments on the NUS-WIDE dataset, where we show that our model outperforms state-of-the-art methods for multilabel image annotation even when our model is forced to generalize to new types of metadata.

    08/30/2015 ∙ by Justin Johnson, et al. ∙ 0 share

    read it

  • DenseCap: Fully Convolutional Localization Networks for Dense Captioning

    We introduce the dense captioning task, which requires a computer vision system to both localize and describe salient regions in images in natural language. The dense captioning task generalizes object detection when the descriptions consist of a single word, and Image Captioning when one predicted region covers the full image. To address the localization and description task jointly we propose a Fully Convolutional Localization Network (FCLN) architecture that processes an image with a single, efficient forward pass, requires no external regions proposals, and can be trained end-to-end with a single round of optimization. The architecture is composed of a Convolutional Network, a novel dense localization layer, and Recurrent Neural Network language model that generates the label sequences. We evaluate our network on the Visual Genome dataset, which comprises 94,000 images and 4,100,000 region-grounded captions. We observe both speed and accuracy improvements over baselines based on current state of the art approaches in both generation and retrieval settings.

    11/24/2015 ∙ by Justin Johnson, et al. ∙ 0 share

    read it

  • Visualizing and Understanding Recurrent Networks

    Recurrent Neural Networks (RNNs), and specifically a variant with Long Short-Term Memory (LSTM), are enjoying renewed interest as a result of successful applications in a wide range of machine learning problems that involve sequential data. However, while LSTMs provide exceptional results in practice, the source of their performance and their limitations remain rather poorly understood. Using character-level language models as an interpretable testbed, we aim to bridge this gap by providing an analysis of their representations, predictions and error types. In particular, our experiments reveal the existence of interpretable cells that keep track of long-range dependencies such as line lengths, quotes and brackets. Moreover, our comparative analysis with finite horizon n-gram models traces the source of the LSTM improvements to long-range structural dependencies. Finally, we provide analysis of the remaining errors and suggests areas for further study.

    06/05/2015 ∙ by Andrej Karpathy, et al. ∙ 0 share

    read it

  • Characterizing and Improving Stability in Neural Style Transfer

    Recent progress in style transfer on images has focused on improving the quality of stylized images and speed of methods. However, real-time methods are highly unstable resulting in visible flickering when applied to videos. In this work we characterize the instability of these methods by examining the solution set of the style transfer objective. We show that the trace of the Gram matrix representing style is inversely related to the stability of the method. Then, we present a recurrent convolutional network for real-time video style transfer which incorporates a temporal consistency loss and overcomes the instability of prior methods. Our networks can be applied at any resolution, do not re- quire optical flow at test time, and produce high quality, temporally consistent stylized videos in real-time.

    05/05/2017 ∙ by Agrim Gupta, et al. ∙ 0 share

    read it

  • Perceptual Losses for Real-Time Style Transfer and Super-Resolution

    We consider image transformation problems, where an input image is transformed into an output image. Recent methods for such problems typically train feed-forward convolutional neural networks using a per-pixel loss between the output and ground-truth images. Parallel work has shown that high-quality images can be generated by defining and optimizing perceptual loss functions based on high-level features extracted from pretrained networks. We combine the benefits of both approaches, and propose the use of perceptual loss functions for training feed-forward networks for image transformation tasks. We show results on image style transfer, where a feed-forward network is trained to solve the optimization problem proposed by Gatys et al in real-time. Compared to the optimization-based method, our network gives similar qualitative results but is three orders of magnitude faster. We also experiment with single-image super-resolution, where replacing a per-pixel loss with a perceptual loss gives visually pleasing results.

    03/27/2016 ∙ by Justin Johnson, et al. ∙ 0 share

    read it

  • Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

    Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding?", computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) in order to answer correctly that "the person is riding a horse-drawn carriage". In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 100K images where each image has an average of 21 objects, 18 attributes, and 18 pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answers.

    02/23/2016 ∙ by Ranjay Krishna, et al. ∙ 0 share

    read it

  • Inferring and Executing Programs for Visual Reasoning

    Existing methods for visual reasoning attempt to directly map inputs to outputs using black-box architectures without explicitly modeling the underlying reasoning processes. As a result, these black-box models often learn to exploit biases in the data rather than learning to perform visual reasoning. Inspired by module networks, this paper proposes a model for visual reasoning that consists of a program generator that constructs an explicit representation of the reasoning process to be performed, and an execution engine that executes the resulting program to produce an answer. Both the program generator and the execution engine are implemented by neural networks, and are trained using a combination of backpropagation and REINFORCE. Using the CLEVR benchmark for visual reasoning, we show that our model significantly outperforms strong baselines and generalizes better in a variety of settings.

    05/10/2017 ∙ by Justin Johnson, et al. ∙ 0 share

    read it

  • CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

    When building artificial intelligence systems that can reason and answer questions about visual data, we need diagnostic tests to analyze our progress and discover shortcomings. Existing benchmarks for visual question answering can help, but have strong biases that models can exploit to correctly answer questions without reasoning. They also conflate multiple sources of error, making it hard to pinpoint model weaknesses. We present a diagnostic dataset that tests a range of visual reasoning abilities. It contains minimal biases and has detailed annotations describing the kind of reasoning each question requires. We use this dataset to analyze a variety of modern visual reasoning systems, providing novel insights into their abilities and limitations.

    12/20/2016 ∙ by Justin Johnson, et al. ∙ 0 share

    read it