PhotoShape: Photorealistic Materials for Large-Scale Shape Collections

09/26/2018 ∙ by Keunhong Park, et al. ∙ 2

Existing online 3D shape repositories contain thousands of 3D models but lack photorealistic appearance. We present an approach to automatically assign high-quality, realistic appearance models to large scale 3D shape collections. The key idea is to jointly leverage three types of online data -- shape collections, material collections, and photo collections, using the photos as reference to guide assignment of materials to shapes. By generating a large number of synthetic renderings, we train a convolutional neural network to classify materials in real photos, and employ 3D-2D alignment techniques to transfer materials to different parts of each shape model. Our system produces photorealistic, relightable, 3D shapes (PhotoShapes).



There are no comments yet.


page 1

page 3

page 5

page 6

page 7

page 8

page 9

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

3D shapes with photorealistic materials are of great importance for problems ranging from augmented reality to game design to e-commerce. Creating realistic 3D content is quite difficult however, and the vast majority of existing models are manually authored.

Even more difficult than producing the geometry, which an artist can author using CAD tools, is creating realistic relightable textures

, as spatially varying reflectance models (SVBRDFs) are 6-dimensional. And while computer vision-based 3D reconstruction research has advanced considerably over the last few years, existing commercial tools produce raw geometry but lack relightable textures and hierarchical part segmentation. As a result, relatively few photorealistic relightable 3D shapes (PhotoShapes) exist, and even fewer are freely available online.

Our goal is to produce thousands of freely available PhotoShapes. To this end, we observe that the problem of creating PhotoShapes can be factored into three more tractable subproblems:

  1. we need thousands of good shape models.

  2. we need databases of high-quality spatially-varying material models.

  3. we need an assignment of materials to shape parts.

The first problem (shape models) is addressed in part by the existence of large model databases like ShapeNet [Chang et al., 2015] which captures thousands of chairs (and many other categories) spanning myriad shapes and styles. The second problem (appearance models) is addressed in part by existing BRDF/SVBRDF databases and other online material libraries, although many of these are not free (e.g., [Adobe Stock, 2018]), and these collections are not as extensive as we would like; we therefore contributed a number of high quality SVBRDFs to round out the collection. The third problem – assignment of shape to materials – is the focus of our paper.

Given a 3D shape of a chair with a set of parts (e.g., legs, seat, back) and a set of materials (e.g., different types of wood, plastic, leather, fabric, metal) how should we decide which materials to apply to each part? Whereas an artist would choose this assignment manually, we propose to automate this process by leveraging photos of chairs on the Internet. I.e., the goal is to use photos of real objects as reference to automate the assignment of materials to 3D shapes.

Note that this problem is different than transferring the texture of a reference photo to a 3D shape, e.g., [Wang et al., 2016a], as the latter requires generating missing texture for the of object surfaces not visible in the photo, and does not enable specular relighting

(required for many applications). In addition to solving both of these problems, our approach produces “super-resolution” textures (based on the high-res appearance database) where you can zoom in far beyond the resolution afforded by the reference photo, to see fine wood-grain or stitch patterns up close. The caveat is that these textures are “hallucinated” i.e., they are best matches from the database of textures rather than exact reproductions of the reference object. This means for example, that while the overall look and appearance of the material is often matched well (e.g., “oak”, “black leather”) and the level of realism is high, the particular knot placement of the wood grain or the texture of the leather may differ significantly from the object in the photo.

Conceptually, we could solve this problem by comparing every reference photo with every 3D shape, textured with every material in the database, and rendered to every viewpoint and with different illuminations. The good matches (for each shape part) would yield our desired assignment of materials to shapes. Aside from the obvious scale and combinatorial complexity problems with such an approach, a key challenge is how to robustly compare two images where features differ in both shape (e.g., arm height in an office chair) and appearance (e.g., different wood grain). We leverage deep networks, trained on thousands of synthetic renderings of BRDFs applied to 3D shapes to produce robust classifiers that map patches of reference photos to material database instances. We then use fine-scale image alignment techniques and spatial aggregation (CRFs) to assign materials to parts of the shape.

Our fully automatic system is able to produce 2,000 PhotoShapes that very accurately reflect their exemplars, and 9,000 PhotoShapes which deviate slightly but are good representations of their exemplars. In total, our system produces 11,000 photorealistic, relightable 3D shapes.

2. Related Work

Material Capture and Representation

A widely used representation for opaque materials is the Bidirectional Reflectance Distribution Function (BRDF) and its spatially varying form (SVBRDF). Estimating BRDF parameters from images is a well studied problem, either in a lab setup 

[Matusik et al., 2003] or directly from images. The work of [Lombardi and Nishino, 2015; Oxholm and Nishino, 2016] optimize for the BRDF parameters given the shape or the illumination respectively. Chadraker etal [Chandraker, 2014]

study how motion cues can assist the estimation. More recent approaches use deep learning to estimate the material parameters 

[Georgoulis et al., 2017a] from reflectance maps [Rematas et al., 2016], from the image directly [Wang TY, 2017; Liu et al., 2017] or from a RGB-D sequence [Kim et al., 2017]. For SVBRDFs, [Aittala et al., 2013] introduces a system for easy capture of SVBRDF parameters. [Dong et al., 2014] infers the diffuse and specular albedo from a moving object. [Zhou et al., 2016] proposes a method for capturing SVBRDFs from a small number of views. The work of [Aittala et al., 2015] uses two photos of the same texture on a flat surface, one taken with a flash and one without. [Li et al., 2017] is able to estimate the diffuse albedo, the specularity, and the illumination of a flat texture using a self-augmented neural network.

Diffuse Textures

Diffuse textures maps is another technique for material representation. These textures can be directly applied to their corresponding 3D models [Debevec et al., 1996]. The work of [Kholgade et al., 2014] uses projective texturing followed by texture synthesis in 3D, after the manual alignment of a 3D object to a 2D image. [Huang et al., 2018] proposes an approach for detailed geometry and reflectance extraction from single photos using rough 3D proxies. [Diamanti et al., 2015] proposes an exemplar based synthesis approach that incorporates 3D cues such as normals and light direction. In a different manner, [Kopf et al., 2007] synthesizes solid textures that were optimized to match the statistics of a 2D image. Apart from their 3D applications, textures have been studied extensively in the image domain. Non-Parametric texture synthesis methods [Efros and Leung, 1999; Efros and Freeman, 2001] are able to generate plausible textures from small patches. [Hertzmann et al., 2001] introduced a framework that transfers the texture effects that relate two images to new one. Recently, deep learning approaches have boosted the quality of the produced textures, synthesizing either from exemplars [Gatys et al., 2015; Sendik and Cohen-Or, 2017] or semantic labels [Isola et al., 2017; Chen and Koltun, 2017].

Material Recognition

In many computer vision applications it is important to recognize the materials that appear in an image. The method of [Sharan et al., 2013] uses features based on human perception for material classification. [Schwartz and Nishino, 2015] proposes a method for discovering attributes suitable for material classification, while [Zhang et al., 2015] identifies materials based on their reflectance. Similarly, [Georgoulis et al., 2017b] classifies materials based on reflectance maps. [Bell et al., 2015] introduces a large dataset and a deep learning framework for material recognition in the wild (e.g. the Flickr Material Database [Sharan et al., 2014] contains 100 images per class, with the images not beeing representative of everyday scenes). Another dataset for surface material recognition is presented in [Xue et al., 2017], together with a classification network based on differentiable angular imaging. Moreover, [Wang et al., 2016b] introduces a light-field dataset for material recognition. [Cimpoi et al., 2014; Cimpoi et al., 2015] introduces a texture dataset and a deep learning approach for texture recognition and segmentation, but the texture classes are based on high level attributes.

Image and Shape Dataset Analysis

Wang et al. [Wang et al., 2013] proposes a method to transfer image segmentation labels to 3D models by aligning the projections of the 3D shapes with annotated images. [Huang et al., 2015] performs single-view reconstruction by jointly analyzing image and shape collections. Starting from a 3D dataset with material annotations, [Jain et al., 2012] introduces a method for material suggestions based on material relation of object parts. Similarly, [Chen et al., 2015] proposes a framework for automatic assignment of materials and textures for indoor scenes based on a set of rules learnt from an annotated database. The work of [Izadinia et al., 2017] proposes a framework to infer the geometry of a room from a single image, but the appearance of the 3D models are estimated as the mean diffuse color of the image pixels. [Rematas et al., 2017] aligns 3D models with images to estimate reflectance maps (orientation dependent appearance), and then uses them for shading. Closer to our work is the method of [Wang et al., 2016a], which transfers textures from images to aligned shapes. However, the transferred textures consist of only a diffuse albedo and need to contain strong patterns. Our work attempts to alleviate this limitation by using rich multi-component representations that capture a large variety of materials.

3. Dataset

In this section we describe the three types of datasets that we used in our paper: shape, photo, and texture collections. For this work we have focused on chairs (including a variety of sofas, office chairs, stools etc.). Chairs have a diverse set of appearances and material combinations that make them appealing for our experiments.

3.1. 3D Shape Collections

The 3D models to be textured come from two free online CAD sources: ShapeNet [Chang et al., 2015] and Herman Miller [Herman-Miller, 2018]. In particular, we used 5,740 3D models from ShapeNet and 90 models from Herman Miller. ShapeNet is a large database of 3D models, containing thousands of 3D models across different categories. The furniture classes that are investigated in this paper are among the most populous, providing a good sampling of the “furniture” geometry. The Herman Miller database contains a small number of 3D models, but with higher quality meshes.

Our 3D models are in OBJ format and are segmented into parts. These parts do not always correspond to semantic groups like “chair leg” but are a byproduct of the 3D shape design process. Some shapes do include material information either as simple Blinn-Phong parameters or as textures, but such materials are usually low quality and inadequate for photorealistic rendering.


The initial 3D shapes vary in terms of geometric detail and quality, etc. To ensure that the shape collection contains sufficient geometric quality we pre-process the database with the following steps. Firstly, we manually remove the 3D models that they do not belong to the aforementioned categories. Moreover, we remove unrealistic shapes and shapes with poor quality. Next, we delete vertex doubles and we enable smooth shading. Finally, all the models are resized to fit in a unit bounding cube.

UV Map Generation

Most models lack UV mappings which are required for texturing. To estimate the UV maps we use Blender’s “Smart UV projection” algorithm [Vallet and Lévy, 2009] for each material segment.

3.2. Exemplars: Photographic References for PhotoShapes

We pair 3D shapes with photographic references which we call exemplars to guide the appearance of PhotoShapes. Our collection of exemplar consists of of 40,927 product photos that were collected from a) the Herman Miller website, and b) image search engines (Google and Bing), similar to [Huang et al., 2015]. Specifically, we used 1820 images from Herman Miller and 39107 from the search engines. Product images are suitable for our task because objects of interest are uncluttered and easily segmented from the background (which is usually white). To ensure that images are unique, we remove duplicates by computing dense HOG features [Felzenszwalb et al., 2010] on each image and removing images with an L2 distance lower than 0.1. Moreover, a foreground mask is computed for each image using a simple pixel value threshold. The object is then tightly cropped with a square bounding box and resized to 1000x1000.

3.3. Material Collection

Figure 2. Examples of materials from each class. Rendered with Blender.

Our goal is to provide realistic, physically based textures to a 3D model collection. Textures are represented as SVBRDFs with spatially varying diffuse, specular and roughness maps. They also contain geometric information via normal and height maps. This representation enables realistic reproduction of materials and seamless incorporation into any modern physically based renderer.

Figure 3. An overview of our system. (a) The input to our system is a collection of images, shapes, and materials. (section 3), (b) we take the shape and image collections and correlate them in an alignment step (section 4), (c) we take each shape-image pair along with the finely aligned segmentation mask and predict the material of each part (section 5.2), (d) the output of our system is a large collection of richly textured 3D shapes (novel viewpoints are shown in the figure).

We collected SVBRDFs by scanning real surfaces using [Aittala et al., 2015] and by collecting synthetic textures from online repositories. Specifically, we manually scanned 33 materials in addition to the 34 materials provided in [Aittala et al., 2015]. We downloaded 68 materials from [Poliigon, 2018], 57 materials from VRay Materials [, 2018] and 238 materials from [Adobe Stock, 2018]. We also manually created 15 metals and plastics.

Materials scanned using [Aittala et al., 2015] were converted from their model (similar to BRDF Model A from [Brady et al., 2014]) to an anisotropic Beckmann model in order to render using Blender. All the other BRDFs were rendered using their designed BRDFs. Poliigon and V-Ray Materials are rendered using the GGX [Walter et al., 2007] model and Adobe Stock materials are rendered using the Disney Principled BRDF [Burley and Studios, [n. d.]]. In total, our database consist of 48 leathers, 154 fabrics, 105 woods, 86 metals, and 60 plastics. Examples of the materials are shown in Figure 2.

Normalizing Scale

Materials are scanned or created with an arbitrary and unknown scale. In order to use materials consistently, we manually assign a scale value for each material . This value is used as a scaling factor for the UV mappings which are scaled by a factor of during rendering.

Environment Maps

We also have a small set of 30 HDR environment maps from [Debevec et al., 1996], [zbyg, 2018], and [Adobe Stock, 2018]. We select environment maps that simulate studio-like lighting conditions as use them for all of our renderings.

4. Shape-Image Alignment

Figure 4. The top-5 shape and pose retrievals given an image (outlined).
Figure 5. The top-5 image retrievals (outlined) given a shape. Computed using a reverse-index.

Given a collection of uncorrelated 3D shapes and images of the same category, we wish to synthesize realistic textured versions of each 3D shape. To achieve this, we propose a system to extract appearance information from the images and transfer that information onto the shape collection. We pose the problem as a classification problem in which the goal is to assign a material model from our texture dataset to each 3D shape part. Our system is comprised of two parts:

  1. Coarse step: assigns to each shape a list of exemplars and associated camera poses

  2. Fine step: creates a pixel-wise alignment between shapes and exemplars

We use the following terminology throughout the paper: a shape is a 3D model obtained from an online shape collection. Each shape is by construction divided into object parts defining structural divisions (seat, arm, leg, etc.) and material parts () defining which objects should share the same material.

In order to texture a 3D shape, we refer to a set of associated image exemplars and use them as a proxy for reasoning about plausible textures for the shape. For this to be possible we must first compute an association between the collection of 3D shapes and exemplar images. We call this task alignment and break it down into two steps: 1) a coarse step, and 2) a fine step.

Figure 6. We refine a coarsely aligned mesh segment ID map to better align with the image. The shape segmentation map projected onto the image (a) after coarse alignment; (b) after applying the flow field computed using SIFT Flow; (c) after applying our dense CRF.

4.1. Coarse Step: Shape to exemplar matching.

We seek a list of exemplars for each 3D shape, as well as the camera pose for every such shape-exemplar pair. We pose coarse alignment as an image retrieval problem, solving shape retrieval and pose estimation simultaneously. Similar approaches are taken in

[Huang et al., 2015] and [Wang et al., 2016a]. For efficiency, we solve this by creating a reverse-index from exemplars to the top shape renderings. Inverting this index gives us our desired output.

We render each shape from various viewpoints sampled from a sphere around the object. The camera is parameterized in spherical coordinates. 50 elevation values are uniformly sampled in and azimuth values sampled uniformly over . This results in 456 distinct viewpoints.

Distance Metric

We require a distance metric in order to compare the compatibility of a rendering and an image. The alignment problem is then where is a function that returns the top match, is the image query, is the renderer, is the set of all shapes, is the set of all elevations, is the set of all azimuths, and is a feature descriptor.

We use the HOG descriptor from [Felzenszwalb et al., 2010] as . This feature descriptor has the benefit of low dimensionality making computation and comparisons extremely efficient. We compute descriptors of size 100x100 with a bin size of 8 yielding 1352 dimensional features. The input image is blurred with a Gaussian filter with in order to reduce texture effects.

During comparison, the rendering and the image are both cropped with a square bounding box around their foreground masks. This allows us to perform the coarse alignment with a simple spherical coordinate camera model forgoing focal length or translation parameters.

4.2. Fine Step: Segmentation Refinement

3D shapes are segmented into object and material parts by their authors upon construction. We assume that any parts of the shape which have the same material label share the same apperance. One may also use the object parts as supervision for this purpose; however, we find that these tend to be over segmented and lack the symmetry found in material segments (e.g. each leg of a chair may be assigned a different material). Given the coarse alignment we can compute a 2D material part labeling for an image of size by projecting the shape parts with the estimated camera pose. The effect of this is shown in Figure 6(a).

A naive projecting of the coarsely aligned part mask is insufficient for associating the two modalities. The mask does not perfectly align with the exemplar and thin structures such as chair legs may have zero overlap. We use the coarse alignment as initialization and perform an additional refinement step in order to get a cleaner pixel-wise alignment of the projected part mask.


We use the approach from [Wang et al., 2016a] in which we compute a flow which warps the projected shape segment map onto the exemplar. The flow is computed by using the SIFT Flow algorithm [Liu et al., 2011] on the silhouettes of each map. We encode the vertical and horizontal pixel coordinates into the blue and green pixels of the silhouette image (as in Figure 3). This prevents the SIFT Flow step from overly distorting the mask. The resulting warped mask is shown in Figure 6(b).

Dense CRF

The SIFT Flow refinement results in an overlapping but noisy segmentation. We further clean the segmentation mask by using a dense CRF [Krähenbühl and Koltun, 2011] in the same manner as [Bell et al., 2015] (Please see supplemental materials for details). The resulting part mask which we use for the rest of the system is shown in Figure 6(c). The aligned part mask enabled us to share information between shapes and corresponding image exemplars.

4.3. Substance Segmentation

We first use the aligned image exemplar to infer types of materials, a.k.a. “substances” for each part of the aligned object. In the next section, we will convert these substances into fine-grained SVBRDFs. We segment the image and label each pixel with a substance category using [Bell et al., 2015]

. For our experiments we use the substances ’leather’, ’fabric’, ’metal’, ’wood’, and ’plastic’. Similar categories are mapped to a canonical category (e.g. ’carpet’ to ’fabric’). All other category probabilities are set to zero and the remainder are re-scaled to sum to 1.0. This process results in a substance mask

where is the set of substance labels. We compute a substance labeling of the shape by choosing the substance label that has the most overlap. Let be the substance label of part . This process may also yield a cleaner substance segmentation of the image. See supplementary material for an example.

This process assigns a substance label to each aligned 3D shape part computed from section 4.2.

5. Image Segment to SVBRDF

Our objective is to assign a plausible SVBRDF to each part of a 3D shape. One approach is to extract planar patches and optimize a texture as in [Wang et al., 2016a] with an SVBRDF regression method such as [Li et al., 2017]. However, we find that extracting patches from images yields low resolution, distorted textures which are difficult to analyze. Extracting local planar patches also loses global context which is useful for inferring glossiness.

We instead tackle this problem as a classification problem. The input is an image and a corresponding binary mask representing a single material. The output is a material label chosen from our collection of SVBRDFs. Ideally we would have a collection of real images with ground truth SVBRDF labelings. In practice, it is difficult to define such a task. We also found human judgment of reflectance (as in [Bell et al., 2013]) to be noisy and low quality. We therefore generate synthetic data where we know ground truth.

5.1. Synthesizing Training Data

Synthetic data has shown to be effective for training or augmenting models that generalize to real world applications [Su et al., 2015; Richter et al., 2016]. We therefore sidestep the difficulty of collecting ground truth by creating our own. Given our 3D shape and material databases, we can create a large amount of training data by applying different materials to shapes and rendering to a range of camera viewpoints under different illuminations.

Camera Pose Prior

We find that there is a strong bias in camera poses in real images (e.g., chairs are rarely photographed from below). We thus uniformly sample from the distribution of camera poses obtained in the coarse alignment step.

Substance Prior

Substances do not occur randomly in objects. Legs of chairs are usually not made of leather and a sofa is usually not upholstered with metal. We leverage the shape substance labelings in section 4.3 to enforce a substance prior. Instead of selecting a completely random material, we condition on the substance category and sample .

Texture Scale Normalization

Different tessellation and UV mappings can arbitrarily change the rendered scale of our textures. We normalize the UV scale for each mesh segment by computing a density where is the local UV-space surface area of the mesh and is the local world-space surface area. The UV coordinates for the segment is then scaled by . This method assumes little or no distortion in the UV mapping.

Randomized Rendering

To generate a single random rendering, we uniformly sample a shape-exemplar pair computed in section 4.1. Given a pair, we 1) sample a camera pose from the distribution computed above, 2) assign a random material (SVBRDF) to each shape part conditioned on the substance label computed in section 4.3, and 3) select a random environment map.

To improve classifier robustness, we further augment our data as follows: 1) Randomly jitter azimuth and elevation by and , 2) randomly select a field of view , 3) randomly select a camera distance , 4) randomly scale the radiance of the environment map by , 5) randomly scale, rotate, and translate UV mappings by (, , and respectively. We use Blender to generate 156,262 renderings (examples in supplementary material).

5.2. Material Classification

Our synthetic dataset is generated by conditioning on substances. We would like to be able to generate PhotoShapes with more accurate fine grained materials (e.g., a specific ’oak’, ’cherry’, ’maple’, instead of just ’wood’). Although material assignments were only conditioned on substance categories, the renderings of our synthetic dataset from section 5.1 contain ground truth labels for which specific material is rendered at each pixel.

We directly use the renderings and their ground truth materials

to train a classifier which predicts which materials are present in a specified image. We experimented with other methods such as color histogram matching but found that such brute-force matching approaches are not practical when comparing a large number of materials. Our classifier is a feed-forward neural network which is efficient even for a large number of materials.

The input to our classifier is an image exemplar concatenated with a binary segmentation mask. The input mask represents the portion of the image that is to be classified. For example, if the mask had non-zero- values only within the fabric upholstery (as in Figure 3(c)) the desired output would be a blue fabric. The output of our classifier is an

dimensional vector

which when applied a soft-max operator becomes a discrete probability mass. We optimize this class labeling using a cross entropy loss

where is the ground truth label.

5.2.1. Multitask Learning

We find that naively training a network only on material class information generalizes poorly to real images. Intuitively this makes sense since we have given the system no information about the relative affinity between different materials e.g., it is less wrong to classify beech wood as cherry wood than it is to classify it as a black leather. We introduce an auxiliary task to our network in order to regularize our feature space and to teach the network the relative affinity between materials.

Substance Classification

Mis-classifying a wood material as leather is more detrimental than classifying it as a different wood. As such, we add an additional fully connected layer to our network which predicts the substance category of the input (wood, leather, plastic, etc.) The ground truth labels come directly from annotations of our material dataset. The substance is also a classification and is optimized using a cross entropy loss in a similar fashion to the material loss:

where the ground truth label.

Combining Losses

The most straightforward way to optimize our objective is to compute a weighted sum of our loss functions:

for some weighting term . However, we found it more efficient to use the uncertainty weighted multitask loss from [Kendall et al., 2018] which defines a weighting based on learned homoscedastic (task-dependent and not data-dependent) uncertainties . Concretely, our loss function is formulated as:


In practice, we optimize for the log variance

as it avoids a possible divide by zero and is more numerically stable:


We initialize .

5.2.2. Model Architecture and Training

We use Resnet-34 [He et al., 2016]

with weights initialized to those pre-trained on ImageNet. We add a 4th alpha channel to the input layer of the network which represents the segment of the image we wish to classify. The associated filters of the alpha channel are initialized with a random Gaussian. The output of the network is a score for each of the

categories mapping to a material and a background class.

We train the network with stochastic gradient descent (SGD) with a fixed learning rate of 0.0001 until convergence (about 100 epochs). We performance standard data augmentations: random rotations, random crops, random aspect ratio changes, random horizontal flips, and random chromatic/brightness/contrast shifts.

5.2.3. Pretraining on Real Images

Although our network trained only on synthetic renderings generalizes fairly well to real images, we found that having natural image supervision helps the network learn a more robust model. We pre-train our network on the dataset of [Bell et al., 2015] using their ground truth region polygons to generate input masks. Our data mostly has white backgrounds and we therefore interleave whole image inputs with cropped inputs with white backgrounds.

We randomly split the OpenSurfaces dataset into training and validation sets at a ratio of 9:1. We fine-tune the network initialized to weights pretrained on ImageNet [Deng et al., 2009] and train using only the substance task with a learning rate of until convergence. The network reaches top-1 validation precision on the our pretraining validation set.

Figure 7. The top-3 material predictions of our material classification network for the input shown on the left. (a) shows predictions when trained without substance supervision, (b) shows predictions when trained with substance supervision.

5.2.4. Inference

Given an image we take the segmentation mask computed in section 4.2 and infer a material for each segment. We find that we are able to improve material prediction performance by weighting the material prediction by the confidence of the corresponding substance category prediction.

5.3. Generating PhotoShapes

The final objective is to create a collection of photorealistic, relightable 3D shapes. Consider the collection of all aligned shape-exemplar pairs computed in section 4 as PhotoShapes candidates. These candidates become PhotoShapes when each of their material parts are assigned a relightable texture. The latter step is done simply by applying section 5.2 to each photo-aligned material part (using the mask from section 4.2). For our experiments, we took the top matching exemplars for each shape (ordered by HOG distance) and discarded any pairs with a HOG distance . We then applied our material classifier to the remaining candidates. This process yields 29,133 PhotoShapes. Examples are in Figure 15.

6. Experiments

In this section, we describe implementation details, evaluations, comparisons to prior work, limitations, and applications.

6.1. Baselines

Built-in Textures

The simplest baseline is to compare with the materials that come by default from each shape model. Most of the shapes in our dataset lack any meaningful material properties. Besides the very few with high quality textures, most shapes have either arbitrary colors or low resolution textures.

No Material Classification

We evaluate a version of our method without the material classification step. We assign random materials given the substance computed in section 4.3. This is akin to our training data generation process.

Color Matching

A naive method of matching materials is to render a shape with all possible materials and compare the resulting color distribution with the exemplar image. Our experiments show that such brute-force approaches were too slow for large-scale applications (rendering thousands of object with hundreds of materials takes a long time). Such comparisons also have difficulty accounting for complicated textures and non-uniform lighting effects such as shadows and specularities. We show a comparison in Figure 8.

Projective Texturing

Another approach for appearance modeling is projective texturing. If the alignment between the 3D model and the image is good  [Debevec et al., 1996; Kholgade et al., 2014], the image can be used as a texture for the 3D model. However, good alignment is a very difficult task, even with manual intervention. An additional approach that models the appearance of an object as “baked” material and illumination properties is reflectance maps [Rematas et al., 2017]. Figure 13 shows the results of projective texturing and reflectance map shading compared to our approach.

Figure 8. Matching by median color fails to account for specular reflections and fetches a diffuse, lighter material. We include the color histogram of each image on the upper left with the median color (dotted lines).

6.2. Results

We show a sample of our results in Figure 15 (please see supplementary materials for more), comparing to the built-in materials, our results with no classifier, and our full results. We also show the benefit of using high-resolution relightable textures in Figure 12.

We also show results for categories other than chairs in Figure 10. Note that these results are produced without additional training. Results may be improved by adding relevant materials (e.g., rubber for tires) and further training the classifier.

Figure 9. Our method is able to produce different plausible materials (bottom) of the same Shape (left) given different exemplar images (top outlined).
Figure 10. Without any additional training, our pipeline produces plausible results for motorcycles (top left), pillows (top right), coffee tables (bottom left) and cabinets (bottom right). A green outline indicates the exemplar followed by our result to the right. The motorcycle tires are assigned metallic and leather materials as we lack a rubber material in our dataset. The knobs of the cabinet were missed by the fine alignment step due to the small size and sharp color variation of the chrome.
Figure 11. Representative examples of our categorization of results. (a) Good models are good representations of the original exemplar. (b) Acceptable models have slight differences but are overall plausible. (c) Failures, for which we identify the following modes: (i) material mis-classification, (ii) material mis-classification caused by ambiguous appearance (plastic sometimes looks like leather etc.), (iii) color mismatch, (iv) over-segmented mesh causing material discontinuity, (v) under-segmented mesh making it impossible to assign correct materials, and (vi) mis-alignment of the exemplar and shape (includes retrieving the wrong object)

6.3. User Study

We conducted a user study on Amazon Mechanical Turk to evaluate the performance of our method. We showed users an image and asked them to specify whether they thought the image was a ”real photograph” or ”generated by a computer”. We tested our method with three different baselines: built-in textures, our pipeline but without the material classifier, our full method, and real images. We performed the study on the ShapeNet and Herman Miller datasets separately. For ShapeNet results, we sampled 1000 result renderings uniformly at random. For Herman Miller results we sampled 500. Results are shown in Table 1.

ShapeNet Herman Miller
Real Fake Real Fake
Built-in 32% 68% 37% 63%
Ours (No Classifier) 41% 59% 43% 57%
Ours 47% 53% 51% 49%
Photographs 81% 19% 83% 17%
Table 1. User Study. The ratio of users who thought our results were real photographs.

6.4. Shape to PhotoShapes Conversion Rate

We also manually evaluated our results. We categorize resulting PhotoShapes from our pipeline into the following categories: good, acceptable, and failure. Good results represent their exemplar image almost exactly. Acceptable results have slight differences from the exemplar but are good representations. We ignore shapes with low mesh quality (very low polygon count, holes, incorrect normals) and bad exemplars (watermarks, transparent materials, etc.). Everything else are considered failures. We identify four main failure modes: (1) material mis-classification e.g., metal instead of wood, (2) color or pattern mismatch, (3) failures due to the under- or over-segmentation of meshes, and (4) incorrect shape retrievals or mis-alignments. Representative examples of each of each category are shown in Figure 11.

We evaluate our two input shapes collections (ShapeNet and Herman Miller) separately. We generate 28,432 PhotoShapes for ShapeNet and 701 PhotoShapes for Herman Miller shapes. Since manually sorting all results is not feasible, we evaluate quality on a random sampling. We uniformly randomly sample 1,322 PhotoShapes from ShapeNet and 243 PhotoShapes from Herman Miller and categorize them into the aforementioned classes. We found most failure cases were due to mis-alignments and under-segmentations of the mesh. Table 2 shows the division between each category, and Table 3 shows a finer division between different failure modes. Our shapes from Herman Miller have considerably higher mesh quality, resulting in a higher success rate due to a lower number of mis-alignments.

Extrapolating from our categorization, our system was able to produce around 2,100 ’good’ PhotoShapes for ShapeNet and 262 ’good’ PhotoShapes for Herman Miller. Including ’acceptable’ results, we generated 11,000 PhotoShapes for ShapeNet and 475 PhotoShapes for Herman Miller.

We also report input shape coverage (projected numbers shown in parentheses). Our annotations show that 14.93% (856) of ShapeNet shapes and 68.42% (69) of Herman Miller shapes had least one ’good’ PhotoShapes. When we include ’acceptable’ results, ShapeNet had a success rate of 64.25% (3,687) and 91.23% (92) for Herman Miller.

While a success rate like 64% may not sound impressive, it is very significant in the context of the problem that we’re trying to solve, i.e., generating a large dataset of high quality PhotoShapes. I.e., we have generated over 3,500 photorealistic, relightable, 3D shapes of chairs. It’s not critical that we texture every shape, as many of these (particularly with ShapeNet) were artist-created and may not correspond to real furniture for which photo exemplars exist.

Good Acceptable Failure Good+Acc
ShapeNet 6.08% 32.76% 61.15% 38.85%
Herman Miller 37.45% 30.45% 32.10% 67.90%
Table 2. Our generated PhotoShapes divided into good, acceptable, and failure categories. We also show the sum of the good and acceptable classes. See Table 3 for a more detailed division of failures.
align subst color und-seg ov-seg
ShapeNet 40.18% 27.93% 11.86% 17.98% 2.04%
Herman Miller 14.10% 38.46% 32.05% 14.10% 1.28%
Table 3. Different modes of failure. ’align’ refers to mis-alignment errors, ’subst’ refers to substance mis-classification errors, ’color’ refers to incorrect color or pattern predictions, ’und-seg’ refers to errors due to the under-segmentation of the shape, and ’ov-seg’ refers to awkward results due to models being over-segmented. See Figure 11 for examples.
Figure 12. Here we show some of our results and close-ups. By using SVBRDFs to model we are able to infer appearance at great detail, even if the exemplar image has low resolution.
Figure 13. Comparisons of producing a novel view of an exemplar image using projective texturing, reflectance maps, and then our method.

6.5. Material Classifier Evaluation

Data Split

We split our synthetic rendering dataset into a training set and a validation set. Since our materials are our prediction categories, we perform the split on the set of shapes and environment maps that are used to generate our renderings. We set aside 10% of our shapes and environment maps as validation and use the rest for training. This process yields 156,511 renderings for training and 15,872 renderings for validation.

Ablation Study

We evaluate our network with the following metrics: a) the top-1 material class precision, b) top-5 material class precision, c) top-1 substance class precision of the substance output layer, and d) the top-1 substance class precision of the substance predicted by aggregating the confidence of each material class by substance category.

We compare four different versions of our network: 1) trained only with material class supervision, 2) trained with material and substance class supervision, 3) pretrained on OpenSurfaces and then trained only with material class supervision, and 4) pretrained on OpenSurfaces and then trained with material and substance class supervision. The results for these metrics are shown in Table 4.

Our additional substance categorization task significantly boosts the validation accuracy of our classifier. We also find that material predictions are qualitatively more robust when trained with substance supervision as shown in Figure 7.

pretrain sub mtl@1 mtl@5 sub@1 sub-mtl@1
N N 33.87% 61.95% - 71.38%
N Y 37.17% 64.31% 75.50% 76.58%
Y N 37.59% 64.69% - 72.37%
Y Y 37.34% 65.45% 75.51% 76.60%
Table 4. Ablation Study. We compare different versions of our model on our synthetic validation set. We try the permutations of whether or not we pretrain on OpenSurfaces and the presence of an additional substance task. (a) mtl@1 is the top-1 validation precision, (b) mtl@5 is the top-5 precision, (c) sub@1 is the top-1 substance precision of the substance task layer prediction, and (d) sub-mtl@1 is the top-1 precision of the substance prediction implied by the material prediction.

6.6. Image-Based SVBRDF Retrieval

Finding or designing a suitable material for a 3D scene may be time consuming. An application of our work is retrieving a BRDF based on an image. A user-specified region of an image may be used as input to our classifier in order to produce a ranking of BRDFs.

Despite being provided very little natural image supervision (only pretraining on OpenSurfaces), our material classifier is able to generalize surprisingly well to general photographs that do not have white backgrounds. Examples of predictions on such images are shown in Figure 14.

Figure 14. Without any additional training, our network is able to make reasonable predictions on natural images. The input image and colored outline for the segmentation mask is shown on the left. The top BRDF retrieval for each segment are shown on the right.
Figure 15. Selected Results. (a) shows the input shape, (b) the exemplar image, (c) a rendering with the default materials that come with the shape, (d) rendering with materials sampled conditioned on substance category, (e) renderings of our final PhotoShapes. Please see supplementary materials for more results.

7. Discussion

7.1. Limitations and Future Work

Our results are limited in part by the variety of materials available – we cannot reconstruct the wheels of the motorcycle in Figure 10 because we do not have a rubber tire material. Our results could be improved by expanding the material database or by exploring methods to augment the current materials. This may be done by synthetically varying color and glossiness, or by synthesis of novel SVBRDFs.

Our work relies heavily on reliable matching of photos and shapes. Most of our failures come from mis-alignments or under-segmentations of the input shape. Adding more exemplar images and filtering low-quality shapes would yield better results. Incorporating more sophisticated alignment and segmentation methods are interesting topics for future work.

7.2. Conclusion

We have presented a framework that assigns high quality relightable textures to a collection of 3D models with limited material information. The textures come from a large database and the material-to-3D assignment is performed with the guidance of real images to ensure plausible material configurations, and yields thousands of high quality PhotoShapes.


We thank Samsung Scholarship, the Allen Institute for Artificial Intelligence, Intel, Google, and the National Science Foundation (IIS-1538618) for supporting this research. We thank Dustin Schwenk for his help with the user study.


  • [1]
  • Adobe Stock [2018] Adobe Stock. 2018. Adobe Stock: Stock photos, royalty-free images, graphics, vectors and videos. (2018).
  • Aittala et al. [2013] Miika Aittala, Tim Weyrich, and Jaakko Lehtinen. 2013. Practical SVBRDF Capture in the Frequency Domain. ACM Trans. Graph. 32, 4 (2013).
  • Aittala et al. [2015] Miika Aittala, Tim Weyrich, and Jaakko Lehtinen. 2015. Two-shot SVBRDF Capture for Stationary Materials. ACM Trans. Graph. 34, 4 (2015).
  • Bell et al. [2013] Sean Bell, Paul Upchurch, Noah Snavely, and Kavita Bala. 2013. OpenSurfaces: A Richly Annotated Catalog of Surface Appearance. ACM Trans. on Graphics 32, 4 (2013).
  • Bell et al. [2015] Sean Bell, Paul Upchurch, Noah Snavely, and Kavita Bala. 2015. Material Recognition in the Wild with the Materials in Context Database. In CVPR.
  • Brady et al. [2014] Adam Brady, Jason Lawrence, Pieter Peers, and Westley Weimer. 2014.

    genBRDF: discovering new analytic BRDFs with genetic programming.

    ACM Transactions on Graphics 33, 4 (2014).
  • Burley and Studios [[n. d.]] Brent Burley and Walt Disney Animation Studios. [n. d.]. Physically-based shading at disney.
  • Chandraker [2014] Manmohan Chandraker. 2014. On Shape and Material Recovery from Motion. In ECCV. Springer International Publishing, Cham.
  • Chang et al. [2015] Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. 2015. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR]. Stanford University — Princeton University — Toyota Technological Institute at Chicago.
  • Chen et al. [2015] Kang Chen, Kun Xu, Yizhou Yu, Tian-Yi Wang, and Shi-Min Hu. 2015. Magic Decorator: Automatic Material Suggestion for Indoor Digital Scenes. ACM Trans. Graph. 34, 6 (2015).
  • Chen and Koltun [2017] Qifeng Chen and Vladlen Koltun. 2017. Photographic Image Synthesis with Cascaded Refinement Networks. In ICCV.
  • Cimpoi et al. [2014] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. 2014. Describing Textures in the Wild. In CVPR.
  • Cimpoi et al. [2015] M. Cimpoi, S. Maji, and A. Vedaldi. 2015. Deep filter banks for texture recognition and segmentation. In CVPR.
  • Debevec et al. [1996] Paul E. Debevec, Camillo J. Taylor, and Jitendra Malik. 1996. Modeling and Rendering Architecture from Photographs: A Hybrid Geometry- and Image-based Approach. In SIGGRAPH.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR.
  • Diamanti et al. [2015] Olga Diamanti, Connelly Barnes, Sylvain Paris, Eli Shechtman, and Olga Sorkine-Hornung. 2015. Synthesis of Complex Image Appearance from Limited Exemplars. ACM Transactions on Graphics 34, 2 (2015).
  • Dong et al. [2014] Yue Dong, Guojun Chen, Pieter Peers, Jiawan Zhang, and Xin Tong. 2014. Appearance-from-motion: Recovering Spatially Varying Surface Reflectance Under Unknown Lighting. ACM Trans. Graph. 33, 6 (2014).
  • Efros and Freeman [2001] Alexei A. Efros and William T. Freeman. 2001. Image Quilting for Texture Synthesis and Transfer. In SIGGRAPH.
  • Efros and Leung [1999] Alexei A. Efros and Thomas K. Leung. 1999. Texture Synthesis by Non-Parametric Sampling. In ICCV.
  • Felzenszwalb et al. [2010] Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester, and Deva Ramanan. 2010. Object Detection with Discriminatively Trained Part-Based Models. IEEE TPAMI 32, 9 (2010).
  • Gatys et al. [2015] L. A. Gatys, A. S. Ecker, and M. Bethge. 2015. Texture Synthesis Using Convolutional Neural Networks. In NIPS.
  • Georgoulis et al. [2017a] Stamatios Georgoulis, Konstantinos Rematas, Tobias Ritschel, Efstratios Gavves, Mario Fritz, Luc Van Gool, and Tinne Tuytelaars. 2017a. Reflectance and Natural Illumination from Single-Material Specular Objects Using Deep Learning. IEEE TPAMI (2017).
  • Georgoulis et al. [2017b] Stamatios Georgoulis, Vincent Vanweddingen, Marc Proesmans, and Luc Van Gool. 2017b. Material Classification under Natural Illumination Using Reflectance Maps. In WACV.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
  • Herman-Miller [2018] Herman-Miller. 2018. Herman Miller 3D Models. (2018).
  • Hertzmann et al. [2001] Aaron Hertzmann, Charles E. Jacobs, Nuria Oliver, Brian Curless, and David H. Salesin. 2001. Image Analogies. In SIGGRAPH.
  • Huang et al. [2018] Hui Huang, Ke Xie, Lin Ma, Dani Lischinski, Minglun Gong, Xin Tong, and Daniel Cohen-Or. 2018. Appearance Modeling via Proxy-to-Image Alignment. ACM Trans. Graph. 37, 1 (2018).
  • Huang et al. [2015] Qixing Huang, Hai Wang, and Vladlen Koltun. 2015. Single-view Reconstruction via Joint Analysis of Image and Shape Collections. ACM Trans. Graph. 34, 4 (2015).
  • Isola et al. [2017] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-Image Translation with Conditional Adversarial Networks. In CVPR.
  • Izadinia et al. [2017] Hamid Izadinia, Qi Shan, and Steven M Seitz. 2017. IM2CAD. In CVPR.
  • Jain et al. [2012] Arjun Jain, Thorsten Thormählen, Tobias Ritschel, and Hans-Peter Seidel. 2012. Material Memex: Automatic Material Suggestions for 3D Objects. ACM Trans. Graph. (Proc. SIGGRAPH Asia 2012) 31, 5 (2012).
  • Kendall et al. [2018] Alex Kendall, Yarin Gal, and Roberto Cipolla. 2018. Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. In CVPR.
  • Kholgade et al. [2014] Natasha Kholgade, Tomas Simon, Alexei Efros, and Yaser Sheikh. 2014. 3D Object Manipulation in a Single Photograph using Stock 3D Models. ACM Trans. Graph. 33, 4 (2014).
  • Kim et al. [2017] Kihwan Kim, Jinwei Gu, Stephen Tyree, Pavlo Molchanov, Matthias Nießner, and Jan Kautz. 2017. A Lightweight Approach for On-the-Fly Reflectance Estimation. In ICCV.
  • Kopf et al. [2007] Johannes Kopf, Chi-Wing Fu, Daniel Cohen-Or, Oliver Deussen, Dani Lischinski, and Tien-Tsin Wong. 2007. Solid Texture Synthesis from 2D Exemplars. ACM Trans. Graph. 26, 3 (2007).
  • Krähenbühl and Koltun [2011] Philipp Krähenbühl and Vladlen Koltun. 2011. Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials. In NIPS.
  • Li et al. [2017] Xiao Li, Yue Dong, Pieter Peers, and Xin Tong. 2017. Modeling Surface Appearance from a Single Photograph Using Self-augmented Convolutional Neural Networks. ACM Trans. Graph. 36, 4 (2017).
  • Liu et al. [2011] C. Liu, J. Yuen, and A. Torralba. 2011. SIFT Flow: Dense Correspondence across Scenes and Its Applications. IEEE TPAMI 33, 5 (2011).
  • Liu et al. [2017] Guilin Liu, Duygu Ceylan, Ersin Yumer, Jimei Yang, and Jyh-Ming Lien. 2017. Material Editing using a Physically Based Rendering Network. In ICCV.
  • Lombardi and Nishino [2015] Stephen Lombardi and Ko Nishino. 2015. Reflectance and Illumination Recovery in the Wild. IEEE TPAMI (2015).
  • Matusik et al. [2003] Wojciech Matusik, Hanspeter Pfister, Matt Brand, and Leonard McMillan. 2003. A Data-Driven Reflectance Model. ACM Trans. Graph. 22, 3 (2003).
  • Oxholm and Nishino [2016] Geoffrey Oxholm and Ko Nishino. 2016. Shape and Reflectance Estimation in the Wild. IEEE TPAMI (2016).
  • Poliigon [2018] Poliigon. 2018. A library of textures, materials and HDR’s for artists that want photorealism. (2018).
  • Rematas et al. [2017] Konstantinos Rematas, Chuong Nguyen, Tobias Ritschel, Mario Fritz, and Tinne Tuytelaars. 2017. Novel Views of Objects from a Single Image. TPAMI (2017).
  • Rematas et al. [2016] Konstantinos Rematas, Tobias Ritschel, Mario Fritz, Efstratios Gavves, and Tinne Tuytelaars. 2016. Deep Reflectance Maps. In CVPR.
  • Richter et al. [2016] Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. 2016. Playing for data: Ground truth from computer games. In ECCV. Springer.
  • Schwartz and Nishino [2015] Gabriel Schwartz and Ko Nishino. 2015. Automatically discovering local visual material attributes. CVPR (2015).
  • Sendik and Cohen-Or [2017] Omry Sendik and Daniel Cohen-Or. 2017. Deep Correlations for Texture Synthesis. ACM Trans. Graph. 36, 4 (2017).
  • Sharan et al. [2013] Lavanya Sharan, Ce Liu, Ruth Rosenholtz, and Edward H. Adelson. 2013. Recognizing materials using perceptually inspired features. International Journal of Computer Vision 108, 3 (2013).
  • Sharan et al. [2014] Lavanya Sharan, Ruth Rosenholtz, and Edward H. Adelson. 2014. Accuracy and speed of material categorization in real-world images. Journal of Vision 14, 10 (2014).
  • Su et al. [2015] Hao Su, Charles R. Qi, Yangyan Li, and Leonidas J. Guibas. 2015. Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views. In ICCV.
  • Vallet and Lévy [2009] Bruno Vallet and Bruno Lévy. 2009. What you seam is what you get. Technical Report. INRIA - ALICE Project Team.
  • [2018] 2018. - Your ultimate V-Ray material resource. (2018).
  • Walter et al. [2007] Bruce Walter, Stephen R. Marschner, Hongsong Li, and Kenneth E. Torrance. 2007. Microfacet Models for Refraction Through Rough Surfaces. In Proceedings of the 18th Eurographics Conference on Rendering Techniques (EGSR’07).
  • Wang et al. [2016b] Ting-Chun Wang, Jun-Yan Zhu, Ebi Hiroaki, Manmohan Chandraker, Alexei Efros, and Ravi Ramamoorthi. 2016b. A 4D light-field dataset and CNN architectures for material recognition. In ECCV.
  • Wang et al. [2016a] Tuanfeng Y. Wang, Hao Su, Qixing Huang, Jingwei Huang, Leonidas Guibas, and Niloy J. Mitra. 2016a. Unsupervised Texture Transfer from Images to Model Collections. ACM Trans. Graph. 35, 6, Article 177 (Nov. 2016).
  • Wang et al. [2013] Yunhai Wang, Minglun Gong, Tianhua Wang, Daniel Cohen-Or, Hao Zhang, and Baoquan Chen. 2013. Projective Analysis for 3D Shape Segmentation. ACM Trans. Graph. 32, 6 (2013).
  • Wang TY [2017] Mitra NJ Wang TY, Ritschel T. 2017. Joint Material and Illumination Estimation from Photo Sets in the Wild. Eurographics (2017).
  • Xue et al. [2017] Jia Xue, Hang Zhang, Kristin J. Dana, and Ko Nishino. 2017. Differential Angular Imaging for Material Recognition. CVPR (2017).
  • zbyg [2018] zbyg. 2018. HDRi Pack. (2018).
  • Zhang et al. [2015] Hang Zhang, Kristin Dana, and Ko Nishino. 2015. Reflectance Hashing for Material Recognition. In CVPR.
  • Zhou et al. [2016] Zhiming Zhou, Guojun Chen, Yue Dong, David Wipf, Yong Yu, John Snyder, and Xin Tong. 2016. Sparse-as-possible SVBRDF Acquisition. ACM Trans. Graph. 35, 6 (2016).