This paper describes a system that makes it possible to insert objects from a source image to a target image. This system can correct the shading of both the source object and the target image to make it look as though the object was originally part of the target image (Figure 1). Our system extends the state of the art, because we do not require any 3D information to be provided for the object that is being inserted – it is represented as an image fragment, and user experience of authoring images is greatly simplified — a user just selects an object in one image, and places it in another.
There is a significant literature dealing with methods to insert computer generated objects into real images, which we review below. The key problem here is to adjust the illumination of the object to be consistent with that of the scene. But rendering the object is straightforward, because it is a CG object. The system has a representation of geometry and material properties created by some modeling program.
There is also a significant literature on methods to compose images from fragments of other images. The key problem here is to avoid inconsistent illumination fields, which seem to be a strong cue that the target image isn’t real. This is handled by managing the choice of fragments that are composed. Ideally, a system would relight
fragments to avoid inconsistency. The natural strategy is to try and recover a comprehensive geometric and material representation for the object from the image fragment. Doing so is one of the main open challenges in computer vision; as we show, current methods produce marginal relighting results.
The core observation is that one does not need to relight an image with fastidious physical accuracy to fool people. This view is supported both by the perception literature, reviewed briefly below, and by our user studies. Instead, we build a model with four separate components that can be estimated effectively: a reflectance term, a rough base shape, a parametric shading residual layer and a geometric detail layer. The reflectance term encodes object material; the base shape encodes coarse-scale shading by reshading, and the two detail layers encode higher-frequency components left out by the coarse shape. The base shape and the detail layers together encode the shading of an object. The model and the estimation procedures are described in section3; and the resulting relighting system in section 4.
The model is not intended to be physically accurate, but does give better re-rendering mean-squared error (MSE) than the state-of-the-art shape reconstruction method SIRFS Barron:2012B on the MIT intrinsic image dataset and a new dataset. The model shines in user studies of performance. A set of extensive user studies shows that subjects preferred our results over that of SIRFS by a margin of 20%, and over that of Karsch et al. Karsch:2011 by a margin of 14%, indicating the model provides a promising alternative solution to existing methods of object insertion.
2 Related work
To insert an object into a scene, a method must (a) determine the appearance of the object, when in the scene and (b) determine the effect of the object on the scene’s appearance. Different scenarios compel different approaches to these problems. For the cases of interest to us, each problem is attacked by recovering some form of geometric and photometric model of object and scene, then predicting appearance with these models. Relevant methods are linked by using the same compositing approach.
Compositing: In each of the cases of interest, we will have multiple estimates of most pixels, and must come up with a single image. Usually, there is (the target image of the scene the object must be inserted into), (a rendered image of a model of the scene without the inserted object), and (a rendered image of a model of the scene with an inserted object). Generally, we wish to composite these estimates to use the most reliable estimate at each pixel. We assume we have an object matte (which is 0 at pixels where the object is absent, and otherwise). Debevec et al. describe a now-standard method that preserves the original image as much as possible Debevec:1998:RSO In this method, the final composite image is obtained by:
Methods differ by how the object (resp. scene) model is estimated. Great simplifications are available if either scene or object is CG, or either scene or object is readily accessible (i.e. one can obtain many images, under a wide range of conditions; contrast an object or scene depicted in a legacy image). Symmetry means we can consider only cases where the scene is real (as in, not CG). We ignore the case where both scene and object are CG, which is rendering.
CG object into accessible scene: If the object is CG, then its geometric and photometric parameters are known. Strategies then differ depending on what is known about the scene. In one case of great practical importance, one has detailed access to the real scene into which a CG object must be inserted. Debevec et al. have shown methods that use reference objects to estimate the illumination environment at various points in the scene Debevec:1998:RSO . From this estimate, one can relight the object. If the scene illumination field needs to be corrected to account for the object’s effects, then one uses a geometric and photometric model of the scene.
CG object into legacy image: If the scene is a legacy image, there are significant difficulties. One must estimate sufficient geometric and photometric information from the image to be able to compute the object shading and any effects on the scene illumination. The legacy image is . Karsch et al. use a simple geometric model of a room as a box, estimate albedo using standard Color Retinex algorithm Grosse:2009 , and estimate luminaires by looking for bright image patches Karsch:2011 . This model is corrected by user interaction. The luminaire parameters are then adjusted to make the rendered scene similar to in norm. Finally, the model is rendered to produce . The object is then inserted using a geometric modeler. A simple renderer produces from this information; is obtained with a physically-based renderer. These images are composited as above.
The original method requires user interactions. A later paper by Karsch et al. describes an entirely automatic method for recovering a scene model Karsch:2014 . Scene geometry is estimated using the depth from single image method of Karsch:depth2012 . Visible luminaries are obtained by looking for very bright patches. The effect of out of view luminaires is estimated using a matcher. Albedo is again estimated using standard methods Grosse:2009 . Once the scene model has been recovered, the pipeline is the same as that in Karsch:2011 .
Accessible real objects reduce to the case of CG objects. One uses a variety of standard methods to build geometric and photometric models of the object, and proceeds as above. Modeling methods are reviewed in Hartley2004 ; photometricstereo:2008 ; Furukawa2015 . There is a highly developed literature for the very important case of faces, recently reviewed in Ghosh:2011:MFC ; Fyffe:2013:DHF ; Fyffe:facetopology2017 . An alternative, which we believe has not been explored in the literature, would be to recover an illumination cone for the fragment from multiple images. An illumination cone is a representation of all images of an object in a fixed configuration, as lighting changes. This cone is known to lie close to a low dimensional space Basri:2003 – a 9-Spherical Harmonics illumination can account for up to 98% of shading variation, suggesting that quite low-dimensional image based reshading methods are available. We use illumination cone methods in relighting objects, but have not explored the illumination cone as a modelling strategy.
Objects from legacy images into other legacy images are the primary topic of this paper. Here one wishes to cut a fragment, representing an object, out of one image and insert it into another, obtaining realistic results. Early work focused on resolving blending issues, and relied on the artist’s discretion not to insert an incompatible fragment into an image Burt83amultiresolution ; Perez:2003:poisson ; Agarwala:2004:montage . The artist’s work can be simplified by building a dictionary of fragments, organized by illumination estimates. This means that, for a particular scene, the artist can see and choose from compatible fragmentsLalonde:2007:clipart ; Chen:2009:sketch2photo ; Liao:CVPR15 . However, there is no means to relight a fragment or correct the shading of a scene.
One natural strategy would be to recover a full geometric and photometric model of the object from the legacy image. This strategy is not available, because doing so is one of the major open problems in computer vision. Current methods are still unable to recover accurate shape from a single image, even inferring satisfactory approximate shape is difficult. Methods that recover shape from shading (SfS) are unstable and inaccurate, particularly in the absence of reliable albedo and illumination Zhang:1999:survey ; Durou:2008:survey . Barron and Malik have shown significant improvements from jointly estimating shape, illumination and albedo Barron:2012B . We compare to that method here, though the method is not intended to cope with objects that have complex geometric mesostructure. An alternative would be to recover shape from contour, to recover a surface that is smooth and meets normal constraints along the contour Johnston:2002:lumo ; Prasad:2006:SVR . These methods yield surface reconstructions that lack any detail. Our method revolves around correcting shading predictions made by a variant shape-from-contour method.
Human tolerance of shading inaccuracy: Evidence shows that human visual system can tolerate certain degrees of shading inaccuracy. For example, observers find it hard to spot inconsistent shadow directions in a single image Ostrovsky:2005 as long as gross shading is correct. Highlights are important material cues for humans Beck:1981:HPG , but observers are not perturbed if the highlight is somewhat in the wrong place (see Berzh:2005:HGP , experiment 3). The alternative physics theory Cavanagh:2005 argues that the brain employs a set of rules that are convenient, but not strictly physical, when interpreting a scene from retina image. When these rules are violated, a perception alarm is fired, or recognition is negatively affected Tarr:1999 . Otherwise, the scene “looks right”. This means that humans may tolerate a considerable degree of estimation error, as long as it is of the right kind. Our results – which are clearly neither canonical nor physical, but are effective at fooling human viewers – support this notion.
We note a poorly understood asymmetry between scene and object here, caused by the way in which methods are used. Typically, the object that will be inserted is a crucial part of the target image, and will command much visual attention from the observer. This means that humans may react differently to errors in depiction of object and scene. We are not aware of guidelines in the literature as to what is tolerable here. It is usual to build less scrupulous scene models than object models, and doing so seems not to present difficulties (eg Karsch:2011 ; Karsch:2014 ; Debevec:1998:RSO ).
3 The shading model
We have an image fragment that is shaded by some illumination field. The fragment represents an object which might have complex surface material. There may be gloss, specular or mesostructural effects. We wish to adjust the shading over the fragment to look as though the illumination field has changed, while preserving the apparent material effects. We expect that errors in gross shading effects – for example, which side of an object illuminated from the side is bright – are likely to be apparent to people. We expect that inconsistencies between gross shading and small shadows (say, from mesostructural bumps) may be difficult to spot. We expect to be able to recover only a very simple geometric and photometric model of the object from the fragment.
These observations motivate our approximate shading model. We decompose shading into 3 components – a smooth component captured by shading a coarse geometric shape model (the “coarse shape”), and two shading detail layers: a parametric residual layer and a geometric detail layer . The relighting of the object is given by the smooth shading plus a weighted combination of the two detail layers:
where is an illumination field, and are scalar weights, and
can be any shading (rendering) procedure that works for the illumination and scene representation.
By shading the coarse shape, we can obtain a smooth shading that captures directional long-scale illumination effects. These make it look as though the overall illumination direction has changed. The two detail layers account for appearance complexity of the object. They encode mid and high frequencies of the appearance that are not successfully predicted by the albedo and shape alone. We see a loose correspondence between the mid frequency layer () and small shape features (silhouettes, creases and folds, etc.), and between the high frequency layer () and material properties. Separating the two allows an artist to adjust each weight ( and in equation 2) independently.
3.1 Estimating coarse shape
We need a shape representation capable of capturing gross shading effects. For example, a vertical cylinder with light from the left will be light on left, dark on right. Moving the light to the right will cause it to become dark on left, light on right. We also want our reconstruction to be consistent with generic view and generic lighting assumptions. This implies that the outline should not shift too much if the view shifts. It also implies relighting the shape with a directional light should not produce large or curious cast shadows. In turn, there should not be large bumps in the shape that are concealed by the view direction (i.e. that extend toward the viewer). Large bumps in shape concealed by the view direction are characteristic of current shape from shading methods (Fig. 13), but can create large cast shadows in the scene.
We use a simple shape from contour (SFC) method with a modification to ensure a stable outline (Fig. 2
). First, we create a normal field by constraining normals on the object boundary to be perpendicular to the view direction, and interpolate them from the boundary to the interior region, similar to Johnston’s Lumo techniqueJohnston:2002:lumo . Let be the normal field, and be the set of pixels in the object and on boundary, respectively. We compute by the following optimization:
We then reconstruct an approximate height field from the normal by minimizing:
subject to for boundary pixels (stable outline). The threshold avoids numerical issues near the boundary for exact reconstruction (Wu et al. Wu:2008:SFS ); and forces the reconstructed object to have a crease along its boundary. This crease is very useful for the support of generic view direction, as it allows slight change of view direction without exposing the back of the object and causing self-occlusion. The reconstructed height field is then flipped to make a symmetric full 3D shape. Our model is a simple shape model in the spirit of puffballs Adelson:puffball , but in contrast has a crease along its boundary.
3.2 Parametric shading residual
The coarse shape can recover gross changes in shading caused by lighting. However, it cannot represent finer detail. We use shading detail maps to represent this detail. We define the shading detail maps as a representation of the residual incurred by fitting the object shading estimate with some model. We use two shading details in our model: parametric shading residual that encodes object level features (silhouettes, crease and folds, etc.), and geometric detail that encodes fine scale material effects.
First, we use a standard color Retinex algorithm Grosse:2009 to get an initial albedo and shading estimate from the input image: . We then use a parametric illumination model to shade the coarse shape and compute the residual by solving:
The optimized illumination is substituted to obtain the parametric shading residual:
Many parametric illuminations are possible (i.e., spherical harmonics). We used a mixture of 5 point sources in all of our experiments, the parametrization of the lights are position and intensity of each point source, forming a 20-dimensional representation.
An alternative is to use spherical harmonic lighting models. A comparison is made in Figure 3, which shows parametrically fitted shading image and the derived shading residual using a 9-coefficient spherical harmonics representation versus the point lights. Note the shading image produced by a spherical harmonic model is smoother than that produced by a point source model. Our coarse shape has relatively smooth normals, and SH lighting models do not account for occlusion well. In contrast, point source shading falls off faster as a normal swings, so our point source model can produce shading that changes more abruptly (e.g. the dragon in the third column of Figure 3) In all of our experiments, we use our point source model to derive the parametric shading residual.
Note that the use of five point light sources is out of empirical design consideration. The five point sources are initialized to large cover light from the five major directions (frontal and four sides). A larger number of point sources could be used, or other types of light forms, such as area light or the spherical harmonics. What we demonstrate in the comparison in Figure 3 – in the case of the statue horse and the statue dragon – is that the use of spherical harmonics and the point sources fit the input shading slightly differently, and more important, the resultant parametric shading detail still look characteristically very similar (Figure 3 (d) and (e)).
Figure 4 upper right row shows an example of the best fit coarse shading and the resultant parametric shading detail. Note that the directional shading is effectively removed, leaving shading cues of object level features. The bottom right row shows the geometric detail extraction pipeline.
3.3 Geometric detail
The parametric shading residual is computed by a global shape and illumination parametrization, and contains all the shading detail missed by the shape. Now we wish to compute another layer that contains only fine-scale detail. We use a technique from Liao et al. Liao:2013:geometricdetail that extracts fine-scale geometric detail. The algorithm reconstructs local image patches with an L1 regularized patch dictionary learned from data (called a local patch-based non-parametric filtering) and computes the reconstruction residual as the geometric detail. The resultant geometric detail represents high frequency shading signal caused by local surface geometry like bumps and grooves and is insensitive to gross shading and higher-level object features. See the characteristics of the geometric detail we obtained from the horse image (bottom row of Fig. 4) and compare it with the parametric shading residual term.
The filtering procedure uses a set of shading patches learned from smooth shading images to reconstruct an input shading image. In the experiment we use the same empirical parameter settings used in Liao:2013:geometricdetail : dictionary size of 500 patches and patch size . The two hyper-parameters should correlate with one another and should vary subject to input image size (image size in our experiments are around ).
How many detail layers are necessary? We choose two shading detail layers for a balance of representational power and ease of editing. Having two detail layers allows user to adjust mid- (eg. muscle) and high (eg. wrinkle) frequency shading components separately, which by itself is a straightforward design choice. However, better solutions should exist. One possible idea is to extend the two-layer structure to a Laplacian pyramid like decomposition, in which a larger number of detail layers is defined, each taking up a particular band of the frequency domain. Another possible idea is to replace the current image-based shading detail representation with illumination independent “detail normal maps”, which would have interesting implication to the relighting effect. Yet another possibility is to manually define new shading detail maps in the current framework that capture other types of high frequency shading phenomenon.
4 The relighting system
We now describe a system to take a fragment from an image and insert it into a new scene, relighting it and the scene as required. The system combines interactive scene modeling, physically-based rendering and image-based detail composition (Fig. 5).
4.1 Modeling and rendering
We build a sparse mesh object from the height field computed by the method of section 3.1 (by pixel-grid triangulation and mesh simplification Xia:2009 ). The target scene can be existing 3D scenes, or recovered from an image. In the latter case, we use the approach of Karsch et al. Karsch:2011 to recover a 3D model of the scene from image. The method asks a user to specify indoor scene room boundaries, and then automatically recovers interior illumination parameters that minimize re-rendering error. The recovered illumination does a fairly good job in producing consistent shading on the inserted object; adjustment of the illumination intensity is sometimes needed in practice. Having had the model of the scene and the object, the user then places the object model into the scene, adjust its scale and orientation, and make sure the view direction is roughly the same as that of the fragment in the original image. The model is then rendered with a physically-based renderer. Finally, the residual shading fields ( and ) are composited into the rendered image. For all the results in the paper, we use Blender (http://blender.org) for modeling and LuxRender (http://luxrender.net) for physically-based rendering. All target scenes were constructed using the technique of Karsch et al. Karsch:2011 technique if not otherwise stated.
To create the mesh object from the height field, we flip the mesh along the contour plane to create a closed 3D mesh. However, the flipped shape model is thin along the base and this can cause light leaks and/or skinny lateral shadows. We use a simple user-controllable back extrusion procedure to handle such cases. Specifically, the back extrusion asks a user to manually select a distance to extrude the back side of the mesh (since it was flipped and symmetric) to ensure full contact of the object bottom with the supporting surface. The extruded back is eased in the camera’s viewing direction to make sure it is invisible.
Our shape model assumes an orthographic camera, while most rendering systems use a perspective camera. This will cause texture distortion during texture mapping. Since we expect the focal point to be (a) far from the camera, and (b) largely frontal, we can use a simple easing operation to avoid self-occlusion and restore the texture field: write for a vertex on the model, for the coordinate of the vertex in the image plane, for the focal point, we replace with . If the camera is orthographic, there is no change in vertex position, and for cameras that are distant along the z-axis compared to the and axes, the shift is small.
We now composite the rendered scene with the two detail maps and the original scene to produce the final result (Fig. 6). First, we composite the two shading detail images with the shading field of the rendered image:
where (equivalently denoted as in equation 2) is the shading field, is the rendered image. The weights and can be automatically determined by regression (section 5.1) or manually adjusted by artist with a slider control (Fig. 5 detail composition). Compositing the two detail layers improves the visual realism of the rendered object (Fig. 7). A controlled user study (section 5.3, task 1) showed that users consistently prefer composition results with more detail layers applied. Note that the compositing procedure can tolerate certain degree of albedo-shading misclassification, because the albedo and the shading are composited back (equation 7). Misclassification of albedo and shading would, however, affect the detail manipulation results (when and are not equal to 1) in principle, i.e., color instead of shading signal gets magnified or smoothed.
Second, we use standard techniques (e.g. Debevec:1998:RSO ; Karsch:2011 ) to composite with the original image of the target scene. This produces the final result. Write for the target image, for the empty rendered scene without the inserted object, for the rendered scene with the inserted object, and for the object matte (0 where no object is present, and otherwise), the final composite image is obtained by equation 1.
Our assumption is that the approximate shading model can capture major effects of illumination change of an object in new environment and generate visually plausible image. To evaluate the performance, we compare our representation with state-of-the-art shape reconstructions by Barron and Malik Barron:2012B on a re-rendering metric (Section 5.1). We then evaluated the affect of the different base shape acquisition methods on the relighting performance (Section 5.2). We also conduct an extensive set of user study to evaluate the realism of our relighting results versus that of Barron and Malik Barron:2012B , Karsch et al. Karsch:2011 and real scenes (Section 5.3). The evaluation results show that our representation is a promising alternative to the existing methods for object relighting.
5.1 Re-rendering error
The re-rendering metric measures the error of relighting an estimated shape. On a canonical shape representation (a depth field), the metric is defined as
where and are estimated albedo and depth, is the re-rendering with the ground truth shape and albedo : , is the number of pixels, is a scaling factor that minimizes the squared error.
With our model, write for the coarse shading, parametric shading detail and the geometric detail, respectively, and replace the corresponding part of Equation 2 with
for some choice of weight vector. The re-rendering metric is:
That is, rendering of canonical shape is replaced by our approximate shading model (Equation 2).
We offer three methods to select . An oracle could determine the values by least square fitting that leads to best MSE. Regression
could offer a value based on past experience. We learn a simple linear regression model to predict the weights from illumination. Lastly, an artist couldmanually choose the weights, as demonstrated in our relighting system (Sec. 4.2).
|B + M Barron:2012B||0.0172||0.0372|
Experiment We run the evaluation on the augmented MIT Intrinsic image dataset Barron:2012B . To generate the target images, we re-render each of the 20 object by 20 randomized monochrome ( Spherical Harmonics illuminations, forming a image set. We then measure the re-rendering error of our model and Barron and Malik’s reconstructions. For our method, we compare models built from the Natural Illumination dataset and Lab Illumination dataset separately. The models are built (a) in the default setting, (b) using Barron and Malik’s shading estimation, and (c) using the ground truth shading. See Table 1 for the results. To learn the linear regression model, we draw for each object 100 nearest neighbors in illumination space from the other 380 data points, and fit its weights by least square.
Table 1 displays our experiment result. The result shows that when the shape estimation of Barron and Malik is accurate (on the “Natural” Illumination dataset, a synthetic dataset by the same shading model used in their optimization), our approximate shading performs less well. This is reasonable because a perfect shape is supposed to produce zero error in the re-rendering metric. This is also acceptable because the dataset images are not real. On the real image set (the “Lab” Illumination dataset, taken in lab environment with strong shadows), the shape estimation of Barron and Malik becomes inaccurate, and our approximate shading model can produce lower error with both regressed weights and oracle setting. With better detail layers (when ground truth shading is used to derive them), our model achieves significantly lower errors.
5.2 The effect of shape estimates
The model relies on the rough base shape to generate coarse-scale shading according to lighting environment. For practical consideration we have chosen an internally-smooth SFC representation. This representation captures no object-specific creases or folds, and can be quite different from the true depth. In this section, we investigate the effect of alternative estimates of the rough base shape.
We consider three alternative shape representations: (a) Kinect depth, (b) depth from SIRFS, and (c) depth from SIRFS-stereo. These are fair baselines. The Kinect is a standard consumer depth sensor. The SIRFS and SIRFS-stereo algorithm are described in Barron and Malik Barron:2012B ; the latter was used to reconstruct the ground-truth depth of the augmented MIT intrinsic image dataset using multiple differently illuminated images.
We also need a new dataset for this evaluation because existing intrinsic image datasets (e.g. the MIT intrinsic image dataset) do not have Kinect depths, and the NYU RGBD dataset has very low resolution that is undesirable for our relighting purpose. Our new dataset contains 10 example objects. Each example has one diffuse image under a normal lighting (with no strong directional light) to allow estimation of the model, one aligned depth image captured by a Kinect-II, and 10 additional images that are taken under the same camera pose but different lighting conditions, and for each of the 10 images, a diffuse light probe is placed nearby and photographed for the recovery of the ground truth lighting coefficients Ramamoorthi:2001:SH . Figure 8 shows a few examples of the new dataset, and one example’s 10 lighting images and their estimated lighting.
Experiment In the experiment, we are going to build for each object four models using (a) the SFC base shape, (b) the Kinect base shape, (c) the SIRFS base shape and (d) the SIRFS-stereo base shape. For each option, the rest of the model (albedo and the two detail layers) is derived as described in section 3. We then relight each object with its four models under the 10 lighting conditions. So for each object and each lighting, we have one ground truth image from the dataset, and four relighting results that are different only because of the based shape. Comparing the relighting results to the ground truth image gives us the re-rendering MSE error (eq. 9) for each model; the averaged errors are displayed in Table 2. In the table, the oracle setting is the same as that described in section 5.1. The default setting is different from the regression method in section 5.1. Instead, we use the default weight 1 for and , and then compute the MSE up to a free scaling factor. Figure 9 visualizes the overall and per-instance errors of Table 2.
Discussion Our SFC base shape is comparable to the Kinect and SIRFS shape methods in terms of re-rendering error in the default setting, and is slightly better in the oracle setting. There is an important result here. Our simple base shape estimate is not the major performance barrier in this task – more complex estimates of the base shape do not outperform it. There are two possible explanations: first, the alternative shapes (Kinect and SIRFS) are not that good either, so the simple and stable SFC shape representation has certain advantage; second, the detail layers are moderately successful at compensating for missing details, so the smooth SFC shape only need to take care of the smooth shading. On the other hand, the SIRFS-stereo shape gives significantly higher performance, which means the current model (with SFC base shape) still has considerable space to improve.
The Kinect shape does not work particularly well. This is because we used the Kinect-II, which is a time-of-flight depth sensor rather than the structured light sensor (Kinect-I). The time-of-flight sensor works best for foreground/background separation (i.e., for hand/body tracking), but it has very low depth disparity on objects. Kinect-I might give better depth disparity but we did not use it because it is out dated and it has very low RGB resolution. A more accurate object-level depth sensor is certainly going to boost the relighting performance of our model from the current state, just like the stereo shape did. And note we only need the depth sensor to acquire depth in the coarse scale, because we still have the detail layers for the rest.
In Figure 9, it shows the backpack has much lower errors than the average errors. This is because it is a black backpack, so the overall image values are small, which proportionally affects the errors, because the re-rendering MSE is dependent on the target image (if we scale a target image by a factor of two, the errors is to increase by about the same factor). Therefore, it does not make much sense to compare errors across objects or datasets. Instead, it is the relative errors that matter, as is shown in Figure 9.
It is worth noting that MSE is not geared toward visual realism of relighting results (non-linearity of visual perception; image features take little weights, etc.). So, we further conduct a set of user studies to measure the subjective user ratings to our relighting results.
5.3 User study
In the study, each subject is shown series of two-alternative forced choice tests and chooses between each pair which he/she feels the most realistic. We tested five different tasks: (1) our method against real images, (2) our method against Barron and Malik Barron:2012B , and (3) our method against Karsch et al. Karsch:2011 , and (4) our method with controlled number of detail layers. The last task shows both detail layers help make more visually realistic result. The other three tasks show the advantage of our result over that of Barron and Malik Barron:2012B and Karsch et al. Karsch:2011 . Figure 10 shows example image pairs used in the user study.
Experiment For each task, we created 10 different insertion results using a particular method (ours, Barron and Malik, or Karsch et al. For the results of Barron and Malik, we ensured the same object was inserted at roughly the same location as our results. This was not the case for the results of Karsch et al., as synthetic models were not all available for the objects we chose. We also collected 10 real scenes (similar to the ones with insertion) for the tasks involving real images. Each subject viewed all 10 pairs of images for one but only one of the five tasks. For the 10 results by our method, the detail layer weights were manually selected (it is hard to apply the regression model as in Section 5.1 to the real scene illuminations) while the other two methods do not have such options.
We polled 100-200 subjects using Mechanical Turk for each task. In an attempt to avoid inattentive subjects, each task also included four “qualification” image pairs (a cartoon picture next to a real image). Subjects who incorrectly chose any of the four cartoon picture as realistic were removed from our findings (6 in total, leaving 294 studies with usable data). At the end of the study, we showed subjects two additional image pairs: a pair containing rendered spheres (one a physically plausible, the other not), and a pair containing line drawings of a scene (one with proper vanishing point perspective, the other not). For each pair, subjects chose the image they felt looked most realistic. Then, each subject completed a brief questionnaire, listing demographics, expertise, and voluntary comments.
These answers allowed us to separate subjects into subpopulations: male/female, age / , whether or not the subject correctly identified both the physically accurate sphere and the proper-perspective line drawing at the end of the study (passed/failed perspective-shading (p-s) tests), and also expert/non-expert
(subjects were classified as experts only if they passed the perspective-shading testsand indicated that they had expertise in art/graphics). We also attempted to quantify any learning effects by grouping responses into the first half (first five images shown to a subject) and the second half (last five images shown).
Results and discussion In the first task, the user study results show that subjects confused our insertion result with a real image in 44% of 1040 viewed pairs (task 1, see table 3); an optimal result would be 50%. We also achieve better confusion rates than the insertion results of Barron and Malik Barron:2012B (42%), and perform well ahead of the method of Barron and Malik in a head-to-head comparison (task 2, Fig. 11 left), as well as a head-to-head comparison with the method of Karsch et al. Karsch:2011 (task 3, Fig. 11 right). In the last task, users consistently preferred insertion results with more detail layers applied (Figure 12).
Table 3 demonstrates how well images containing inserted objects (using either our method or Barron and Malik) hold up to real images (tasks 1). We observe better confusion rates (e.g. our method is confused with real images more than the method of Barron and Malik) overall and in each subpopulation except for the population who failed the perspective and shading tests in the questionnaire.
We also compared our method and the method of Barron and Malik head-to-head by asking subjects to choose a more realistic image when shown two similar results side-by-side (Fig. 10 middle column). Figure 11 summarizes our findings. Overall, users chose our method as more realistic in a side-by-side comparison on average 60% of the time in 1000 trials. In all subject subpopulations, our method was preferred by a large margin to the method of Barron and Malik; each subpopulation was at least two standard deviations away from being “at chance” (50% – see the red bars and black dotted line in Fig. 11). Most interestingly, the expert subpopulation preferred our method by an even greater margin (66%), indicating that our method may appear more realistic to those who are good judges of realism.
Karsch et al. Karsch:2011 performed a similar study to evaluate their 3D synthetic object insertion technique, in which subjects were shown similar pairs of images, except the inserted objects were synthetic models. In their study, subjects chose the insertion results only 34% of the time, much lower than the two insertion methods in this study, a full 10 points lower than our method and 8 points lower than the method of Barron and Malik. While the two studies were not conducted in identical setting, the results are nonetheless intriguing. We postulate this large difference is due to the nature of the objects being inserted: we use real image fragments that were formed under real geometry, complex material and lighting, sensor noise, and so on; they use 3D models for which photorealism can be extremely difficult to model. By inserting image fragments instead of 3D models, we gain photorealism in a data-driven manner (Fig. 10 C1 versus C2). This postulation is validated by our comparison in task 5. For all but one subpopulations, our results were preferred by a large margin (Fig. 11 right).
Figure 12 displays the result of the last task. It demonstrates the model makes better insertion results as more detail layers are applied. Overall, users preferred insertion results with detail 1 over that without detail composition in 61% of 764 viewed image pairs, and preferred results with both detail layers over that with only detail 1 in 65% of 756 viewed pairs. Consistent results were shown in all subpopulations.
6 Conclusion and future work
We have presented an approximate shading model and the accompanying algorithms for building the corresponding object model from single image, and a relighting system that supports image-to-image object insertion with an interactive interface for user control. The shading model is based on psychological findings of human visual perception, and therefore distinguishes from existing physically-based shading approach. In the end, the model takes a hybrid of physically-based graphics rendering and image-based detail composition. The detail components enable the object model to accommodate surface mesostructure for which 3D mesh representation is limited, and allows visual effect of such fine-scale structure to be modeled and transferred directly from image to image with illumination change.
The system can be improved in several directions. First, the object is constrained to be relit from roughly the same camera direction. A slight view angle perturbation is manageable, but larger view angle change is to expose the back of the object, because the object model is constructed from a single view and not as a 3D model. Integrating geometric stereo techniques to recover full 3D object model would be an interesting next step. Second, the core part of the model – the two detail layers – can be further improved. One important direction is to extend them to a larger number of components for finer modeling granularity and control. It remains an open question as to explore more principled ways of defining such detail series other than the current handcrafted approach. Another possible direction of future work is to focus to the very important case of human face. We can make use of category specific prior for better intrinsic decomposition, exploit parametric face model for better depth recovery, and, possibly, learn more powerful shape or material extraction models from data – if a suitable face dataset is publicly available.
Another direction is the integration of the widely adopted deep neural network to our system. We have seen exciting improvements in intrinsic decomposition , where both end-to-end trained convolutional neural network and adversarial training with a CNN-based generator/discriminator network are proved effective. We have also seen the reconstruction of face normal or depth, or joint reconstruction of face normal, albedo and illumination using the convolutional encoder-decoder framework propose to generate diverse color fields from grayscale images using variational autoencoders. The key of the above mentioned problems can be viewed as an analysis (i.e. representation learning) process followed with a synthesis (i.e. generation) process; and deep neural networks are the most convenient computational infrastructure and toolset for this problem up-to-date. Yet the key component of our system – the relighting process – is not seen in neural network-based solutions so far. It remains as an open question whether deep neural networks will be a viable approach.
The challenges may include in the requirement of high resolution visual outputs, the trade-offs between diversity and physical correctness, and the network’s adaptability to unseen geometry or illumination patterns.
Another direction is the integration of the widely adopted deep neural network to our system. We have seen exciting improvements in intrinsic decompositionnarihira2015direct ; DBLP:KimPSL16 ; lettry2016darn
, where both end-to-end trained convolutional neural network and adversarial training with a CNN-based generator/discriminator network are proved effective. We have also seen the reconstruction of face normal or depth, or joint reconstruction of face normal, albedo and illumination using the convolutional encoder-decoder frameworkTrigeorgis_2017_CVPR ; Richardson_2017_CVPR ; Shu_2017_CVPR . In addition, Soltani et al. Soltani_2017_CVPR propose to synthesize 3D shapes from depth maps and silhouettes using a deep generative network. Deshpande et al. Deshpande_2017_CVPR
propose to generate diverse color fields from grayscale images using variational autoencoders. The key of the above mentioned problems can be viewed as an analysis (i.e. representation learning) process followed with a synthesis (i.e. generation) process; and deep neural networks are the most convenient computational infrastructure and toolset for this problem up-to-date. Yet the key component of our system – the relighting process – is not seen in neural network-based solutions so far. It remains as an open question whether deep neural networks will be a viable approach. The challenges may include in the requirement of high resolution visual outputs, the trade-offs between diversity and physical correctness, and the network’s adaptability to unseen geometry or illumination patterns.
Acknowledgements DAF is supported in part by Division of Information and Intelligent Systems (US) (IIS 09-16014), Division of Information and Intelligent Systems (IIS-1421521) and Office of Naval Research (N00014-10-10934). ZL is supported in part by NSFC grant No. 61602406, ZJNSF Grant No. Q15F020006 and a special fund from Alibaba – Zhejiang University Joint Institute of Frontier Technologies.
- (1) A. Agarwala, M. Dontcheva, M. Agrawala, S. Drucker, A. Colburn, B. Curless, D. Salesin, and M. Cohen. Interactive digital photomontage. ACM Trans. Graph., 23(3):294–302, 2004.
A. Arsalan Soltani, H. Huang, J. Wu, T. D. Kulkarni, and J. B. Tenenbaum.
Synthesizing 3d shapes via modeling multi-view depth maps and
silhouettes with deep generative networks.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
- (3) J. T. Barron and J. Malik. Color constancy, intrinsic images, and shape estimation. ECCV, 2012.
- (4) R. Basri and D. Jacobs. Lambertian reflectance and linear subspaces. PAMI, 2003.
- (5) J. Beck and S. Prazdny. Highlights and the perception of glossiness. Perception and Psychophysics, 1981.
- (6) J. Berzhanskaya, G. Swaminathan, J. Beck, and E. Mingolla. Remote effects of highlights on gloss perception. Perception, 2005.
- (7) P. J. Burt and E. H. Adelson. A multiresolution spline with application to image mosaics. ACM Trans. Graph., 2(4):217–236, 1983.
- (8) P. Cavanagh. The artist as neuroscientist. Nature, pages 301–307, 2005.
- (9) T. Chen, M.-M. Cheng, P. Tan, A. Shamir, and S.-M. Hu. Sketch2photo: internet image montage. ACM Trans. Graph., 28(5):124:1–124:10, Dec. 2009.
- (10) P. Debevec. Rendering synthetic objects into real scenes: bridging traditional and image-based graphics with global illumination and high dynamic range photography. In SIGGRAPH’98, pages 189–198. ACM, 1998.
A. Deshpande, J. Lu, M.-C. Yeh, M. Jin Chong, and D. Forsyth.
Learning diverse image colorization.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
- (12) J.-D. Durou, M. Falcone, and M. Sagona. Numerical methods for shape-from-shading: A new survey with benchmarks. Comput. Vis. Image Underst., 109(1):22–43, 2008.
- (13) C. H. Esteban, G. Vogiatzis, and R. Cipolla. Multiview photometric stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(3):548–554, 2008.
- (14) Y. Furukawa and C. Hernandez. Multi-vew Stereo: A Tutorial. Foundations and Trends? in Computer Graphics and Vision, 2015.
- (15) G. Fyffe, A. Jones, O. Alexander, R. Ichikari, P. Graham, K. Nagano, J. Busch, and P. Debevec. Driving high-resolution facial blendshapes with video performance capture. In ACM SIGGRAPH 2013 Talks, SIGGRAPH ’13, pages 33:1–33:1, New York, NY, USA, 2013. ACM.
- (16) G. Fyffe, K. Nagano, L. Huynh, S. Saito, J. Busch, A. Jones, H. Li, and P. Debevec. Multi-View Stereo on Consistent Face Topology. Computer Graphics Forum, 2017.
- (17) A. Ghosh, G. Fyffe, B. Tunwattanapong, J. Busch, X. Yu, and P. Debevec. Multiview face capture using polarized spherical gradient illumination. ACM Trans. Graph., 30(6):129:1–129:10, Dec. 2011.
- (18) R. Grosse, M. K. Johnson, E. H. Adelson, and W. T. Freeman. Ground-truth dataset and baseline evaluations for intrinsic image algorithms. In ICCV, pages 2335–2342, 2009.
- (19) R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004.
- (20) S. F. Johnston. Lumo: illumination for cel animation. In NPAR ’02, 2002.
- (21) K. Karsch, V. Hedau, D. Forsyth, and D. Hoiem. Rendering synthetic objects into legacy photographs. ACM Trans. Graph (SIGGRAPH Asia), 30(6):157:1–157:12, 2011.
- (22) K. Karsch, C. Liu, S. B. Kang, and N. England. Depth extraction from video using non-parametric sampling. In In: ECCV, 2012.
- (23) K. Karsch, K. Sunkavalli, S. Hadap, N. Carr, H. Jin, R. Fonte, M. Sittig, and D. Forsyth. Automatic scene inference for 3d object compositing. ACM Trans. Graph., 33(3):32:1–32:15, June 2014.
- (24) S. Kim, K. Park, K. Sohn, and S. Lin. Unified depth prediction and intrinsic image decomposition from a single image via joint convolutional neural fields. In European Conference on Computer Vision, 2016.
- (25) J.-F. Lalonde, D. Hoiem, A. A. Efros, C. Rother, J. Winn, and A. Criminisi. Photo clip art. ACM Transactions on Graphics (SIGGRAPH 2007), 26(3):3, August 2007.
- (26) L. Lettry, K. Vanhoey, and L. Van Gool. Darn: a deep adversial residual network for intrinsic image decomposition. arXiv preprint arXiv:1612.07899, 2016.
- (27) Z. Liao, K. Karsch, and D. Forsyth. An approximate shading model for object relighting. In CVPR, 2015.
- (28) Z. Liao, J. Rock, Y. Wang, and D. Forsyth. Non-parametric filtering for geometric detail extraction and material representation. In CVPR, pages 963–970, 2013.
- (29) T. Narihira, M. Maire, and S. X. Yu. Direct intrinsics: Learning albedo-shading decomposition by convolutional regression. In Proceedings of the IEEE International Conference on Computer Vision, pages 2992–2992, 2015.
- (30) Y. Ostrovsky, P. Cavanagh, and P. Sinha. Perceiving illumination inconsistencies in scenes. Perception, 34:1301–1314, 2005.
- (31) P. Pérez, M. Gangnet, and A. Blake. Poisson image editing. ACM Trans. Graph., 22(3):313–318, July 2003.
- (32) M. Prasad and A. Fitzgibbon. Single view reconstruction of curved surfaces. In CVPR, pages 1345–1354, 2006.
- (33) R. Ramamoorthi and P. Hanrahan. An efficient representation for irradiance environment maps. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH’01, pages 497–500, 2001.
- (34) E. Richardson, M. Sela, R. Or-El, and R. Kimmel. Learning detailed face reconstruction from a single image. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
- (35) Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras. Neural face editing with intrinsic image disentangling. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
- (36) M. J. Tarr, D. Kersten, and H. H. Bülthoff. Why the visual recognition system might encode the effects of illumination. Vision Research, 38:2259 – 2275, 1998.
- (37) G. Trigeorgis, P. Snape, I. Kokkinos, and S. Zafeiriou. Face normals ”in-the-wild” using fully convolutional networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
- (38) N. Twarog, M. Tappen, and E. Adelson. Playing with puffball: simple scale-invariant inflation for use in vision and graphics. In Proceedings of the ACM Symposium on Applied Perception, SAP’12, pages 45–54, 2012.
- (39) T.-P. Wu, J. Sun, C.-K. Tang, and H.-Y. Shum. Interactive normal reconstruction from a single image. ACM Trans. Graph., 27(5):119:1–119:9, Dec. 2008.
- (40) T. Xia, B. Liao, and Y. Yu. Patch-based image vectorization with automatic curvilinear feature alignment. ACM Trans. Graph., 28(5):115:1–115:10, Dec. 2009.
- (41) R. Zhang, P.-S. Tsai, J. Cryer, and M. Shah. Shape-from-shading: a survey. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 21(8):690–706, 1999.