SSN: Soft Shadow Network for Image Compositing

07/16/2020 ∙ by Yichen Sheng, et al. ∙ 0

In image compositing tasks, objects from different sources are put together to form a new image. Artists often increase realism by adding object shadows to match the scene geometry and lighting. However, creating realistic soft shadows requires skill and is time-consuming. We introduce a Soft Shadow Network to generate convincing soft shadows for 2D object cutouts automatically. SSN takes an object cutout mask as input and thus is agnostic to image types such as painting and vector art. Although inferring the 3D shape of an object from its silhouette can be ambiguous, it is easy for humans to get the 3D geometry from a 2D projection when it is in an iconic view. We follow this intuition and train the SSN to render soft shadows for objects' iconic views. To train our model, we design an efficient pipeline to produce diverse soft shadow training data using 3D object models. Our pipeline first computes a set of soft shadow bases by sampling hard shadows. During training, environment lighting maps that cover a wide spectrum of possible configurations are used to calculate the soft shadow ground truth using the shadow bases. This enables our model to see a complex lighting pattern and to learn the interaction between the lights and 3D geometries. In addition, we propose an inverse shadow map representation, which makes the training focused on the shadow area and leads to much faster convergence and better performance. We show that our model produces realistic soft shadow details for objects of different shapes. A user study shows that SSN generated shadows are often indistinguishable from shadows calculated by physics-based rendering. Our SSN can produce a shadow in real-time and it allows real-time interactive shadow manipulation. We develop a simple user interface and a second user study shows that amateur users can easily use our tool to generate soft shadows matching a reference shadow.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image compositing is an important and powerful means for image editing, where elements from different sources are put together to create a new image. One of the challenging tasks for image compositing is to synthesize shadows for objects, as the properties of the shadows constitute strong visual clues for the scene geometry and the lighting. Manually creating a convincing shadow from a 2D object mask requires a significant amount of expertise and effort, because the shadow generation process involves a complex interaction between the object geometry and light sources, especially for area lights and soft shadows. In this work, we ease image compositing by introducing a Soft Shadow Network (SSN) that helps users to generate a convincing soft shadow given a 2D object mask and a potentially complex light configuration that can be easily created.

Figure 2: An example of image compositing using our Soft Shadow Network. The user wants to composite some cartoon figures (left) on to background photo (middle). To make the compositing look more realistic, the user uses our shadow generation tool to synthesize the shadows to match the direction, intensity, and softness of the shadow of the miniature figure in the background image. It only takes a couple of minutes to achieve the desirable shadow effect. Best viewed zoomed-in.

Existing algorithms working with 3D objects often aim at speed while sacrificing shadow quality. Many are limited to only single area light source or use global illumination that generates soft shadows as one of many effects. However, previous works of soft shadow generation require 3D shape information of the object, which is not available for 2D image compositing. Although calculating accurate shadows requires 3D information of the object the scene, we speculate that the essential 3D information for soft shadow generation can often be inferred from 2D object cutouts. The strong 3D shape and pose cues of common objects make the soft shadow predictable. Therefore, it could be theoretically possible to train a model on a vast variety of 3D scenes and varying lighting conditions that could then be used to predict soft shadows for 2D binary masks extracted as cutouts from images. Moreover, the human perceptual system is sensitive to sharp gradient changes (edges and hard shadows) and is more forgiving to the subtle variations of intensities such as soft shadows. While it would be difficult to train such networks on all possible scenes and lighting conditions, we hypothesize that having broad coverage will produce perceptually realistic shadows for objects such as humans in their common views.

We introduce Soft Shadow Network (SSN): a deep neural network that generates approximate soft shadow for a 2D binary mask image and an input image-based lighting. Our model’s objective is to generate approximate soft shadows that can be used for interactive 2D image composition. The SSN has been trained on 3D models of various shapes with millions of randomly sampled complex lighting patterns. Rendering soft shadows for complex lighting patterns can be time-consuming, which will throttle the training process. Therefore, we propose an efficient data pipeline to render complex soft shadows on the fly.

For each 3D object model, we first sample a few iconic view angles and compute its cutout mask. Given an object view, a set of soft shadow bases is pre-computed by sampling the hard shadows corresponding to an area light grid in the image-based lighting map. The bases are used to generate soft shadows given a randomly generated coefficient map, which is a low-resolution version of the environment lighting map represented as a Gaussian mixture.

We also propose an inverse shadow representation to make the training process easier. A shadow map represents the shadow areas using low-intensity values. We observe that such representation makes the network training a lot harder as the brightness in the rest of the area can vary significantly for different lighting input. Therefore, we invert the shadow map so that the bright area becomes the shadow, and the rest of the map will be zero. This simple transformation significantly improves the training convergence and makes the model training more focused on the shadow regions.

Our SSN can produce realistic soft shadows for 2D object cutouts. Since our model only relies on object masks, it is agnostic to image types such as painting, cartoon, or vector arts. In Fig. 1, we show animated shadows predicted by SSN for objects of various shapes in different image types. SSN produces smooth transitions with the changing lightmaps as well as realistic shade details on the shadow map, especially near the object-ground contact points.

A perceptual user study confirms that in many cases the soft shadows generated by SSN are visually indistinguishable from the ground truth soft shadows generated for 3D models by a physics-based renderer. Moreover, we demonstrate our approach as an interactive tool that allows for real-time shadow manipulation with the response of the system in about 20ms in our implementation. As confirmed by a second user study, photo editors can effortlessly incorporate a cutout with soft desirable soft shadows into an existing image in a couple of minutes by using our tool (see Fig. 2).

In summary, our key contribution is a soft shadow generation network which can produce convincing soft shadows with complex light configuration. We develop a method to generate diverse training data of soft shadows and lighting maps on the fly, and we propose an inverse map representation to improve the training. Our SSN enables real-time accurate and easy soft shadow generation which can be useful for many image compositing tasks.

2 Related Work

We relate our work to soft shadow generation and image composition. Shadow synthesis is one of the oldest topics in computer graphics and we refer the reader to the state of the art report [woo1990survey] as well to the recent book [eisemann2016real]. We limit the review to soft shadow generation [soft-survey] and to the recent algorithm for image composition, relighting, and manipulation.

Soft Shadows: One common method for soft shadow generation from a single area light source is its approximation by summing multiple hard shadows, and the common algorithms build on the seminal hard-shadow map generation [Crow77]. For example, the work [assarsson2003geometry] uses volumetric shadows that exploit object silhouettes of the shadow caster. So-called silhouette maps [shadowSil] also use the object silhouette to augment the information in shadow map for faster hard shadow computation, and they were also used for texture magnification [sen2004silhouette]

. Many of the single light source estimations are aimed at real-time performance and they sacrifice the precision for speed of soft shadows calculation 

[brabec2002single, EGWR:EGWR03:208-218]

. Contrary to silhouette-based approaches requiring full 3D object geometry, our algorithm estimates soft shadow from only a 2D binary mask of the object using deep learning. Our algorithm also aims at the fast generation, but instead of sampling the input geometry, it estimates the soft shadow from the input binary mask and the image-based light.

Agrawala et al. [agrawala2000efficient] introduced two algorithms: layered attenuation maps that are precomputed, and coherence-based raytracing that uses the maps for soft shadows generation. Guennebaud et al. [guennebaud2006real] used projection of the object geometry back to the area light to estimate the coverage. This method uses a single shadow map to approximate soft shadows, and it also suffers from soft shadow overestimation. This drawback has been eliminated by [schwarz2007bitmask, HQASSM] that uses multi-resolution sampling.

A class of algorithms generates soft shadows by image filtering that can be expressed as a convolution [soler1998fast]

. The variance 

[donnelly2006variance] and convolution [annen2007convolution] shadow maps are methods that sample the shadow in linear time. Most of the methods that use precomputed soft shadows are unable to handle dynamic scenes and this drawback has been addressed by the method of [Annen08]. Our method uses an arbitrary image light map instead of a single light source and is also suitable for dynamically changing scenes with varying light conditions that can be modeled interactively.

Global illumination algorithms provide soft shadows implicitly. Close to our method for fast soft shadow calculation during the training are point-sampling approaches used in importance sampling [Agarwal03, Ostromoukhov04]. We trade-off shadow accuracy for efficient model training.

Image Relighting and Synthesis: Our method belongs to the category of deep generative models [goodfellow2014generative, kingma2013auto] which can perform image synthesis and manipulation via semantic control [10.1145/3306346.3323023, brock2018large, odena2017conditional] or user guidance such as sketches and painting [isola2017image, park2019semantic].

Among them, deep image harmonization and relighting methods [shu2017portrait, sun2019single, tsai2017deep] learn to adapt the subject’s appearance to match the target lighting space. This line of works focuses mainly on the harmonization of the subject’s appearance such as color, texture, or lighting style [sun2019single, sunkavalli2010multi, tao2010error]. Soft shadow synthesis for 2D image compositing has been understudied.

Shadow generation and harmonization can be achieved by estimating the environment lighting and the scene geometry from multiple views [10.1145/3306346.3323013]. Given a single image, Hold-Geoffroy et al. [hold2017deep] and Gardner et al. [gardner2017learning] proposed methods to estimate lighting maps for 3D object compositing. However, neither the multi-view information nor the object 3D models are available in general 2D image compositing. Potentially, 3D reconstruction methods from a single image [fan2017point, hassner2006example, saito2019pifu] can close this gap, but they typically require a complex model architecture design for 3D representation and may not be suitable for time-critical applications such as interactive image editing.

Our work aims to provide a highly controllable method for realistic soft shadow generation, which is an essential element for a convincing composite photo. Our method is trained to implicitly infer object 3D information from a 2D cutout mask regardless of their appearance and thus can be applied in general image compositing tasks for different image types.

3 Overview

Figure 3: System Overview: During the training phase we train the SSN on a wide variety of 3D objects under different lighting conditions. Each 3D object is viewed from multiple common views, and its 2D mask and hard shadows are computed based on a sampling grid. Hard shadows are processed to become a set of shadow bases for efficient soft shadow computation during training. During the inference step, the user inputs a 2D mask (for example a cutout from an existing image) and an image lightmap (either interactively or from a predefined set). The SSN then estimates a soft shadow.

The Soft Shadow Network (SSN) is designed to generate fast visually plausible soft shadows given 2D binary masks of 3D objects. The targeted application is image compositing, and the overall pipeline of our method is shown in Fig. 3.

The system works in two phases: the first phase trains an encoder/decoder deep neural network to generate soft shadows given 2D binary masks generated from 3D objects and complex image-based light maps. The second phase is the inference that generates soft shadows for an input 2D binary mask, obtained, for example, as a cutout from an input image. The soft shadow is generated from a user-defined or existing image-based light represented as a 2D image.

The training phase (Fig. 3 left) takes as an input a set of 3D objects: we used 186 objects (66 humans and 120 general objects such as airplanes, beds, benches, bottles, and cars). Each object is viewed from 15 iconic angles, and the generated 2D binary mask is used for training (see Sect. 4.1).

We need to generate soft shadow data for each 3D object. Although we could use a physics-based renderer to generate images of soft shadows, it would be time-consuming, and it would require an extremely large number of soft shadow samples to cover all possible soft shadows combinations.

We propose a dynamic soft shadow generation method (Sect. 4.3) that only needs to render the ”cheap” hard shadows on the GPU once before training(For modern GPU, one hard shadow can be rendered in several ms). The soft shadow is approximated on-the-fly based on the shadow bases and the environment light maps(ELMs) randomly generated during the training.

To cover a large space of possible lighting conditions, we use Environment Light Maps (ELMs) for lighting. The ELMs are generated procedurally as a combination of 2D Gaussians (Gaussian mixture) with the varying position, kernel size, and intensity. During training, we randomly sample up to 50 2D Gaussians that have different kernel sizes and intensities and add them up as a training environment lightmap sample. For a converged model, we sampled about different ELMs so we rendered about different soft shadows on-the-fly.

The 2D masks and the soft shadows are then used as input to train the SSN as described in Sect. 5. We use a variant of U-Net [ronneberger2015u] encoder/decoder network with some additional data injected in the bottleneck part of the network.

The inference phase (Fig. 3 right) is aimed at a fast soft shadow generation for image compositing. In a typical scenario, the user selects a part of an image and wants to paste it into an image with soft shadows. The lighting can be either provided or can be painted by a simple GUI. The resulting ELM and the extracted silhouette are then parsed to the SNN that predicts the corresponding soft shadow.

4 Shadow Generation

The input to this step is a set of 3D objects, and the output is a set of pairs: binary masks of the 3D objects and approximated while high quality soft shadows of the objects cast on a planar surface (floor) from a large amount of environment light maps (ELMs).

4.1 3D Objects and Masks

The target of our method is image compositing and one of the most common objects used are cutouts of humans. Therefore the 3D models of humans are prevailing in our dataset.

Let’s denote the 3D geometries by , where . In our dataset, we used 66 human characters and 120 general objects such as airplanes, bathtubs, beds, benches, bottles, and cars). Note that the shadow generation requires only the 3D geometries without textures or any additional information. Each is normalized and its min-max box is put in a canonical position with the center of the min-max box in the origin of the coordinate system and its projection in the middle of the image.

Each is then used to generate fifteen masks denoted by , where the lower index denotes the corresponding object and the upper index is the corresponding view in form . Each object is rotated five times around the axis and is displayed from three common view angles . This gives the total of unique masks (see Fig. 4 for examples).

(a) A 3D object model (b) Sample bases for specific camera rotation and pitch angle
Figure 4: Base example: for each view of each 3D object, we generate 5x16 shadow bases(only 3x16 is shown here). During training, we reduce the shadow sampling problem to be environment map pattern generation problem since we can always use those shadow bases to approximate any kinds of soft shadows

4.2 Environment light Maps

The second input to the SSN training phase is the soft shadows (see Fig. 3) that are generated from the 3D geometry of  by using image-based lighting (IBL) with HDR image maps in resolution .

We use a single light source represented as a 2D Gaussian function

(1)

where is a 2D Gaussian function with a radius , maximum intensity (scaling factor) , and sharpness corresponding to . Each IBL can be described as a sum of individual lights

(2)

where is the actual position of the light source. We assume the IBL to be normalized so the range is (table Tab. 1 summarizes the parameters).

Our goal is to provide a wide variety of environment light maps that would mimic complex natural or human-made lighting configurations so that the SSN can generalize well for arbitrary environment light maps. We generate the environment light maps by random sampling each variable from Eqn(1). Please note that the environment light maps composed by even a small number of lights provide a very high dynamic range of soft shadows.

The ranges of each parameter from environment light maps is shown in Table 1. The overall number of possible lights is vast. However, large values of will lead to environment light maps that have very small variability because they are covered by large Gaussians. To account for the variability, we randomly sample the environment light map from Eqn(2) on the fly during the SSN training and we measure the actual loss to decide when to stop (Sect. 5.2). In our examples, we used up to environment light maps and an example in Fig. 5 show samples from the generated data and the comparison to physicall-based rendered soft shadows.

meaning parameter values
number of lights
light location
light intensity
light sharpness
Table 1: Ranges of the environment light map parameters. We use random samples from this space during SSN training.
Figure 5: Comparison to ground truth: The ELM (left column) is used to generate soft shadows by using physics-based rendered (Mitsuba) (middle). Right column shows soft shadows generated by our method.

4.3 Soft Shadows

Although we could use a physics-based renderer to generate physically-correct soft shadows, the rendering time for the vast amount of images would be infeasible. Instead, we use a simple method of summing hard shadows generated by a GPU-based renderer by leveraging the linearity property of light. Our method can generate much more diverse soft shadow than naively sample several soft shadows from some directions.

We prepare our shadow bases once during the dataset generation stage and assume that each pixel in the environment light map casts a hard shadow. For each non-overlapping patch in the environment light map, we sample the hard shadows cast by the pixels included in the patch and sum the group of shadows as a soft shadow base, which is used during training stages.

Each model silhouette mask has a set of soft shadow bases. During training, the soft shadow is rendered by just weighting the soft shadow bases with the environment light map.

5 Learning to Render Soft Shadow

We want the model to learn a function that takes model cutout mask  and environment light map as input and predicts the soft shadow cast on the ground plane:

(3)

During training, the input environment light map as described in Sect. 4.2 is generated randomly to ensure the generalization ability of our model. Another important observation is that by inverting the training ground truth shadow value :

(4)

the model converges much faster. This transformation does not bring any new information or lose any information for shadows, while converging speed and training performance improve largely. In the original shadow domain, the net needs to learn to predict the intensity of each pixel. While in the inverse domain, it does not need to think about most regions since they are all zero. The model only needs to learn a small region’s values that have range in this inverse domain, while in the original domain, each pixel has range in theory.

5.1 Network Architecture

Figure 6: SNN Architecture: Source mask goes through several convolution layers. Spatial dimensions are compressed while the number of channels increases. At the bottleneck, a light source image is compressed and shared globally. The tilted feature block is concatenated by the last feature block from the encoder side. By gradually bi-linearly upsampling the features and going through convolution layers, the spatial dimension is growing. Skip links share some spatial information for the decoder in each stage. Loss is finally computed to minimize the difference between the soft shadow generated by shadow basis and the output from the decoder.

SSN architecture is shown in Fig. 6 and the overall design is inspired by the U-Net [ronneberger2015u]

except that we inject light source information into the bottleneck. Both the encoder and decoder are fully convolutional. Our net takes a mask as input and processes it by a series of 3x3 convolution layers. Each convolution layer during encoding follows conv-group norm-ReLU fashion. The convolution layers gradually compress the spatial dimension of the input features while keeping the number of channels big enough to keep useful information. At the bottleneck, we first compress the environment lightmap and make it shared globally. The decoder applies bilinear upsampling-convolution-group norm-ReLU fashion during the upsampling stage. We skip link the corresponding activations from decoder layers in each stage. The final activations from the decoder go through a final convolution layer and shrink the dimension to become the inverse shadow map.

5.2 Loss Function and Training

The loss we use when optimizing the inverse shadow map is a per-pixel distance between our predicted inverse shadow map and the training ground truth . Note that the training ground truth is approximated by our shadow bases instead of rendered by some physical rendering engines.

(5)

During the training, given a mask as an input, we randomly sample the environment light map from Eqn(2) using our Gaussian light representation and render the corresponding soft shadow on the fly. This training routine efficiently helps our model generalize well for diverse lighting conditions. The inverse shadow representation also helps the net to converge faster.

During the inference stage, the input mask goes through the encoder. At the bottleneck, the input environment lightmap needs to be converted into the same format as our environment map. Since our shadow is generated on the ground plane, we cropped the upper half of the environment light map as the bottom half of the environment light map is not used. Then we resize the map to pixels and insert it into our bottleneck. The features of environment light map concatenated with the feature blocks go through the decoder, and our net outputs a soft shadow prediction.

6 Results and Evaluation

Training:

We implement our deep neural network model by using PyTorch 

[paszke2017automatic]. All SNN results have been generated on a desktop computer equipped with Intel Xeon W-2145 CPU running at 3.70GHz, and we used three NVIDIA GeForce GTX TITAN X GPUs for training. The total batch size was 72 and we used Adam optimizer [kingma2013auto] with a learning rate

and we decreased our learning rate for every 30 epochs. For each epoch, we run the whole dataset 10 times since we need to sample enough environment maps for each mask input. Our model converged after 80 epochs and the overall training time was about 40 hours. The average time for soft shadow inference was

.

We show various examples of static scenes through this paper and all of them have been generated by the GUI from Fig. 9.

An example in Fig 1 shows that our method generates smooth shadow transitions for dynamically changing lighting conditions.

Fig. 2 show an example of matching the lighting of an existing input scene by inserting multiple 2D masks. It is important to note that once the lighting has been generated for one cutout, it can be reused for different ones. Adding multiple cutouts to an image is simple.

Several examples generated by the users are shown in Fig. 10 and discussed below.

Quantitative Shadow Evaluation: Five human models with diverse poses, shapes, and clothes are sampled from Daz3D website, and we render 15 masks for each model. We used the same lightmap generation method as described aforementioned to randomly sample 260 different lightmaps for testing that were generated by randomly select a pair of silhouettes and a lightmap for each model.

Metrics: We used three metrics to evaluate the testing performance of SSN: 1) RMSE, 2) RMSE-s [10.1145/3306346.3323008], and 3) zero-normalized cross-correlation (ZNCC). Since the exposure condition of the rendered image may vary due to different rendering implementations, we use scale-invariant metric RMSE-s and ZNCC in addition to RMSE. Note that all the measurements are computed in the inverse shadow domain. Table 2 shows the results. The range of the error is small indicating similarity of the ground truth and SSN-generated shadows.

Method RMSE RMSE-s ZNCC
SSN 0.03100709 0.02824657 0.86131977
Table 2: Quantitative shadow analysis.

Qualitative Evaluation We performed two perceptual user studies. The first one measured the perceived level of realism of shadows generated by the SSN, the second tested the ease-of-use of the shadow generator.

Perceived Realism (user study 1): We have generated two sets of images with soft shadows. One set, called MTR, was generated from the 3D object by rendering it in Mitsuba renderer that is a physics-based rendered and was considered the ground truth when using enough samples. The second set, called SSN, used binary masks from the same objects from MTR and estimated the shadows by using the SSN. Both sets have the same number of images resulting in 40 pairs. In both cases we used 3D objects that were not present in the training set or the validation set of the SSN during its training. The presented objects were unknown to the SSN. The Image Light Maps used were designed to cover a wide variety of shadows ranging from a single shadow, two shadows to a very subtle shape and intensity. An example in Fig. 7 shows examples of the pairs of images used in our study and the supplementary material includes all of them.

Figure 7: Perceptual evaluation: samples of images from our perceptual evaluation. Output generated from 3D objects rendered by Mitsuba (left) and output generated by SSN from binary masks (right). We attempted to cover a wide variety of shadows.
Figure 8: p-value distributions: In our second user study 10 questions have , 5 questions , and 4 questions have , and 21 questions have .

The perceptual study was a two-alternative forced-choice (2AFC) method. To validate the perceived realism of the rendered images, we have shown pairs of images in random order and random position (left-right) to multiple users and asked the participants of an online study which of the two images is a fake shadow.

The study was answered by 180 participants (57.4% male, 25.5% female, 17.1% did not identify). We discarded all replies that were too short (under 3 minutes) or did not complete all the questions. We also discarded all answers of users who always clicked on the same side. Each image pair was viewed by 93 valid users. In general, the users were not able to distinguish SSN-generated shadows from the ground truth. In particular, the result shows that the average accuracy was 59.315% with a standard deviation of 0.105. T-test for each question is shown in Fig. 

8 that shows that there are more than half of the predictions that do not have a significant difference with the Mitsuba ground truth.

Ease-of-Use (user study 2): In the second study, human subjects were asked to recreate soft shadows by using a simple interactive application. The application (see Fig. 9) showed three windows, left is the input, the middle is a 2D mask with no shadows, and right is an interactive window that allows adding, modify, and delete lights (see Eqn(2)). The user interactively modified the IBL and observed the generated soft shadows while attempting to recreate the input image.

Figure 9: GUI for the user study 2: Users were asked to recreate the shadow from the input image by adding and deleting Gaussian lights and varying their parameters.

We asked 8 subjects (4 males and 4 females) with the following age distribution: 50-59 (1), 40-49 (1), 20-29 (4), 10-19 (2). Three subjects reported experience in computer graphics. We asked the users ”how easy was it to use the tool to recreate the soft shadows” and they responded on scale 1(difficult) - 5(extremely easy). The average of responses was 4.125 with a standard deviation of 0.95 indicating that our tool can be used for soft shadow image compositing by inexperienced users. An example in Fig. 10 shows the input example on the left, and some of the user-generated results on the right. We also collected responses and some of them were: ”it would be good to know how many lights are needed”, ”it is intuitive, but it takes some time to tweak the sliders”, ”fast feedback, but changing one shadow affects the others”, ”axis labels would help”. The supplementary material includes all user-generated results. Our second user study results seem to indicate that our approach is suitable for fast and intuitive creation of soft shadows from 2D masks.

Figure 10: Results of the second user study: The input image (left) and the user-generated outputs (right) with the corresponding ELMs (supplementary material includes all results).

7 Conclusion an Future Work

We introduced Soft Shadow Network(SSN) to approximate soft shadows given 2D masks with the objective of image compositing. Naively generating soft shadow data is cost expensive. Training on that dataset, the model cannot generalize well due to the limited diversity of soft shadows. The key ingredients of our algorithm is a high-quality soft shadow approximation combined with fast ELM sampling that allows for fast training and better generalization ability. We also contribute to finding the inverse shadow domain that will significantly improve the convergence rate and overall performance. We show that our method can be used with a wide variety of light. Our first user study confirms the shadows approximate ground truth, and the second user study shows that user can quickly and intuitively generate soft shadows even without any computer graphics experience.

However, our method also has several limitations. First, the shadow is generated only on a planar receiver. While this is usually sufficient in image compositing, it would probably not generate suitable shadows in more complex scenes. Second, we focused on common objects used in image compositing (humans or other common objects like cars, bottles, etc), and our method would not work for the type of objects that are not present in the training set and are significantly dissimilar.

There are several possible avenues for future work. We focus only on shadow generation and the placed 2D cutout usually does not reflect any lighting. It would be interesting to combine our shadow generator with relighting methods for image compositing and to expand our method to include self-shadowing. Moreover, another future direction would be finding the generalization limit of SSN on the object categories.

References