“When you light a candle, you also cast a shadow,”—Ursula K. Le Guin written in A Wizard of Earthsea.
When some objects block the light, shadows are formed. And when we see a shadow, we also know that there must be some objects that create or cast the shadow. Shadows are light-deficient regions in a scene, due to light occlusion, but they carry the shape of the light-occluding objects, as they are projections of these objects onto the physical world. In this work, we are interested in a new problem, i.e., finding shadows together with their associated objects.
. Our goal in this work is to leverage the remarkable computation capability of deep neural networks to address the new problem of associating shadows and objects—instance shadow detection. That is, we want to detect the shadow instances in images, together with the associated object that casts each shadow.
Being able to find shadow-object associations has the potentials to benefit various applications. For example, for privacy protection, when we remove humans and cars from photos, we can remove objects and associated shadows altogether. In a recent work on removing objects from images for privacy protection , the shadows are simply left behind. Also, when we edit photos, say by scaling or translating objects, we can naturally manipulate objects with their associated shadows simultaneously. Further, shadow-object associations give hints to the light direction in the scene, supporting applications such as relighting.
To approach the problem of instance shadow detection, first, we prepare a new dataset called SOBA, named after Shadow OBject Association. SOBA contains 3,623 pairs of shadow-object associations over 1,000 photos, each with three masks (see Figures 1 (c)-(e)): (i) shadow instance mask, where we label each shadow instance with a unique color; (ii) shadow-object association mask, where we label each shadow-object pair with a corresponding unique color; and (iii) object instance mask, which is (ii) minus (i). In general, there are two types of shadows: (i) cast shadows, formed on background objects, usually ground, as the projections of the light-occluding objects, and (ii) self shadows, formed on the side of the light-occluding objects opposite to the direct light (see Figure 1(a)). In this work, we consider mainly cast shadows, which are object projections, since self shadows are already on the associated objects. See also Figure 2 for example images in our SOBA dataset.
Next, we design an end-to-end framework called LISA, named after Light-guided Instance Shadow-object Association, to find the individual shadow and object instances, the shadow-object associations, and the light direction in each shadow-object association. From these predictions, we then use a simple yet effective method to pair the predicted shadow and object instances and to match them with the predicted shadow-object associations.
Third, to quantitatively measure and evaluate the performance of the instance shadow detection results, we formulate a new evaluation metric called SOAP, named afterShadow-Object Average Precision. In the end, we further perform a series of experiments to show the effectiveness of our method and demonstrate its applicability on light direction estimation and photo editing.
2 Related Work
made use of physical illumination and color models, and analyzed the spectral and geometrical properties of shadows. Later, machine learning methods were explored to detect shadows by modeling shadows based on handcrafted features,e.g., texture [52, 42, 12, 44], color [23, 42, 12, 44], T-junction , and edge [23, 52, 19]
, then by using various classifiers,e.g23, 52] and SVM [12, 19, 42, 44], to differentiate shadows and non-shadows. However, physical models and handcrafted features have limited feature representation capability, thus they are not robust in general situations.
Later, convolutional neural networks (CNN) were introduced to detect shadows. Khanet al.  and Shen et al.  used CNN to learn high-level features and optimization methods to detect shadows. Vicente et al. 
trained a fully-connected network to predict a shadow probability map, then locally refine the shadows via a patch-CNN.
More recently, end-to-end networks were designed to detect shadows. Nguyen et al.  built a conditional generative adversarial network with a sensitive parameter to stabilize the network training. Hu et al. [16, 18] and Zhu et al.  explored the spatial context via the direction-aware spatial context module and recurrent attention residual module, respectively. Wang et al.  and Ding et al.  jointly detected and removed shadows by using multiple networks or a multi-branch network. To improve the detection performance, Le et al.  proposed to generate more training samples, while Zheng et al.  combined the strengths of multiple methods to explicitly revise the results. This work explores a new problem on detecting shadows, namely instance shadow detection. Unlike general shadow detection, which finds only a single mask for all shadows in an image, we design a deep architecture to find not just the individual shadows but also the associated objects altogether.
Besides, this work relates to the emerging problem of instance segmentation, where the goal is to label pixels of individual foreground objects in the input image. Overall, there are two major approaches to the problem: proposal-based and proposal-free approaches.
Proposal-based approach generally uses object detectors to propose candidates and classifies the candidates to find object instances, e.g., MNC 
, DeepMask, InstanceFCN , and SharpMask . Later, FCIS  jointly detects and segments the object instances using a fully convolutional network. BAIS  models the object shapes and segments the object instances in a boundary-aware manner. MaskLab  uses a network with three outputs for box detection, semantic segmentation, and direction prediction, while methods based on Mask R-CNN , e.g., [30, 3, 33], achieved great performance by simultaneously detecting the object instances and predicting the segmentation masks.
3 SOBA (Shadow OBject Association) Dataset
We collected 1,000 images from the ADE2K [50, 51], SBU [15, 43, 45], ISTD , and Microsoft COCO  datasets, and also from the Internet using keyword search with shadow plus animal, people, car, athletic meeting, zoo, street, etc. Then, we coarsely label the images to produce the shadow instance masks and shadow-object association masks, and refine them using Apple Pencil; see Figures 1 (c) & (e). Next, we obtain the object instance masks (see Figure 1 (d)) by subtracting each shadow instance mask from the associated shadow-object association mask. Overall, there are 3,623 pairs of shadow-object instances in the dataset images, and we randomly split the images into a training set (840 images, 2,999 pairs) and a testing set (160 images, 624 pairs); see Figure 2 for some examples.
Figure 3 shows some statistical properties of the SOBA dataset. From the histogram shown on the left, we can see that SOBA has a diverse number of shadow-object pairs per image, with around 3.62 pairs per image on average. Also, it contains many challenging cases: 7% of the images have nine or more shadow-object pairs per image. On the other hand, the histogram shown on the right reveals the proportion of image space (horizontal axis) occupied, respectively, by the shadow and object instances in the dataset images. From the plot, we can see that most shadows and objects occupy relatively small areas in the whole images, demonstrating the challenges to detect them.
4.1 Overall Network Architecture of LISA
Compared with shadow detection, the challenges of instance shadow detection are that we have to predict shadow instances rather than just a single mask for all the shadows in the input image. Also, we have to find object instances in the input image and pair them up with the shadow instances. To meet these challenges, we design an end-to-end framework called LISA, named after Light-guided Instance Shadow-object Association. Overall, as shown in Figure 5, LISA takes a single image as input and predicts
a box of each shadow/object instance,
a mask of each shadow/object instance,
a box of each shadow-object association (pair), and
the light direction for each shadow-object association.
Figure 4 shows a set of example outputs. Particularly, LISA predicts the light direction and takes it as a guidance to find shadow-object associations, since the light direction is usually consistent with the shadow-object associations.
Figure 5 shows the architecture of LISA, which begins by using a convolutional neural network (ConvNet) to extract semantic features from the input image. Here, we use the feature pyramid network  as the backbone ConvNet. Then, we design a two-branch architecture: the top branch predicts the box and mask for each shadow/object instance and the bottom branch predicts the box for each shadow-object association and the associated light direction.
In detail, the top branch starts with the instance region proposal network (RPN)  to find region proposals, which are regions with the high probability of containing the shadow/object instances. Then, we adopt RoIAlign  to extract features for each proposal and leverage the box and mask heads to predict the boxes and masks for the shadow and object instances by minimizing the loss between the prediction results and the supervision signals from the training data. Please refer to Mask R-CNN  for the detail. On the other hand, the bottom branch adopts an association RPN to generate region proposals for the shadow-object associations, then uses RoIAlign to extract features for each proposal and adopts the box head to produce the bounding boxes of the shadow-object associations. After obtaining the associations, we can then efficiently obtain the masks of the shadow-object associations by combining the shadow and object masks predicted from the top branch. Note that the parameters in the box head are learned by minimizing the loss between the boxes of the predicted shadow-object associations and the ground-truth associations.
Besides, we design a light direction head in parallel with the box head of the bottom branch to predict an angle that represents the estimated light direction from shadow to object in each association pair. Note that we compute the ground-truth angle of the light direction by
where and are the 2D coordinates of the shadow and object instance centroids in the ground-truth image, and is a variation of the function to avoid anomaly and output a full-range polar angle in . By jointly optimizing the predictions of the light direction and shadow-object association in LISA, we can improve the overall performance of instance shadow detection; see the experimental results in Section 5.
4.2 Pairing up Shadow and Object Instances
The raw predictions of LISA include shadow instances, object instances, shadow-object associations, and a light direction predicted per association. Note that, the predicted shadow and object instances are not paired, whereas the predicted shadow-object associations are not separated as shadows and objects. Also, some of these predictions may not be correct, and they may also contradict one another. Hence, we have to analyze these predictions, pair up the predicted shadow and object instances, and match them with the predicted shadow-object associations, so that we can find and output the final paired shadow and object instances.
Figure 6 illustrates the procedure, where we first find candidate shadow-object associations (see Figure 6 (b)) by (i) computing the shortest distance between the bounding boxes of every pair of shadow and object instances, and (ii) regarding a pair as a candidate association, if the computed distance is smaller than a threshold, which is empirically set as the height of the associated shadow instance. After that, we construct bounding box for the -th candidate pair (see Figure 6 (c)) by merging the bounding boxes of the associated shadow and object instances. Given (,) and (,) as the lower-left and upper-right corners of the shadow instance bounding box, and (,) and (,) as the lower-left and upper-right corners of the object instance bounding box, the corners of the merged bounding box are given by
In the end, as illustrated in Figure 6 (d), we compute the Intersection over Union (IoU) between the merged boxes and the shadow-object association boxes predicted independently in LISA (see Figure 5), and select those with the highest IoUs as the final shadow-object associations. Then, for each of these associations, we can get back the associated shadow instance and object instance, and pair them as the final outputs; see Figure 6 (e).
4.3 Training Strategies
We optimize LISA by jointly minimizing the instance box loss, instance mask loss, association box loss, light direction loss (see Figure 5
), and the losses of instance RPN and association RPN. The loss functions of boxes, masks, and RPNs follow the formulations in Mask R-CNN, whereas the light direction loss is formulated by a smooth loss , as follows:
where and are the predicted and ground-truth angles of the light direction, respectively.
trained on ImageNet to initialize the parameters of the backbone network, and train our framework on two GeForce GTX 1080 Ti GPUs (four images per GPU) for 40 training iterations. We set the base learning rate as 1e-4, adopt a warm-up  strategy to linearly increase the learning rate to 1e-3 during the first 1,000 iterations, keep the learning rate as 1e-3, and stop the learning after 40 iterations. We re-scale the input images, such that the longer side is less than 1,333 and the shorter side is less than 800 without changing the image aspect ratio. Lastly, we randomly apply horizontal flips on the images for data augmentation.
5.1 Evaluation Metrics
Existing metrics evaluate instance segmentation results by looking at object instances individually. Our problem involves multiple types of instances: shadows, objects, and their associations. Hence, we formulate a new metric called the Shadow-Object Average Precision (SOAP) by adopting the same formulation as the traditional average precision (AP) with the intersection over union (IoU) but further considering a sample as true positive (an output shadow-object association), if it satisfies the following three conditions:
the IoU between the predicted shadow instance and ground-truth shadow instance is no less than ;
the IoU between the predicted object instance and ground-truth object instance is no less than ; and
the IoU between the predicted and ground-truth shadow-object associations is no less than .
We follow  to report the evaluation results by setting as 0.5 () or 0.75 (), and also report the average over multiple [0.5:0.05:0.95] (SOAP). Moreover, since we can obtain the bounding boxes as well as the masks for the shadow instances, object instances, and shadow-object associations, we further report , , and SOAP in terms of both bounding boxes and masks.
|Method||box SOAP||box SOAP||box SOAP|
|Our full pipeline||50.5||16.4||21.8|
|Method||mask SOAP||mask SOAP||mask SOAP|
|Our full pipeline||50.9||14.4||21.6|
To evaluate the LISA framework, we set up (i) Baseline 1, which adopts only the top branch of LISA to predict bounding boxes and masks of the shadow and object instances, then merges them to form shadow-object associations based on the proximity between the shadow and object instances; and (ii) Baseline 2, which removes the light direction head in LISA when predicting the shadow-object associations, but still adopts the procedure to pair-and-match the shadow and object instances (Section 4.2).
Tables 1 and 2 report the quantitative comparison results in terms of the bounding boxes and masks in the final detected shadow-object associations. Comparing different rows in the results, we can see that Baseline 2 clearly improves over Baseline 1, demonstrating that we can obtain better shadow-object associations in our deep end-to-end framework by independently predicting also the shadow-object associations and then pairing the shadow and object instances and matching them with the predicted shadow-object associations. Moreover, by further predicting the light direction and taking it as the guidance to jointly optimize the framework, our full pipeline LISA achieves the best performance for all the evaluation metrics.
Figure 7 shows visual comparison results for Baseline 1, Baseline 2, and our full pipeline. The first column shows the input images, whereas the second, third, and fourth columns show the results produced by the two baselines and our full pipeline. By comparing Baseline 1 with Baseline 2, we can see that further learning to detect the shadow-object associations independently in the deep framework helps to discover more shadow-object pairs, as shown in the third and fourth rows in Figure 7. Moreover, after taking the light direction as a guidance (Baseline 2 vs. full pipeline), our method improves the performance in various challenging cases, e.g., when there is large but irrelevant shadow region nearby (see the first row), when there are multiple shadow instances connect with a single object instance (see the second row), when the centers of the shadow and object instances are far from each other (see the third row), and when there are multiple shadow regions near a single object instance (see the last row). Please see Figure 8 and supplemental material for more instance shadow detection results produced by our method on various types of images and objects.
Below, we present application scenarios to demonstrate the applicability of the results produced by our method.
Light direction estimation.
First, instance shadow detection helps to estimate the light direction in a single 2D image, and we connect the centers of the bounding boxes of the shadow and object instances in each shadow-object association pair as the estimated light direction. Figure 9 shows some example results, where for each photo, we estimate the light direction and render a virtual red post with a simulated shadow on the ground based on the estimated light direction. From the results, we can see that the virtual shadows with the red posts look consistent with the real shadows cast by other objects, thus demonstrating the applicability of our detection results.
Another application to demonstrate instance shadow detection is photo editing, where we can remove not only the object instances but also their associated shadows altogether. For privacy protection, Uittenbogaard et al.  presents a method to automatically remove specific objects in street-view photos; see Figure 10 (c) for a result, where it can successfully remove the vehicle. However, the shadow cast by the vehicle remains on the ground. With the help of our instance shadow detection result (Figure 10 (b)), we can remove the vehicle with its shadow altogether, as shown in Figure 10 (d).
Further, we can more efficiently transfer an object together with its shadow from one photo to another photo. Figure 11 presents an example, we cut the motorcycle with its shadow from (b) and paste them into (a) in smaller sizes. Clearly, if we simply paste the motorcycle and shadow to (a), the shadow is not consistent with the real shadows in the target photo; see (c). Thanks to instance shadow detection, which outputs individual masks for both object and shadow instances, as well as light directions. Therefore, we can achieve light-aware photo editing by making use of the estimated light direction in both photos to adjust the shadow images when transferring the motorcycle from one photo to the other; see (d).
7 Conclusions and Limitations
In this paper, we presented instance shadow detection, which targets to find shadow instances and object instances, and pair them up together. Also, we presented three technical contributions to approach the problem. First, we prepare SOBA, a new dataset of 1,000 images and 3,623 pairs of shadow-object associations, where we provide the input photos together with a set of three instance masks. Second, we develop LISA, an end-to-end deep framework, to predict boxes and masks of individual shadow and object instances, as well as boxes of shadow-object associations and the associated light directions; from these predictions, we further match the shadow and object instances, and pair them up to match with the predicted shadow-object associations and light directions for producing the output shadow-object pairs. Third, we formulate SOAP, a new evaluation metric for quantitatively measuring the instance shadow detection results, enabling us to perform various experiments to compare with baseline frameworks. In the end, we also demonstrate the applicability of our results on light direction estimation and photo editing.
As the first attempt to detect shadow-object instances, we admit that there are many possible methods that can be explored to improve the detection performance. Besides methodologies, we did not consider the overlap between shadow instances associated with different objects. Also, we did not consider cast shadows formed on some other object instances. There are many open problems and unexplored situations for instance shadow detection.
In the future, we plan to first improve the performance of instance shadow detection by simultaneously leveraging multiple training data from the current datasets prepared for shadow detection and instance segmentation. By exploring semi- or weakly-supervised methods to learn to detect instance shadows, we could combine the strengths and knowledge from various data to better the performance of instance shadow detection. Last, we will also explore more applications based on the shadow-object association results.
-  (2017) Pixelwise instance segmentation with a dynamically instantiated network. In CVPR, pp. 441–450. Cited by: §2.
-  (2017) Deep watershed transform for instance segmentation. In CVPR, pp. 5221–5229. Cited by: §2.
-  (2019) Hybrid task cascade for instance segmentation. In CVPR, pp. 4974–4983. Cited by: §2.
-  (2018) Masklab: instance segmentation by refining object detection with semantic and direction features. In CVPR, pp. 4013–4022. Cited by: §2.
-  (2019) TensorMask: a foundation for dense object segmentation. arXiv preprint arXiv:1903.12174. Cited by: §2.
-  (2016) Instance-aware semantic segmentation via multi-task network cascades. In CVPR, pp. 3150–3158. Cited by: §2.
-  (2009) ImageNet: a large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §4.3.
-  (2019) ARGAN: attentive recurrent generative adversarial network for shadow detection and removal. In ICCV, pp. 10213–10222. Cited by: §1, §2.
-  (2019) SSAP: single-shot instance segmentation with affinity pyramid. In ICCV, pp. 642–651. Cited by: §2.
-  (2015) Fast R-CNN. In ICCV, pp. 1440–1448. Cited by: §4.3.
-  (2017) Accurate, large minibatch SGD: training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §4.3.
-  (2011) Single-image shadow detection and removal using paired regions. In CVPR, pp. 2033–2040. Cited by: §2.
-  (2017) Boundary-aware instance segmentation. In CVPR, pp. 5696–5704. Cited by: §2.
-  (2017) Mask r-cnn. In ICCV, pp. 2961–2969. Cited by: §2, §4.1, §4.3.
-  (2019) Large scale shadow annotation and detection using lazy annotation and stacked CNNs.. IEEE Transactions on Pattern Analysis and Machine Intelligence. Note: to appear Cited by: §1, §3.
-  (2019) Direction-aware spatial context features for shadow detection and removal. IEEE Transactions on Pattern Analysis and Machine Intelligence. Note: to appear Cited by: §1, §2.
-  (2019) Mask-ShadowGAN: learning to remove shadows from unpaired data. In ICCV, pp. 2472–2481. Cited by: §1.
-  (2018) Direction-aware spatial context features for shadow detection. In CVPR, pp. 7454–7462. Cited by: §1, §2.
-  (2011) What characterizes a shadow boundary under the sun and sky?. In ICCV, pp. 898–905. Cited by: §2.
-  (2014) Automatic feature learning for robust shadow detection. In CVPR, pp. 1939–1946. Cited by: §1, §2.
-  (2016) Automatic shadow detection and removal from a single image. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (3), pp. 431–446. Cited by: §1.
-  (2017) Instancecut: from edges to instances with multicut. In CVPR, pp. 5008–5017. Cited by: §2.
-  (2010) Detecting ground shadows in outdoor consumer photographs. In ECCV, pp. 322–335. Cited by: §2.
-  (2019) Shadow removal via shadow image decomposition. In ICCV, pp. 8578–8587. Cited by: §1.
-  (2018) A+D Net: training a shadow detector with adversarial shadow attenuation. In ECCV, pp. 662–678. Cited by: §1, §2.
-  (2017) Fully convolutional instance-aware semantic segmentation. In CVPR, pp. 2359–2367. Cited by: §2.
-  (2017) Feature pyramid networks for object detection. In CVPR, pp. 2117–2125. Cited by: §4.1, §4.3.
-  (2014) Microsoft coco: common objects in context. In ECCV, pp. 740–755. Cited by: §3, §5.1.
-  (2017) Sgn: sequential grouping networks for instance segmentation. In CVPR, pp. 3496–3504. Cited by: §2.
-  (2018) Path aggregation network for instance segmentation. In CVPR, pp. 8759–8768. Cited by: §2.
-  (2017) Shadow detection with conditional generative adversarial networks. In ICCV, pp. 4510–4518. Cited by: §2.
-  (2011) Illumination estimation and cast shadow detection through a higher-order graphical model. In CVPR, pp. 673–680. Cited by: §2.
-  (2018) MegDet: a large mini-batch object detector. In CVPR, pp. 6181–6189. Cited by: §2.
-  (2015) Learning to segment object candidates. In NeurIPS, pp. 1990–1998. Cited by: §2.
-  (2016) Learning to refine object segments. In ECCV, pp. 75–91. Cited by: §2.
-  (2017) DeshadowNet: a multi-context embedding deep network for shadow removal. In CVPR, pp. 4067–4075. Cited by: §1.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NeurIPS, pp. 91–99. Cited by: §4.1.
-  (2004) Cast shadow segmentation using invariant color features. Computer Vision and Image Understanding 95 (2), pp. 238–259. Cited by: §2.
-  (2015) Shadow optimization from structured deep edge detection. In CVPR, pp. 2067–2074. Cited by: §2.
-  (2016) New spectrum ratio properties and features for shadow detection. Pattern Recognition 51, pp. 85–96. Cited by: §2.
-  (2019) Privacy protection in street-view panoramas using depth and multi-view imagery. In CVPR, pp. 10581–10590. Cited by: §1, Figure 10, §6.
-  (2015) Leave-one-out kernel optimization for shadow detection. In ICCV, pp. 3388–3396. Cited by: §2.
-  (2016) Noisy label recovery for shadow detection in unfamiliar domains. In CVPR, pp. 3783–3792. Cited by: §3.
-  (2018) Leave-one-out kernel optimization for shadow detection and removal. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (3), pp. 682–695. Cited by: §2.
-  (2016) Large-scale training of shadow detectors with noisily-annotated shadow examples. In ECCV, pp. 816–832. Cited by: §1, §2, §3.
-  (2018) Stacked conditional generative adversarial networks for jointly learning shadow detection and shadow removal. In CVPR, pp. 1788–1797. Cited by: §1, §2, §3.
-  (2019) Detectron2. Note: https://github.com/facebookresearch/detectron2 Cited by: §4.3.
-  (2017) Aggregated residual transformations for deep neural networks. In CVPR, pp. 1492–1500. Cited by: §4.3.
-  (2019) Distraction-aware shadow detection. In CVPR, pp. 5167–5176. Cited by: §1, §2.
-  (2017) Scene parsing through ADE20K dataset. In CVPR, pp. 633–641. Cited by: §3.
-  (2019) Semantic understanding of scenes through the ADE20K dataset. International Journal of Computer Vision 127 (3), pp. 302–321. Cited by: §3.
-  (2010) Learning to recognize shadows in monochromatic natural images. In CVPR, pp. 223–230. Cited by: §2.
-  (2018) Bidirectional feature pyramid network with recurrent attention residual modules for shadow detection. In ECCV, pp. 121–136. Cited by: §1, §2.