Revisiting Shadow Detection: A New Benchmark Dataset for Complex World

11/16/2019 ∙ by Xiaowei Hu, et al. ∙ 0

Shadow detection in general photos is a nontrivial problem, due to the complexity of the real world. Though recent shadow detectors have already achieved remarkable performance on various benchmark data, their performance is still limited for general real-world situations. In this work, we collected shadow images for multiple scenarios and compiled a new dataset of 10,500 shadow images, each with labeled ground-truth mask, for supporting shadow detection in the complex world. Our dataset covers a rich variety of scene categories, with diverse shadow sizes, locations, contrasts, and types. Further, we comprehensively analyze the complexity of the dataset, present a fast shadow detection network with a detail enhancement module to harvest shadow details, and demonstrate the effectiveness of our method to detect shadows in general situations.



There are no comments yet.


page 2

page 3

page 4

page 5

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Shadows are formed in the 3D spatial volume behind the objects that occlude the light. The appearance of a shadow generally depends not only on the shape of the occluding object, but also on the direction of the light that shines on the object, the strength of the light source, and the geometry of the background object on which the shadow is cast. Yet, in general real photos, it is likely that we can observe multiple shadows cast by multiple objects and lights, while the shadows may lie on or go across multiple background objects in the scene. Hence, shadow detection can be a very complicated problem in general situations.

From the research literature of computer vision and image processing [3, 44], it is known that the presence of shadows degrades the performance of many object recognition tasks, from object detection and tracking [4, 29] to person re-identification [1]

, for examples. Also, the knowledge of shadows in a scene can help to estimate the light conditions 

[23, 32] and scene geometry [19, 31]. Thus, shadow detection has long been a fundamental problem.

At present, the de facto approach to detect shadows [41, 21, 16, 25, 42, 7, 14, 47, 12]

is based on deep neural networks, which have demonstrated notable performance on various benchmark data 

[11, 12, 39, 41, 42, 50]. However, existing datasets contain mainly shadows cast by single or few separate objects. They do not adequately model the complexity of shadows in the real world; see Figures 1 (a) & (b). Though recent methods [47, 51] already achieved nearly-saturated performance on the benchmarks with a balanced error rate (BER) less than 4% on the SBU [39, 41, 12] and ISTD [42] datasets, if we use them to detect shadows in various types of real-world situations, their performance is rather limited; see Section 5. Also, current datasets contain mainly cast shadows with few self shadows, thus limiting the shadow detection performance in general situations. Note that when an object occludes the light and casts shadows, self shadows are regions on the object that do not receive direct light, while cast shadows are projections of the object on some other background objects.

Figure 1: Example shadow images and masks in ISTD [42], SBU [12, 39, 41], and our shadow detection dataset.

In this work, we prepare a new dataset to support shadow detection in complex real-world situations. Our dataset contains 10,500 shadow images, each with labeled ground-truth mask. Apart from the dataset size, it has three main advantages when comparing with the existing data. First, the shadow images are collected from diverse scenese.g., cities, buildings, satellite maps, and roads, which are general and challenging situations that existing data do not exhibit. Second, our dataset includes cast shadows on background objects and also self shadows on occluding objects. Third, besides the training and testing sets, our dataset provides a validation set for tuning the training parameters and performing ablation study for deep models. This helps to reduce the risk of overfitting. We will publicly release the dataset upon the publication of this work.

Besides, we design a fast shadow detection network called FSDNet by adopting the direction-aware spatial context module [14, 16] to aggregate global information from high-level feature maps and formulating a detail enhancement module to harvest shadow details in low-level feature maps. Also, we perform a comprehensive statistical analysis on our dataset to study its complexity, and evaluate the performance of various shadow detectors and FSDNet on the data. Experimental results show that FSDNet performs favorably against the state-of-the-arts; particularly, it has only 4M model parameters, so it can achieve real-time performance, while detecting shadow with good quality.

Figure 2: Example shadow images and shadow masks for categories (i) to (v) in our dataset; see Section 3.1 for details.

2 Related Work

Shadow detection on single image has been widely studied in computer vision research. Early methods focus on illumination models or machine learning algorithms by exploring various hand-crafted shadow features, 

e.g., geometrical properties [34, 32], spectrum ratios [37], color [22, 38, 11, 40], texture [50, 38, 11, 40], edge [22, 50, 17], and T-junction [22]. These features, however, have limited capability to distinguish between the shadow and non-shadow regions, so approaches based on them often fail to detect shadows in general real-world environments.

Later, methods based on features learned from deep convolutional neural networks (CNNs) demonstrate remarkable improved performance on various benchmarks, especially when large training data is available. Khan 

et al[20] adopt CNNs to learn features at the super-pixel level and object boundaries, then use a conditional random field to predict the shadow contours. Shen et al[36] predict the structures of shadow edges from a structured CNN and adopt a global shadow optimization framework for shadow recovery. Vicente et al[41] train a stacked-CNN to detect shadows by recovering the noisy shadow annotations. Nguyen et al[30]

introduce a sensitive parameter to the loss function in a conditional generative adversarial network to solve the unbalanced labels of shadow and non-shadow regions.

More recently, Hu et al[14, 16] aggregate global context features via two rounds of data translations and formulate the direction-aware spatial context features to detect shadows. Wang et al[42] jointly detect and remove shadows by stacking two conditional generative adversarial networks. Le et al[25] adopt a shadow attenuation network to generate adversarial training samples, further for training a shadow detection network. Zhu et al[51] formulate a recurrent attention residual module to selectively use the global and local features in a bidirectional feature pyramid network. Zheng et al[47] present a distraction-aware shadow detection network by explicitly revising the false negative and false positive regions found by other shadow detection methods. Ding et al[7] jointly detect and remove shadows in a recurrent manner. While these methods have achieved high accuracy in detecting shadows in current benchmarks [12, 39, 41], their performances are still limited for complex real environments; see experiments in Section 5.2. Apart from shadow detection, recent works [21, 33, 15, 7, 24]

explore deep learning methods to remove shadows, but training data for shadow removal also contains mainly shadows cast by a few objects.

3 Our Dataset

Existing datasets for shadow detection, i.e., UCF [50], UIUC [11], SBU [12, 39, 41], and ISTD [42] have been widely used in the past decade. Among them, pioneering ones, i.e., UCF and UIUC, contain only 245 and 108 images, respectively, so deep models trained on them certainly have limited generalization capability, as shown in [39, 41]. For the more recent ones, SBU has 4,087 training images and 638 testing images, whereas ISTD has 1,330 training and 540 testing triples of shadow images, shadow-free images, and shadow masks. Typically, ISTD has only 135 background scenes; while SBU features a wider variety of scenes, both datasets provide mainly shadows cast by single or a few objects. In contrast, our dataset contains 10,500 shadow images, each with mask, featuring shadows in diverse situations; see Figure 1 for example images randomly picked from ISTD, SBU, and our new dataset.

3.1 Building the Dataset

To start, we collected shadow images from five different sources: (i) Shadow-ADE: 1,132 images from the ADE20K dataset [48, 49] with shadows cast mainly by buildings; (ii) Shadow-KITTI: 2,773 images from the KITTI dataset [9], featuring shadows of vehicles, trees, and objects along the roads; (iii) Shadow-MAP: 1,595 remote-sensing and street-view photos from Google Map; (iv) Shadow-USR: 2,445 images from the USR dataset [15] with mainly people and object shadows; and (v) Shadow-WEB: 2,555 Internet images found by keyword search with “complex shadow.”

Next, we hired a professional company for data labeling. To ensure data labeling quality and consistency, we manually labeled some shadow images, gave the labeled masks to the company as samples, and checked with the company to finalize the shadow masks. Figure 2 shows example shadow images and masks for the five categories of shadow images in our dataset. After that, we randomly split the images in each category into the training set, validation set, and testing set with a ratio of 7:1:2. So, we have 7,350 training images, 1,050 validation images, and 2,100 testing images in total. To the best of our knowledge, this is currently the largest shadow detection dataset with labeled shadow masks. Also, it is the first shadow detection dataset with a validation set, and it features a wide variety of real-world situations.

Figure 3: Analysis on the shadow area proportion for different datasets. Shadows in the ISTD and SBU datasets have mainly small shadows, while our dataset has more diverse types of shadows with wider ranges of sizes in the shadow images.

3.2 Dataset Complexity

To provide a comprehensive understanding of our dataset, we perform a series of statistical analysis on the shadow images and compare the statistical results with the ISTD and SBU datasets in the following aspects.

Shadow area proportion.

First, we find the proportion of pixels (range: [0,1]) occupied by shadows in each image. Figure 3 (left) shows the histogram plots of shadow area proportion for ISTD, SBU, and our dataset. From the histograms, we can see that most images in ISTD and SBU have relatively small shadow regions, while our dataset has more diverse shadow areas compared with them. Figures 3 (right) further reports histogram plots for the five scene categories in our dataset. Interestingly, the distribution in Shadow-WEB can serve as a good reference for general real-world shadows, since the images were obtained from the Internet, while Shadow-KITTI features mainly shadows for road images and Shadow-USR features mainly shadows for people and objects, so the shadow areas in these images have less variation than other categories.

mean std
ISTD 1.51 1.14
SBU 3.44 3.02
Overall 7.51 7.02
Shadow-ADE 10.09 7.95
Shadow-KITTI 9.63 5.92
Shadow-MAP 8.94 6.88
Shadow-USR 4.30 6.32
Shadow-WEB 6.25 6.95
Table 1: Number of separated shadow regions per image in ISTD [42], SBU [12, 39, 41], and our dataset.

Number of shadows per image.

Next, we group connected shadow pixels and count the number of separated shadows per image in ISTD, SBU, and our dataset. To avoid influence of noisy labels, we ignore shadow regions whose area is less than 0.05% of the whole image. Table 1 reports the resulting statistics, showing that ISTD and SBU only have around 1.51 and 3.44 shadow regions per image, while our dataset have far more shadow regions per image on average. This certainly reveals the complexity of our dataset. Note also that there are more than ten separate shadow regions per image in Shadow-ADE, showing the challenge of detecting shadows in this data category.

Figure 4: Shadow location distributions. Lighter (darker) colors indicate larger (smaller) chances of having shadows.
Figure 5: Color contrast distributions of different datasets.
Figure 6:

Illustration of our fast shadow detection network (FSDNet). Note that the height of the boxes indicates the size of the associated feature maps. BN and IRB denote batch normalization and inverted residual bottleneck, respectively.

Shadow location distribution.

Further, we study shadow locations in image space by resizing all shadow masks to 512

512 and summing them up per dataset. So, we can obtain a per-pixel probability value about shadow occurrence. Figure 

4 shows the results for the three datasets, revealing that shadows in our dataset cover a wider spatial range, except for the top regions, which are often related to the sky. In contrast, shadows in ISTD locate mainly in the middle, while shadows in SBU locate mainly on the bottom.

Color contrast distribution.

Real-world shadows are often more soft instead of being entirely dark. This means that the color contrast in shadow and non-shadow regions may not be high. Here, we follow [26, 45] to measure the distances between the color histograms of the shadow and non-shadow regions in each image. Figure 5 plots the color contrast distribution for images in the three datasets, where a contrast value (horizontal axis in the plot) of one means high color contrast, and vice versa. From the results, we can see that the color contrast in our dataset is lower than both ISTD and SBU, so it is more challenging to detect shadows (which are more soft) in our dataset.

Shadow detection performance.

Last, to reveal the complexity of our dataset with respect to ISTD and SBU, we compare the performance of two recent shadow detectors, DSDNet [47] and BDRAR [51], on the three datasets. DSDNet and BDRAR achieve 2.17 and 2.69 balanced error rate (BER) on ISTD, 3.45 and 3.64 BER on SBU, but only 8.27 and 9.18 BER on our dataset, respectively, showing that it is more challenging to detect shadows in our dataset.


Overall, our dataset not only contains far more shadow images and covers a rich variety of scene categories, but also features more diverse proportion of shadow areas, separated shadows, and shadow locations, as well as lower color contrast between shadows and non-shadows. Having said that, it also means that our dataset is more challenging and complex for shadow detection. This is also evidenced by the experimental results to be shown in Table 3.

3.3 Evaluation Metrics

Balanced error rate (BER) [41] is a common metric to evaluate shadow detection performance, where shadow and non-shadow regions contribute equally to the overall performance without considering their relative areas:


where , , and are true positives, true negatives, false positives, and false negatives, respectively. To compute these values, we have to first quantize the predicted shadow mask into a binary mask, then compare this binary mask with the ground truth mask. A lower BER value indicates a better detection result.

BER is designed for evaluating binary predictions, but recent deep neural networks [16, 51, 42, 25, 47, 7] predict shadow masks in continuous values, which indicate the probability of a pixel of being inside a shadow. Hence, we use the -measure [28], which evaluates continuous predictions by extending , , , and to


where is the ground truth image, is the predicted shadow mask, and . Further, -measure introduces an error importance based on the dependency and location constraints, so

can balance the weighted precision and recall values; please see 

[28] for more details. Overall, a larger indicates a better result.

4 Methodology

Method Overall shadow-ADE shadow-KITTI shadow-MAP shadow-USR shadow-WEB


84.32 10.94 74.81 13.92 87.05 9.35 82.51 11.15 88.78 5.17 82.43 10.72

84.84 10.74 75.52 13.77 87.30 9.10 82.81 10.90 89.90 4.90 82.71 10.28

FSDNet w/o DEM
85.88 9.78 76.58 12.74 88.59 7.89 84.61 9.54 89.70 4.95 84.17 9.42

FSDNet (our full pipeline)
86.12 9.58 76.85 12.49 88.70 7.82 84.86 9.29 90.00 4.62 84.50 8.98

Table 2: Ablation study on the validation set of our dataset.

Network architecture.

Figure 6 shows the overall architecture of our fast shadow detection network (FSDNet), which takes a shadow image as input and outputs a shadow mask in an end-to-end manner. First, we use MobileNet V2 [35] as the backbone with a series of inverted residual bottlenecks (IRBs) to extract feature maps in multiple scales. Each IRB contains a 11 convolution, a 33 depthwise convolution [2], and another 11 convolution, with a skip connection to add the input and output feature maps. Also, it adopts batch normalization [18] after each convolution and ReLU6 [13] after the first two convolutions. Second, we employ the direction-aware spatial context (DSC) module [14, 16] after the last convolutional layer of the backbone to harvest the DSC features, which contain global context information to help recognize the shadows.

Third, low-level feature maps of the backbone contain rich fine details that can help discover shadow boundaries and tiny shadows. So, we further formulate the detail enhancement module (DEM) by harvesting shadow details in low-level feature maps when the distance between the DSC feature and low-level feature is large. Last, we concatenate the DEM-refined low-level feature, mid-level feature, and high-level feature, then use a series of convolution layers to predict the output shadow mask; see Figure 6 for details.

Figure 7: The detail enhancement module (DEM).

Detail enhancement module.

Figure 7 illustrates the structure of the detail enhancement module (DEM), which adopts the low-level feature () and DSC feature () as inputs. First, we reduce the number of feature channels of the DSC feature by a 11 convolution and upsample it to the size of . Then, we calculate gate map to measure the importance of the detail structures by examining the distance between the DSC feature and low-level feature:


where reports the distance between the low-level feature and DSC feature, which is normalized to [0,1] by a logarithm function. Then, we follow [8] to introduce a learnable parameter to adjust the scale of the gate map. In the end, we multiply gate map with input low-level feature to enhance the spatial details and produce the refined low-level feature . Note that this module only introduces a few parameters (a 11 convolution and parameter ), so the computing time is negligible.

Training strategies.

We use the weights of MobileNet V2 trained on ImageNet 


for classification to initialize the backbone network parameters, and initialize the parameters in other layers by random noise. We use stochastic gradient descent with momentum 0.9 and weight decay 0.0005 to optimize the network by minimizing the

loss between the ground-truth and predicted shadow masks. We set the initial learning rate as 0.005, reduce it by the poly strategy [27]

with a power of 0.9, and stop the learning after 50k iterations. Last, we implement the network on PyTorch, train it on a GeForce GTX 1080 Ti GPU with a mini-batch size of six, and horizontally flip the images as data augmentation.

5 Experimental Results

5.1 Evaluation on the Network Design

We perform experiments on the validation set of our data to evaluate the effectiveness of the major components in FSDNet. First, we build a “basic” model using the last layer of the backbone network to directly predict the shadows. This model is built by removing the DSC module, DEM, and skip connections of the low- and middle-level features in the overall architecture shown in Figure 6. Then, we add back the DSC module to aggregate global features and report the results as “basic+DSC” in Table 2. Further, we consider the low-level features refined by the DEM and set up another network architecture, namely “FSDNet w/o DEM,” by removing the DEM from the whole architecture and directly concatenating the low-, middle-, and high-level features. From the quantitative evaluation results shown in Table 2, we can see that the major components help improve the results and contribute to the full pipeline.

Method FPS Params (M) Overall shadow-ADE shadow-KITTI shadow-MAP shadow-USR shadow-WEB


FSDNet (ours)
77.52 4.40 86.27 8.65 79.04 10.14 88.78 8.10 83.22 9.95 90.86 4.40 84.27 9.75

DSDNet [47]
21.96 58.16 82.59 8.27 74.75 9.80 86.52 7.92 78.56 9.59 86.41 4.03 80.64 9.20

BDRAR [51]
18.20 42.46 83.51 9.18 76.53 10.47 85.35 9.73 79.77 10.56 88.56 4.88 82.09 10.09

A+D Net [25]
68.02 54.41 83.04 12.43 73.64 15.89 88.83 9.06 78.46 13.72 86.63 6.78 80.32 14.34

DSC [14, 16]
4.95 79.03 82.76 8.65 75.26 10.49 87.03 7.58 79.26 9.56 85.34 4.53 81.16 9.92

Net [6]
26.43 56.16 81.36 8.86 73.32 10.18 84.95 8.20 76.46 10.80 86.03 4.97 79.61 10.21

MirrorNet [45]
16.01 127.77 78.29 13.39 69.83 15.20 79.92 12.77 74.22 14.03 85.12 7.08 76.26 15.30

PSPNet [46]
12.21 65.47 84.93 10.65 76.76 12.38 88.12 9.48 81.14 12.65 90.42 5.68 82.20 12.75

Table 3: Comparing with state-of-the-art methods in terms of and BER. We trained all methods on our training set and tested them on our testing set without using any post-processing method such as CRF. Note that “FPS” stands for “frames per second,” which is evaluated on a single GeForce GTX 1080 Ti GPU with a batch size of one and image size of 512512.
(a) Input image
(b) Ground truth
(c) FSDNet (ours)
(d) DSDNet [47]
(e) BDRAR [51]
(f) A+D Net [25]
(g) DSC [14, 16]
Figure 8: Visual comparison of the shadow masks produced by our method and by other shadow detection methods.
(a) Input image
(b) Ground truth
(c) FSDNet (ours)
(d) DSDNet [47]
(e) BDRAR [51]
(f) A+D Net [25]
(g) DSC [14, 16]
Figure 9: More visual comparison results (continue from Figure 8).

5.2 Comparison with the State-of-the-art

Comparison with recent shadow detection methods.

We compare FSDNet with four recent shadow detection methods: DSDNet [47], BDRAR [51], A+D Net [25], and DSC [14, 16]. We re-train each of the models on the training set of our dataset and evaluate them on our testing set. For a fair comparison, we set the size of input image as 512512.

Table 3 reports the overall quantitative comparison results, where our method performs favorably against all the other methods on the five categories of scenes in our dataset in terms of and achieves comparable performance with DSDNet on BER. Note that DSDNet requires the results of BDRAR, A+D Net, and DSC to discover the false positives and false negatives and to train its model during the training process. In contrast, we train our FSDNet using only the given ground-truth images. Furthermore, our network uses only 4M network parameters and is able to process 77.52 frames per second (see the second column on FPS in Table 3) on a single GeForce GTX 1080 Ti GPU. Particularly, it performs faster and achieves more accurate results than the recent real-time shadow detector A+D Net.

On the other hand, we show visual comparison results in Figures 89, where the results produced by our method are more consistent with the ground truths, while other methods may mis-recognize black regions as shadows, e.g., the green door shown in the first row and the trees shown in the last three rows of the Figure 8 and the last two rows of Figure 9, or fail to find unobvious shadows, e.g., the shadow across backgrounds of different colors in the first row of Figure 9. However, our method may also fail to detect some extremely tiny shadows of the trees and buildings; see Figure 8 and the last two rows of Figure 9. In the future, we plan to further enhance the DEM by considering patch-based methods to process image regions with detailed structures in high resolutions.

Comparison with other networks.

Deep network architectures designed for saliency detection, mirror detection, and semantic segmentation can also be used for shadow detection, if we re-train their models on shadow detection datasets. We take three recent works on saliency detection (i.e., Net [6]), mirror detection (i.e., MirrorNet [45]), and semantic segmentation (i.e., PSPNet [46]), re-train their models on the training set of our dataset, and then test them on our testing set. The last three rows in Table 3 show their quantitative results. Comparing our results with theirs, our method still performs favorably against these deep models for both accuracy and speed.

(a) Original image
(b) Result of DeepUPE [43]
(c) Histogram equalization over
the whole image
(d) Histogram equalization over
the shadow regions
Figure 10: Object detection results on different inputs.

5.3 Application

As described earlier in Section 1, shadows can degrade the object detection performance. Figure 10 (a) shows an example, where the objects are detected by Google Cloud Vision [10]. Note that, the people under the grandstand were missed out, due to the presence of the self shadows. After adopting a recent underexposed photo enhancement method, i.e., DeepUPE [43] or a simple histogram equalization operation, to adjust the contrast over the whole image, we can enhance the input image. However, the improvement on the object detections is still limited; see Figures 10 (b) & (c). If we apply the histogram equalization operation only on the shadows with the help of our detected shadow mask, we can largely improve the visibility of the people in the shadow regions and improve also the object detection performance, as demonstrated in Figure 10 (d). In the future, we will explore a shadow-mask-guided method for photo enhancement to improve both the visual quality and performance of high-level computer vision tasks.

6 Conclusion

This paper revisits the problem of shadow detection, with a specific aim to handle general real-world situations. We collected and prepared a new benchmark dataset of 10,500 shadow images, each with a labeled shadow mask, from five different categories of sources. Comparing with the existing shadow detection datasets, our dataset provides images for diverse scene categories, features both cast and self shadows, and also introduces a validation set to reduce the risk of overfitting. We show the complexity of our dataset by analyzing the shadow area proportion, number of shadows per image, shadow location distribution, and color contrast between shadow and non-shadow regions. Moreover, we design a novel and fast deep neural network architecture (FSDNet) and formulate the detail enhancement module to bring in more shadow details from the low-level features. Comparing with the state-of-the-art methods, our network performs favorably for both accuracy and speed.

From the results, we can see that all methods (including ours) cannot achieve high performance on our new dataset, different from what have been achieved in the existing datasets. In the future, we plan to explore the performance of shadow detection in different image categories, and further design robust techniques to improve shadow detection, particularly on fixing the shadow details and shadows boundaries.


  • [1] E. Bekele, W. E. Lawson, Z. Horne, and S. Khemlani (2018) Implementing a robust explanatory bias in a person re-identification network. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    pp. 2165–2172. Cited by: §1.
  • [2] F. Chollet (2017) Xception: deep learning with depthwise separable convolutions. In CVPR, pp. 1251–1258. Cited by: §4.
  • [3] V. Chondagar, H. Pandya, M. Panchal, R. Patel, D. Sevak, and K. Jani (2015) A review: shadow detection and removal. International Journal of Computer Science and Information Technologies 6 (6), pp. 5536–5541. Cited by: §1.
  • [4] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati (2003) Detecting moving objects, ghosts, and shadows in video streams. IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (10), pp. 1337–1342. Cited by: §1.
  • [5] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §4.
  • [6] Z. Deng, X. Hu, L. Zhu, X. Xu, J. Qin, G. Han, and P. Heng (2018) RNet: recurrent residual refinement network for saliency detection. In IJCAI, pp. 684–690. Cited by: §5.2, Table 3.
  • [7] B. Ding, C. Long, L. Zhang, and C. Xiao (2019) ARGAN: attentive recurrent generative adversarial network for shadow detection and removal. In ICCV, pp. 10213–10222. Cited by: §1, §2, §3.3.
  • [8] J. Fu, J. Liu, Y. Wang, Y. Li, Y. Bao, J. Tang, and H. Lu (2019) Adaptive context network for scene parsing. In ICCV, pp. 6748–6757. Cited by: §4.
  • [9] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, pp. 3354–3361. Cited by: §3.1.
  • [10] (2019) Google Cloud Vision. Note: Cited by: §5.3.
  • [11] R. Guo, Q. Dai, and D. Hoiem (2011) Single-image shadow detection and removal using paired regions. In CVPR, pp. 2033–2040. Cited by: §1, §2, §3.
  • [12] L. Hou, T. F. Y. Vicente, M. Hoai, and D. Samaras (2019) Large scale shadow annotation and detection using lazy annotation and stacked CNNs.. IEEE Transactions on Pattern Analysis and Machine Intelligence. Note: to appear Cited by: Figure 1, §1, §2, Table 1, §3.
  • [13] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §4.
  • [14] X. Hu, C. Fu, L. Zhu, J. Qin, and P. Heng (2019) Direction-aware spatial context features for shadow detection and removal. IEEE Transactions on Pattern Analysis and Machine Intelligence. Note: to appear Cited by: §1, §1, §2, §4, 7(g), 8(g), §5.2, Table 3.
  • [15] X. Hu, Y. Jiang, C. Fu, and P. Heng (2019) Mask-ShadowGAN: learning to remove shadows from unpaired data. In ICCV, pp. 2472–2481. Cited by: §2, §3.1.
  • [16] X. Hu, L. Zhu, C. Fu, J. Qin, and P. Heng (2018) Direction-aware spatial context features for shadow detection. In CVPR, pp. 7454–7462. Cited by: §1, §1, §2, §3.3, §4, 7(g), 8(g), §5.2, Table 3.
  • [17] X. Huang, G. Hua, J. Tumblin, and L. Williams (2011) What characterizes a shadow boundary under the sun and sky?. In ICCV, pp. 898–905. Cited by: §2.
  • [18] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §4.
  • [19] K. Karsch, V. Hedau, D. Forsyth, and D. Hoiem (2011) Rendering synthetic objects into legacy photographs. ACM Transactions on Graphics (SIGGRAPH Asia) 30 (6), pp. 157:1–157:12. Cited by: §1.
  • [20] S. H. Khan, M. Bennamoun, F. Sohel, and R. Togneri (2014) Automatic feature learning for robust shadow detection. In CVPR, pp. 1939–1946. Cited by: §2.
  • [21] S. H. Khan, M. Bennamoun, F. Sohel, and R. Togneri (2016) Automatic shadow detection and removal from a single image. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (3), pp. 431–446. Cited by: §1, §2.
  • [22] J. Lalonde, A. A. Efros, and S. G. Narasimhan (2010) Detecting ground shadows in outdoor consumer photographs. In ECCV, pp. 322–335. Cited by: §2.
  • [23] J. Lalonde, A. A. Efros, and S. G. Narasimhan (2009) Estimating natural illumination from a single outdoor image. In ICCV, pp. 183–190. Cited by: §1.
  • [24] H. Le and D. Samaras (2019) Shadow removal via shadow image decomposition. In ICCV, pp. 8578–8587. Cited by: §2.
  • [25] H. Le, T. F. Y. Vicente, V. Nguyen, M. Hoai, and D. Samaras (2018) A+D Net: training a shadow detector with adversarial shadow attenuation. In ECCV, pp. 662–678. Cited by: §1, §2, §3.3, 7(f), 8(f), §5.2, Table 3.
  • [26] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille (2014) The secrets of salient object segmentation. In CVPR, pp. 280–287. Cited by: §3.2.
  • [27] W. Liu, A. Rabinovich, and A. C. Berg (2015) ParseNet: looking wider to see better. arXiv preprint arXiv:1506.04579. Cited by: §4.
  • [28] R. Margolin, L. Zelnik-Manor, and A. Tal (2014) How to evaluate foreground maps?. In CVPR, pp. 248–255. Cited by: §3.3.
  • [29] S. Nadimi and B. Bhanu (2004) Physical models for moving shadow and object detection in video. IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (8), pp. 1079–1087. Cited by: §1.
  • [30] V. Nguyen, T. F. Y. Vicente, M. Zhao, M. Hoai, and D. Samaras (2017) Shadow detection with conditional generative adversarial networks. In ICCV, pp. 4510–4518. Cited by: §2.
  • [31] T. Okabe, I. Sato, and Y. Sato (2009) Attached shadow coding: estimating surface normals from shadows under unknown reflectance and lighting conditions. In ICCV, pp. 1693–1700. Cited by: §1.
  • [32] A. Panagopoulos, C. Wang, D. Samaras, and N. Paragios (2011) Illumination estimation and cast shadow detection through a higher-order graphical model. In CVPR, pp. 673–680. Cited by: §1, §2.
  • [33] L. Qu, J. Tian, S. He, Y. Tang, and R. W.H. Lau (2017) DeshadowNet: a multi-context embedding deep network for shadow removal. In CVPR, pp. 4067–4075. Cited by: §2.
  • [34] E. Salvador, A. Cavallaro, and T. Ebrahimi (2004) Cast shadow segmentation using invariant color features. Computer Vision and Image Understanding 95 (2), pp. 238–259. Cited by: §2.
  • [35] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) MobileNetV2: inverted residuals and linear bottlenecks. In CVPR, pp. 4510–4520. Cited by: §4.
  • [36] L. Shen, T. Wee Chua, and K. Leman (2015) Shadow optimization from structured deep edge detection. In CVPR, pp. 2067–2074. Cited by: §2.
  • [37] J. Tian, X. Qi, L. Qu, and Y. Tang (2016) New spectrum ratio properties and features for shadow detection. Pattern Recognition 51, pp. 85–96. Cited by: §2.
  • [38] T. F. Y. Vicente, M. Hoai, and D. Samaras (2015) Leave-one-out kernel optimization for shadow detection. In ICCV, pp. 3388–3396. Cited by: §2.
  • [39] T. F. Y. Vicente, M. Hoai, and D. Samaras (2016) Noisy label recovery for shadow detection in unfamiliar domains. In CVPR, pp. 3783–3792. Cited by: Figure 1, §1, §2, Table 1, §3.
  • [40] T. F. Y. Vicente, M. Hoai, and D. Samaras (2018) Leave-one-out kernel optimization for shadow detection and removal. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (3), pp. 682–695. Cited by: §2.
  • [41] T. F. Y. Vicente, L. Hou, C. Yu, M. Hoai, and D. Samaras (2016) Large-scale training of shadow detectors with noisily-annotated shadow examples. In ECCV, pp. 816–832. Cited by: Figure 1, §1, §2, §2, §3.3, Table 1, §3.
  • [42] J. Wang, X. Li, and J. Yang (2018) Stacked conditional generative adversarial networks for jointly learning shadow detection and shadow removal. In CVPR, pp. 1788–1797. Cited by: Figure 1, §1, §2, §3.3, Table 1, §3.
  • [43] R. Wang, Q. Zhang, C. Fu, X. Shen, W. Zheng, and J. Jia (2019) Underexposed photo enhancement using deep illumination estimation. In CVPR, pp. 6849–6857. Cited by: 9(b), §5.3.
  • [44] L. Xu, F. Qi, R. Jiang, Y. Hao, and G. Wu (2006) Shadow detection and removal in real images: a survey. CVLAB, Shanghai Jiao Tong University, China. Cited by: §1.
  • [45] X. Yang, H. Mei, K. Xu, X. Wei, B. Yin, and R. W.H. Lau (2019) Where is my mirror?. In ICCV, pp. 8809–8818. Cited by: §3.2, §5.2, Table 3.
  • [46] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In CVPR, pp. 2881–2890. Cited by: §5.2, Table 3.
  • [47] Q. Zheng, X. Qiao, Y. Cao, and R. W.H. Lau (2019) Distraction-aware shadow detection. In CVPR, pp. 5167–5176. Cited by: §1, §2, §3.2, §3.3, 7(d), 8(d), §5.2, Table 3.
  • [48] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba (2017) Scene parsing through ADE20K dataset. In CVPR, pp. 633–641. Cited by: §3.1.
  • [49] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba (2019) Semantic understanding of scenes through the ADE20K dataset. International Journal of Computer Vision 127 (3), pp. 302–321. Cited by: §3.1.
  • [50] J. Zhu, K. G.G. Samuel, S. Z. Masood, and M. F. Tappen (2010) Learning to recognize shadows in monochromatic natural images. In CVPR, pp. 223–230. Cited by: §1, §2, §3.
  • [51] L. Zhu, Z. Deng, X. Hu, C. Fu, X. Xu, J. Qin, and P. Heng (2018) Bidirectional feature pyramid network with recurrent attention residual modules for shadow detection. In ECCV, pp. 121–136. Cited by: §1, §2, §3.2, §3.3, 7(e), 8(e), §5.2, Table 3.