Log In Sign Up

Segmenting Sky Pixels in Images

Outdoor scene parsing models are often trained on ideal datasets and produce quality results. However, this leads to a discrepancy when applied to the real world. The quality of scene parsing, particularly sky classification, decreases in night time images, images involving varying weather conditions, and scene changes due to seasonal weather. This project focuses on approaching these challenges by using a state-of-the-art model in conjunction with a non-ideal dataset: SkyFinder and a subset from SUN database with Sky object. We focus specifically on sky segmentation, the task of determining sky and not-sky pixels, and improving upon an existing state-of-the-art model: RefineNet. As a result of our efforts, we have seen an improvement of 10-15 compared to the prior methods on SkyFinder dataset. We have also improved from an off-the shelf-model in terms of average mIOU by nearly 35 analyze our trained models on images w.r.t two aspects: times of day and weather, and find that, in spite of facing same challenges as prior methods, our trained models significantly outperform them.


page 1

page 2

page 3

page 6

page 7

page 9

page 10


Non-parametric spatially constrained local prior for scene parsing on real-world data

Scene parsing aims to recognize the object category of every pixel in sc...

Competitive Simplicity for Multi-Task Learning for Real-Time Foggy Scene Understanding via Domain Adaptation

Automotive scene understanding under adverse weather conditions raises a...

A Dense Material Segmentation Dataset for Indoor and Outdoor Scene Parsing

A key algorithm for understanding the world is material segmentation, wh...

Synscapes: A Photorealistic Synthetic Dataset for Street Scene Parsing

We introduce Synscapes -- a synthetic dataset for street scene parsing c...

Physics-Based Rendering for Improving Robustness to Rain

To improve the robustness to rain, we present a physically-based rain re...

Fast and Accurate Camera Scene Detection on Smartphones

AI-powered automatic camera scene detection mode is nowadays available i...

1 Introduction

Sky segmentation is a part of the scene parsing world in which algorithms seek to label or identify objects in images. However, due to being trained on ideal datasets, these algorithms face difficulty in non-ideal conditions [12]

. As deep learning methods might become more involved in real world applications, it becomes apparent that off-the-shelf methods are not always effective and reliability comes into question. Inspired by Mihail et al.,

[12] who compared existing sky segmentation methods and sought to bring attention to the problem of ideal datasets, we focused upon approaching the challenges they mentioned through existing models. The challenges of outdoor scene parsing lie in the time of day, year, and varying weather conditions. Their SkyFinder dataset allowed us to pursue these challenges in order to obtain improved results. This work highlights the importance of this problem. We offer an improved model that will aid in challenging sky segmentation in the real world. In this work, we adopted state-of-the-art segmentation model [8] for the task of pixel level detection of the sky. Our task is different than semantic segmentation in a sense that we are only interested in one object i.e. sky. Sky, unlike other objects can be difficult to segment due to poor lightning (night time) and weather conditions where even humans are likely to fail (e.g., over dense fog, thunder storms, etc). Thus, we attempt to address this problem in this work.

Our contribution is as following. First, we evaluated an off-the-shelf state-of-the-art model [8] to demonstrate that existing models fail for different weather conditions, times of day, and other transient attributes [12]. Second, we fine-tuned RefineNet-Res101-Cityscapes model on the SkyFinder dataset and obtained 12.26% improvement over off-the-shelf model in terms of misclassification rate (MCR). Third, taking advantage of the existing big dataset, we trained [8] solely on SkyFinder, which further improved accuracy and outperformed all baseline methods. Fourth, to study across datasets performance of our trained models, we selected a subset from SUN [23] dataset with the ’sky’ label. We then both fine-tuned and trained RefineNet-Res101 model on this subset of SUN dataset (we refer to it as the SUN-Sky dataset in this paper), and perform evaluation across models trained on both SkyFinder and SUN-Sky datasets, and report the results. Fifth, following [12], we investigate the effect of times of day and weather conditions on performance of our model and transient attributes. Sixth, we compare our analysis with [12] in terms of MCR and report impact of weather and times of day w.r.t mIOU scores. Finally, we determine the impact of noisy images like motion blur, Gaussian noise, etcetera on our model’s performance and report our results on robustness of our approach.

The rest of this paper is organized as follows. In section 2, we present a brief overview of existing works in this area. In section 3, we describe details of our plan including utilized datasets, and our models. Section 4 discusses the performance metrics we used to evaluate the trained models. Section 5 presents experimental findings on Sky segmentation task for both datasets and analysis of different attributes, times of day, weather conditions and noise on this task, followed by discussions in section 6.

Figure 2: SUN-Sky dataset. Sample images from the various locations described in the SUN dataset with sky.

2 Related Work

The history of scene parsing i.e., assigning each pixel of input image to one of the object labels [12], has evolved from its beginnings. Scene labeling methods [6], [9], [16], [18], [19] mainly use local appearance information for objects being learned from training data [12]

. Although, previously this task has been addressed by using hand engineered features with a classifier, recent methods address this task by learning features using deep neural networks. From convolutional networks spawned fully convolutional networks which moved away from pixel level algorithms to whole-image-at-a-time methods

[10]. Afterward, the introduction of skip layers led to deep residual learning [4]. Recently, residual nets became the backbone to scene parsing algorithms such as RefineNet [8], PSPNet [26], and more.

Specifically, the sky segmentation task can be helpful for a diverse variety of applications such as stylizing images using sky replacement [20], obstacle avoidance [2], and [13] since sky tells a lot about weather conditions. Current applications of sky segmentation range from personal to public use and more. Often times they are used in scene parsing [5, 17]

, horizon estimation, and geolocation. Other applications include weather classification

[11], image editing [7, 15], weather estimation [21, 14], and more. [1] and [24] worked on weather recognition and used camera as weather sensors. Weather detection has also been used for image searching [15] where one can search outdoor scene images based on weather related attributes.

Dev used color based methods [3] for segmentation of sky/cloud images, whereas [25] proposed deep learning approach for segmenting sky. Like [12], we used the same baseline methods (Hoeim [5], Tighe [17] and Lu [11]

) to compare our results. Hoiem uses a single image to produce an estimate of scene geometry for three classes: ground, sky and vertical regions by learning underlying geometry of an image via appearance-based models. Tighe combines region-level features with SVM-based sliding window detectors for parsing an image into different class labels including ’sky’. Lu classifies an input image into two classes: sunny or cloudy. Their work uses sky detection as an important cue for weather classification. For detecting sky, they used random forests to produce seed patches for sky and non-sky, and then used graph cut to segment sky regions.

[12] uses their sky detector for reporting results which we also used for our comparisons on the SkyFinder dataset. Here we investigate the effectiveness of existing state-of-the-art segmentation methods for this specific problem. To select among the best contenders for this task, we evaluated off-the-shelf RefineNet [8] and PSPNet [26] methods on SkyFinder dataset, and chose [8] for our further experiments as it outperforms [26] with a large margin on SkyFinder dataset. We further take a deep insight on challenges (such as weather conditions, night time images, and noisy images) which are faced even when robust methods are used.

Method Split 1 Split 2 Split 3 Avg.
mIOU(%) MCR(%) mIOU(%) MCR(%) mIOU(%) MCR(%) mIOU(%) MCR(%)
Hoiem - 21.28 - 20.68 - 26.24 - 22.73
Lu - 25.38 - 21.67 - 23.32 - 20.38
Tighe - 17.48 - 20.33 - 31.58 - 26.21
RefineNet-Res101-Cityscapes 49.31 17.12 46.31 18.00 49.87 21.68 48.5 18.93
RefineNet-Res101-SkyFinder-FT 79.48 5.17 72.18 7.48 85.2 7.37 78.95 6.67
RefineNet-Res101-SkyFinder 87.07 5.08 73.84 7.00 88.05 5.65 82.99 5.90
Table 1: Results from all three testing splits. MCR results for top 3 baselines are used from the SkyFinder metadata in order to compare our pixel-level sky detector with their methods. We also report mIOU scores. RefineNet-Res101-Cityscapes is the off-the-shelf model trained on Cityscapes. RefineNet-Res101-SkyFinder-FT shows results when we fine-tuned RefineNet-Res101-Cityscapes model on SkyFinder dataset. Finally, training RefineNet-Res101 from scratch on SkyFinder dataset (last row) gives best results. For RefineNet, all these numbers are reported when the model was evaluated at test scale of 0.8.

3 Approach

Our approach is based on adopting semantic segmentation methods for the task of pixel-level sky detection. We used two datasets: SkyFinder [12] and SUN-Sky [23] for our work. Our approach outperforms baseline methods in the task of sky segmentation. We evaluated generalization capability of our models by performing across datasets evaluation. Also, we studied influence of various factors like transient attributes, weather conditions and noise on our model’s performance.

3.1 Datasets

3.1.1 SkyFinder dataset

The SkyFinder dataset is a subset of the Archive of Many Outdoor Scenes (AMOS) dataset. Due to the full SkyFinder dataset not being available, we used 45 of the 53 cameras shared. This entailed  60K-70K images, with an average of  1500 images per camera. These images were of varying dimensions, quality, time of day, season, and weather. However, some cameras contained images which indicated the camera was experiencing technical difficulties or repairs. These few images were removed to focus on the challenges we wished to address. We then changed the sizes of the images to introduce a form of uniformity and make test evaluation faster by resizing them to be within the ranges of .

A single segmentation map was associated with each camera. This is due to each camera being stationary for at least a year in order to be included within in the dataset. See figure 1 for sample images from this dataset.

Gt. a b c
Figure 3: Improvement in night time images. Column 1 shows original image examples, col2 shows ground truths, where last three columns show results for a) off-the-shelf RefineNet-Res101-Cityscapes, b) RefineNet-Res101-SkyFinder-FT, and c) RefineNet-Res101-SkyFinder respectively.
    Orig.        Gt.          a          b     c
Figure 4: Improvement in weather obscured images. Column 1 shows original image examples, col2 shows ground truths, where last three columns show results for a) off-the-shelf RefineNet-Res101-Cityscapes, b) RefineNet-Res101-SkyFinder-FT, and c) RefineNet-Res101-SkyFinder respectively.

3.1.2 SUN dataset

The Scene UNderstanding dataset is comprised of a multitude of different scenes and the objects that make up said scenes

[22]. It is constantly being updated through community effort, thus adding to it’s already large nature. We primarily looked at this dataset for its ”sky” object classification which allowed us to begin comparing the SkyFinder dataset and results. Unlike SkyFinder, images are not grouped by camera, instead they are grouped by scene such as airport terminal, or church. Also, SkyFinder focused on images taken from stationary cameras, whereas SUN has a variety of images from a variety of viewpoints. For the purposes of this research, we focused on the subsection of the SUN database that had the object ”sky” labeled (about 20,000 images). We resized those images to be within the range of for improved test evaluation speed. In what follows, we refer to this subset as the SUN-Sky dataset. See figure 2 for few sample images from this dataset.

Method mIOU(%) MCR(%)
RefineNet-Res101-Cityscapes 61.69 8.4
RefineNet-Res101-SUNdb-FT 83.1 3.7
RefineNet-Res101-SUNdb 82.36 4.17
Table 2:

Performance results of SUNdb Finetune and SUNdb ImageNet.

3.2 Sky Segmentation

3.2.1 RefineNet

The model we used, RefineNet, was created by Lin [8]. This model seeks to to retain detail throughout the reconstruction of the image and its segmented output unlike its predecessor and backbone, ResNet.

3.2.2 Off-the-Shelf RefineNet

In order to establish a baseline in RefineNet, we used RefineNet’s Res101 model that was trained on Cityscapes, a dataset of European cities for scene segmentation. The Cityscapes dataset is an ideal dataset for the Sky class, and as a result does well in ideal conditions in the SkyFinder dataset. After setting up the model and using it on a single Titan X GPU, we ran each of the 45 cameras through the model and evaluated solely on their sky classification ability, i.e. all other classes were construed as non-sky. We refer this baseline model as RefineNet-Res101-Cityscapes in the following.

3.2.3 Finetuning

To obtain proof of concept prior to running the entire dataset we focused on a smaller subset of the dataset. We took 10 random cameras and broke it into a train-val-test split. From each camera we took 75 images for training, 25 images for validation, and between 175-300 images for testing and evaluation.

Following the success of fine-tuning the model on the subset, we fine-tuned the RefineNet-Res101-Cityscapes model on the SkyFinder dataset. Being unable to find the same train-val-test split as Mihail [12], we split the dataset into our own train-val-test splits. Although, to keep our experiments as much consistent as possible to [12]

, we used the same number of test cameras in our experiments. Hence, our split consisted of 13 cameras used for testing, 4 cameras used for validation, and the remaining cameras used for training. We then shuffled the cameras in each section but kept the same number of cameras for each training, validation, and testing set, and repeated the aforementioned fine-tuning process two more times for a total of three fine-tuning trials. We used a learning rate of 5e-5 for each instance and trained the model for 10 epochs. After 10 epochs, the validation accuracy started leveling out. Thus, for consistency, we report all results on model trained till 10 epochs. We refer our fine-tuned models as RefineNet-Res101-SkyFinder-FT in our results.

3.2.4 Training with SkyFinder dataset

Finally, to take advantage of the big size of the dataset, we trained RefineNet-Res101 from scratch, where Res101 was initialized with pre-trained ImageNet model but trained the RefineNet on solely the SkyFinder dataset. We used the same three train-val-test splits as mentioned above (to allow for fair comparison) and trained at a learning rate of 5e-4 for over 10 epochs. Due to the flattening of the learning curve, we used the model at epoch 10 for testing in both instances. We refer to these models as RefineNet-Res101-SkyFinder in our results.

3.2.5 Training with Sun-Sky dataset

We broke down the sky-labeled sub-dataset of the SUN dataset into a randomly shuffled 60-20-20 split for our uses. After resizing the images to to train and test quickly, we then generated the ground truth segmentation masks by keeping only the sky class and treating any other classes as non-sky. To train and evaluate, we focused on a similar process as we did with the SkyFinder dataset. Training entails fine-tuning from the RefineNet-Res101-Cityscapes model, and training on a model initialized from ImageNet-Res101. Evaluation is comprised of calculating the average MCR and mIOU of the dataset. We finetuned RefineNet-Res101-Cityscapes model using the SUN-Sky dataset for 10 epochs at a learning rate of 5e-4, and subsequently 10 more epochs at a learning rate of 5e-5. For our second model, we initialized from ImageNet-Res101 and trained using the same split as we did for the previous model for 10 epochs at a learning rate of 5e-4. Much like the previous model, we again trained another 10 epochs at a lower learning rate of 5e-5.

Table 2 shows quantitative results for SUN-Sky dataset on both fine-tuned model (RefineNet-Res101-SUNdb-FT) and trained model (RefineNet-Res101-SUNdb). See fig. 14 for qualitative results on this dataset.

Time of day Split 1 Split 2 Split 3 Avg.
0 82.36 5.54 80.64 6.27 85.37 7.46 82.79 6.42
1 76.28 8.25 75.40 7.69 84.88 7.34 78.85 7.76
2 76.12 9.75 76.26 8.31 86.27 6.75 79.55 8.27
3 70.41 7.21 69.99 6.28 87.96 5.97 76.12 6.48
4 71.66 6.65 69.83 5.71 88.26 5.06 76.58 5.81
5 77.33 5.16 71.04 4.84 90.82 3.09 79.73 4.36
6 76.48 5.06 74.20 5.85 91.74 3.94 80.80 4.95
7 76.66 3.65 71.05 8.71 89.62 5.75 79.11 6.04
8 72.05 3.31 69.53 4.15 93.23 3.46 78.27 3.64
9 72.47 3.06 70.65 3.34 93.93 2.94 79.02 3.12
10 74.54 2.44 73.21 2.89 94.51 2.69 80.75 2.68
11 70.97 2.57 68.90 2.82 94.17 2.89 78.01 2.76
12 74.08 2.61 72.69 2.92 94.57 2.57 80.44 2.70
13 74.04 2.12 71.78 2.61 94.60 2.49 80.14 2.41
14 72.58 2.41 70.38 2.70 94.17 2.81 79.04 2.64
15 73.25 2.40 71.08 2.85 94.19 2.84 79.50 2.70
16 71.69 2.59 69.78 2.86 92.77 3.44 78.08 2.96
17 71.73 2.85 69.63 3.13 91.76 3.73 77.71 3.24
18 70.12 3.33 67.50 3.51 90.41 4.37 76.01 3.73
19 69.00 3.96 66.53 4.42 89.38 5.00 74.97 4.46
20 69.43 5.74 64.92 11.19 85.27 8.08 73.21 8.34
21 67.11 5.70 64.85 6.79 86.80 5.88 72.92 6.12
22 68.27 6.58 64.70 6.09 86.50 5.49 73.16 6.05
23 70.86 7.17 62.64 6.48 83.62 6.89 72.37 6.85
Avg. 72.89 4.77 70.30 6.45 90.20 4.62 77.8 5.25
Table 3: Sky segmentation results break down w.r.t different times of the day on SkyFinder dataset. Numbers are in percentage.

4 Evaluation

To evaluate the accuracy of our results we used both the misclassification rate (MCR) defined in Mihail [12], and the mean intersection over union (mIOU). The use of both of these metrics allowed for the ability to compare our results to the results found in Mihail [12], which focused primarily on MCR results, and determine the overlapping accuracy of the segmentation outputs.


5 Results

In this section, we discuss our experiments for sky segmentation, and studying impact of different conditions like times of day, weather situation, transient attributes and noise on performance of our model. Please note that, although [12] have done a similar study, we extend their work and also report mIOU in our experiments.

5.1 SkyFinder dataset

Since we evaluated RefineNet on SkyFinder dataset in three settings: off-the-shelf RefineNet, fine-tuned, and trained from the scratch, for reference, we use RefineNet-Res101-Cityscapes for off-the-shelf RefineNet model, initialized from Res101 and trained on Cityscapes. We then further fine-tuned same off-the-shelf model on SkyFinder dataset for all three splits and refer to the model as RefineNet-Res101-SkyFinder-FT. RefineNet when initialized from ImageNet pre-trained Res101 and trained on SkyFinder from the scratch is referred as RefineNet-Res101-SkyFinder. We find that RefineNet-Res101-SkyFinder outperforms the fine-tuned model which is clearly because the former takes an advantage of the large size of this dataset. For comparison with other baseline models (Hoiem , Lu , or Tighe ) mentioned in [12], we used the MCR scores for all three test splits and also report the average performance for all methods in table 1. To compute MCR scores for our baselines on our test sets, we fetch the MCR score from metadata provided with SkyFinder dataset after they evaluated these models.

For results in table 1, we evaluated RefineNet at test scale of 0.8 (default setting), which performs better than being evaluated at full scale (scale = 1.0). Please refer to tables 1 and 3 for comparison. We find that RefineNet-Res101-SkyFinder outperforms all baseline methods both in terms of mIOU and MCR scores. For qualitative evaluation, the images are selected from a few of the first test set of cameras and do not include visual results from Hoiem , Lu , or Tighe . While Mihail created their own ensemble using the combination of the three methods, their results were unreported for individual images.

Analysis on time of day and weather has been performed with full scale test evaluation.

5.1.1 Performance on Camera 10917

First, some background on SkyFinder’s Camera 10917: this specific camera is of a location in which the sky does not peek through anywhere. It depicts a quaint village and trees, but no sky. When SkyFinder was trained on the first two splits it did not contain this camera in its training data. As a result we consistently witnessed abysmal IOU values of 0 which results in performance degradation for test splits 1 and 2 because the model has not previously seen images with no sky. However, the model performs decent in terms of MCR values (below 10%). We believe that the IOU evaluation method in a case such as this is inefficient. Therefore we pay particular attention to the MCR in regards to this camera. For test split 3, this camera has been used for training our models, thus model performs better on the test set.

5.1.2 Performance for night time

Fig 3 shows visual improvement in results for night time images when compared with off the shelf RefineNet-Res101-Cityscapes and RefineNet-Res101-SkyFinder fine-tuned, which makes it obvious why our trained models win over the baseline methods in terms of MCR as well.

5.1.3 Performance for weather obscured images

In section 5.1.5, our results suggests that sky segmentation in night time is more challenging than during day hours. But, our final trained model on SkyFinder still improves in terms of mIOU over our established baseline models (RefineNet-Res101-Cityscapes and RefineNet-Res101-SkyFinder-FT). Fig. 4 shows that despite of images obscured due to dense fog, RefineNet-Res101-SkyFinder was able to perform reasonably well.

Figure 5: Performance analysis of RefineNet-Res101-SkyFinder in terms of mean intersection over union w.r.t row1) time of day, and row2) Weather conditions.
Figure 6: Illustration for performance analysis on time of day. X-axis shows hour of the day and y-axis represents mean classification rate (MCR) over all three test splits we used in our experiment.
Input GT Pred. Input GT Pred. Input GT Pred.
Figure 7: Qualitative results for performance analysis w.r.t times of day on SkyFinder dataset using RefineNet-Res101-SkyFinder. The first row shows success cases(mIOU 0.8) for hour: 0(night), 12(noon) and 18(evening), and the last row shows images where model fails(mIOU 0.5). The last row shows failure cases for hours 6 and 0. Our model never failed for images around noon.
Weather Split 1 Split 2 Split 3 Avg.
clear 59.88 2.96 66.48 3.54 92.61 4.21 72.99 3.57
cloudy 79.92 3.61 77.02 4.19 90.71 4.07 82.55 3.96
fog 77.82 7.18 71.72 7.75 86.22 5.58 78.59 6.83
hazy 75.74 6.72 72.32 6.71 89.31 4.49 79.12 5.97
mostly cloudy 73.12 3.44 69.10 4.89 90.40 4.06 77.54 4.13
partly cloudy 77.46 4.18 75.79 6.10 89.98 4.61 81.08 4.96
rain 61.88 3.94 58.00 4.32 89.79 4.65 69.89 4.30
sleet 47.66 4.17 50.00 4.75 92.57 4.51 63.41 4.48
snow 29.50 1.42 33.70 4.48 89.03 6.83 50.74 4.24
tstorms 84.50 3.88 80.75 3.32 89.09 4.58 84.78 3.93
unknown 83.36 5.76 68.27 5.29 86.17 3.84 79.27 4.96
blanks 83.09 7.08 83.84 6.38 78.93 8.94 81.95 7.47
Avg. 69.49 4.82 67.25 6.77 88.73 5.03 75.16 5.54
Table 4: Sky segmentation results break down w.r.t weather on SkyFinder dataset. Numbers are in percentage.

5.1.4 Performance analysis for different day times

We also wanted to see how the trained model performed during different times of day, similar to [12]. As we use three test splits for reporting our results, we sorted each of them w.r.t hour of the day. Then for each sorted split, we evaluate the respective model on its test set and compute mIOU and MCR. We then calculate average over them (see table 3). Please note that while evaluating RefineNet-Res101-SkyFinder for each of its respective test set, we used scale=1.0 (i.e. full scale at test time) and report the numbers. We find similar pattern as discussed in [12] for RefineNet as well i.e., model achieves good performance during day time in terms of MCR, and performance decreases during start(early morning) and end of the day(night time), thus MCR increases. In terms of mIOU, we witness a decrease in performance towards the end of the day, but overall the performance seems consistent. Fig 5 shows illustration for mIOU during different day hours. For comparing RefineNet with the results in the baseline paper in this aspect, we fetched MCR scores for each split w.r.t hour and report results provided by [12]. Fig.6 shows that although our trained model follows the same pattern, but our pixel-level sky detector performs significantly better than all three methods.

Please see figure 7 for qualitative results when we tested RefineNet-Res101-SkyFinder on sorted test split w.r.t time. Row 1 shows success cases where row2 shows failure cases for different times of day.

Figure 8: Illustration of performance analysis on different weather conditions. X-axis shows different weather conditions existing in SkyFinder dataset and y-axis represents mean classification rate(MCR) over all three test splits we used in our experiment.

5.1.5 Performance analysis for weather conditions

Like [12], we aim to evaluate how well semantic segmentation model when trained as a sky classifier performs for different weather conditions. We split the testset with respect to different weather conditions as provided in metadata of SkyFinder dataset and evaluated RefineNet-Res101-SkyFinder on them. We run the same experiment for all three test splits and then compute the average scores for both mIOU and MCR (See table 4 for quantitative results). Looking at the average mIOU scores, we find that our model struggles for few weather conditions like snow and sleet. On the other hand our model performs better even when weather is cloudy, partly cloudy, or foggy. Interestingly, the model is able to perform really well during thunderstorms both in terms of mIOU and MCR.

SkyFinder dataset has some images without labels for weather conditions, we evaluated our model on them and report the results. We observe that mIOU is lower than expected for clear sky. This is due to the reason that a big share of clear sky images (1200+) are from camera 10917 where there is no sky visible, thus it lowers the mIOU at test time for splits 1 and 2. Split 3 has been trained on such images and therefore learns to tell when there is no sky in the image (92.61% mIOU for clear sky).Fig 5 shows mIOU scores averaged over all three splits for RefineNet-Res101-SkyFinder. Fig 8 compares MCR score for weather with different methods, and RefineNet significantly outperforms them. For calculating MCR for baseline methods, we use the MCR scores from SkyFinder metadata. This figure reports the average MCR score over all test splits. Please see figure 9 for qualitative results.

GT Pred. Input GT Pred. Input GT Pred.
Figure 9: Qualitative results for performance analysis w.r.t weather on SkyFinder dataset using RefineNet-Res101-SkyFinder. First four rows show success cases(mIOU 0.8) for each weather type, and last row shows images where model fails(mIOU 0.5). Success cases in first 4 rows follow the weather order: clear, cloudy, fog, hazy, misc, mostly cloudy, partly cloudy, rain, sleet, snow, tstorms, and unknown respectively. For each weather type, we show input image, ground truth and prediction respectively. Last row shows failure cases where weather type was: tstorms , clear, and fog.

5.1.6 Performance analysis w.r.t transient attributes

Inspired by [12], we aimed to investigate how much transient attributes affect the performance of our trained model. Hence, we selected images from one of our test splits for four transient attributes: gloomy, clouds, cold and night. We selected images having high presence of these attributes i.e., thresholded above 0.8 and with low chances of their existence i.e., value is less than 0.2. We then tested RefineNet-Res101-SkyFinder on these subsets of images(2 for each attribute : low and high) and report the performance.

As expected, the model performs well for the images where the sky is not gloomy, cloudy, cold or night time. The model is most robust to clouds and performs well(both in terms of mIOU and MCR) even when clouds are largely present in the image. In terms of MCR, model’s performance is worse when we input a highly gloomy image (MCR=10.84) or when it is night time (MCR=10.25). In terms of mIOU, our trained model performs slightly better for gloomy images than night time images (see table 5, last row). We also report MCR scores from baseline methods for comparisons and find that Tighe performs poorly among all of the baselines in high presence of the above mentioned transient attributes. Lu seems robust among the baselines but our method significantly outperforms all of them for this experiment(see fig. 10). Figure 11 shows the histogram of MCR scores when each of these four transient attributes are high and low in the input images. See figure 12 for qualitative results of this experiment.

gloomy clouds cold night
0.8 0.2 0.8 0.2 0.8 0.2 0.8 0.2
Hoiem - 40.70 - 19.70 - 18.47 - 18.39 - 24.93 - 19.23 - 41.57 - 18.46
Lu - 36.42 - 14.88 - 12.74 - 8.35 - 16.44 - 14.98 - 36.60 - 10.03
Tighe - 50.53 - 13.36 - 56.21 - 23.44 - 50.65 - 17.83 - 48.14 - 31.05
RNet-Res101-SkyFinder 84.16 10.84 92.97 1.94 93.09 4.35 93.65 2.93 90.84 5.35 92.19 2.90 83.26 10.25 94.36 2.72
Table 5: Performance analysis on transient attributes (high thresholded and low thresholded values).
Figure 10: Comparison of RefineNet-Res101-SkyFinder with baseline methods (Tighe et al., Hoiem et al., and Lu et al.) in terms of MCR given high values(0.8) and low values(0.2) of transient features(gloomy, clouds, cold, and night). Row 1 shows performance comparison when each of four attributes has high value i.e., thresholded above 0.8, whereas row 2 shows results when very low value is observed for each attribute.
Figure 11: Frequency distribution of MCR with RefineNet-Res101-SkyFinder given transient features(gloomy, clouds, cold, and night) using low and high threshold values(0.2 and 0.8, respectively).

     Input image
             GT               Prediction         Input image             GT       Prediction
Figure 12: Qualitative results for absence(below a threshold i.e. 0.2) and presence(above a threshold i.e. 0.8) of transient attributes (gloomy, clouds, cold, and night) on SkyFinder dataset when using RefineNet-Res101-SkyFinder model. Row 1 shows selected examples for clouds with threshold 0.2, row 2 shows results for clouds using high threshold 0.8, row 3 and 4 show results for gloomy, row 5 and 6 are results for cold, and the last two rows show results for night following the same order as mentioned for first two rows i.e., absence and presence of each transient attribute, respectively.

5.1.7 Performance analysis w.r.t noisy images

We are also interested in investigating the impact of noise on the task of Sky segmentation. We conducted this experiment since the image can be noisy due to various reasons in original settings as well. For this purpose, we selected 50 images for each camera from one of the test splits (total 650 images) and added different types of noise to them. We added motion blur, Gaussian noise, Poisson noise, salt & pepper, and speckle noise to our subset of images. We first evaluated performance without adding any noise and then compared our results with noisy images(see table 6). We find that our trained model is robust to Poisson noise, and highly sensitive to salt & pepper noise. Motion blur also affects mIOU (a drop of approx. 3.5%). Similarly, Gaussian noise and speckle noise also hurt performance (both in terms of mIOU and MCR). See figure 13 for sample images after adding different types of noise to input images.

Original Motion Blur Gaussian Poisson Salt & Pepper Speckle
mIOU 89.92 86.31 85.26 88.67 81.77 84.70
MCR 5.08 6.65 7.60 5.73 9.32 7.91
Table 6: Performance analysis on SkyFinder dataset when input is a noisy image. Column 1 shows results on images without any noise, whereas rest of the columns show results when we added motion blur, Gaussian noise, Poisson noise, salt & pepper noise, and speckle noise to the images respectively.

Motion blur Gaussian Poisson Salt & Pepper Speckle
Figure 13: Some example images for different types of noise added to the selected subset of images from SkyFinder dataset.
    Orig.        Gt.          a           b     c
Figure 14: Segmentation improvement across the RefineNet-Res101-Cityscapes model (a), the RefineNet-Res101-SUN-FT model (b), and the RefineNet-Res101-SUN model (c).

5.2 Sun-Sky dataset

The SUN-Sky dataset is a subset of images from SUN dataset [23] of outdoor scenes having sky present in them. Using a similar approach as we used for training on SkyFinder dataset, we first evaluate RefineNet-Res101-Cityscapes model on SUN-Sky dataset, then fine-tune the same model on this dataset referred as RefineNet-Res101-SUNdb-FT, and lastly, we trained RefineNet-Res101- on SUN-Sky dataset from the scratch (referred to as RefineNet-Res101-SUNdb). We randomly split the images into 60%-20%-20% ratio for training, validation and testing respectively. When evaluated these three models on our test set, we find that off-the-shelf model performs better(mIOU = 61.69, MCR = 8.4) on this dataset as compared to its performance on SkyFinder dataset (mIOU = 48.5, MCR = 18.93), which maintains that SkyFinder is more challenging in nature as compared to SUN-Sky dataset. Interestingly, the fine-tuned model on this dataset outperforms the model when trained solely on SUN-Sky dataset with a slight margin (both in terms of mIOU and MCR) which is mainly because the SUN-Sky dataset is not big enough in size. Overall, we find that both fine-tuned and trained models on this dataset performs reasonably well as compared to off-the-shelf RefineNet-Res101-Cityscapes model (see table 2).

5.3 Cross datasets evaluation

To see generalization power of our models, we evaluated them cross datasets i.e., models trained on SkyFinder were evaluated on SUN-Sky dataset and vice versa. Table 7 shows results for both fine-tuned and trained model on each dataset and across other dataset. Looking at the results, we find that SkyFinder, due to its large size, gives better performance when we trained the model on it from the scratch. Interestingly, our fine-tuned models perform better than trained from scratch when evaluated across datasets. In terms of MCR, fine-tuned model on SkyFinder generalizes better than the model fine-tuned on SUN-Sky dataset(6.67 vs. 11.74). Please note, for SkyFinder dataset, results shown from all models are averaged over all test splits.

Datasets SkyFinder SUN-Sky
mIOU(%) MCR(%) mIOU(%) MCR(%)
RNet-SkyFinder-FT 79.00 6.69 blue71.70 blue6.67
RNet-SkyFinder 83.00 5.89 71.57 7.24
RNet-SUNdb-FT blue71.49 blue11.74 83.10 3.70
RNet-SUNdb 70.42 12.37 82.36 4.17
Table 7: Sky segmentation results on 2 datasets broken down and trained on one sub dataset and tested on others. Numbers in bold text show the best results on particular dataset, whereas, blue font shows best results for that dataset during cross dataset evaluation. Numbers are in percentage.

6 Discussion and Conclusion

We will first direct our focus to the original results of RefineNet’s res101 model trained on Cityscapes in Table 1. Both results incorporate the pretrained model’s incapability to properly access non-ideal images, particularly night images as addressed earlier. Upon finetuning on the model using the SkyFinder dataset, we note a drastic improvement in the mIOU and just over 10% decrease in the MCR. This is clearly as a result of including the non-ideal images. Upon training it on the SkyFinder dataset, we see a less drastic improvement in one split, but improvement nonetheless.

Despite the averages in Table 1, focusing on a singular camera in one of the data splits, we can see a drastic improvement. Camera 858, consists of 200 images and was not seen by the finetuned or ImageNet initialized model during training. The mIOU from the model trained on Cityscapes was 69.21%, and had an MCR of 18.16%. However, after finetuning the model, the mIOU and MCR results respectively quickly jumped up to 91.41% and 4.69%. Finally, after training on only 28 cameras of the data (again, not including camera 858) and initializing from ImageNet, the results improved slightly. The mIOU increased to 95.76% and the MCR dropped a little more to 2.30%. Other cameras show similar rates of improvement. There are some cameras however that prove difficult to segment for all methods and show a smaller rate of improvement.

The importance of these results lies in the ability to use these models in real-world applications. Using the original pre-trained model would result in poor quality segmentation outside of the ideal circumstances. Off-the-shelf methods must be modified in order to be used most effectively in the real world.

Understanding the impacts of these results, we also incorporate the results from Mihail et al’s findings for their baseline methods. While their own model’s results which reported an MCR of 12.96% across their own testing split has not been used to report the performance in terms of MCR on our test splits. In spite of this, we can still further prove the idea that existing models are still effective–so long as they have been modified to suit the task’s needs.

We also demonstrated that, overall, even state-of-the-art models struggle with challenging conditions like night time, variation in weather and other transient attributes. Although, our trained models still perform much better than prior methods in terms of MCR.

Following this work, we intend to look at other scene parsing models to evaluate their off-the-shelf methods, finetune them, and possibly train them on the SkyFinder dataset in order to compare the results to the above. We also plan to develop our own end-to-end sky segmentation model to also use for comparison. To possibly improve general results, cleaning of the dataset may need to occur, such as removing timestamps. Other future directions include using sky segmentation for applications such as weather classification and weather forecasting. Code, our trained models and data will be made available for further exploration of this area.


  • [1] W.-T. Chu, X.-Y. Zheng, and D.-S. Ding. Camera as weather sensor: Estimating weather information from single images. Journal of Visual Communication and Image Representation, 46:233–249, 2017.
  • [2] G. De Croon, C. De Wagter, B. Remes, and R. Ruijsink. Sky segmentation approach to obstacle avoidance. In Aerospace Conference, 2011 IEEE, pages 1–16. IEEE, 2011.
  • [3] S. Dev, Y. H. Lee, and S. Winkler. Color-based segmentation of sky/cloud images from ground-based cameras. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 10(1):231–242, 2017.
  • [4] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition, 2016.
  • [5] D. Hoeiem, A. A. Efros, and M. Hebert. Geometric context from a single image, 2005.
  • [6] D. Hoiem, A. A. Efros, and M. Hebert. Geometric context from a single image. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, volume 1, pages 654–661. IEEE, 2005.
  • [7] P.-Y. Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays. Transient attributes for high-level understanding and editing of outdoor scenes, 2014.
  • [8] G. Lin, A. Milan, C. Shen, and I. Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation, 2016.
  • [9] C. Liu, J. Yuen, and A. Torralba. Nonparametric scene parsing: Label transfer via dense scene alignment. In

    Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on

    , pages 1972–1979. IEEE, 2009.
  • [10] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation, 2015.
  • [11] C. Lu, D. Lin, J. Jia, and C. K. Tang. Two-class weather classification, 2014.
  • [12] R. P. Mihail, S. Workman, Z. Bessinger, and N. Jacobs. Sky segmentation in the wild: An empirical study. In Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on, pages 1–6. IEEE, 2016.
  • [13] M. Roser and F. Moosmann. Classification of weather situations on single color images. In Intelligent Vehicles Symposium, 2008 IEEE, pages 798–803. IEEE, 2008.
  • [14] L. Tao, Y. Luo, and X. Zheng. Weather recognition based on images captured by vision system in vehicle, 2009.
  • [15] L. Tao, L. Yuan, and J. Sun. Skyfinder: attribute-based sky image search. In ACM Transactions on Graphics (TOG), volume 28, page 68. ACM, 2009.
  • [16] J. Tighe and S. Lazebnik. Superparsing: scalable nonparametric image parsing with superpixels. Computer Vision–ECCV 2010, pages 352–365, 2010.
  • [17] J. Tighe and S. Lazebnik. Superparsing: scalable nonparametric image parsing with superpixels, 2010.
  • [18] J. Tighe and S. Lazebnik. Finding things: Image parsing with regions and per-exemplar detectors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3001–3008, 2013.
  • [19] J. Tighe, M. Niethammer, and S. Lazebnik. Scene parsing with object instances and occlusion ordering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3748–3755, 2014.
  • [20] Y.-H. Tsai, X. Shen, Z. Lin, K. Sunkavalli, and M.-H. Yang. Sky is not the limit: Semantic-aware sky replacement. ACM Transactions on Graphics (TOG), 35(4):149, 2016.
  • [21] Wei-TaChu, Xiang-YouZheng, and D.-S. Ding. Camera as weather sensor: Estimating weather information from single images. Journal of Visual Communication and Image Representation, 46(1):233–249, 2017.
  • [22] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo, 2010.
  • [23] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In Computer vision and pattern recognition (CVPR), 2010 IEEE conference on, pages 3485–3492. IEEE, 2010.
  • [24] X. Yan, Y. Luo, and X. Zheng. Weather recognition based on images captured by vision system in vehicle. Advances in Neural Networks–ISNN 2009, pages 390–398, 2009.
  • [25] A. P. Yazdanpanah, E. E. Regentova, A. K. Mandava, T. Ahmad, and G. Bebis. Sky segmentation by fusing clustering with neural networks. In International Symposium on Visual Computing, pages 663–672. Springer, 2013.
  • [26] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network, 2017.