Bloom intensity corresponds to the number of flowers present in orchards during the early growing season. Climate and bloom intensity information are crucial to guide the processes of pruning and thinning, which directly impact fruit load, size, coloration, and taste [1, 2]. Accurate estimates of bloom intensity can also benefit packing houses, since early crop-load estimation greatly contributes to optimizing postharvest handling and storage processes.
Visual inspection is still the dominant approach for bloom intensity estimation in orchards, a technique which is time-consuming, labor-intensive and prone to errors . Since only a limited sample of trees is inspected, the extrapolation to the entire orchard relies heavily on the grower’s experience. Moreover, it does not provide information about the spatial variability in the orchard, although the benefits of precision agriculture practices are well known .
These limitations added to the short-term nature of flower appearance until petal fall make an automated method highly desirable. Multiple automated computer vision systems have been proposed to solve this problem, but most of these methods rely on hand-engineered features , making their overall performance acceptable only under relatively controlled environments (e.g. at night with artificial illumination). Their applicability is in most cases species-specific and highly vulnerable to variations in lightning conditions, occlusions by leaves, stems or other flowers .
In the last decade, deep learning approaches based on convolutional neural networks (CNNs) led to substantial improvements in the state-of-the-art of many computer vision tasks. Recent works have adapted CNN architectures to agricultural applications such as fruit quantification , classification of crops , and plant identification from leaf vein patterns . To the best of our knowledge, our work in  was the first to employ CNNs for flower detection. In that work, we combined superpixel-based region proposals with a classification network to detect apple flowers. Limitations of that approach are intrinsic to the inaccuracies of superpixel segmentation and the network architecture.
In the present work, we provide the following contributions for automated flower segmentation:
A novel technique for flower identification that is i) automated, ii) robust to clutter and changes in illumination; and, iii) generalizable to multiple species. Using as starting point a fully convolutional network (FCN)  pre-trained on a large multi-class dataset, we describe an effective fine-tuning procedure that adapts this model for fine pixel-wise flower segmentation. Our final method evaluates in less than seconds high-resolution images covering each a full tree. Although the task comparison is not one-to-one, human workers may need on average up to minutes to count the number of flowers per tree.
A feasible procedure for evaluating high-resolution images with deep FCNs on commercial GPUs. Fully convolutional computations require GPU memory space that exponentially increases according to image resolution. We employ an image partitioning mechanism with partially overlapping windows, which reduces artifacts introduced by artificial boundaries when evaluating disjoint image regions.
Release of an annotated dataset with pixel-accurate labels for flower segmentation on high resolution images . We believe this can greatly benefit the community, since this is a very time consuming yet critical task for both training and evaluation of segmentation models.
2 Related Work
Previous attempts at automating bloom intensity estimation were mostly based on color thresholding, such as the works described in [14, 15] and . Despite differences in terms of color-space used for analysis (e.g. HSL and RGB), all these methods fail when applied in uncontrolled environments. Apart from size filtering, no morphological feature is taken into account, such that thresholding parameters have to be adjusted in case of changes in illumination, camera position or flowering density. Even strategies using aerial multispectral images such as  also rely solely on color information for image processing.
to classify individual superpixels composing an image. That method highly outperformed color-based approaches, especially in terms of generalization to datasets composed of different flower species and acquired in uncontrolled environments. However, existing superpixel algorithms rely solely on local context information, representing the main source of imprecisions in scenarios where flowers and the surrounding background present similar colors.
While early attempts for autonomous fruit detection also relied on hand-engineered features (e.g. color, texture, shape) , recent works have been exploring more advanced computer vision techniques. One example is the work of Hung et al. 
, which combines sparse autoencoders
and support vector machines (SVM) for segmenting leaves, almonds, trunks, ground and sky. The approaches described by Bargoti and Underwood in and Chen et al. in  for fruit detection share some similarities with our method for flower segmentation. In , the authors introduce a Faster R-CNN trained for the detection of mangoes, almonds and apple fruits on trees. The method introduced in  for counting apples and oranges employs a fully convolutional network (FCN) to perform fruit segmentation and a convolutional network to estimate fruit count.
End-to-end fully convolutional networks  have been replacing traditional fully connected architectures for image segmentation tasks . Conventional architectures such as the Alexnet  and VGG  networks are very effective for image classification but provide coarse outputs for image segmentation tasks. This is a consequence of the image downsampling introduced by the max-pooling and striding operations performed by these networks, which allow the extraction of learned hierarchical features at the cost of pixel-level precision .
Different strategies have been proposed to alleviate the effects of downsampling , including the use of deconvolution layers [21, 25], and encoder-decoder architectures with skip layer connections [26, 27]. The DeepLab model introduced in  is one of the most successful approaches for semantic image segmentation using deep learning. By combining the ResNet-101  model with atrous convolutions and spatial pyramid pooling, it significantly reduces the downsampling rate and achieves state-of-the-art performance in challenging semantic segmentation datasets such as the PASCAL VOC  and COCO .
In addition to the changes in CNN architecture, the authors of DeepLab also employ the dense CRF model described in  to produce fine-grained segmentations. Although providing visually appealing segmentations, this refinement model relies on parameters that have to be optimized by means of supervised grid-search. In , we introduced a generic post-processing module that can be coupled to the output of any CNN to refine segmentations without the need for dataset-specific tuning. Called region growing refinement (RGR), this algorithm uses the score maps available from the CNN to divide the image into regions of high confidence background, high confidence object and uncertainty region. By means of appearance-based region growing, pixels within the uncertainty region are classified based on initial seeds randomly sampled from the high confidence regions.
3 Our Approach
In this section, we first describe the pre-training and fine-tuning procedures carried out to obtain a CNN highly sensitive to flowers. Subsequently, we describe the sequence of operations that our pipeline performs to segment flowers in an image.
3.1 Network training
One of the largest datasets available for semantic segmentation, the COCO dataset  was recently augmented by Caesar et al.  into the COCO-Stuff dataset. This dataset includes pixel-level annotations of classes such as grass, leaves, tree and flowers, which are relevant for our application. In the same work, the authors also discuss the performance of modern semantic segmentation methods on COCO-Stuff, with a DeepLab-based model outperforming the standard FCN. Thus, we opted for the publicly available DeepLab-ResNet model pre-trained on the COCO-Stuff dataset as the starting point for our pipeline. Rather than fine-tuning the dense CRF model used in the original DeepLab work, we opt for the generic RGR algorithm as a post-processing module to obtain fine-grained segmentations.
The base model was originally designed for segmentation within the COCO-Stuff classes. To adapt its architecture for our binary flower segmentation task, we perform procedures known as network surgery and fine-tuning . The surgery procedure is analogous to the pruning of undesired branches in trees: out of the original classification branches, we preserve only the weights and connections responsible for the segmentation of classes of interest.
We considered first an architecture preserving only the flower
classification branch, followed by a sigmoid classification unit. However, without the normalization induced by the model’s original softmax layer, the scores generated by the transferredflower branch are unbounded and the final sigmoid easily saturates. To alleviate the learning difficulties caused by such a poor initialization, we opted for tuning a model with two-branches, under the hypothesis that a second branch would allow the network to learn a background representation that properly normalizes the predictions generated by the foreground (flower) branch.
We have observed experimentally that nearby leaves represent one of the main sources of misclassification for flower segmentation. Moreover, predictions for the class leaf presented the highest activations when applying the pre-trained model to our training dataset. For these reasons, we opt for this branch together with the one associated with flowers to initialize our two-branch flower segmentation network.
The adapted architecture was then fine-tuned using the training set described in Section 4, which contains images of apple trees. For our experiments, the procedure was carried out for
iterations using the Caffe framework, with an initial learning rate of that polynomially decays according to , where is the iteration number. Aiming at scale robustness, our fine-tuning procedure employs the same strategy used for model pre-training, where each training portrait is evaluated at (, , , , ) times its original resolution.
While the validation set has pixel-accurate annotations obtained using the procedure described in Section 4, the training set was annotated using the less precise but quicker superpixel-based procedure described in our previous work . Less than of the total image areas in this dataset contain flowers. To compensate for this imbalance, we augmented portraits containing flowers by mirroring them with respect to vertical and horizontal axes. Following the original network parameterization, we split the training images into portraits of pixels, corresponding to a total of training portraits after augmentation.
3.2 Segmentation pipeline
The method we propose for fruit flower segmentation consists of three main operations: 1) divide a high resolution image into smaller patches, in a sliding window manner; 2) evaluate each patch using our fine-tuned CNN; 3) apply the refinement algorithm on the obtained scoremaps to compute the final segmentation mask. These steps are described in detail below. In our description, we make reference to Algorithm 1 and Figure 1.
1) Step 1 - Sliding window: As mentioned above, the adopted CNN architecture either crops or resizes input images to portraits. Since our datasets are composed of images with resolution ranging from to pixels (see Section 4), we emulate a sliding window approach to avoid resampling artifacts. More specifically, we split each input image into a set of portraits . Each portrait is pixels large, i.e. with . Cropping non-overlapping portraits from the original image introduces artificial boundaries that compromise the detection quality. For this reason, in our approach each portrait overlaps a percentage of the area of each immediate neighbor. For our experiments, we adopted . When the scoremaps are fused, the results corresponding to the overlapping pixels are discarded. Figure 2 illustrates this process for a pair of subsequent portraits. The scores obtained for each portrait are depicted as a heatmap, where blue is associated with lower scores and higher scores are illustrated with red.
|Sunny||No||Canon EOS 60D||Hand-held|
|AppleB||Sunny||Yes||GoPro HERO5||Utility vehicle|
|H ||S [%]||V [%]|
|H ||S [%]||V [%]|
2) Step 2 - CNN prediction: We evaluate in parallel each portrait with our fine-tuned network for flower identification. The CNN is equivalent to a function
which maps each input into two pixel-dense scoremaps: represents the pixel-wise likelihood that pixels in belong to the foreground (i.e., flower), while corresponds to the pixel-wise background likelihood. The heatmaps in Figures 3(a) and (b) are examples of scoremaps computed for a given portrait.
3) Step 3 - Fusion and refinement: After evaluating each portrait, we generate two global scoremaps and by combining the predictions obtained for all . Let represent the pixel-coordinates of in
after discarding the padding pixels. The fusion procedure is defined as
such that both scoremaps and have the same resolution as . As illustrated in Figure 2, the padded areas of (outside the red box) are discarded during fusion. For every pixel in the image, a single prediction score is obtained from exactly one portrait, such that artifacts introduced by artificial boundaries are avoided.
After fusion, the scoremaps and are normalized into scoremaps and using a softmax function
where is the -th pixel in the input image . With this formulation, for each pixel the scores and
add to one, i.e. they correspond to the probability thatbelongs to the corresponding class.
As Figure 3(c) shows, the predictions obtained directly from the CNN are coarse in terms of adherence to actual flower boundaries. Therefore, rather than directly thresholding , this scoremap and the image are fed to the RGR refinement module described in . For our application, the refinement algorithm relies on two high-confidence classification regions and defined according to
where and are the high-confidence background and foreground thresholds. Using the high-confidence regions as starting points, the RGR algorithm performs multiple Monte Carlo region growing steps that groups similar pixels into clusters. Afterwards, it performs majority voting to classify each cluster according to the presence of flowers. Each pixel within a cluster contributes with a positive vote if its score is larger than a threshold . As detailed in Section 5, this parameter can be empirically tuned according to the dataset under consideration. Based on a grid-search optimization on our training dataset, we selected for all our experiments and fixed and .
We evaluate our method on four datasets that we created and made publicly available: AppleA, AppleB, Peach, Pear . As summarized in Table 1, images from different fruit flower species were collected in diverse uncontrolled environments and under different angles of capture.
Both datasets AppleA and AppleB are composed of images of apple trees, which were collected in a USDA orchard on a sunny day. In both datasets, the trees are supported with trellises and planted in rows. AppleA is a collection of images acquired using a hand-held camera. From this total, we randomly selected images to build the training set used to train the CNN. Out of the remaining images, were randomly selected to compose the testing set for which we report results in Section 5.
This dataset contains flowers that greatly vary in terms of size, cluttering, occlusion by leaves and branches. Flowers composing its images have an average area of
pixels, but with a standard deviation ofpixels. On average, flowers compose only of the total image area within this dataset, which is otherwise vastly occupied by leaves.
Differently from AppleA, for the AppleB dataset, a utility vehicle equipped with a background unit was used for imaging, such that trees in other rows are not visible in the images. Figure 4 illustrates the utility vehicle used for image acquisition, and Figures 6 and 7 illustrate the differences between datasets AppleA and AppleB.
The Peach and Pear datasets differ both in terms of species and acquisition conditions, therefore representing adequate scenarios for evaluating the generalization capabilities of the proposed method. Both datasets contain images acquired on an overcast day and without a background unit. Compared to the AppleA dataset, images composing these datasets present significantly lower saturation and value means. Tables 2 and 3 summarize the differences among datasets in terms of the statistics of the HSV color components, where stands for mean values and for interquartile ranges.
Regarding the flower characteristics, apple blossoms are typically white, with hue components spread in the whole spectrum (high ) and low saturation mean. Flowers composing the AppleB dataset present higher brightness (), while peach flowers show a pink hue centered on , with higher saturation and lower value means. Moreover, pear flowers are slightly different in terms of color (greener) and morphology, as illustrated in Figure 9.
Image annotation for segmentation tasks is a laborious and time-consuming activity. Labels must be accurate at pixel-level, otherwise both supervised training and the evaluation of segmentation techniques are compromised. Most existing annotation tools rely on approximating segmentations as polygons, which provide ground truth images that frequently lack accurate adherence to real object boundaries .
We opted for a labeling procedure that combines freehand annotations and RGR refinement . Using a tablet, the user draws traces on regions of the image that contain flowers, indicating as well hard negative examples when necessary. These traces indicate high-confidence segmentation points, which are used as reference by RGR to segment the remaining parts of the image. Figure 5 shows an example of a ground truth segmentation obtained using this procedure111We will make the annotation tool publicly available as future work..
5 Experiments and Results
We aim at a method capable of accurate multi-species flower detection, regardless of image acquisition conditions and without the need for dataset-specific training or pre-processing. To verify that our method satisfies all these requirements, we performed experiments on the four different datasets described in Section 4 while only using the AppleA dataset for training.
We adopt as the main baseline our previous model described in , which highly outperformed existing methods by employing the Clarifai CNN architecture to classify individual superpixels. We therefore refer to that model as Sppx+Clarifai and to our new method as DeepLab+RGR. We also compare our results against a HSV-based method  that segments images based only on HSV color information and size filtering according to threshold values optimized using grid-search.
All three methods were tuned using the AppleA
training dataset, with differences in the pipeline for transfer learning. For the three unseen datasets, theSppx+Clarifai relies on a pre-processing step that enhances contrast and removes the different backgrounds present in the images. Our new method DeepLab+RGR does not require any pre-processing. Instead, it employs the same pipeline regardless of the dataset, requiring only adjustments in portrait size. As summarized in Table 1, images composing the AppleA dataset have resolution larger than images in the other three datasets. Thus, we split images in these datasets into portraits of pixels, rather than the pixels portraits used for AppleA.
The quantitative analysis of segmentation accuracy relies on precision, recall, and intersection-over-union (IoU) metrics  computed at pixel-level, instead of the superpixel-wise metrics used in our previous work. Table 4 summarizes the results obtained by each method on the different datasets.
Our new model outperforms the baseline methods for all datasets evaluated, especially in terms of generalization to unseen datasets. By combining a deeper CNN architecture and the RGR refinement module, DeepLab+RGR improves both prediction and recall rates in the validation AppleA set by more than . Figure 6 provides a qualitative example of flower detection accuracy in this dataset.
As Figure 7 illustrates, images composing the AppleB
dataset present a higher number of flower buds and illumination changes, especially in terms of sunlight reflection by leaves. Despite the larger variance in comparison to the previous dataset, the performance obtained byDeepLab+RGR surpasses in terms of .
Results obtained for the Peach dataset demonstrate the limitation of color-based methods and two important generalization characteristics of our model. The HSV-based method is incapable of detecting peach flowers, since their pink color is very different from the white apple blossoms used for training. On the other hand, our method presents near , indicating that it can properly detect even flowers that differ to a great extent from apple flowers in terms of color. Moreover, images composing this dataset are characterized by a cloudy sky and hence poorer illumination. Most cases of false negatives correspond to flower buds, due to the lack of such examples in the training dataset. As illustrated in Figure 8, poor superpixel segmentation leads the Sppx+Clarifai approach to incorrectly classify parts of the sky as flowers. This problem is overcome by our new model, which greatly increases precision rates to above .
Furthermore, the high recall rate provided by DeepLab+RGR in the Pear dataset demonstrates its robustness to slight variations in both flower morphology and color. As shown in Figure 9, similar to the Peach dataset, these images also present a cloudy background. In addition to that, their background is characterized by a high level of clutter caused by the presence of a large number of branches. These high texture components compromise the background removal model used by Sppx+Clarifai. Still, the DeepLab+RGR method provides a very accurate detection of flowers, with precision above .
The results obtained by our method for AppleB, Peach and Pear datasets can be further improved by adjusting the parameter used for final classification and refinement. As summarized in Figure 10, increasing from to increases in the performance on AppleB, reaching both recall and precision levels around . For the Peach dataset, decreasing to increases the recall rate to above . Such adjustment can be carried out quickly through a simple interactive procedure, where is chosen according to its visual impact on the segmentation of a single image.
In terms of inference time, the current implementation of our algorithm on an Intel Xeon™CPU E5-2620 v3 @ 2.40GHz (62GB) with a Quadro P6000 GPU requires on average seconds to evaluate each high-resolution image composing our datasets. Around seconds are required to save portraits as individual files and load their corresponding prediction scores, a process that can be simplified by generating portraits directly within the neural network framework.
We have presented a novel automated approach for flower detection, which exploits state-of-the-art deep learning techniques for semantic image segmentation. The applicability of our method was demonstrated by its high flower segmentation accuracy across datasets that vary in terms of illumination conditions, background composition, image resolution, flower density and flower species. Without any supervised fine-tuning or image pre-processing, our model trained using only images of apple flowers succeeded in generalizing for peach and pear flowers, which are noticeably different in terms of color and morphology.
In the future, we intend to further improve the generalization capabilities of our model by training and evaluating it on multi-species flower datasets. We ultimately aim at a completely autonomous system capable of online bloom intensity estimation. The current implementation of our model can evaluate high-resolution images of complete trees an order of magnitude faster than human workers. While in this work we are not creating maps of flowers at the block level, this method will scale well for precision agricultural applications such as predicting thinning spray treatments and timing.
-  C. Forshey, “Chemical fruit thinning of apples,” New York State Agricultural Experiment Station, 1986.
-  H. Link, “Significance of flower and fruit thinning on fruit quality,” Plant growth regulation, vol. 31, no. 1-2, pp. 17–26, 2000.
-  A. Gongal, A. Silwal, S. Amatya, M. Karkee, Q. Zhang, and K. Lewis, “Apple crop-load estimation with over-the-row machine vision system,” Computers and Electronics in Agriculture, vol. 120, pp. 26–35, 2016.
-  N. Zhang, M. Wang, and N. Wang, “Precision agriculture – a worldwide overview,” Computers and Electronics in Agriculture, vol. 36, no. 2-3, pp. 113–132, 2002.
-  K. Kapach, E. Barnea, R. Mairon, Y. Edan, and O. Ben-Shahar, “Computer vision for fruit harvesting robots–state of the art and challenges ahead,” International Journal of Computational Vision and Robotics, vol. 3, no. 1-2, pp. 4–34, 2012.
-  A. Gongal, S. Amatya, M. Karkee, Q. Zhang, and K. Lewis, “Sensors and systems for fruit detection and localization: A review,” Computers and Electronics in Agriculture, vol. 116, pp. 8–19, 2015.
-  Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, and M. S. Lew, “Deep learning for visual understanding: A review,” Neurocomputing, vol. 187, pp. 27–48, 2016.
-  S. W. Chen, S. S. Shivakumar, S. Dcunha, J. Das, E. Okon, C. Qu, C. J. Taylor, and V. Kumar, “Counting apples and oranges with deep learning: a data-driven approach,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 781–788, 2017.
-  M. Dyrmann, A. K. Mortensen, H. S. Midtiby, and R. N. Jørgensen, “Pixel-wise classification of weeds and crops in images by using a fully convolutional neural network,” in Proceedings of the International Conference on Agricultural Engineering, Aarhus, Denmark, 2016, pp. 26–29.
-  G. L. Grinblat, L. C. Uzal, M. G. Larese, and P. M. Granitto, “Deep learning for plant identification using vein morphological patterns,” Computers and Electronics in Agriculture, vol. 127, pp. 418–424, 2016.
-  P. A. Dias, A. Tabb, and H. Medeiros, “Apple flower detection using deep convolutional networks,” Computers in Industry, vol. 99, pp. 17–28, Aug. 2018.
-  L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PP, no. 99, pp. 1–1, 2018.
-  P. A. Dias, A. Tabb, and H. Medeiros, “Data from: Multi-species fruit flower detection using a refined semantic segmentation network,” 2018. [Online]. Available: http://dx.doi.org/10.15482/USDA.ADC/1423466
-  A. D. Aggelopoulou, D. Bochtis, S. Fountas, K. C. Swain, T. A. Gemtos, and G. D. Nanos, “Yield prediction in apple orchards based on image processing,” Precision Agriculture, vol. 12, no. 3, pp. 448–456, 2011.
-  M. Hočevar, B. Širok, T. Godeša, and M. Stopar, “Flowering estimation in apple orchards by image analysis,” Precision Agriculture, vol. 15, no. 4, pp. 466–478, 2014.
-  K. R. Thorp and D. A. Dierig, “Color image segmentation approach to monitor flowering in lesquerella,” Industrial Crops and Products, vol. 34, no. 1, pp. 1150–1159, 2011.
-  R. Horton, E. Cano, D. Bulanon, and E. Fallahi, “Peach Flower Monitoring Using Aerial Multispectral Imaging,” 2016 ASABE International Meeting, 2016.
-  M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in European Conference on Computer Vision. Springer, 2014, pp. 818–833.
-  C. Hung, J. Nieto, Z. Taylor, J. Underwood, and S. Sukkarieh, “Orchard fruit segmentation using multi-spectral feature learning,” IEEE International Conference on Intelligent Robots and Systems, pp. 5314–5320, 2013.
-  S. Bargoti and J. P. Underwood, “Image Segmentation for Fruit Detection and Yield Estimation in Apple Orchards,” Journal of Field Robotics, vol. 34, no. 6, pp. 1039–1060, 2017.
J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for
semantic segmentation,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 07-12-June, 2015, pp. 3431–3440.
-  A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, P. Martinez-Gonzalez, and J. Garcia-Rodriguez, “A survey on deep learning techniques for image and video semantic segmentation,” Applied Soft Computing, vol. 70, pp. 41–65, 2018.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,”Advances In Neural Information Processing Systems, pp. 1–9, 2012.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR (Presented at International Conference on Learning Representations, 2015), vol. abs/1409.1556, 2014.
-  H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Proceedings of the IEEE International Conference on Computer Vision, vol. 2015 Inter, 2015, pp. 1520–1528.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for scene segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
-  B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Hypercolumns for object segmentation and fine-grained localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 07-12-June, 2015, pp. 447–456.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
-  M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes challenge: A retrospective,” International journal of computer vision, vol. 111, no. 1, pp. 98–136, 2015.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in European Conference on Computer Vision. Springer, 2014, pp. 740–755.
-  P. Krähenbühl and V. Koltun, “Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials,” in Advances in neural information processing systems, 2011, pp. 109–117.
-  P. A. Dias and H. Medeiros, “Semantic Segmentation Refinement by Monte Carlo Region Growing of High Confidence Detections,” in ArXiv, 2018. [Online]. Available: https://arxiv.org/abs/1802.07789
-  H. Caesar, J. Uijlings, and V. Ferrari, “COCO-Stuff: Thing and Stuff Classes in Context,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM International Conference on Multimedia. New York, NY, USA: ACM, 2014, pp. 675–678.