Material recognition is an inherently challenging problem, primarily due to the large variation in appearance between different instances of a given material and between different materials. There has been significant recent progress in terms of accuracy on benchmark datasets. Most methods proposed to achieve such results, however, essentially treat material recognition as object recognition with different categories. They often use large image patches that cover parts or whole objects and scenes, as one would when performing object recognition, which inevitably mix visual cues of materials and other image context, mainly of objects.
Material recognition is fundamentally different from object recognition. Adelson  alludes to this difference in his discussion of “things” vs. “stuff”. While materials underlie both “things” and “stuff,” the key difference between them highlights the critical difference between objects and materials. Similar to “stuff”, and unlike “things” (objects), materials may not necessarily be recognized by having a particular shape. A cup, an object with a typical cylindrical shape, is often made of ceramic, a material. The fact that an object is a cup can be used as a cue to recognize the material as ceramic, and likewise the presence of ceramic may suggest that an object might be a cup. “Cup” and “ceramic” are not, however, interchangeable. Not all ceramic “things” are cups and relying on shape cues to recognize ceramic, which past methods inevitably do, is a fundamentally limited approach.
To avoid this uncontrolled dependence on context, materials need to be recognized locally (i.e., without seeing the object it makes up or the scene in which it lies), before other “global” context including objects and scenes recognized separately can help eliminate remaining uncertainty arising from strictly local appearance cues. Recognition of materials from local appearance cues becomes even more essential when recognizing “stuff” such as towels, water, and bushes that do not have canonical shapes.
Mixing context and material categories during material recognition implicitly relies on the underlying recognition framework to disentangle these concepts. Bell  come to the conclusion that, “Training on a dataset which includes the surrounding context is crucial for real-world material classification,” but when simply given large patches as input and materials as output, we may only speculate as to the actual importance of context. In fact, Fig. 10 of  clearly shows that materials are recognized by identifying the actual objects they make up (e.g., “mirror,” which is actually an object, is identified as a material by finding mirrors, “leather” is recognized by finding sofas and ottomans, and “fabric” is identified by recognizing pillows).
On the other hand, there have been recent attempts to study the separation of materials from other image context. Hu  briefly investigated the correlations between objects and materials. Schwartz and Nishino [12, 13] proposed to recognize materials from small local image patches taken from inside the boundary of object regions and do not contain any shape cues to achieve per-pixel recognition based on visual material traits (e.g., “smooth,” “hard,” and “organic”) as discriminative internal representations that transcend material categories. Although these methods successfully demonstrate the power of local material recognition, they do not show the integration of rich image context for combined reasoning of materials in images.
Global image context, including both what the object is and where it is, can provide rich information to narrow down what things and stuff are made of. A city street, for example, is likely to contain asphalt, rubber, glass, and metal materials. Figure 2 shows examples of actual correlations between materials and context given ground-truth objects from the MS COCO database , along with global context in the form of place category predictions from the MIT Places CNN . The challenge in exploiting these contextual cues is that if we were to simply attempt to obtain exhaustive annotations for materials, objects and places, we would be searching a product space with an extremely large number of combinations.
In this paper, we introduce a material recognition framework that fully leverages local appearance and global contextual cues to recognize materials at each pixel in an image. By relying on accurate global context that may be extracted from any image, we avoid having to obtain annotations for . Materials are an inherently local property, and as such we aim to produce dense per-pixel material category predictions. We separate materials from objects and other context by training a full-resolution dense-output material CNN on small local image patches. We then introduce explicit accurate context cues, in the form of object and place category predictions, to higher levels of the network. This allows us to provide accurate context rather than the implicit and uncertain context present in large patches, and also avoids the tradeoff between spatial resolution and context observed in .
We group the context categories into a logical hierarchy and investigate the effect of hierarchical context level on material recognition. We find that each additional form of context we introduce provides an independent increase in material recognition accuracy, and that finer-grained context is better for material recognition. We also investigate the ideal level, in terms of the hierarchical levels of the CNN recognition framework, at which to introduce context. Intuitively, objects and places are high-level concepts. Our results agree with this, as we find that context is best used when it we introduce it at the highest level of the network.
Our results show that the explicit separation and (re-)integration of image context significantly improves local material recognition accuracy. We quantitatively evaluate the accuracy of our method and find that it outperforms previous approaches that implicitly mix materials with object and place context. On a recently introduced comprehensive and diverse material database , we confirm that our method achieves state-of-the-art accuracy with significantly less training data compared to past methods.
2 Related Work
Material recognition is usually done at an image patch level. These patches are, however, relatively large in most existing works: they span a significant area of the scene, sometimes even the entire image, covering parts of or entire objects. Sharan  introduced the earliest form of such classification with the Flickr Materials Database (FMD). In the FMD, the image patch is the entire image, and each image contains a single primary material of interest, similar to image classification. Recently, Bell  demonstrated per-pixel material classification using a large-scale annotated training data, the Materials in Context (MINC) dataset, and a combination of CNN and CRF models for classification. Their method uses a large image patch for each pixel, roughly a quarter of the entire image, which inevitably mix in object or place context to material appearance. This naturally leads to the reliance on recognizing the object to recognize the material, which would fundamentally necessitate extremely large training data that span the product space of materials and objects (and places). It is also important to point out that the dataset is highly biased as it is predominantly sources from professional real-estate photographs. Wang  also demonstrate accurate dense per-pixel material predictions using 4D light field images. Zhang  have recently shown impressive performance on the FMD, but their results focus only on single patch predictions. These methods mix materials and context interchangeably throughout the recognition pipeline, when they would be better-used in a factorized form (as we show).
Dense prediction, outputting a value or category prediction for each pixel, has been extensively studied in the context of object recognition and object semantic segmentation. Object recognition datasets, such as ImageNet or MS COCO , often contain many (80-10,000) categories. Despite this, state-of-the-art semantic segmentation methods such as DeepLab  focus on only a small subset of coarse-grained categories. A notable and relevant exception is the recent ADE20k dataset, scene parsing challenge, and associated models . The dataset contains many fully-segmented images, and the challenge defines a set of 150 categories for semantic segmentation. We are not merely performing semantic segmentation. We instead aim to produce dense material predictions. For this, we find the ADE20k models to be ideal sources of per-pixel object category context information.
The use of context as a means to reduce ambiguity, whether in materials or other cases, appears promising. Hu  showed that a simple addition of object category predictions as features could potentially improve material recognition. On an unrelated topic, Iizuka 
use scene place category predictions to improve the accuracy of greyscale image colorization. Our work, in contrast to these previous methods, takes advantage of multiple sources of context and investigates the ideal granularity of context categories. Within the framework of a Convolutional Neural Network (CNN), we evaluate how the hierarchical level at which we introduce context influences the accuracy of the corresponding material predictions.
3 Local Material Recognition
We aim to leverage scene context, such as objects and places, to improve dense per-pixel material recognition. Our first step is to ensure that, when recognizing materials, we are in fact dealing with just materials and not an implicit fusion of materials and context. Schwartz and Nishino [12, 13] have proposed to achieve such a separation by recognizing materials using only small local image patches inside the boundary of objects as input, thereby avoiding any influence from context derived from object shape or other global features. Equally important, they have recently introduced a dataset aimed at local material recognition, with carefully-selected categories chosen from a material hierarchy , and material annotations that respect object boundaries. The taxonomy of materials is based on canonical categorization defined in materials science  and material regions are carefully segmented for images sourced from a variety of databases including PASCAL VOC database , the Microsoft COCO database , the FMD , and the ImageNet database . Although the total number of images (about 3000) are smaller compared to past datasets 
, the clean separation of materials and other context (i.e., objects and places), the principled material category definitions that avoid mixing objects and materials, and the additional care taken to minimize bias in types of images make it ideal for studying local material recognition. We are able to extract more than 200,000 image patches without object context (e.g., object boundaries) lurking in, which makes it a sufficiently large-scale dataset for training a local material recognition classifier.
Our goal is to integrate materials with context that may be partially global (objects with large spatial extent but defined boundaries) or fully global (scene place categories, one per image). The frameworks introduced in [12, 13] are only able to make dense pixel-wise predictions in a sliding window fashion, and do not offer any logical point at which to introduce global context. To address this, we build a fully-convolutional CNN architecture, based on the VGG-16 network of Simonyan and Zisserman  with modifications to enable us to output dense full-resolution material predictions with integrated global context. Bell  have previously investigated a similar architecture for material recognition. They, however, rely on large (24% image size) training patches and a loosely-defined set of material categories (e.g., carpet as a material) that collectively mix up materials and objects. In contrast, all of our training is done with small local material image patches. Section 5.1 describes the model architecture.
4 Distributions of Materials and Context
We have an intuitive understanding that, if one knows an object is, for example, made of metal, then it may be a knife or a car, but probably not a piece of clothing. Likewise, if we know an object is a cup, then it is likely made of glass, plastic, or ceramic. We can quantitatively evaluate the informative nature of context, such as object and place categories, by computing the conditional probability distributions of materials given each possible category of context. If our intuition is correct, then these distributions should be discriminative (e.g., have a low entropy relative to the corresponding discrete uniform distribution).
4.1 Object Context
The conditional distributions of materials given ground-truth object categories (top row) and predicted places (bottom row) are highly discriminative. Many context categories exhibit only a small set of materials. Some outliers are inevitable as the ground-truth COCO segmentation masks do not perfectly conform to actual object boundaries in the image. Places do not offer the same very strong divisions of materials as objects do. Their distributions are, however, still valuable as shown both by their entropy and the resulting material recognition accuracy based on place context.
We can get an initial idea as to how discriminative context is by using ground-truth object masks and corresponding materials to compute the conditional distribution , where is the material category and is the object category. We use the material database of  as they include images from databases that contain object category map annotations (particularly, MS COCO ). To compute the conditional probabilities, we take each image with material annotations and find the object exhibiting each material as indicated by the COCO ground truth. Figure 2 shows conditional material probabilities for a few selected object categories . The entropy for the discrete uniform distribution over 16 categories is 2.77, and as shown in Figure 2 the entropy given true object categories is much lower.
4.2 Place Context
Objects are defined somewhat locally (at the level of groups of pixels, but still globally compared to the local material appearance we model) and tend to exhibit only a small set of materials. In contrast, places are single scene-wide properties and can encompass many objects and materials. Despite this, we expect that places can still provide useful cues to disambiguate local materials. Ceramic and paper, for example, are often both flat white surfaces. Without seeing a specular highlight, it may be difficult to distinguish the two given only a small local patch. If, however, we know that the image patch originates from an image of a classroom, it is more likely that the patch contains paper.
We can evaluate the discriminative power of places by using predictions from the MIT Places CNN . Figure 2 contains examples of the conditional distributions for a few choices of place category . While they are not uniformly as discriminative as object categories, they still do provide some useful cues. Botanical gardens, for example, tend to contain plants as one would expect, and images of crosswalks contain asphalt, metal, and rubber (roads, cars).
At least in the case of objects, and perhaps places as well, the conditional distributions are so discriminative as to suggest that we might simply multiply these distributions with the predictions of an existing material recognition model and achieve improved accuracy. We initially investigated this applied to a material recognition CNN as a baseline for comparisons. We, however, found that simple multiplication made a negligible difference in the accuracy. This is due to the fact that many of the mispredictions of materials we might hope to correct are too strongly-predicted for simple multiplication to have any effect.
5 Local Materials and Global Context
While the context cues from objects and places are highly discriminative, we cannot simply treat them as a prior on material occurrence and multiply them with a model’s prediction. The model must instead have the context available during training, so that the context may influence the material predictions. Such an observation is consistent with the general idea of leveraging top-down feedback with bottom-up recognition, for instance, as demonstrated with object detection  and human pose estimation Carreira 
. Here, we are obtaining the top-down information in the form of object and place context. We treat the set of predicted context category probabilities, obtained from state-of-the-art networks for scene recognition and object semantic segmentation, as an additional feature in a dense per-pixel material CNN. By concatenating these probabilities with the high-level features in the network prior to output, we may take full advantage of the strong material recognition cues available in global image context.
5.1 Context Integration Network
As shown in Figure 3, our model is based on the VGG-16 architecture of Simonyan and Zisserman  with a few fundamental modifications to enable dense prediction from local material image patches with added global image context. To enable dense prediction, we train a fully-convolutional form of the network where all fully-connected layers are replaced with convolutions. We also add a series of upsampling layers (output-strided convolution, also called “deconvolution”) to match the input and output resolutions.
The level of downsampling in the original VGG-16 network is not compatible with local material recognition. The minimum patch size is constrained by the downsampling factor, and we would like to use small image patches to maximize the separation between local materials and global context. We find px patches represent an appropriate trade-off between eliminating all non-local information (single pixel “patches”) and the large patches used by previous methods. While extremely small objects may still appear in such patches, relying on the database of 
ensures that these objects are not labeled and will not create any undesired dependence on implicit context. To avoid too much downsampling, we remove the last set of pooling and filtering layers from the network. The remaining top two pooling layers are also removed and the corresponding layers above are replaced with dilated convolutions. In order to train on datasets where densely-segmented material ground truth may not always be available, we compute the softmax loss function at each pixel. The loss function is only evaluated at pixels where there is a known material. In this way we are able to take advantage of segmented material images without requiring completely dense annotation.
We leverage other contextual cues, namely object recognition and place recognition results, by running state-of-the-art classifiers separately and then integrating the results into the material recognition pipeline. This enables the complete separation of global image context recognition from local material recognition, which avoids requiring prohibitively large numbers of samples of the product space of materials, objects, and places as training data. It also allows us to use any source of object and place context without being limited to specific requirements for integration.
Specifically, for global image context integration, we treat estimated context category probabilities as additional features and concatenate them with existing features in the network. For places, we only have a single set of category probabilities for the entire image. We replicate these values across the image to match the image dimensions at the level where the context is introduced. This is similar to the colorization work of Iizuka . We, however, introduce per-pixel context in the form of object semantic segmentation and find quantitative support for the spatial resolution and introduction level in the network of the added context.
5.2 Extracting Per-Pixel Global Context
We leverage global image context in the form of per-pixel object category predictions (semantic segmentation) and single scene-wide place category predictions. For places, we use the MIT Places CNN . A key feature of their places database is that the categories are hierarchically organized. As we will show below when we investigate the effects of place category granularity, the number of context categories is important: fine-grained context provides more useful information than simple, higher-level context groups. In contrast, most existing object semantic segmentation methods train their models on only a relatively small set of high-level object categories. A notable exception is the ADE20k dataset . The dataset contains over 2,000 object categories. The MIT Scene Parsing Challenge, which relies on this dataset, selects 150 categories for semantic segmentation. We use these categories and trained models for our per-pixel object context.
5.3 Hierarchical Place Context
As part of their SUN database for scene and object recognition, Xiao  define a hierarchy of place categories. This hierarchy raises the question of whether any particular context granularity is more or less useful for material recognition. On one hand, having an extremely fine set of place categories might mean that few training examples would appear from certain places. At the other extreme, the coarsest division of places could only provide very general cues as to which materials may be present.
To evaluate the importance of place granularity, we compute material recognition accuracy scores using only place context at each level of the SUN places hierarchy. We adapt their hierarchy to the place categories recognized by the MIT places CNN and treat nodes within each level of the hierarchy as place categories. The highest level is the simple division of indoors vs. outdoors, mid-level categories deal with distinctions such as commercial and residential buildings, or mountains and forests, and the lowest level includes smaller groups such as entertainment or religious places. Results in Table 1 show that accuracy increases with place category granularity: more detailed place categories provide more discriminative information for material recognition. Computing the entropy of the conditional distributions for place category set at hierarchy level supports these results.
5.4 Integration Level and Spatial Resolution
Existing methods for integration of context, such as that of , find that adding context at the highest levels of the network results in a successful integration. This may make intuitive sense, as we want the context to directly inform existing category predictions at the end of the network. We show quantitatively that higher levels are indeed better suited for the addition of context. Additionally, as our method relies on object context that varies spatially (in contrast with single place context values across the entire image), we also investigate the effect of the object context’s spatial resolution. As the context introduction level and spatial resolution are related (due to pooling layers), any observed change in accuracy could be caused by either the drop in spatial resolution or the change in level. We vary the effective resolution of the context by fixing the level and actual resolution (the number of pixels in the context map) then downsampling and upsampling the context.
Results in Table 2 show that the highest level is indeed the ideal place at which to introduce global context. If introduced at lower levels, the network is free to overfit to the context and poor accuracy results. For this experiment we randomly initialized the weights of the network, thus the accuracy values are not comparable with results in later sections. If the network was initialized from pre-trained weights, as in our final results, then the accuracy would be artificially reduced as we introduced context at lower levels as the added context would invalidate the pre-trained weights above it. To separate the effects of spatial resolution and introduction level, we trained the same network with object context effective spatial resolution , with context introduced after upsampling. Accuracy was essentially unaffected (71.4% with 16 downsampling), showing that it is indeed the hierarchical network level and not the spatial resolution that determines the accuracy.
|Context||Accuracy||Mean Class Accuracy|
|Places + Objects||73.0%||72.5%|
Table 3 contains a breakdown of the contributions for each form of global context. Most importantly, the contributions from objects and places are similar and the combination of the two outperforms either individual context source. This suggests that the objects and places are providing unique sources of information and are both critical to the accurate recognition of materials. A cup, for example, may be made of glass or plastic. If, however, you are in a bar, then it is much more likely to be made of glass.
6 Dense Material Recognition
|Method||Accuracy||Mean Class Accuracy|
|VGG-16 + Upsampling||66.7%||55.1%|
|MINC , no retraining||60.4%||67.5%|
|MINC , retrained||72.8%||70.1%|
|Ours (Places + Objects)||73.0%||72.5%|
Results in Table 4 show that our approach outperforms the large-patch-based (VGG-16 + CRF) method of Bell  on the local materials database of , despite the fact that their model was trained on both millions of patches and fine-tuned on local material images. We evaluated all of the MINC-trained models (AlexNet, GoogleNet and VGG-16) and found VGG-16 to be the most accurate for this comparison. As a baseline, we train the full VGG-16 architecture (with large patches) on the local materials database. As expected, it is unable to take full advantage of the implicitly available context context. This can be viewed as the baseline case when the MINC model was only trained on the local materials database .
The MINC model’s accuracy is also significantly lower on the local materials database  compared to their own database, even after the final category layer was re-trained for the categories of . These results suggest that the local materials database contains more diverse and challenging images. A large part of the MINC database comes from real-estate photographs and thus are inevitably biased in materials. The fact that MINC with no retraining exhibits lower mean-class accuracy even on a smaller number of categories further supports this.
We attempted to generate an approximate comparison on the MINC database by splitting the segmented MINC test images and training on one portion. Due, however, to the disproportionately small amount of data that could be extracted (only 7000 segments, roughly equivalent to 63,000 patches, whereas the MINC method uses 2.5 million patches ), a fair comparison of the methods on this database could not be achieved. For reference, our method still achieves 68.5%, which is a smaller drop in accuracy than MINC when comparing cross-dataset performance of the two models. This further shows the bias inherent in the MINC database. Although the vast single click training data that MINC  is able to leverage certainly is an advantage of not separating material appearance from surrounding context, these numbers and our rigorous comparative experimental results summarized in Table 4 clearly show that our framework outperforms the MINC model  with significantly less training data on a much more comprehensive and diverse dataset. Please see our supplemental material for the full implementation and training details of our final model.
We can readily see in Figure 4 that the context helps disambiguate materials that may be difficult to recognize from only local information. When metal does not exhibit specular highlights or reflections, as is the case with the airplane body, the flat white surface offers little in the way of local recognition cues. Knowing either that the scene is an airport or that the current pixel belongs to a plane removes this ambiguity. Likewise, in the natural scene with elephants, the combination of high-frequency waves and specular reflection causes the water to appear like concrete. Scene context makes it clear that concrete would be unlikely in this case. In general, the predictions are accurate subject to the limitations of the training data. Skin is not a material in the dataset of , and thus skin is often classified as the surrounding fabric. Sky is not a material and the predictions for sky are determined largely by context (ex. metal at airports, water over the ocean). Additional qualitative examples in Figure 5 show that the combination of local materials and global context results in accurate material predictions in the face of local ambiguity (both in the context and in the local appearance). Our supplemental material contains further examples of dense material predictions from our framework.
Our results show that we can successfully separate materials from their surrounding context and combine those materials with highly-discriminative forms of global context. Such a combination outperforms previous methods which implicitly rely on context being available in a large input image patch. Additionally, we performed a detailed investigation into the ideal granularity for context in material recognition as well as the hierarchical level and spatial resolution at which the context should be introduced into a CNN framework.
The experimental results conclusively demonstrate that material recognition based on the explicit integration of local appearance and global context achieves state-of-the-art accuracy on a comprehensive and diverse dataset with less training data. We believe these results also suggest similar approaches to bottom-up top-down integration for other recognition tasks, which we are interested in exploring in future work.
This work was supported by the Office of Naval Research grant N00014-16-1-2158 (N00014-14-1-0316) and N00014-17-1-2406, and the National Science Foundation award IIS-1421094. The Titan X used for part of this research was donated by the NVIDIA Corporation.
-  Matbase: Chemical, Mechanical, Physical and Environmental Properties of Materials. http://www.matbase.com.
-  E. H. Adelson. On Seeing Stuff: The Perception of Materials by Humans and Machines. In SPIE, pages 1–12, 2001.
-  S. Bell, P. Upchurch, N. Snavely, and K. Bala. Material Recognition in the Wild with the Materials in Context Database. In CVPR, 2015.
-  J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik. Human Pose Estimation with Iterative Error Feedback. arXiv, abs/1507.06550, 2015.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. arXiv, abs/1606.00915, 2016.
-  B. Epshtein, I. Lifshitz, and S. Ullman. Image Interpretation by a Single Bottom-Up Top-Down Cycle. PNAS, 105(38), 2008.
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman.
The pascal visual object classes (voc) challenge.
International Journal of Computer Vision, 88(2):303–338, June 2010.
-  D. Hu, L. Bo, and X. Ren. Toward Robust Material Recognition for Everyday Objects. In BMVC, pages 48.1–48.11, 2011.
-  S. Iizuka, E. Simo-Serra, and H. Ishikawa. Let there be Color!: Joint End-to-end Learning of Global and Local Image Priors for Automatic Image Colorization with Simultaneous Classification. In SIGGRAPH, volume 35, pages 110:1–110:11, 2016.
-  T.-Y. Lin, M. Marie, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common Objects in Context. In ECCV, 2014.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
-  G. Schwartz and K. Nishino. Visual Material Traits: Recognizing Per-Pixel Material Context. In Color and Photometry in Computer Vision (Workshop held in conjunction with ICCV’13), 2013.
-  G. Schwartz and K. Nishino. Automatically Discovering Local Visual Material Attributes. In CVPR, pages 3565–3573, 2015.
-  G. Schwartz and K. Nishino. Discovering Perceptual Attributes in a Deep Local Material Recognition Network. arXiv, abs/1604.01345, 2016.
-  L. Sharan, R. Rosenholtz, and E. Adelson. Material Perception: What Can You See in a Brief Glance? Journal of Vision, 9(8):784, 2009.
-  K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR, pages 1–14, 2015.
-  T.-C. Wang, J.-Y. Zhu, H. Ebi, M. Chandraker, A. A. Efros, and R. Ramamoorthi. A 4D Light-Field Dataset and CNN Architectures for Material Recognition. In ECCV, 2016.
-  J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. SUN Database: Large-scale Scene Recognition from Abbey to Zoo. In CVPR, 2010.
-  Y. Zhang, M. Ozay, X. Liu, and T. Okatani. Integrating Deep Features for Material Recognition. arXiv, abs/1511.06522, 2015.
B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba.
Learning Deep Features for Scene Recognition using Places Database.In NIPS, 2014.
-  B. Z. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Semantic Understanding of Scenes through ADE20K Dataset. arXiv, abs/1608.05442, 2016.