Code release for "Differential Angular Imaging for Material Recognition", CVPR 2017.
Material recognition for real-world outdoor surfaces has become increasingly important for computer vision to support its operation "in the wild." Computational surface modeling that underlies material recognition has transitioned from reflectance modeling using in-lab controlled radiometric measurements to image-based representations based on internet-mined images of materials captured in the scene. We propose to take a middle-ground approach for material recognition that takes advantage of both rich radiometric cues and flexible image capture. We realize this by developing a framework for differential angular imaging, where small angular variations in image capture provide an enhanced appearance representation and significant recognition improvement. We build a large-scale material database, Ground Terrain in Outdoor Scenes (GTOS) database, geared towards real use for autonomous agents. The database consists of over 30,000 images covering 40 classes of outdoor ground terrain under varying weather and lighting conditions. We develop a novel approach for material recognition called a Differential Angular Imaging Network (DAIN) to fully leverage this large dataset. With this novel network architecture, we extract characteristics of materials encoded in the angular and spatial gradients of their appearance. Our results show that DAIN achieves recognition performance that surpasses single view or coarsely quantized multiview images. These results demonstrate the effectiveness of differential angular imaging as a means for flexible, in-place material recognition.READ FULL TEXT VIEW PDF
Code release for "Differential Angular Imaging for Material Recognition", CVPR 2017.
Real world scenes consist of surfaces made of numerous materials, such as wood, marble, dirt, metal, ceramic and fabric, which contribute to the rich visual variation we find in images. Material recognition has become an active area of research in recent years with the goal of providing detailed material information for applications such as autonomous agents and human-machine systems.
Modeling the apparent or latent characteristic appearance of different materials is essential to robustly recognize them in images. Early studies of material appearance modeling largely concentrated on comprehensive lab-based measurements using dome systems, robots, or gonioreflectometers collecting measurements that are dense in angular space (such as BRDF, BTF). These reflectance-based studies have the advantage of capturing intrinsic invariant properties of the surface, which enables fine-grained material recognition [47, 27, 33, 42]
. The inflexibility of lab-based image capture, however, prevents widespread use in real world scenes, especially in the important class of outdoor scenes. A fundamentally different approach to reflectance modeling is image-based modeling where surfaces are captured with a single-view image in-scene or “in-the-wild.” Recent studies of image-based material recognition use single-view internet-mined images to train classifiers[1, 20, 7, 28] and can be applied to arbitrary images casually taken without the need of multiview reflectance information. In these methods, however, recognition is typically based more on context than intrinsic material appearance properties except for a few purely local methods [34, 35].
Between comprehensive in-lab imaging and internet-mined images, we take an advantageous middle-ground. We capture in-scene appearance but use controlled viewpoint angles. These measurements provide a sampling of the full reflectance function. This leads to a very basic question: how do multiple viewing angles help in material recognition? Prior work used differential camera motion or object motion for shape reconstruction [2, 3, 43], here we consider a novel question: Do small changes in viewing angles, differential changes, result in significant increases in recognition performance? Prior work has shown the power of angular filtering to complement spatial filtering in material recognition. These methods, however, rely on a mirror-based camera to capture a slice of the BRDF  or a lightfield camera to achieve multiple differential viewpoint variations  which limits their application due to the need for specialized imaging equipment. We instead propose to capture surfaces with differential changes in viewing angles with an ordinary camera and compute discrete approximations of angular gradients from them. We present an approach called angular differential imaging that augments image capture for a particular viewing angle a differential viewpoint . Contrast this method with lab-based reflectance measurements that often quantize the angular space measuring with domes or positioning devices with large angular spacing such as . These coarse-quantized measurements have limited use in approximating angular gradients. Angular differential imaging can be implemented with a small-baseline stereo camera or a moving camera (e.g. handheld). We demonstrate that differential angular imaging provides key information about material reflectance properties while maintaining the flexibility of convenient in-scene appearance capture.
|Datasets||samples||classes||views||illumination||in scene||scene image||camera parameters||year|
To capture material appearance in a manner that preserves the convenience of image-based methods and the important angular information of reflectance-based methods, we assemble a comprehensive, first-of-its-kind, outdoor material database that includes multiple viewpoints and multiple illumination directions (partial BRDF sampling), multiple weather conditions, a large set of surface material classes surpassing existing comparable datasets, multiple physical instances per surface class (to capture intra-class variability) and differential viewpoints to support the framework of differential angular imaging. We concentrate on outdoor scenes because of the limited availability of reflectance databases for outdoor surfaces. We also concentrate on materials from ground terrain in outdoor scenes (GTOS) for applicability in numerous application such as automated driving, robot navigation, photometric stereo and shape reconstruction. The 40 surface classes include ground terrain such as grass, gravel, asphalt, concrete, black ice, snow, moss, mud and sand (see Figure 2).
We build a recognition algorithm that leverages the strength of deep learning and differential angular imaging. The resulting method takes two image streams as input, the original image and a differential image as illustrated in Figure1. We optimize the two-stream configuration for material recognition performance and call the resulting network DAIN–differential angular imaging network.
We make three significant contributions in this paper: 1) Introduction of differential angular imaging as a middle-ground between reflectance-based and image-based material recognition; 2) Collection of the GTOS database made publicly available with over 30000 in-scene outdoor images capturing angular reflectance samples with scene context over a large set of material classes; 3) The development of DAIN, a material recognition network with state-of-the-art performance in comprehensive comparative validation.
. More recently, features learned with deep neural networks have outperformed these methods for texture recognition. Cimpoiet al.  achieves state-of-art results on FMD  and KTH-TIPS2 
The success of deep learning methods in object recognition has also translated to the problem of material recognition, the classification and segmentation of material categories in arbitrary images. Bell et al., achieve per-pixel material category labeling by retraining the then state-of-the-art object recognition network  on a large dataset of material appearance . This method relies on large image patches that include object and scene context to recognize materials. In contrast, Schwartz and Nishino [34, 35] learn material appearance models from small image patches extracted inside object boundaries to decouple contextual information from material appearance. To achieve accurate local material recognition, they introduced intermediate material appearance representations based on their intrinsic properties (e.g., “smooth” and “metallic”).
In addition to the apparent appearance, materials can be discerned by their radiometric properties, namely the bidirectional reflectance distribution function (BRDF)  and the bidirectional texture function (BTF) , which essentially encode the spatial and angular appearance variations of surfaces. Materials often exhibit unique characteristics in their reflectance offering detailed cues to recognize the difference of subtle variations in them (e.g., different types of metal  and paint ). Reflectance measurements, however, necessitate elaborate image capture systems, such as a gonioreflectometer [30, 45], robotic arm , or a dome with cameras and light sources [12, 27, 42]. Recently, Zhang et al. introduced the use of a one-shot reflectance field capture for material recognition . They adapt the parabolic mirror-based camera developed by Dana and Wang  to capture the reflected radiance for a given light source direction in a single shot, which they refer to as a reflectance disk. More recently, Zhang et al. showed that the reflectance disks contain sufficient information to accurately predict the kinetic friction coeffcient of surfaces . These results demonstrate that the angular appearance variation of materials and their gradients encode rich cues for their recognition. Similarly, Wang et al.  uses a light field camera and combines angular and spatial filtering for material recognition. In strong alignment with these recent advances in material recognition, we build a framework of spatial and angular appearance filtering. In sharp contrast to past methods, however, we use image information from standard cameras instead of a multilens array as in Lytro. We explore the difference of using a large viewing angle range (with samples coarsely quantized in angle space) by using differential changes in angles which can easily be captured by a two-camera system or small motions of a single ordinary camera.
Deep learning has achieved major success in object classification [23, 4, 19], segmentation [17, 22, 32], and material recognition [7, 49, 26, 50]. In our goal of combining spatial and angular image information to account for texture and reflectance, we are particularly motivated by the two-stream fusion framework [15, 37] which achieves state-of-art results in UCF101 action recognition dataset.
Datasets to measure reflectance of real world surfaces have a long history of lab-based measurements including: CUReT database, KTH-TIPS database by Hayman et al. , MERL Reflectance Database , UBO2014 BTF Database , UTIA BRDF Database , Drexel Texture Database  and IC-CERTH Fabric Database . In many of these datasets, dense reflectance angles are captured with special image capture equipment. Some of these datasets have limited instances/samples per surface category (different physical samples representing the same class for intraclass variability) or have few surface categories, and all are obtained from indoor measurements where the sample is removed from the scene. More recent datasets capture materials and texture in-scene, (a.k.a. in-situ, or in-the-wild). A motivation of moving to in-scene capture is to build algorithms and methods that are more relevant to real-world applications. These recent databases are from internet-mined databases and contain a single view of the scene under a single illumination direction. Examples include the the Flickr Materials Database by Sharan et al.  and the Material in Context Database by Bell et al. . Recently, DeGol et al. released GeoMat Database with 19 material categories from outdoor sites and each category has between 3 and 26 physical surface instances, with 8 to 12 viewpoints per surface. The viewpoints in this dataset are irregularly sampled in angle space.
We present a new measurement method called differential angular imaging where a surface is imaged from a particular viewing angle and then from an additional viewpoint . The motivation for this differential change in viewpoint is improved computation of the angular gradient of intensity . Intensity gradients are the basic building block of image features and it is well known that discrete approximations to derivatives have limitations. In particular, spatial gradients of intensities for an image are approximated by and this approximation is most reasonable at low spatial frequencies and when is small. For angular gradients of reflectance, the discrete approximation to the derivative is a subtraction with respect to the viewing angle. Angular gradients are approximated by and this approximation requires a small . Consequently, differential angular imaging provides more accurate angular gradients.
The differential images as shown in Figures 1 and 2 have several characteristics. First, the differential image reveals the gradients in BRDF/BTF at the particular viewpoint. Second, relief texture is also observable in the differential image due to non-planar surface structure. Finally, the differential images are sparse. This sparsity can provide a computational advantage within the network. (Note that and are aligned with a global affine transformation before subtraction.)
We collect the GTOS database, a first-of-its-kind in-scene material reflectance database, to investigate the use of spatial and angular reflectance information of outdoor ground terrain for material recognition. We capture reflectance systematically by imaging a set of viewing angles comprising a partial BRDF with a mobile exploration robot. Differential angular images are obtained by also measuring each of base angles , , and a differential angle variation of resulting in 18 viewing directions per sample as shown in Figure 3 (b). Example surface classes are depicted in Figure 3 (a). The class names are (in order of top-left to bottom-right): cement, asphalt, painted asphalt , brick, soil, muddy stone, mud, mud-puddle, grass, dry leaves, leaves, asphalt-puddle, mulch, metal grating, plastic, sand, stone, artificial turf, aluminum, limestone, painted turf, pebbles, roots, moss, loose asphalt-stone, asphalt-stone, cloth, paper, plastic cover, shale, painted cover, stone-brick, sandpaper, steel, dry grass, rusty cover, glass, stone-cement, icy mud, and snow. The surface classes mostly have between 4 and 14 instances (samples of intra-class variability) and each instance is imaged not only under viewing directions but also under multiple natural light illumination conditions. As illustrated in Figure 1, sample appearance depends on the weather condition and the time of day. To capture this variation, we image the same region with different weather conditions (cloudy dry, cloudy wet, sunny morning, and sunny afternoon). We capture the samples with 3 different exposure times to enable high dynamic range imaging. Additionally, we image a mirrored sphere to capture the environment lighting of the natural sky. In addition to surface images, we capture a scene image to show the global context. The robot measurement device is depicted in Figure 4. Although, the database measurements were obtained with robotic positioning for precise angular measurements, our recognition results are based on subsets of these measurements so that an articulated arm would not be required for an in-field system. The total number of surface images in the database is 34,243. As shown in Table 1, this is the most extensive outdoor in-scene multiview material database to date.
Consider the problem of in-scene material recognition with images from multiple viewing directions (multiview). We develop a two-stream convolutional neural network to fully leverage differential angular imaging for material recognition. The differential imagesparsely encodes reflectance angular gradients as well as surface relief texture. The spatial variation of image intensity remains an important recognition cue and so our method integrates these two streams of information. A CNN is used on both streams of the network and then combined for the final prediction result. The combination method and the layer at which the combination takes place leads to variations of the architecture.
We employ the ImageNet pre-trained VGG-M model  as the prediction unit (labeled CNN in Figure 5). The first input branch is the image at a specific viewing direction . The second input branch is the differential image . The first method of combination shown in Figure 5 (a) is a simple averaging of the output prediction vectors obtained by the two branches. The second method combines the two branches at the intermediate layers of the CNN, i.e. the feature maps output at layer are combined and passed forward to the higher layers of the CNN, as shown Figure 5
(b). We empirically find that combining feature maps generated by Conv5 layer after ReLU performs best. A third method (see Figure5 (c)) is a hybrid of the two architectures that preserves the original CNN path for the original image by combining the layer feature maps for both streams and by combining the prediction outputs for both streams as shown in Figure 5 (c). This approach is the best performing architecture of the three methods and we call it the differential angular imaging network (DAIN).
For combining feature maps at layer , consider features maps and from the two branches that have width , height , and feature channel depth . The output feature map will be the same dimensions . We can combine feature maps by: (1) Sum: pointwise sum of and , and (2) Max: pointwise maximum of and . In Section 6 we evaluate the performance of these methods of combining lower layer feature maps.
Our GTOS database has multiple viewing directions on an arc (a partial BRDF sampling) as well as differential images for each viewing direction. We evaluate our recognition network in two modes: (1) Single view DAIN, with inputs from and , with representing a single viewing angle; (2) Multi view DAIN, with inputs and , with . For our GTOS databse, are viewing angles separated by representing a range of viewing angles. We empirically determine that viewpoints are sufficient for recognition. For a baseline comparison we also consider non-differential versions: Single View with only for a single viewing direction and Multi View with inputs , .
To incorporate multi view information in DAIN we use three methods: (1) voting (use the predictions from each view to vote), (2) pooling (pointwise maximum of the combined feature maps across viewpoints), (3) 3D filter + pooling (follow  to use a learned filter bank to convolve the multi view feature maps). See Figure 6. After 3D filtering, pooling is used (pointwise maximum across viewpoints). The computational expense of this third method due to learning the filter weights is significantly higher.
In this section, we evaluate the DAIN framework for material recognition and compare the results on GTOS with several state-of-the-art algorithms. The first evaluation determines which structure of the two stream networks from Figure 5 works best on the GTOS dataset, leading to the choice in (c) as the DAIN architecture. The second evaluation considers recognition performance with different variations of DAIN recognition. The third experimental evaluation compares three other state-of-the-art approaches on our GTOS-dataset, concluding that multiview DAIN works best. Finally, we apply DAIN to a lightfield dataset to show performance in another multiview material dataset.
We design 5 training and testing splits by assigning about 70% of ground terrain surfaces of each class to training and the rest 30% to testing. Note that, to ensure that there is no overlap between training and testing sets, if one sample is in the training set, all views and illumination conditions for that sample is in the training set.
Each input image from our GTOS database is resized into 240 240. Before training a two branch network, we first fine-tune the VGG-M model separately with original and differential images with batch size 196, dropout rate 0.5, momentum 0.9. We employ the augmentation method that horizontally and vertically stretch training images within , with an optional 50% horizontal mirror flips. The images are randomly cropped into 224
224 material patches. All images are pre-processed by subtracting a per color channel mean and normalizing for unit variance. The learning rate for the last fully connected layer is set to 10 times of other layers. We first fine-tune only the last fully connected layer with learning rate
for 5 epochs; then, fine-tune all the fully connected layers with learning ratefor 5 epochs. Finally we fine-tune all the layers with leaning rate starting at , and decrease by a factor of 0.1 when the training accuracy saturates. Since the snow class only has 2 samples, we omit them from experiments.
For the two branch network, we employ the fine-tuned two-branch VGG-M model with batch size 64 and learning rate starting from which is reduced by a factor of 0.1 when the training accuracy saturates. We augment training data with randomly stretch training images by horizontally and vertically, and also horizontal mirror flips. The images are randomly cropped to 224
224 material patches. We first backpropagate only to feature maps combination layer for 3 epochs, then fine tunes all layers. We employ the same augmentation method for the multiview images of each material surface. We randomly select the first viewpoint image, then subsequentview point images are selected for experiments.
Table 2 shows the mean classification accuracy of the different three branch combination methods depicted in Figure 5. Inputs are single view images () and single view differential images (). Combining the two streams at the final prediction layer (77% accuracy) is compared with the intermediate layer combination (74.8%) or the hybrid approach in Figure 5 (c) (79.4%) which we choose as the differential angular imaging network. The combination method used is Sum and the feature maps are obtained from Conv5 layers after ReLU.
We evaluate DAIN recognition performance for single view input (and differential image) and for multiview input from the GTOS database. Additionally, we compare the results to recognition using a standard CNN without a differential image stream. For all multiview experimental results we choose the number of viewpoints , separated by with the starting viewpoint chosen at random (and the corresponding differential input). Table 3
shows the resulting recognition rates (with standard deviation over 5 splits shown as a subscript). The first three rows shows the accuracywithout differential angular imaging, using both single view and multiview input. Notice the recognition performance for these non-DAIN results are generally lower than the DAIN recognition rates in the rest of the table. The middle three rows show the recognition results for single view DAIN. For combining feature maps we evaluate both Sum and Max which have comparable results. Notice that single view DAIN achieves better recognition accuracy than multiview CNN with voting (79.4% vs. 78.1%). This is an important result indicating the power of using the differential image. Instead of four viewpoints separated by a single viewpoint and its differential image achieves a better recognition. These results provide design cues for building imaging systems tailored to material recognition. We also evaluate weather using inputs from the two viewpoints directly (i.e. and ) is comparable to using and the differential image . Interestingly, the differential image as input has an advantage (79.4% over 77.5%). The last three rows of Table 3 show that recognition performance using multiview DAIN beats the performance of both single view DAIN and CNN methods with no differential image stream. We evaluate different ways to combine the multiview image set including voting, pooling, and the 3D filter+pooling illustrated in Figure 6.
The CNN module of our DAIN network can be replaced by other state-of-the-art deep learning methods to further improve results. To demonstrate this, we change the CNN module in a single view DAIN (Sum) (with inputs , ) to ImageNet pre-trained ResNet-50 model on split1. Combining feature maps generated from the Res4 layer (the fourth residual unit) after ReLU with training batch size 196, recognition rate improves from 77.5% to 83.0%.
Table 4 shows the recognition rates for multiview DAIN that outperforms three other multi-view classification method: FV+CNN, FV-N+CNN+N3D , and MVCNN. The table shows recognition rates for a single split of the GTOS database with images resized to 240 240. All experiments are based on the same pre-trained VGG-M model. We use the same fine-tuning and training procedure as in the MVCNN experiment. For FV-N+CNN+N3D applied to GTOS, 10 samples (out of 606) failed to get geometry information by the method provided in  and we removed these samples from the experiment. The patch size in  is 100 100, but the accuracy for this patch size for GTOS was only 43%, so we use 240 240. We implement FV-N+CNN+N3D with linear mapping instead of homogeneous kernel map for SVM training to save memory with this larger patch size.
We tested our multiview DAIN (Sum + pooling) method on a recent 4D light field (Lytro) dataset . ResNet-50 is used as the CNN module. The recognition accuracy with full images on 5 splits is 83.0 . Note that a subset of the lightfield data is used to mimic the differential imaging process, so these results should not be interpreted as a comparison of our algorithm to .
The Lytro dataset has views, from the 7 7 lenslet array, where each lenslet corresponds to a different viewing direction. Using as an index into this array, we employ the viewpoints indexed by as the 4 views in multiview DAIN. We use the viewpoint indexed by as the corresponding differential views. This is an approximation of multiview DAIN; the lightfield dataset does not capture the range of viewing angles to exactly emulate multiple viewpoints and small angle variations of these viewpoints. Instead of using all viewpoints as in , we generate comparable recognition accuracy by only 8 viewpoints.
|Method||Final Layer Combination||Intermediate Layer Combination||DAIN|
|Method||First input||Second input||Accuracy|
|single view CNN||-||74.3|
|multiview CNN, voting||-||78.1|
|multiview CNN,3D filter||-||74.8|
|single view DAIN (Sum)||77.5|
|single view DAIN (Sum)||79.4|
|single view DAIN (Max)||79.0|
|multiview DAIN (Sum/voting)||80.0|
|multiview DAIN (Sum/pooling)||81.2|
|multiview DAIN (3D filter/pooling)||81.1|
In summary, there are three main contributions of this work: 1) Differential Angular Imaging for a sparse spatial distribution of angular gradients that provides key cues for material recognition; 2) The GTOS Dataset with ground terrain imaged by systematic in-scene measurement of partial reflectance instead of in-lab reflectance measurements. The database contains 34,243 images with 40 surface classes, 18 viewing directions, 4 illumination conditions, 3 exposure settings per sample and several instances/samples per class. 3) We develop and evaluate an architecture for using differential angular imaging, showing superior results for differential inputs as compared to original images. Our work in measuring and modeling outdoor surfaces has important implications for applications such as robot navigation (determining control parameters based on current ground terrain) and automatic driving (determining road conditions by partial real time reflectance measurements). We believe our database and methods will provide a sound foundation for in-depth studies on material recognition in the wild.
This work was supported by National Science Foundation award IIS-1421134. A GPU used for this research was donated by the NVIDIA Corporation. Thanks to Di Zhu, Hansi Liu, Lingyi Xu, and Yueyang Chen for help with data collection.
Computer Vision and Pattern Recognition (CVPR), 2015.
Simultaneous estimation of near ir brdf and fine-scale surface geometry.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.