Log In Sign Up

Self-Supervised Material and Texture Representation Learning for Remote Sensing Tasks

Self-supervised learning aims to learn image feature representations without the usage of manually annotated labels. It is often used as a precursor step to obtain useful initial network weights which contribute to faster convergence and superior performance of downstream tasks. While self-supervision allows one to reduce the domain gap between supervised and unsupervised learning without the usage of labels, the self-supervised objective still requires a strong inductive bias to downstream tasks for effective transfer learning. In this work, we present our material and texture based self-supervision method named MATTER (MATerial and TExture Representation Learning), which is inspired by classical material and texture methods. Material and texture can effectively describe any surface, including its tactile properties, color, and specularity. By extension, effective representation of material and texture can describe other semantic classes strongly associated with said material and texture. MATTER leverages multi-temporal, spatially aligned remote sensing imagery over unchanged regions to learn invariance to illumination and viewing angle as a mechanism to achieve consistency of material and texture representation. We show that our self-supervision pre-training method allows for up to 24.22 6.33 faster convergence on change detection, land cover classification, and semantic segmentation tasks.


page 4

page 6

page 7

page 8


Self-Supervised Learning of Remote Sensing Scene Representations Using Contrastive Multiview Coding

In recent years self-supervised learning has emerged as a promising cand...

Geographical Knowledge-driven Representation Learning for Remote Sensing Images

The proliferation of remote sensing satellites has resulted in a massive...

Semantic decoupled representation learning for remote sensing image change detection

Contemporary transfer learning-based methods to alleviate the data insuf...

MarioNette: Self-Supervised Sprite Learning

Visual content often contains recurring elements. Text is made up of gly...

Depth Contrast: Self-Supervised Pretraining on 3DPM Images for Mining Material Classification

This work presents a novel self-supervised representation learning metho...

Semantic-aware Dense Representation Learning for Remote Sensing Image Change Detection

Training deep learning-based change detection (CD) model heavily depends...

DILEMMA: Self-Supervised Shape and Texture Learning with Transformers

There is a growing belief that deep neural networks with a shape bias ma...

1 Introduction

Automated understanding of remote sensing imagery has been a long standing goal of the computer vision community. Its broad applicability has driven research and development in construction phase detection

[cohen2016rapid], infrastructure mapping [duan2016towards, lafarge2006automatic, voigt2016global, nayak2002use], land use monitoring [foody2003remote], post natural disaster damage assessment [skakun2014flood, van2000remote, gillespie2007assessment], urban 3D reconstruction [facciolo2017automatic, leotta2019urban], population migration prediction [chen2020nighttime], and climate change tracking [rolnick2019tackling]. Most of those methods require some degree of annotation effort, which is often expensive and/or time consuming. Satellite imagery is increasingly plentiful and accessible, with hundreds of satellites collecting images on a daily basis [wmo_oscar_list_of_all_satellites, drusch2012sentinel, torres2012gmes, roy2014landsat]

. However, annotating land cover, change, or similar labels often requires domain knowledge and/or extreme attention to detail, as labels in remote sensing imagery cover more numerous and smaller objects seen from unfamiliar view points. As a result, annotators require more domain expertise compared to standard benchmark datasets such as Pascal VOC



[lin2014microsoft], or similar.

Recent work in self-supervised learning aims to alleviate the requirement of labeled data by either detecting self-applied transformations, such as color or rotation change, or implicit metadata information, such as temporal order or geographical location. Those objectives are often achieved using contrastive learning methods [he2020momentum, khosla2020supervised, chen2020simple], in which the distance between feature representations of original and transformed images is minimized. More advanced contrastive methods use triplet loss [schultz2004learning, chechik2010large] or quadruplet loss [chen2017beyond] which also include negative examples with which the distance between feature representations is maximized. Despite filling a significant need in the remote sensing domain, these approaches have been yet to be thoroughly investigated. Even methods that utilize contrastive approaches, such as SeCo [Manas_2021_ICCV] and the work of Ayush et al. [ayush2021geography], which learn seasonal change invariance or geographic-location consistency, still show weaker transfer-ability to downstream task learning, as demonstrated by inferior performance and convergence speeds shown in Tab. 1, 3.

Instead, we hypothesize that material and texture have a strong inductive bias to most downstream remote sensing tasks, with pre-training of surface representation to improve performance and convergence speeds (measured in epochs) for those tasks. Consider the task of change detection in remote sensing imagery: when semantic class changes (

i.e., soil to building, or forest to soil), change can also be expected in materials and texture, demonstrating the high correlation between material and texture and the change detection task. We show the effectiveness of our self-supervised pre-trained features in both raw and fine-tuned forms, obtaining state-of-the-art (SOTA) performance in change detection (unsupervised and fine-tuned), land cover segmentation (fine-tuned), and land cover classification (fine-tuned).

Here, we propose a novel self-supervised material and texture representation learning method which is inspired by classical and modern texton filter banks [leung2001representing, zhu2005textons, shotton2006textonboost]. Textons [leung2001representing, julesz1981textons, malik1999textons] refer to the description of micro-structures in images often used to describe material and texture consistency [leung2001representing, wang2004hybrid, cula2001compact, dana2018computational, zhang2015reflectance]. Note that literature has only loosely defined what material, structures, texture, and surface refer to. Here, we define material as any single or combination of elements (soil, concrete, vegetation, etc.) corresponding to some multi-spectral signature, structures as gradients in intensity, texture as spatial distribution of structures, and surface as the combination of material and texture. Note that here we define the physical surface, rather than the geometric or algebraic surface, as described by its material and textural properties. By extension, we aim to jointly describe combinations of materials and textures in a single objective. For example, within a given image patch, a mixture of grass and concrete should be represented differently than patches with grass or concrete separately. In this example, the grass-concrete mixture may be associated to both grass and concrete material classes. To that end, we learn surface representations that describe the affinity, represented as residuals [jegou2010aggregating], to all pre-defined surface classes, represented as clusters. We achieve this by contrastively learning the similarity between the residuals of multi-temporal, spatially aligned imagery of unchanged regions to obtain consistent material and texture representations, regardless of illumination or viewing angle. This framework acts as a pre-training stage for downstream remote sensing tasks.

Overall, our contributions are: 1) We present a novel material and texture based approach for self-supervised pre-training to generate features with high inductive bias for downstream remote sensing tasks. We propose a texture refinement network to amplify low level features and adapt residual cluster learning to characterize mixed materials and texture patches in a self-supervised, contrastive learning framework. 2) We achieve SOTA performance on unsupervised and supervised change detection, semantic segmentation, and land cover classification using our pre-trained network. 3) We provide our curated multi-temporal, spatially aligned, and atmospherically corrected remote sensing imagery dataset, collected over unchanged regions used for self-supervised learning. 111Code and dataset will be released upon publication.

2 Related Work

2.1 Downstream Remote Sensing Tasks

The main downstream tasks we investigate in this work are change detection, land cover segmentation, and land cover classification. The problem of change detection in satellite imagery has been thoroughly investigated over time [coppin1996digital, peng2019end, chen2020spatial, jiang2020pga, papadomanolaki2019detecting, chen2020dasnet, chen2019deep, sakurada2020weakly]. Notable examples include Daudt et al. [daudt2018fully], which predicts change by minimizing feature differences at every layer of the network from a given image pair input, and Chen et al. [chen2019deep], which utilizes a spatial-temporal attention mechanism to detect anomalies in sequences of images. Land cover segmentation and classification have also seen a surge in interest, with growing repositories of annotated datasets [demir2018deepglobe, sumbul2019bigearthnet, alemohammad2020landcovernet, helber2019eurosat, van2018spacenet] and methods [akiva2021h2o, van2018spacenet, rudner2019multi3net, azimi2019skyscapes, tan2020vecroad, gupta2021rescuenet]. H20-Net [akiva2021h2o] synthesizes multi-spectral bands and uses self-sampled points to generate pseudo-ground truth for flood and permanent water segmentation. VecRoad [tan2020vecroad] sets the problem of road segmentation as iterative graph exploration. Multi3Net [rudner2019multi3net] learns fusion of multi-temporal, multi-spectral features from high resolution imagery to jointly predict pixels of floods and building.

2.2 Self-Supervision

In order to effectively utilize large amounts of unlabeled data, recent methods have focused on obtaining good feature representations without explicit annotation efforts. This is done by deriving information from the data itself or learning sub-tasks within data instances without changing the overall objective. The first is often used when high confidence labels can be obtained and trained on, similar to [ayush2021geography, akiva2021h2o], where the method infers weak supervision about input images through provided meta-data or classical methods. The second, and more common approach, leverages metric learning objectives to learn generalizable features for the same data instance or class. Recent methods involve learning invariance to color and geometric transformations [jing2018self, misra2020self, caron2018deep], temporal ordering [behrmann2021unsupervised, fernando2017self], sub-patch relative location prediction [doersch2015unsupervised]

, frame interpolation


, colorization

[zhang2016colorful, deshpande2017learning, larsson2016learning], patch and background filling [wang2021removing], and point cloud reconstruction [xie2020pointcontrast].

Figure 1: (Left) MATTER: anchor, positive, and negative images , , and are densely windowed to crops which are fed to encoder , and correspond to output features , , and

. Crops are also fed to the Texture Refinement Network (shown in blue) which amplifies activation of low-level features to increase their impact in deeper layers. The encoder’s output is then fed to the Surface Residual Encoder to generate patch-wise cumulative residuals, which represents affinity between input data and all learned clusters. A residual vector between feature output

and cluster is denoted as . Output learned residuals, cluster weights, and number of clusters are noted as , , and respectively.
(Right) Simplified example of the contrastive objective with . Residuals from learned clusters are extracted and averaged for all crops as representations of correlation between inputs and all clusters. Best viewed in zoom and color.

More relevant to the remote sensing domain, SeCo [Manas_2021_ICCV] has taken a step toward utilizing the potential in the abundance of satellite imagery by contrastively learning seasonal invariance as a pre-text self-supervision task. It then fine-tunes the pre-trained network on downstream tasks such as change detection and land cover classification. Ayush et al. [ayush2021geography] also proposes a self-supervised approach enforcing geographical-location consistency as a pre-training objective used for downstream tasks such as land-cover segmentation and classification. While both methods show improved results on benchmark datasets when compared to random weights initialization, we show that their inductive bias is still significantly weaker than that of our material and texture consistency based pre-trained weights, which learn an illumination and viewing angle invariance to achieve consistency of material and texture representation.

2.3 Material and Texture Identification

Early material and texture recognition methods relied on hand crafted filter banks, whose combinatorial output are also referred to as textons [leung2001representing], to encode statistical representations of image patches [leung2001representing, cula2001recognition, dana1999reflectance, zhu1998filters, valkealahti1998reduced, brilakis2005material, brilakis2008shape]. Later works investigated the use of clustering and inter-patch statistics as replacement for pre-defined filter banks [varma2003texture, varma2008statistical], at the cost of defining the feature space it operates in. Most notable feature spaces include color intensity [chen2009wld], texture homogeneity [leung2001representing, guo2012discriminative, mao2014active], multi-resolution features [ojala2002multiresolution, sifre2013rotation], and feature curvature [liu2010exploring, sharma2012local]

. More recent, deep learning approaches have translated the problem of texture representation to focus on explicit identification of materials through texture encoding

[zhang2017deep, chen2021deep, zhu2021learning], differential angular imaging [xue2017differential]

, 3D surface variation estimation

[degol2016geometry], auxiliary tactile property [schwartz2013visual], and radiometric properties estimation such as the bidirectional reflectance distribution function (BRDF) [wang2009material, liu2013discriminative, chen2009robust] and the bidirectional texture function (BTF) [weinmann2014material]. Those methods seek to learn low-level features that are key to material classification and segmentation. Some methods choose to add skip connections [lin2017refinenet, yu2018bisenet, zhu2019asymmetric] to supply low-level features in deep layers, and others choose explicit concatenation of texture related information [schwartz2013visual, li2020improving]. Many of these elements are meant to reduce the receptive field or increase impact of low-level features of the network while keeping it sufficiently deep. FV-CNN [cimpoi2015deep] aims to generate texture descriptions of densely sampled windows. Since the features describe regions removed from global spatial information, it explicitly constrains the receptive field of the network to the size of the window. DeepTEN [zhang2017deep] learns residual representations of material images in an end-to-end pipeline using material labels. Our approach combines elements from FV-CNN and DeepTEN in two ways. First, we densely sample windows and refine low level features as receptive field constraints. Then, we contrastively learn implicit surface residual representations without the usage of material labels or auxiliary information. To our knowledge, we are the first to employ self-supervised material and texture based objectives for pre-training steps.

3 Methodology

The goal of MATerial and TExture Representation Learning (MATTER) is to learn a feature extractor that generates illumination and viewing angle invariant material and texture representations from given multi-temporal satellite imagery sampled over unchanged regions. To this end, to train our model, we utilize our self-collected dataset described in Sec. 4.1, which samples multi-temporal imagery of rural and remote regions, in which little to no change is assumed between every consecutive pair of sampled images. See Fig. 2 for an overview of our approach.

Given an anchor reference image sampled over an unchanged region, we obtain a positive, temporally succeeding image over the same region, and a negative image sampled over a different region. , , and correspond to number of channel bands, height, and width of input images. We tile all images into equally sized, corresponding patches of size , with spatially aligned reference and positive patches, and , and negative patches, , randomly sampled from regions other than the reference region. The usage of densely sampled crops aims to restrict the receptive field by removing features from the global spatial context, and to prevent the model from learning higher level features ineffective in describing surfaces. We study the effects of receptive field variation in Sec. 5.1.

To learn material and texture centric features, we present the Texture Refinement Network (TeRN) (Sec. 3.1), and patch-wise Surface Residual Encoder (Sec. 3.2). TeRN aims to amplify the activation of lower level features essential for texture representation (as seen in Fig. 3), and Surface Residual Encoder is our patch-wise adaptation of Deep-TEN [zhang2017deep] to learn surface-based residual representations. We train our network to minimize the feature distance of positive patch pairs, and , and maximize the feature distance of negative patch pairs, and

, where the features are the learned residual representations. For our learning objective, we use the Noise Contrastive Estimation loss



where is the output features of input patch , and is the set of positive and negative patches.

Figure 2: Texture Refinement Network (TeRN)

assigns convolution weights based on the cosine similarity of the kernel’s center pixel and its neighbors, divided by the standard deviation of that kernel. We then convolve features

to refine low-level features essential for texture and material centric learning. The symbols and correspond to convolution and element-wise multiplication operations. Best viewed in zoom.

3.1 Texture Refinement Network

Capturing texture details is difficult in low resolution images, and is especially challenging when considering satellite images that have low contrast. As a result, texture will be less visible and have less impact on the final extracted features. We address this challenge by using our Texture Refinement Network (TeRN) to refine lower level texture features to increase their impact in deeper layers. TeRN utilizes the recently introduced pixel adaptive convolution layer [su2019pixel], in which the convolution kernel weights are a function of the features locally confined by the kernel. Here, our kernel considers the corresponding local pixels in the original image as follows: given kernel centered at location , we calculate the cosine similarity between pixel and all of its neighboring pixels . The output matrix is then divided by the squared standard deviation of all pixels within , noted as .


The output matrix of those operations describes both the similarity of the center pixel to its surroundings, and the intensity gradients within the kernel. As previously defined, texture is the spatial distribution of structures, which are represented as intensity gradients. Since we want to emphasize texture, we explicitly polarize the feature activation in regions with high variance or low similarity, with kernel weights decreasing with high variance and/or low cosine similarity. When convolved over our low level features, it highlights edges, and encourages representation consistency for pixels with similar material signatures, as seen in figure

3. The described operation constitutes a single kernel location of a single refinement layer. We define a single refinement layer, , when the operation is repeated for all image locations. We construct an

-layer refinement network, where each layer is able to utilize different kernel sizes, dilations, and strides. Since the network has deterministicly defined weights, it does not have learned parameters. A base TeRN kernel, and its integration in the overall network, are visually depicted in Fig.

2, and sample refined features in Fig. 3.

 Input Image Raw Features Refined Features
Figure 3: Qualitative results of our Texture Refinement Network (TeRN). It can be seen that similar textured pixels obtain similar feature activation intensity in the refined output. Notice how the building in the second row obtains similar activation throughout the concrete building pixel locations compared to the raw features output. Best viewed in zoom and color.

3.2 Learning Consistency of Surface Residuals

The task of residual encoding is tightly related to classical k-means clustering

[lloyd1982least] and bag-of-words [joachims1998text], in which some hard cluster assignment is learned based on the data instance proximity to cluster centers. Given the cluster centers, the residual is calculated as the distance of any data instance from its corresponding cluster center. In practice, we can use the residual to measure how similar a given data instance is to its assigned cluster, and all other clusters. Our method adapts the work presented in Deep-TEN [zhang2017deep] to learn patch-wise residual encodings without explicit hand-crafted clustering through a differentiable pipeline. Traditionally, in Deep-TEN [zhang2017deep] and other classical and deep clustering methods [cheng1995mean, macqueen1967some, he2020momentum, caron2018deep, chen2020simple], the objective is to cluster image-wise inputs to corresponding class-wise cluster centers. In contrast, as we employ a patch-wise approach. A given patch containing some material and texture may be associated with multiple clusters (i.e. if a patch captures multiple material elements), so it requires a soft representation depicting an affinity to all learned clusters, and not only to its closest cluster.

Consequently, we learn residuals of small patches and enforce multi-temporal consistency between corresponding patch residuals, imposing similarity of cluster affinity. Given an output feature vector for some crop , and a set of learned cluster centers , each of shape , we can find the residual corresponding to the feature vector and learned cluster center using . We repeat this for all cluster centers and take the weighted average of residuals from each cluster to obtain the cumulative residual vector,


with learned cluster weight . By combining the residuals of a given crop, we represent its affinity with all learned clusters. When maximizing or minimizing similarity between residuals, we effectively enforce a consistent cluster affinity between input crops.

4 Experiments

4.1 Self-Supervised Pre-Training

Pre-Training Dataset.

To train our self-supervised task, we collect a large amount of freely available, orthorectified, atmospherically corrected Sentinel-2 imagery of regions with limited human development. Regions of interest were manually selected to cover a variety of climates. Given spatial and temporal ranges, we use the PyStac library [pystac] to fetch imagery from the AWS Sentinel-2 catalog closest to our points of interest. Imagery within the spatial-temporal constraints containing over 20% cloud cover and less than 80% data coverage were removed. A maximum of 100 images meeting these constraints were collected per region. The collected images were divided into 14,857, 10961096 sized tiles for training. The resultant dataset contains 27 regions of interest spanning 1217 over three years. We provide all points of interest (Lat., Long.) in the supplementary material, and will release the dataset upon publication.

Dataset OSCD [8518015]
Method Sup. Precision (%) Recall (%) F-1 (%)
Full Supervision
U-Net [ronneberger2015u] (random) 70.53 19.17 29.44
U-Net [ronneberger2015u]


70.42 25.12 36.20
MoCo-v2 [he2020momentum] + 64.49 30.94 40.71
SeCo [Manas_2021_ICCV] + 65.47 38.06 46.94
DeepLab-v3 [chen2017deeplab]


51.63 51.06 53.04
Ours (fine-tuned) + 61.80 57.13 59.37
Self-Supervision only
VCA [malila1980change] 9.92 20.77 13.43
MoCo-v2 [he2020momentum] 29.21 11.92 16.93
SeCo [Manas_2021_ICCV] 74.70 15.20 25.26
Ours 37.52 72.65 49.48
Table 1: Precision, recall, and F-1 (%) accuracies (higher is better) of the ”change” class on Onera Satellite Change Detection (OSCD) dataset validation set [8518015]. , and represent full and self-supervision respectively. + refer to self-supervised pre-training followed by fully supervised fine-tuning. Random and ImageNet denote the type of backbone weight initialization that method uses.

Implementation Details.

We adopt a standard ResNet-34 backbone, with TeRN inserted after the first layer, and the Surface Residual Encoder as the output layer. TeRN is constructed with 10 blocks, each containing three layers of kernel size and dilations of 1-1-2. For the Surface Residual Encoder, we use . We use training patch size of , batch size of 32, learning rate of , momentum of , and weight decay of for training. For the Noise Contrastive Estimation loss, we use a temperature scaling of . We pre-train the network for 110,000 iterations or until convergence. Note that the self-supervised baselines SeCo [Manas_2021_ICCV] and Ayush et al. [ayush2021geography] use 1 million and 543,435 images respectively for pre-training, while we use only 14,857 images.

4.2 Change Detection

Implementation Details.

This task is evaluated on the Onera Satellite Change Detection (OSCD) dataset [8518015], and performed in two ways: self-supervised, and supervised fine-tuning. The self-supervised approach utilizes only the pre-trained backbone to extract patch-wise residual features from both images, with each patch representing its center pixel. We calculate the euclidean distance as a change metric between corresponding residual features, which are thresholded using Otsu thresholding [otsu1979threshold] to predict change pixels when residual distance is large. For the fine-tuned approach, we use image-wise inputs to a DeepLab-v3 [chen2017deeplab] with skip-connections network with our pre-trained backbone, fine-tuning the decoder for 30 epochs while freezing the backbone’s weights. We use channel-wise concatenations of image pairs as input to the network, with the output features optimized using the cross entropy loss and ground truth change masks. For evaluation, we report precision, recall, and F-1 score of the “change” class in Tab. 1. We use batch-size of , learning rate of , momentum of , and weight decay of . For the self-supervised baseline methods, we use the publicly available model weights and follow the same previously described self-supervised change prediction pipeline. The fully-supervised baselines follow the same steps as our fine-tuned approach, without the pre-trained weight initialization.

Results Discussion.

In Tab. 1 and Fig. 6 we compare our method to SOTA baselines for both self-supervised and fine-tuned approaches. We present common semantic segmentation networks initialized with weights that are random, or pre-trained with ImageNet [krizhevsky2012imagenet], MoCo-v2 [he2020momentum], and SeCo [Manas_2021_ICCV]. We hypothesized that change in material and texture corresponds to actual change in the scene. Hence by learning good material and texture representation and comparing representations of image pairs, we can reliably locate regions of change. As evident by Tab. 1, our self-supervised approach learns sufficiently good material and texture representation to outperform other fine-tuned methods, surpassing self-supervised SeCo by 24.22%, and fine-tuned SeCo by 2.08%. When considering our fine-tuned method, we outperform our baselines even further, with 12.43% performance increase compared to our self-supervision based baseline, and 6.33% performance increase compared to the fully supervised baseline. Additionally, we show that the inductive bias of material and texture representation to the task of change detection is significant as evidenced by the quicker convergence speed (measured in epochs), with our method converging within only 30 epochs, compared to 100 epochs reported by SeCo.

Dataset BigEarthNet [sumbul2019bigearthnet]
Method Sup. Fine-Tune Epochs mAP (%)
Inception-v2 [szegedy2016rethinking] - 48.23
InDomain [neumann2019domain] + 90 69.70
S-CNN [sumbul2019bigearthnet] - 69.93
ResNet-50 [he2016deep] (random) - 78.98
ResNet-50 [he2016deep] (ImageNet) - 86.74
MoCo-v2 [he2020momentum] + 100 86.05
SeCo [Manas_2021_ICCV] + 100 87.81
Ours (fine-tuned) + 24 87.98
Table 2: Mean average precision accuracy (higher is better) on BigEarthNet land cover multi-label classification dataset validation set [sumbul2019bigearthnet]. , and represent full and self-supervision respectively. + refer to self-supervised pre-training followed by fully supervised fine-tuning.

4.3 Land Cover Classification

Implementation Details.

We evaluate our pre-trained backbone on the BigEarthNet [sumbul2019bigearthnet] dataset for the task of multi-label land cover classification. The dataset provides 590,326 multi-spectral images of size

annotated with multiple land-cover labels, split into train and validation sets (95%/5%). We fine-tune a classifier head added to our frozen pre-trained backbone network for 24 epochs using given ground truth labels. We use SGD optimizer, batch-size of

, learning rate of , momentum of , and weight decay of . For performance, we report the mean average precision of all classes (19).

Results Discussion.

Tab. 2 reports the mean average precision performance of baseline and our methods after fine-tuning. While our method only outperforms our baseline by 0.18%, we note that our method converges within 24 epochs, which is significantly faster than our best-performing baseline which reports convergence within 100 epochs.

Dataset SpaceNet [van2018spacenet]
Method Sup. Fine-Tune Epochs mIoU (%)
DeepLab-v3 [chen2017deeplab] (random) - 69.44
DeepLab-v3 [chen2017deeplab] (ImageNet) - 72.22
MoCo-v2 [he2020momentum] + 100 78.05
Ayush et al. [ayush2021geography] + 100 78.51
Ours (fine-tuned) + 24 81.12
Table 3: Mean intersection over union (higher is better) on SpaceNet building segmentation dataset validation set [van2018spacenet]. , and represent full and self-supervision respectively. + refer to self-supervised pre-training followed by fully supervised fine-tuning. Random and ImageNet denote the type of backbone weight initialization that method uses.
Image Ground Truth DeepLab [chen2017deeplab] Ours ()
Figure 4: Qualitative results of our method on SpaceNet dataset [van2018spacenet]. Cyan, magenta, gray, and red colors represent true positive, false positive, true negative, and false negative respectively. Best viewed in zoom and color.

4.4 Semantic Segmentation

Implementation Details.

We use the SpaceNet building segmentation dataset for this task. The dataset provides 10,593 multi-spectral images of size labeled with pixel-wise building/no-building masks, split into train and validation sets (90%/10%). We use a DeepLab-v3 [chen2017deeplab] with skip-connections network with our frozen pre-trained backbone and fine-tune it for 24 epochs with batch-size of , learning rate of , momentum of , and weight decay of . We report mean intersection over union (mIoU) of the best performing epoch in Tab. 3. The fully-supervised baselines follow the same steps as our fine-tuned approach, without the pre-trained weight initialization.

Results Discussion.

Tab. 3 and Fig 4 compare the quantitative and qualitative results of baselines and our method. For our baselines, we report Ayush et al. [ayush2021geography] and MoCo-v2 [he2020momentum], which use PSANet [zhao2018psanet] with backbones pre-trained on geography consistency objective. We also report the performance of DeepLab-v3 initialized with random and ImageNet [krizhevsky2012imagenet] pre-trained weights. As shown in Tab. 3, our method requires significantly less epochs to obtain superior performance on the SpaceNet building segmentation dataset. We outperform our self-supervised based baseline by 2.61%, and fully supervised based baseline by 8.90%, with 76% convergence speed reduction.

5 Results

Evident by our qualitative and quantitative results, our method provides both superior performance and convergence time (measured in epochs) for the evaluated downstream tasks. It is shown that material and texture are strongly associated with common remote sensing downstream tasks, and the ability to represent material and texture effectively improves performances on those tasks. Since quantitatively measuring the ability to represent material without material labels is difficult, we analyze and showcase qualitative texture and material results in the form of visual word maps (pixel-wise cluster assignments). We also discuss limitations, running time, pseudo-code, and additional qualitative results in the supplementary material.

Visual Word Maps Generation.

In order to measure the effectiveness of our approach to describe materials and textures, we qualitatively evaluate the visual word maps (pixel-wise cluster assignments) generated by our method. Ideally, we expect similar material and textures to be mapped to the same clusters, without over or under grouping of pixels. We visually compare classical textons, a patch-wise backbone, and our method in Fig. 5

. The patch-wise backbone has the same base architecture as MATTER, but without TeRN and surface residual encoding modules. Both methods were trained on the same dataset, with the same hyperparameters, and number of iterations, as described in Sec.

4.1. It can be seen that the textons and patch-wise backbone approaches generate two extreme cases of over-sensitivity and under-sensitivity to changes in material and texture. Since textons operate on raw intensity values, the inter-material variance is small, making it highly sensitive to small texture changes. This can be seen in the textons-generated visual word map, in which small irregularities on the road results in mapping to different visual words. On the other hand, the patch-wise backbone, even with the receptive field constraints through patched inputs, still loses crucial low-level details essential for texture representation. This is indicated by the grouping of obviously different textures to a single visual word. In contrast, as demonstrated in Fig. 5, our Texture Refinement Network and Surface Residual Encoder boost the impact of low-level features, generating surface-based visual word maps. Our method is able to retain texture-essential features, and generalize surface representation which translates to superior surface-based visual word maps.

Image Textons Patch-wise Backbone Ours
Figure 5: Qualitative evaluation of our generated material and texture based visual word maps. It can be seen that our method provides more descriptive surface-based features that are not highly sensitive to small texture irregularities like textons, or under-sensitive to structure changes like the patch-wise backbone. Best viewed in zoom and color. Colors are random.
Inference Crop Size

Train Crop Size

- 7 9 11 13 15 17 19 21
5 46.68 47.38 46.94 46.94 45.74 44.51 43.74 42.95
7 48.52 49.48 49.01 49.02 47.76 46.64 45.92 44.69
9 48.58 47.60 48.02 47.83 46.51 45.57 45.45 43.27
11 48.98 47.83 47.32 46.65 45.64 44.51 44.16 42.45
13 47.46 47.14 46.35 46.99 44.65 43.79 43.09 41.61
15 47.63 47.15 47.30 46.10 45.55 44.68 44.10 41.85
17 46.74 46.81 46.49 45.92 44.69 43.60 43.19 41.09
Table 4: Receptive Field Constraint Analysis. F-1 score (%) performance for the unsupervised change detection task. Reported values are of the “change” class with respect to training and inference crop sizes (without fine-tuning). It can be seen that the method benefits from smaller receptive field, achieving superior performance when using smaller train and inference crop sizes.
Image 1 Image 2 Ground Truth DeepLab [chen2017deeplab] Difference of Residuals Ours () Ours ()
Figure 6: Qualitative results of our method on Onera Satellite Change Detection (OSCD) dataset [8518015]. It can be seen that our self-supervised alone is capable of detecting change only by inferring on the change of material and texture. The fine-tuned model is able to utilize the pre-trained material and texture based weights and achieve significantly better results than models with ImageNet weight initialization. Cyan, magenta, gray, and red colors represent true positive, false positive, true negative, and false negative respectively. Best viewed in zoom and color.

5.1 Ablation Study

Constraining Receptive Field.

In Tab. 4 we study the effects of varying receptive field constraints on our method. As mentioned before, as the receptive field increases, the impact of low-level features diminishes, along with the quality of material and texture representation. Unlike traditional methods, which resort to the usage of smaller networks to reduce receptive field, we explicitly constrain the method by feeding crops to the network, removing them from any global context. Recall that the objective of our method is to learn representation of material and spatial distribution of micro-structures, which are largely affected by low level features which are diminished in larger receptive field methods. In practice, the largest possible receptive field of our network during training is pixels, which is significantly smaller than the receptive fields of ResNet-50, and ResNet-101, and ResNet-152 with sizes of 483, 1027, and 1507 pixels respectively. It can be seen in Tab. 4 that in fact, the method benefits from removal from global spatial context and smaller receptive fields, helping it learn better representation for material and texture and achieve better performance on the unsupervised change detection task. Our best results are achieved with train crop size of , and inference crop size of , while worst performance is achieved with largest training and inference receptive fields.

Impact of Modules.

In Tab. 5 we study the impact of each module in our proposed method. We evaluate performance of the self-supervised change detection task (F-1 score of “change” class) as an ablation metric since it has strong transferability to material and texture representation learning. We consider all possible network combinations with patch-wise backbone, TeRN, and Surface Residual encoder. The patch-wise backbone corresponds to the network fed by patch-wise inputs, without TeRN or Surface Residual encoder. We then selectively add TeRN and Surface Residual Encoder to the network and record its performance. Every network combination was trained and evaluated with the same hyperparameters and procedure described in Sec. 4.1 and 4.2. It can be seen that each module provides incremental performance boost, with best performance achieved when both modules are implemented in the network.

Patch-wise Backbone Texture Refinement Surface Residual F-1 Score (%)
Table 5: Ablation study. F-1 score of the “change” class of the Onera Satellite Change Detection dataset using the self-supervised approach with respect to modules used.

6 Conclusion

In this work, we present MATTER, a novel self-supervised method that learns material and texture based representation for multi-temporal, spatially aligned satellite imagery. By utilizing patch-wise inputs and our refinement network, we constrain the receptive field and enhance texture-essential features. Those are then mapped to residuals of learned clusters as an affinity measurement, which represents the material and texture composition of the sampled patch. Through our self-supervision pipeline, MATTER learns discriminative features for various material and texture surfaces, which are shown to have strong correlation to change (change of surface infers actual change), or can be used as pre-trained weights for other remote sensing tasks.