Land Cover Mapping in Limited Labels Scenario: A Survey

03/03/2021 ∙ by Rahul Ghosh, et al. ∙ University of Minnesota University of Pittsburgh 14

Land cover mapping is essential for monitoring global environmental change and managing natural resources. Unfortunately, traditional classification models are plagued by limited training data available in existing land cover products and data heterogeneity over space and time. In this survey, we provide a structured and comprehensive overview of challenges in land cover mapping and machine learning methods used to address these problems. We also discuss the gaps and opportunities that exist for advancing research in this promising direction.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Global demand for land resources to support human livelihoods and well-being through food, fiber, energy and living space will continue to grow in response to the population expansion and socioeconomic development. This poses a great challenge to the human society, given the increasing competition for land from the need to maintain other essential ecosystem services. Addressing this challenge will require timely information on land use and land cover (LULC) changes, e.g., the conversion of forest to farmland or plantations, the loss of productive cropland due to urbanization, and the degradation of soil due to inappropriate management practices.

Advances in Earth observation technologies have led to the acquisition of vast amounts of accurate, timely, and reliable remote sensing (RS) data that can be used for monitoring changes on the Earth system at a global scale. These rich datasets provide tremendous potential to study the nature and impact of global changes due to natural processes and human activities that shape the physical landscape and environmental quality of our planet.

Supervised classification methods, especially recent deep learning approaches, have achieved significant success in commercial applications in Natural Language Processing (NLP) and Computer Vision (CV) domain, where large training data is available. Given the promise provided by these approaches, there is a growing interest for using these techniques in automated land cover mapping at large scale through the analysis of RS data 

[73]

. However, off-the-shelf supervised classification methods have limited utility due to challenges that are unique to environmental applications. Supervised machine learning algorithms, e.g., advanced deep neural networks, require sufficient labeled training instances which are representative of the test data. Such training data is often scarce in land cover applications given high manual labor and material cost required in manual labeling (e.g., visual inspection) and field study. This is further exacerbated by the high-dimensional nature of spatio-temporal remote sensing data. Moreover, land covers commonly show much variability across space and time, e.g., the same crop can look different in different years and in different regions due to variability in weather conditions and farming practice. Additionally, the availability of multiple RS data sources acquired at different spatial and temporal resolutions, and other heterogeneous data, e.g., elevation, thermal anomalies, and night-time light intensity, provides unique algorithmic challenges that need to be addressed. The goal of this survey is to summarize recent developments in ML methods that are being used to address these challenges in the context of LULC. We focus on scenarios with a broad range of limited labels, e.g., scarce labels or imperfect labels, which also makes it challenging to model the heterogeneity, the large number of classes, and the high-dimensional spatio-temporal data. Additionally, we discuss the gaps and opportunities that exist for advancing research in this promising direction.

We organize the paper as follows. Section 2 describes different challenges that can be achieved using latest ML techniques for learning with limited data. Then, Section 3 discusses ML methods that researchers are using to address these challenges. Table 1 categorizes the work surveyed in this paper by challenge and approach. We hope that this survey will also be valuable for domain scientists who are interested in exploring the use of ML to enhance modeling in mapping target land covers. Although the focus of this survey is largely on land cover application, the ideas carry over to a wide range of spatio-temporal data.

2 Challenges

In this section, we provide an overview of a diverse set of challenges that have been encountered in land cover mapping.

2.1 Heterogeneity in space and time

One major challenge in land cover mapping is the presence of a rich variety in the land cover characteristics in different regions and at different times, i.e., data heterogeneity. For example, crops can be planted under different soil types, precipitation, and other weather conditions. Farmers also use different land management practices (e.g., conservation tillage, low phosphorous), future crop rotation plan, and local social exchanges. All these factors can result in the heterogeneity over different places and different years.

Traditional ML methods that learn a single model of target land covers over the entire population of training data fail to account for this heterogeneity of the data in space and time, thus resulting in poor prediction performance when applied across diverse regions and times. Additionally, supervised learning algorithms require sufficient availability of labeled instances for training, which are expected to be adequately representative of the test data. However, training labels are often only available for specific regions and years due to the high associated costs of obtaining gold-standard labels. This results in poor prediction performance when the learned model is generalized to a larger region or different years.

2.2 Large number of classes

Land cover products commonly involve a large number of land cover classes. For example, USDA Crop Data Layer [4] has defined over 150 land cover classes in the area of the US. Accurately mapping all these land covers requires the access to training data of all the classes. However, some land cover classes can be very rare. Such scenario is very common in the fine-grained land cover mapping, e.g., crop classification, where we may not have labeled instances of many such rare crops. Moreover, training labels are often only available for specific regions and years. Hence, unseen/novel land cover classes may appear when the trained model is applied to other regions or other years. Traditional ML models are designed to make classification over a fixed set of classes and cannot identify new classes that are not included in the training set.

2.3 Paucity of labeled data

Most land cover labels are created by domain experts through visual inspection or field study. Given the substantial human labor and material cost required in the labeling process, it is often difficult to obtain large training data for land cover mapping. This poses a serious challenge for effectively training advanced machine learning models. Recent deep learning models, that have shown high performance in a variety of commercial applications, usually involve a large number of parameters and thus require large labeled training data.

This issue can be further exacerbated by the severe skewness of land covers. In particular, there can be much fewer samples for certain minority land cover classes compared with dominant classes such as forest, grassland and barren land. Although some minority classes have very few amount of training samples, detecting these classes can be critical for understanding the impact of human practice and land cover changes to the environment. For example, only small number of farmers have planted cover crops in the winter time so it is challenging to train a cover crop classifier using such limited data. However, planting cover crops is highly encouraged by the government since they can hep maintain the soil nutrient and prevent water contamination. Hence, the capacity to monitor the extent of cover crops is urgently needed.

2.4 Imperfect labels

Training labels provided in many existing land cover products are noisy due to a number of reasons. First, inconsistent standard used in the annotation process. In a prior study for plantations, [39]

divided Landsat images into a grid of 20 × 20 km and visually scanned each gridded structure for multiple types of plantations with the assistance of forest gain and loss information. Such an annotation process results in less than 80% precision according to their independent assessment. Moreover, when applied to map larger regions, this approach may require multiple observers to delineate tree crop plantations, and observers are likely to be inconsistent with one another. Second, many labeling information is present at a much coarser scale, e.g., in the form of bounding boxes (approximate location), image-level labels (whether the input image contains objects), annotated polygons over the region of interest. Furthermore, many land cover products only provide labels for objects of interest, and the negative samples are left unmarked. Lastly, although some proxy labels are available, e.g., active fire for wild fire mapping and night-time light intensity for urban area mapping, they often either over-estimate or under-estimate target land covers.

2.5 High-dimensional spatio-temporal data

Many RS datasets contain reflectance values from multiple spectral bands and thus are high dimensional. Given the power of deep learning models in automatically extracting hidden features, researchers have been using these models in learning new data representations from RS data. Research along this direction is of great importance since it may provide new opportunities to understand the relation between spectral features and land covers. In particular, deep learning models have the promise to extract new vegetation indices from multi-spectral reflectance values. Such indices can be potentially more robust and representative compared with existing vegetation indices, such as Normalized Difference Vegetation Index (NDVI), which are defined based on domain knowledge. Moreover, the learned representation, together with the latest interpretation methods, can provide useful insights about the change of spectral data in response to different land covers.

Besides interpretability, the obtained representations can also be used for a variety of downstream applications. For example, these representations can be used to measure the similarity amongst different land covers. Also, one can gather representations learned from different regions and use their distance to better quantify spatial heterogeneity.

3 ML Methods

3.1 Zero-shot Learning

Zero-shot learning (ZSL) methods have been proposed for classification in a special scenario, where we consider two sets of classes – seen classes (), for which labeled instances are present, and unseen classes (), for which there are no labeled instances. These two set of classes are disjoint and the aim is to learn a classifier that can classify the instances in the testing set into classes belonging to unseen classes. To capture the relationship between seen and unseen classes, ZSL methods create a real-valued semantic space where each class

(both seen and unseen) has a corresponding vector representation, which is also referred to as the class prototype. To create this semantic space, some form of auxiliary information is often necessary to describe the relation amongst classes in

. The class prototype along with the training data are then used to learn the zero-shot classifier. ZSL methods have been used in several land cover studies given their ability to automatically classify rare or unseen classes. Existing ZSL methods can be categorized based on the type of semantic spaces and the methods used to learn the zero-shot classifier [61].

Recent literature has demonstrated the capacity of ZSL in levering additional text data to extract semantic space for unseen land cover classes [29, 14, 51, 40]. For example, [29]

introduced the idea of zero-shot scene classification using high spectral resolution (HSR) images. They first constructed a semantic space for embedding class prototype using the word2vec model 

[35] trained on Wikipedia. Then they further created a semantic directed graph over both seen and unseen classes and assign the edge weight as the distance between each pair classes based on their class prototypes. Once establishing the relationship between the seen and unseen, they fine-tune a GoogLeNet model [54]

on the training data. The combination of GoogLeNet and the semantic graph enables producing the probability of a new instance belonging to each of seen or unseen classes. In another work, 

[51] extended ZSL methods for fine-grained tree classification by using a combination of label embeddings learned from text data via word2vec, visual embeddings of trees, and hierarchy embeddings based on scientific classification of tree species. Such combination allows creating a semantic space that leverage the benefits of both domain knowledge and learning based methods.

An extension of this idea in the context of land-cover mapping is presented in [23]. Instead of using auxiliary text data, authors in this work create the semantic space using sequences of land cover changes collected from other regions in the world. They have shown the effectiveness of this method in identifying deforestation and forest fires without using extra text data. Additionally, [5] used the human-defined attributes as semantic information for knowledge transfer between seen classes and unseen classes and used a projection method to learn the zero-shot classifier. Specifically, they projected image features and label attributes to the same space and train a zero-shot classifier, which can then classify the unseen classes by finding the closest class-attribute in the projected space.

Although notable developments of ZSL models have been published for various CV tasks, their applications to RS are still at a nascent stage with the focus on small regions and small sets of target land covers. Moreover, most of existing methods create label-semantic space using the word2vec model, which was originally designed for text embedding and thus are not able to fully explore land cover characteristics. Hence, there is a need for new mechanism to create a semantic space that captures the properties of the classes and also includes the domain knowledge of the hierarchy of these classes. One potential method is the combination-based method of ZSL, where the idea is to construct the classifier for unseen classes by the combination of classifiers for basic elements that are used to constitute the classes. For example, in crop classification, one can create individual models to learn the basic elements during a crop cycle like sowing of seeds, growing season, harvesting period, texture and combining them for building the semantic space guided by the domain knowledge.

Figure 1: Weak supervision taxonomy.

3.2 Weakly supervised Learning

Weakly supervised learning (WSL) is an area of machine learning covering a variety of studies that attempt to construct predictive models by learning with different forms of weak supervision. In particular, there are three types of weak supervision [72]: incomplete, inexact and inaccurate. An illustration of these three types along with the learning methodologies used in each of them is shown in Fig. 1. Several research works have demonstrated the benefits of such techniques on RS data for handling the challenges posed by weak supervision.

ML methods that handle incomplete supervision, e.g., only target land covers are labeled, falls under the Positive-Unlabeled (PU) Learning family [3].  [62] proposed a scribble based deep learning method for road surface extraction. Specifically, using the scribble based PU road labels from crowd sources such as GPS or Open Street Maps (OSM) [15]

, negative samples (non-road pixels) are identified using buffers around the scribbles. These non-road pixels along with the scribble based road labels are used to train a semantic segmentation model. Another scenario of incomplete supervision arises when along with a few labeled samples, there is exists large unlabeled data. Semi-supervised learning 

[58] methods are typically used to handle this scenario. [17] adopted the Mean Teacher model [56] which is a popular semi-supervised technique to train their backbone ResNet [18] model using a few manually annotated samples and unlabeled samples for economic scales prediction.

The availability of large-scale crowd sourced information offers opportunities for building ML models for land cover mapping tasks for which expert-annotated labels are not available. However, crowd sourced data are often plagued by noise and thus require the ability to deal with inaccurate supervision. [48] showed results on classifying satellite images using weak labels extracted from Wikipedia. Similarly, [52] used crowd sourced GPS data for road extraction from aerial imagery. Apart from crowd sourced labels, other satellite products such as night-time lights also form a noisy proxy for several tasks such as urbanization and economic development. In their study, [16] showed the usability of night-time lights as a data rich proxy for identifying economic development from satellite imagery.

For pixel-wise land cover mapping tasks, coarse-level labels or image-level labels are often much easier to collect than pixel-level labels. Several methods have been proposed to leverage coarse-level labels for pixel-wise land cover mapping [68, 46, 60]. In particular, [46] discussed the challenges and opportunities of using coarse-level and image-level weak labels for land cover mapping. Many of these approaches are inspired by image segmentation approaches using image-level labels [67, 11].

Although most of the existing works have focused on single source of weak supervision, combining independent sources of weak supervision in a multi-task framework can force the models to learn from each other. For instance, for mapping urban areas the different sources of weak supervision can be night-time lights (coarse labels) and crowd sourced OSM labels (noisy labels).

3.3 Self-supervised Learning

One common approach for dealing with limited availability of labeled datasets is to pre-train an ML model on existing large labeled datasets for a related problem, and then refine it using a small number of labeled samples for the problem of interest. For example, models for image recognition are first trained using large-scale datasets like ImageNet 

[9] and then are fine-tuned on the limited-size dataset for the downstream task [20]

. However, such approaches cannot be used for RS due to two key reasons; a) Spectral bands available in RS satellite data are much richer than RGB data in ImageNet (e.g., Sentinel data has 12 bands); b) large scale labeled data for land covers are not available.

Self-supervised learning (SSL) [26] is an alternative approach that learns feature representation from unlabeled images. Specifically, SSL methods aim to pre-train ML models using “pre-text” learning tasks related to target classes before fine-tuning ML models using small amount of observations. The central idea is to propose various pretext tasks for the network to solve, in the hope that the network will learn important feature representations related to specific problems, such as image classification, segmentation, and object detection. Some popular self-supervised methods in CV include inpainting patches [38]

, image colorization 

[70], and solving image-jigsaw [37].

Recently, researchers have started using SSL for land cover mapping. One popular approach is to pre-train the ML model by optimizing the reconstruction performance on unlabeled data. For example, [27]

introduced the reconstruction loss for each hidden layer of a stacked convolutional autoencoder for unsupervised training of the model. Specifically, they used a symmetric encoder-decoder framework where the feed-forward encoder segments the satellite images and the decoder network reconstructs the original input back from the compressed representations of the encoder. This method has shown improved predictive performance on RS datasets that have small annotated training data.

Many existing works have also adapted pre-text tasks that are widely used in CV to RS applications and have shown success in improving the performance. For example, [49]

adopted the inpainting pretext task and showed its superiority over other autoencoder-based self-supervised methods in learning representation for segmenting land covers in several applications. In addition to the inpainting task, the authors propose to use an adversarial training scheme, where an adversarial network, called the coach network, is tasked to increase the difficulty of the inpainting task gradually so that the semantic inpainting model do not overfit to a single type of corruption. In 

[55], the authors compared image inpainting, image-jigsaw and contrastive learning [6] based pretext tasks for the RS land cover classification and present the results on three datasets. Similarly, [59] use an auto-encoder to predict the RGB channels given the other channels of the data as input. During the target classification tasks, the model has two independent branches which takes in the spectral bands and the RGB channels, respectively.

Another line of thought for pretext task, which is commonly used in the NLP domain, is representation learning based on context-similarity. The central idea is that words in similar contexts should have similar representations [35]. By redefining context as spatial neighborhoods, Tile2Vec [22] used this idea in RS domain where they promote the tiles that are close together to have similar representations than the tiles that are far apart.

SSL can also be used to extract shared information across multiple data sources. For example, [53] proposed a method that creates different views of a given area and uses the InfoNCE loss to enforce consistency between the two views. To create two views of a multispectral image, standard data-augmentation techniques such as channel dropout, random flips, rotation, cropping and jittering the brightness and contrast are applied randomly. The consistency is enforced between the representations from different levels of a ResNet model. The authors show that the representations learned in this manner outperform the fully supervised ImageNet weights. Moreover, this multiview framework allows the fusion of multiple sources of data for joint representation learning using multi-modal data sources.

While much of the current research has focused on adopting ideas that have been successful in CV and NLP, there is also huge promise for designing new pretext learning tasks specific to land cover characteristics. One promising direction is to use domain knowledge, e.g., crop phenology, to define pre-training objectives. For example, neural networks can be pre-trained to predict several phenology-related values, including the slope of the NDVI curve, the maximum NDVI value, etc.

ZSL Weakly-supervised Learning SSL TL Meta Learning Feature Extraction
Heterogeneity in space and time  [23]  [57] [24][57][21]
Large number of classes  [5]  [40]  [29]  [14]  [51]  [10]  [10]
Paucity of labeled data  [65]  [55]  [59]  [49]  [27]  [42] [12][34][47][43][7] [2][10] [69][44][31] [65][12][2] [10][33][42]
Imperfect labels [52][60][62][41][67] [68][46][11] [16] [64][36]
High-dimensional spatio-temporal data [8] [17][48] [65][53][22] [19][13][25] [8] [65][13][8] [1][32][66]
Table 1: Table of literature classified by challenges(rows) and method(columns).

3.4 Transfer Learning

Transfer learning (TL) is a class of methods which aim to improve a ML model on a target domain by transferring information from a related source domain. When there is limited labeled training data for the target domain, the existence of large scale datasets in a separate but related source domain can also provides the opportunity to transfer the knowledge gained from the source domain to the target task. TL has been extensively used in ML domain where models trained on large-scale datasets such as ImageNet have been successfully used as a starting point for solving the target task.

Several studies have also shown the benefit of transferring the knowledge from large-scale CV datasets to RS data. In particular, many of these methods aim to deal with the paucity of labeled data [34, 19, 36, 7, 47]. For example, [19] present a comprehensive comparison of several network models pre-trained on large-scale datasets like ImageNet and Places [71], and then fine-tuned on multiple RS benchmark datasets. Several other studies have shown the benefit of pre-training on ImageNet datasets in target tasks like road safety mapping [36], airplane detection [7] and other land cover mapping datasets [47].

Another approach to TL is to pretrain the ML model using data rich proxy labels. This approach has several benefits: 1. Proxy labels can be very specific to the target class labels of interest (e.g., night time lights can be a good proxy for infrastructure available or level of poverty); 2. Transfer can be done meaningfully as they are more closely related to the target task than ImageNet. For instance, [64] pre-train a ResNet model first using ImageNet dataset followed by night-time lights. The authors show that this method can achieve good performance when fine-tuned on limited labels of poverty prediction. As another example, [2] used the publicly available xView [28] dataset, which is one of the largest and most diverse publicly available overhead imagery datasets for object detection. They also showed the success of transferring the models learned using xView as the source dataset to the target task of poverty prediction.

Several studies have also explored domain adaptation techniques for knowledge transfer. Domain adaptation aims to extract feature representation that is invariant across source and target domains and thus can potentially alleviate data heterogeneity. [43] used a domain adaptation method to transfer the knowledge from Electro-optical(EO) domain, which has large number of labels available, to the synthetic aperture radar (SAR) image domain, where labelling can be difficult. They used two encoder networks to extract a shared invariant cross-domain embedding space such that the distributions of encoded EO and SAR data is minimized in this latent space. [24]

utilized a cyclic generative adversarial network (GAN) model to perform the domain adaptation between source and target domains which enables them to address the heterogeneity in croplands and grasslands over different regions and different years. In similar lines, 

[21] proposed a learning based domain adaptation technique using an adversarial loss to align marginal data distribution between a source region and a target region. Then a few labelled samples from the target domain is used to align class specific data distributions between the two domains, based on the contrastive semantic alignment loss.

Additionally, TL has been used to extract meaningful feature representations which can then be used for unsupervised grouping of land cover scenes. [13] used ImageNet and RSI-CB256 [30] benchmark dataset to train a land cover classification model. This model is then used as a feature extractor and the extracted features from RS images can be used for clustering.

3.5 Meta Learning

Meta learning is a strategy that automatically learns how to adapt models quickly to different tasks. In meta learning, a machine learning model gains experience from a distribution of related tasks and uses this experience to improve its future learning performance. Meta learning is performed in two levels: a) inner level where the model tries to solve a single task, and b) outer level where the meta model is updated using the experience gained by solving different tasks in the inner level. This enables the model to quickly learn a new task.

Several works have shown the benefit of using meta learning methods for handling the paucity of labeled data in land cover mapping [69, 44][44] used model agnostic meta learning (MAML) for the problem of inductive transfer-learning, where the generalization is induced by a few labeled examples in the target domain. They show results on the Sen12MS dataset [45] and the DeepGlobe challenge. [31] proposed a novel meta learning framework for few-shot RS scene classification. Specifically, they generated different tasks sampled from a task family and learn a metric space that is tuned to perform well on new tasks. [69] proposed a three stage meta learning based strategy for few-shot classification of aerial scene images. First, a feature extractor is trained using the base set and a standard cross-entropy loss. Second, the meta learning classifier is trained over a set of episodes in the meta-training stage while comparing the query features with the mean of support features. Finally, the meta learning classifier is tested on a set of novel set referred to as the meta-test stage. [8] proposed Model-Agnostic Meta-Ensemble Zero-shot Learning (MAME-ZSL) which exploits the MAML++ algorithm for the detection of unseen classes in land cover mapping.

Majority of the research works above have been on the MAML direction of meta learning. The use of other meta learning algorithms such as metric-based meta learning is still rare, where the central idea is to automatically learn a similarity measure between two instances. If the similarity between the instances of target and source tasks can be successfully captured, it should lead in effective solving of the target tasks given its similarity to a particular source task. For instance, considering the identification of cashew plantations, corn crops and palm oil plantations as the (target, source and source) task, the target task shares more similarity with source(both being plantations) than source. Thus, while identifying cashew plantations, the source task of identifying of palm oil plantation will be more beneficial. Another popular form of metric-based meta learning is prototypical networks [50], where a prototype feature vector is defined for every class using instance embeddings. Here, there is an opportunity to incorporate domain knowledge about the classes to create such prototypical vectors.

3.6 Feature extraction methods

RS products often contain reflectance values from multiple spectral bands, e.g. MODIS has 7 spectral bands, Sentinel 2A has 13 spectral bands, and some hyper-spectral products have even more. The reflectance value on each spectral band can be sensitive to different atmosphere and ground conditions, e.g., clouds, leaf. Environmental scientists have previously created multiple indices, NDVI, Urban Index (UI), Normalized Difference Built-up Index (NDBI), based on their domain knowledge. These indices are essentially combinations of multiple spectral bands and have shown effectiveness in identifying specific land covers. However, the performance can be significantly degraded for distinguishing multiple similar land covers, e.g., multiple crops can have similar NDVI patterns if they are planted and harvested on close dates.

The advent of new data-driven feature extraction methods have provided opportunities to extract more robust features from multiple spectral bands. Moreover, the state-of-the-art deep learning structures, such as convolutional neural network and recurrent neural network, also enable incorporating spatial and temporal context into the data representation. The extracted feature representation using these advanced techniques have shown improved performance in several applications. For example, prior work 

[10] has used a meta feature extractor based on Convolutional Neural Networks (CNNs) that extracts representative feature representations and enables learning with much fewer amounts of training samples. Moreover, such feature extraction can be also conducted using unlabeled data [42].

Land cover changes are commonly driven by natural processes or human interventions, which have specific temporal and spatial patterns, e.g., deforestation continues to happen in nearby areas. This requires new feature extraction methods that incorporate both spatial and temporal relationships that are consistent to the change of weather conditions, human activities and their interactions. Additionally, feature extraction methods that can interpret such data dependencies will be highly valuable for better understanding land cover changes.

4 Discussions, Future Directions and Concluding Remarks

In this survey, we reviewed key challenges faced in land cover mapping and machine learning methods for addressing them. Given the advances in machine learning and deep learning, we anticipate these techniques to play an important role in future works on monitoring land cover changes. In Table 1, we summarized relevant literature on different challenges using different methods. Also note that many boxes in the Table 1 have no and very few entries, many of which represent opportunities for future work. For example, SSL has been widely used for pre-training ML models so that they need much fewer training samples for fine-tuning. There is also a great promise for using SSL for extracting new representations that can be used for downstream applications, such as clustering spatial regions.

Another promising direction is to combine RS with other data sources such as street-level images to jointly study land covers. This would require new multi-view and representation learning methods to handle data from different views and in different spatial and temporal resolutions [25]. Additionally, many environmental processes, e.g., the growth of crops in response to fertilizer, soil and water conditions, cannot be directly observed through RS data. Hence, it is important to combine RS data with underlying physical/bio-chemical processes. For example, the emerging research on physics-guided machine learning [63] has provided huge potential on this direction.

References

  • [1] A. Albert et al. (2017) Using convolutional networks and satellite imagery to identify patterns in urban environments at a large scale. In SIGKDD, Cited by: Table 1.
  • [2] K. Ayush et al. (2020) Generating interpretable poverty maps using object detection in satellite images. arXiv:2002.01612. Cited by: §3.4, Table 1.
  • [3] J. Bekker and J. Davis (2020) Learning from positive and unlabeled data: a survey. Machine Learning. Cited by: §3.2.
  • [4] (2021) USDA cropland data layer. Note: https://www.nass.usda.gov/Research_and_Science/Cropland/SARS1a.php Cited by: §2.2.
  • [5] H. Chen et al. (2019) Generalized zero-shot vehicle detection in remote sensing imagery via coarse-to-fine framework.. In IJCAI, Cited by: §3.1, Table 1.
  • [6] T. Chen et al. (2020) A simple framework for contrastive learning of visual representations. In ICML, Cited by: §3.3.
  • [7] Z. Chen et al. (2018) End-to-end airplane detection using transfer learning in remote sensing images. Remote Sensing. Cited by: §3.4, Table 1.
  • [8] K. Demertzis and L. Iliadis (2020) GeoAI: a model-agnostic meta-ensemble zero-shot learning method for hyperspectral image analysis and classification. Algorithms. Cited by: §3.5, Table 1.
  • [9] J. Deng et al. (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §3.3.
  • [10] J. Deng et al. (2020) Few-shot object detection on remote sensing images. arXiv:2006.07826. Cited by: §3.6, Table 1.
  • [11] K. Fu et al. (2018) WSF-net: weakly supervised feature-fusion network for binary segmentation in remote sensing image. Remote Sensing. Cited by: §3.2, Table 1.
  • [12] K. K. Gadiraju et al. (2020) Comparative analysis of deep transfer learning performance on crop classification. In SIGSPATIAL Workshops, Cited by: Table 1.
  • [13] R. S. Gargees and G. J. Scott (2019) Deep feature clustering for remote sensing imagery land cover analysis. GRSL. Cited by: §3.4, Table 1.
  • [14] R. Gui et al. (2018) A generalized zero-shot learning framework for polsar land cover classification. Remote Sensing. Cited by: §3.1, Table 1.
  • [15] M. Haklay and P. Weber (2008) Openstreetmap: user-generated street maps. IEEE Pervasive computing. Cited by: §3.2.
  • [16] S. Han et al. (2020) Learning to score economic development from satellite imagery. In SIGKDD, Cited by: §3.2, Table 1.
  • [17] S. Han et al. (2020) Lightweight and robust representation of economic scales from satellite imagery. In AAAI, Cited by: §3.2, Table 1.
  • [18] K. He et al. (2016) Deep residual learning for image recognition. In CPVR, Cited by: §3.2.
  • [19] F. Hu et al. (2015) Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery. Remote Sensing. Cited by: §3.4, Table 1.
  • [20] M. Huh et al. (2016) What makes imagenet good for transfer learning?. arXiv:1608.08614. Cited by: §3.3.
  • [21] K. A. Islam et al. (2019) Semi-supervised adversarial domain adaptation for seagrass detection in multispectral images. In ICDM, Cited by: §3.4, Table 1.
  • [22] N. Jean et al. (2019) Tile2vec: unsupervised representation learning for spatially distributed data. In AAAI, Cited by: §3.3, Table 1.
  • [23] X. Jia et al. (2017) Incremental dual-memory lstm in land cover prediction. In SIGKDD, Cited by: §3.1, Table 1.
  • [24] X. Jia et al. (2019) Classifying heterogeneous sequential data by cyclic domain adaptation: an application in land cover detection. In SDM, Cited by: §3.4, Table 1.
  • [25] X. Jia et al. (2019) Recurrent generative networks for multi-resolution satellite data: an application in cropland monitoring.. In IJCAI, Cited by: Table 1, §4.
  • [26] L. Jing and Y. Tian (2020) Self-supervised visual feature learning with deep neural networks: a survey. TPAMI. Cited by: §3.3.
  • [27] R. Kemker et al. (2018) Low-shot learning for the semantic segmentation of remote sensing imagery. TGRS. Cited by: §3.3, Table 1.
  • [28] D. Lam et al. (2018) Xview: objects in context in overhead imagery. arXiv:1802.07856. Cited by: §3.4.
  • [29] A. Li et al. (2017) Zero-shot scene classification for high spatial resolution remote sensing images. TGRS. Cited by: §3.1, Table 1.
  • [30] H. Li et al. (2017) RSI-cb: a large scale remote sensing image classification benchmark via crowdsource data. arXiv:1705.10450. Cited by: §3.4.
  • [31] H. Li et al. (2020) RS-metanet: deep meta metric learning for few-shot remote sensing scene classification. arXiv:2009.13364. Cited by: §3.5, Table 1.
  • [32] X. Liu et al. (2019) Siamese convolutional neural networks for remote sensing scene classification. GRSL. Cited by: Table 1.
  • [33] X. Lu et al. (2017) JM-net and cluster-svm for aerial scene classification.. In IJCAI, Cited by: Table 1.
  • [34] D. Marmanis et al. (2015) Deep learning earth observation classification using imagenet pretrained networks. GRSL. Cited by: §3.4, Table 1.
  • [35] T. Mikolov et al. (2013) Distributed representations of words and phrases and their compositionality. NeurIPS. Cited by: §3.1, §3.3.
  • [36] A. Najjar et al. (2017) Combining satellite imagery and open data to map road safety. In AAAI, Cited by: §3.4, Table 1.
  • [37] M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, Cited by: §3.3.
  • [38] D. Pathak et al. (2016) Context encoders: feature learning by inpainting. In CVPR, Cited by: §3.3.
  • [39] R. Petersen et al. (2016) Mapping tree plantations with multispectral imagery: preliminary results for seven tropical countries. WRI. Cited by: §2.4.
  • [40] B. Pradhan et al. (2020) Unseen land cover classification from high-resolution orthophotos using integration of zero-shot learning and convolutional neural networks. Remote Sensing. Cited by: §3.1, Table 1.
  • [41] R. Qiao et al. (2020) Simple weakly supervised deep learning pipeline for detecting individual red-attacked trees in vhr remote sensing images. Remote Sensing Letters. Cited by: Table 1.
  • [42] A. Romero et al. (2015) Unsupervised deep feature extraction for remote sensing image classification. TGRS. Cited by: §3.6, Table 1.
  • [43] M. Rostami et al. (2019) Deep transfer learning for few-shot sar image classification. Remote Sensing. Cited by: §3.4, Table 1.
  • [44] M. Rußwurm et al. (2020) Meta-learning for few-shot land cover classification. In CVPR Workshops, Cited by: §3.5, Table 1.
  • [45] M. Schmitt et al. (2019) SEN12MS–a curated dataset of georeferenced multi-spectral sentinel-1/2 imagery for deep learning and data fusion. arXiv:1906.07789. Cited by: §3.5.
  • [46] M. Schmitt et al. (2020) Weakly supervised semantic segmentation of satellite images for land cover mapping–challenges and opportunities. arXiv:2002.08254. Cited by: §3.2, Table 1.
  • [47] G. J. Scott et al. (2017) Training deep convolutional neural networks for land–cover classification of high-resolution imagery. GRSL. Cited by: §3.4, Table 1.
  • [48] E. Sheehan et al. (2018) Learning to interpret satellite images using wikipedia. arXiv. Cited by: §3.2, Table 1.
  • [49] S. Singh et al. (2018) Self-supervised feature learning for semantic segmentation of overhead imagery. In BMVC, Cited by: §3.3, Table 1.
  • [50] J. Snell, K. Swersky, and R. S. Zemel (2017) Prototypical networks for few-shot learning. arXiv:1703.05175. Cited by: §3.5.
  • [51] G. Sumbul et al. (2017) Fine-grained object recognition and zero-shot learning in remote sensing imagery. TGRS. Cited by: §3.1, Table 1.
  • [52] T. Sun et al. (2019) Leveraging crowdsourced gps data for road extraction from aerial imagery. In CVPR, Cited by: §3.2, Table 1.
  • [53] A. M. Swope et al. (2019) Representation learning for remote sensing: an unsupervised sensor fusion approach. Cited by: §3.3, Table 1.
  • [54] C. Szegedy et al. (2015) Going deeper with convolutions. In CVPR, Cited by: §3.1.
  • [55] C. Tao et al. (2020) Remote sensing image scene classification with self-supervised paradigm under limited labeled samples. GRSL. Cited by: §3.3, Table 1.
  • [56] A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. arXiv:1703.01780. Cited by: §3.2.
  • [57] X. Tong et al. (2020) Land-cover classification with high-resolution remote sensing images using transferable deep models. RSE. Cited by: Table 1.
  • [58] J. E. Van Engelen and H. H. Hoos (2020) A survey on semi-supervised learning. Machine Learning. Cited by: §3.2.
  • [59] S. Vincenzi et al. (2020) The color out of space: learning self-supervised representations for earth observation imagery. arXiv:2006.12119. Cited by: §3.3, Table 1.
  • [60] S. Wang et al. (2020) Weakly supervised deep learning for segmentation of remote sensing imagery. Remote Sensing. Cited by: §3.2, Table 1.
  • [61] W. Wang et al. (2019) A survey of zero-shot learning: settings, methods, and applications. TIST. Cited by: §3.1.
  • [62] Y. Wei et al. (2020) Scribble-based weakly supervised deep learning for road surface extraction from remote sensing images. arXiv:2010.13106. Cited by: §3.2, Table 1.
  • [63] J. Willard et al. (2020) Integrating physics-based modeling with machine learning: a survey. arXiv:2003.04919. Cited by: §4.
  • [64] M. Xie et al. (2016) Transfer learning from deep features for remote sensing and poverty mapping. In AAAI, Cited by: §3.4, Table 1.
  • [65] Y. Xie et al. (2018) An unsupervised augmentation framework for deep learning based geospatial object detection: a summary of results. In SIGSPATIAL, Cited by: Table 1.
  • [66] X. Yang et al. (2020) Tensor canonical correlation analysis networks for multi-view remote sensing scene recognition. TKDE. Cited by: Table 1.
  • [67] F. Zhang et al. (2016) Weakly supervised learning based on coupled convolutional neural networks for aircraft detection. TGRS. Cited by: §3.2, Table 1.
  • [68] L. Zhang et al. (2019) Hierarchical weakly supervised learning for residential area semantic segmentation in remote sensing images. GRSL. Cited by: §3.2, Table 1.
  • [69] P. Zhang et al. (2021) Few-shot classification of aerial scene images via meta-learning. Remote Sensing. Cited by: §3.5, Table 1.
  • [70] R. Zhang et al. (2016) Colorful image colorization. In ECCV, Cited by: §3.3.
  • [71] B. Zhou et al. (2014) Learning deep features for scene recognition using places database. Cited by: §3.4.
  • [72] Z. Zhou (2018) A brief introduction to weakly supervised learning. National science review. Cited by: §3.2.
  • [73] X. X. Zhu et al. (2017) Deep learning in remote sensing: a comprehensive review and list of resources. GRSM. Cited by: §1.