Machine learning is useful for inferring comprehensive geospatial information from sparsely labelled data. This is applicable to a wide range of uses, from vegetation height mapping (Lang et al., 2019) to building footprint detection (Zhu et al., 2021). In particular, learning from geospatial data is crucial to better understanding, mitigating, and responding to climate change, with applications ranging from hurricane forecasting (Boussioux et al., 2021) to methane detection (Kumar et al., 2020). Geospatial data is especially useful to better understand agricultural practices; data such as agricultural land use or yield is extremely incomplete, especially when considered on a global scale, and machine learning is critical in helping fill the gaps in the data. A complete picture of global agricultural practices is vital to mitigate and adapt to the effects of climate change, including by assessing food security in the event of extreme weather, more rapidly responding to food crises, and increasing productive land without sacrificing carbon sinks.
Certain parts of the world collect plentiful field-level agricultural data, but many regions are extremely data-sparse (with this data imbalance reflecting a eurocentric and amerocentric bias as in other labeled datasets in machine learning (Shankar et al., 2017)). While previous work has investigated transfer learning from data-rich areas to improve performance in data-sparse areas (Wang et al., 2018; Rußwurm et al., 2020), geospatial datasets (and agricultural data in particular) are rich in metadata that can inform transfer learning algorithms by enabling models to learn useful context between datapoints, such as the relative geographic locations of datapoints or the higher-level category of the class label (Turkoglu et al., 2021).
We propose a new method for passing such auxiliary information to the model to improve overall performance and equitable generalization. Specifically, we build on previous work applying Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017) to geospatial data. Meta-learning aims to learn a model that can quickly learn a new task from a small amount of new data by optimizing over many training tasks. When using geospatial data, tasks are created by partitioning samples based on agro-ecological (Rußwurm et al., 2020) or political (Tseng et al., 2021a, b) boundaries.
We summarize the main contributions of this paper below111See github.com/nasaharvest/timl for code & data:
We introduce Task-Informed Meta-Learning (TIML), an algorithm designed to augment MAML by incorporating task metadata and removing memorized tasks.
We show that TIML improves performance for both regression (yield estimation) and classification (crop type classification) tasks across a diversity of neural network architectures.
We highlight TIML’s ability to learn from very few positive labels and to perform well on tasks where other transfer-learned models do poorly.
While we motivate our method and focus our experiments on agricultural applications, we highlight that TIML is not specific to agriculture and could be applied to any meta-learning problem that includes task-specific metadata, such as classification of geo-tagged images (Mac Aodha et al., 2019) or species distribution modelling (Beery et al., 2021).
2 Related Work
2.1 Transfer learning for remote sensing
In prior work using machine learning for remote sensing data, there have been numerous efforts to learn from data-rich geographies and transfer the resulting model to data-sparse regions or underrepresented classes. Wang et al. (2018) found that training a yield estimation model using data from Argentina boosted the performance of that model in Brazil. In other studies, the source and target tasks are not geographically defined. For example, Jean et al. (2016) used transfer learning to improve performance on a data-sparse task (wealth estimation) by first learning a data-rich task (nighttime light estimation).
Prior work has also used multi-task learning to improve model performance for data-sparse tasks. Kerner et al. (2020)
trained a multi-task model where one task classifies crops in the data-sparse target region (Togo) and the second task classifies crops elsewhere in the world, and showed that augmenting the data-sparse task with global data improved the model performance in Togo.Chang et al. (2019) trained a multi-task neural network to simultaneously classify forest cover type and regress forest structure variables such as biomass to improve model performance on both tasks.
Rußwurm et al. (2020) introduced the idea of geographically defined tasks for meta-learning and applied it to remote sensing data (specifically for land cover classification). Tseng et al. (2021a) and Tseng et al. (2021b) used meta-learning for agricultural land cover classification.
These approaches often fail to capture important metadata and expert knowledge about the data-sparse tasks of focus, such as their geographic location relative to the pre-training data or the high-level category of crops being classified. This metadata can be useful for learning relationships between samples or tasks that can improve classification performance—for example, remote sensing observations of maize will be more similar between Kenya and Mali than between Kenya and France (Figure 1). In this work, we consider the metadata inherent to agricultural classification tasks, such as the spatial relations between tasks, and how this can inform the model’s predictions.
Meta-learning, or learning to learn, consists of learning a function for a task given other example tasks (Thrun and Pratt, 1998). Recent work in this area has focused on few-shot learning, i.e., learning a function for a task given few training datapoints (Snell et al., 2017; Ravi and Larochelle, 2017). In particular, model-agnostic meta-learning (MAML) (Finn et al., 2017) is a few-shot meta-learning algorithm that uses the other example tasks to learn a set of initial weights that can rapidly generalize to a new task. MAML can be used with any neural network architecture.
A more complex variant to few-shot generalization is few-shot dataset generalization, where a single model is used to learn from multiple datasets (as opposed to tasks, which can be drawn from the same dataset) (Triantafillou et al., 2020). In this regime, one solution is to learn an encoder which can modulate the distribution of weights learned by MAML depending on the dataset being considered (Vuorio et al., 2019; Triantafillou et al., 2021).
In this work, we consider how the MAML weights may be modulated even when all the tasks are drawn from the same dataset. In particular, we consider the special case when there is task-level metadata that can inform the modulation of the MAML weights for specific tasks.
Our proposed approach, Task-Informed Meta-Learning (TIML), builds on Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017). Specifically, we modulate the meta-weights learned by MAML depending on the task of interest. This allows different weight initializations to be learned depending on the tasks, allowing the model to learn a wide distribution of tasks. Model-Agnostic Meta-Learning (MAML) learns a set of model weights which is close to optimal for each of a variety of different tasks, allowing the optimal weights for a specific task to be reached with little data and/or few gradient steps. These initial weights are updated by fine-tuning them on a training task (inner loop training), yielding updated weights . A gradient for is then computed with respect to the loss of the updated model, . This gradient is then used to update (outer loop training).
In the following sections, we first describe how we implemented MAML in a geospatial context (Section 3.1) (focusing on how tasks are constructed) before describing the TIML method in more detail (Section 3.2).
3.1 MAML in a geospatial context
As in previous work applying meta-learning to geospatial data (Tseng et al., 2021b; Wang et al., 2020), we define tasks spatially. Specifically, given a particular task we use political boundaries (counties or countries) to separate a single dataset into many different tasks. The intuition for this is that agricultural practices (and land use) are influenced by the policies and cultural practices of a region, which are often defined along political boundaries (e.g., teff is a popular crop grown in Ethiopia and Eritrea but not bordering countries). This makes political territories useful units when conducting spatial analysis of a region (e.g. (Kimenyi et al., 2014)). In addition, defining tasks in this way allows for region-specific tasks to be defined. For example, some crops may only be grown in certain parts of the world (e.g., cacao is typically grown within 20 degrees of the equator), or data collection efforts for specific crops may only have occurred in certain areas. Spatially defined tasks mean that the model can be trained to identify these regional crops when it is looking at that region, and not elsewhere.
3.2 Task-Informed Meta-Learning
We build on model-agnostic meta-learning (Finn et al., 2017), considering the case where there is additional task-specific information that could inform the model, such as the spatial relationships between tasks. Information such as the spatial coordinates of a task remains static for all datapoints in the task, so is not useful to differentiate positive and negative instances within tasks. However, it may be useful to condition the model prior to inner loop training.
We introduce Task-Informed Meta-Learning (TIML) (Algorithm 1), which modulates the hidden vectors in the meta-model based on embeddings calculated using task information. We encode the task-specific information into a set of vectors – two for each hidden layer to be modulated in the meta-model, and . We use feature-wise linear modulation (FiLM (Perez et al., 2018)) to modulate the hidden vector outputs of the meta-model using these task encodings. Given a hidden vector output , we compute the Hadamard product of and and add to calculate the modulated vector which is passed to the next layer in the network:
These task embeddings are updated in the outer loop during training. This means that when the meta-model is being fine-tuned for a specific task, the embeddings remain constant for all datapoints in that task. We illustrate this in Figure 2.
We use a task encoder to learn the embeddings. This encoder consists of linear blocks, where each block contains a linear layer with a GeLU activation (Hendrycks and Gimpel, 2016), group normalization (Wu and He, 2018) and dropout (Srivastava et al., 2014). The task information is encoded into a hidden task vector. Independent blocks are then used to generate an embedding for each hidden vector in the classifier to be modulated.
3.3 Forgetful Meta-Learning
Due to the spatial imbalance of data globally (de Vries et al., 2019; Shankar et al., 2017), spatially partitioned meta-learning tasks may not be geographically well distributed. They can also be semantically imbalanced. In the CropHarvest dataset (Tseng et al., 2021b), a large fraction of the tasks are crop vs. non-crop tasks, reflecting the large number of binary crop vs. non-crop datapoints in the dataset (65.8% of all instances only have crop vs. non-crop labels). We find that in such settings the model is liable to memorize many similar tasks to the detriment of its ability to learn more difficult or rarer tasks, thus hurting generalization performance for the fine-tuning tasks. This limitation is not limited to geospatial or agricultural datasets and could occur for any dataset with imbalanced classes or task difficulty.
, we take advantage of the large number of similar tasks in the geospatial data setting to introduce a simple method to prevent memorization of certain tasks: removing training tasks the model has memorized, where memorization is defined as having exceeded a performance threshold for a task over a continuous set of epochs. We call this method “forgetfulness.” Reducing the training set size before training has been previously explored(Ohno-Machado et al., 1998; Han et al., 2021) to reduce training time - we do it dynamically to improve performance.
TIML is designed for transfer learning regimes which can be structured in terms of meta-learning (learning from many different tasks). In addition, TIML expects metadata that inform how the model should consider each task and remain constant for all datapoints within the task. We consider two datasets well suited to this regime.We consider a regression and a classification task to demonstrate the suitability of TIML in both contexts.
4.1 Crop Type Classification
Up to date cropland maps are critical to understanding the climate impacts of agriculture (Song et al., 2021). Crop type classification consists of predicting whether or not a given instance contains a crop of interest. Specifically, given a remote sensing-derived pixel time series for a specific latitude and longitude and a crop of interest, the goal is to output a binary value describing whether the crop of interest is being grown at that pixel location.
4.1.1 Data description
We use the CropHarvest dataset (Tseng et al., 2021b). This dataset consists of 90,480 globally distributed datapoints with the associated satellite pixel time series for each point. Of these datapoints, 30,899 (34.2%) contain multi-class agricultural labels; the remaining datapoints contain binary “crop” or “non-crop” labels. Each datapoint is accompanied by a pixel time series from 4 remote sensing products: Sentinel-2 L1C optical observations, Sentinel 1 synthetic aperture radar observations, ERA5 climatology data (precipitation and temperature), and topography (slope and elevation) from a Digital Elevation Model (DEM). The time series includes 1 year of data at monthly timesteps.
4.1.2 Task construction for meta-learning
As with the CropHarvest benchmarks, we defined tasks spatially using bounding boxes for countries drawn by Natural Earth (Patterson and Kelso, ). Tasks consist of binary classification of pixels as either crop vs. non-crop or a specific crop type vs. rest. This yielded 525 tasks, which were randomly split into training and validation tasks. Three evaluation tasks (described in Section 4.1.3) were withheld from the initial training. For each evaluation task, we fine-tuned the model on that task’s training data before evaluating the model on that task’s test data.
Task information is encoded in a 13-dimensional vector. Three dimensions are used to encode spatial information, consisting of latitude and longitude transformed to
. This transforms the spatial information from spherical to Cartesian coordinates, ensuring that transformed values at the extreme longitudes are close to each other. The remaining 10 dimensions are used to communicate the type of task the model is being asked to learn. This consists of a one-hot encoding of crop categories from the UN Food and Agriculture Organization (FAO) indicative crop classification(27), with an additional class for non-crop. For crop vs. non-crop tasks, positive examples are given the value across all the crop type categories.
The CropHarvest dataset is accompanied by 3 evaluation tasks which test the ability of a pre-trained model to learn from a small number of in-distribution datapoints in a variety of agroecologies. These test tasks cover a variety of agroecologies. We describe each task and the accompanying training data below.
Togo crop vs. non-crop: The goal of this task is to classify datapoints as crop or non-crop in Togo. The training set consists of 1,319 datapoints and the test set consists of 306 datapoints – 106 (35%) positive and 200 (65%) negative – sampled from random locations within the country.
The two other evaluation tasks consist of classifying a specific crop. Thus, “rest” below includes all other crop and non-crop classes. For both tasks, entire polygons delineating a field (as opposed to single pixels within a field) were collected, allowing evaluation across the polygons. However, during training, only the polygon centroids were used.
Kenya maize vs. rest: The training set consists of 1,345 imbalanced samples (266 positive and 1,079 negative samples). The test set consists of 45 polygons containing 575 (64%) positive and 323 (36%) negative pixels.
Brazil coffee vs. rest: The training set consists of 794 imbalanced samples (21 positive and 773 negative samples). The test set consists of 66 polygons containing 174,026 (25%) positive and 508,533 (75%) negative pixels.
4.2 Yield Estimation
Accurate and timely yield estimates are a key input to food security forecasts (Becker-Reshef et al., 2020), and to better understand how food production can be sustainably managed (Lark et al., 2020). Yield estimation is a regression task which consists of estimating the yield – the amount of crop harvested per unit of land – of an area, given remote sensing data of that area. Specifically, we estimate soybean yield in the top soybean-producing states in the United States.
4.2.1 Data description
We recreate the yield prediction dataset originally collected by You et al. (2017). This dataset consists of county-level soybean yields for the 11 US states accounting for over 75% of national soybean production from 2009 to 2015. MODIS reflectance (Vermote, 2015) and temperature data (Wan et al., 2015) are used to construct the remote sensing inputs. Since counties cover large areas, inputting the raw satellite data to the model would create extremely high-dimensional inputs. To handle this, You et al. (2017) assumed permutation invariance, meaning the positions of farmland pixels in a county do not affect yield, since they only indicate the positions of cropland. This allows all cropland pixels in a county (based on the MODIS land cover map (Friedl et al., 2010)) to be mapped to a histogram of pixel values, significantly reducing the dimensionality of the input. This is the predictand for soybean yields in each county. Since the original You et al. (2017) paper was released, the MODIS data product version has incremented from version to . Therefore, our histograms are similar but not identical to those in You et al. (2017). We note that this dataset was also released by (Yeh et al., 2021) but we chose to recreate the dataset from You et al. (2017) since we wanted to maintain the temporal validation originally used (as opposed to the random split used in Yeh et al. (2021)).
4.2.2 Task construction for meta-learning
We define tasks to be individual counties, with task pairs consisting of histograms and yields for different years.
As with the crop type classification task (Section 4.1), we use 3 dimensions in the task information vector to encode transformed latitude and longitude values describing the location of the county. In addition, we include a one-hot encoding communicating which state the county is in to the model based on the intuition that agricultural practices vary enough across states that it may help the model to have this difference explicitly communicated.
We use temporal validation; specifically, for each year in , we train a model using all the data prior to that year, and evaluate the performance of the model for the unseen year.
5.1 Crop Type Classification
|no task info or encoder||0.848|
|no task info or encoder||0.652|
evaluation tasks. All results are averaged from 10 runs and reported with the accompanying standard error. We report the area under the receiver operating characteristic curve (AUC ROC) and the F1 score using a threshold of 0.5 to classify a prediction as the positive or negative class. We highlight thefirst and second best metrics for each task. TIML achieves the highest F1 score of any model on the Brazil task and the best AUC ROC and F1 scores when averaged across the 3 tasks. We highlight the improvement of TIML relative to other transfer-learning models, showing it is able to leverage task structure to significantly increase performance on the CropHarvest dataset.
We evaluate TIML by training it on the CropHarvest dataset and fine-tuning it on the evaluation tasks, as was done for the benchmark results released with the dataset in Tseng et al. (2021b). MAML (and by extension, TIML) can be applied to any neural network architecture. We use the same base classifier and hyperparameters as in Tseng et al. (2021b): a 1-layer LSTM model followed by a linear classifier.
We perform 3 ablations to understand the effects of different components of TIML on overall model performance:
No forgetfulness: A TIML model trained without forgetfulness; no tasks are removed in the training loop.
No encoder: A TIML model with no encoder. The task information is instead appended to every raw input timestep and passed directly to the classifier.
No task information or encoder: No task information passed to the model at all. This model is effectively a normal MAML model, trained with forgetfulness.
We compare the TIML architecture to 4 baselines. As with TIML, we fine-tune these models on each benchmark task’s training data and then evaluate them on the task’s test data:
MAML: A model-agnostic meta-learning classifier without the task information.
Crop pre-training: A classifier pre-trained to classify all data as crop or non-crop (without task metadata), then re-trained on each test task.
No pre-training: A randomly initialized classifier, which is not pre-trained on the global CropHarvest dataset but instead is trained directly on the test task training data.
In addition, we trained a Random Forest baseline implemented using scikit-learn (Pedregosa et al., 2011) with the default hyperparameters.
5.2 Yield Estimation
We apply TIML to the original network architectures used by You et al. (2017) – a 1-layer LSTM and a CNN-based regressor. In addition to the remote sensing input, the Deep Gaussian Process baseline model (described in Section 5.2.1) receives as input the year of each training point. We therefore append the year to each timestep of the input to the TIML LSTM, so that the model has comparable inputs to the Deep Gaussian Process. The CNN-based models receive only the remotely sensed data as input.
We compared TIML to 2 baselines: the Deep Gaussian Process models (proposed by You et al. (2017) alongside the yield estimation dataset) and standard MAML.
Deep Gaussian Process
To train a Deep Gaussian Process, a deep learning model is first trained to estimate yield given the remote sensing dataset described above. The final hidden vectorof the model (for each input) is used as input to a Gaussian process:
where the kernel function is conditioned on both the location of the datapoint (defined by its latitude and longitude), , and the year of the datapoint, :
We include baselines with and without a Gaussian process (i.e., using the outputs of the deep learning models directly instead of passing the final hidden vectors to a Gaussian process). We note that this implementation of Deep Gaussian Processes differs from Damianou and Lawrence (2013).
As noted in Section 4.2.1, the MODIS datasets have been updated since the original Deep Gaussian Process models were run. We therefore retrain them to obtain our baseline results. We use the same hyperparameters as You et al. (2017), with the addition of early stopping when training. You et al. (2017)’s original results are included for comparison.
Additional implementation details for the crop classification and yield estimation datasets are available in Appendix A.
|(You et al., 2017)|
|LSTM + GP||5.77||6.23||5.96||5.70||5.49||5.83|
|CNN + GP||5.70||5.68||5.83||4.89||5.67||5.55|
6 Results & Discussion
6.1 Crop Type Classification
We show the model results for TIML, its ablations and all baseline models when trained on the CropHarvest dataset in Table 1. Like Tseng et al. (2021b), we report the AUC ROC score and the F1 score calculated using a threshold of 0.5. Overall, TIML is the best performing algorithm on the CropHarvest dataset, achieving the highest F1 and AUC ROC scores when averaged across all tasks. TIML is consistently the best performing algorithm on every task. In particular, TIML is the only transfer learning model that outperforms a randomly-initialized model in the challenging Brazil task, where there are only 26 positive datapoints.
6.1.1 Effects of transfer learning
Standard transfer learning from the global dataset is not guaranteed to confer advantages to the model. For example, first training using MAML or crop pre-training results in lower performance on the Brazil task compared to an LSTM initialized with random weights. We hypothesize this may be due to the difference in distribution of the Brazil task data relative to the other tasks the models are trained on. TIML is the only model to see significant improvements in performance compared to the randomly initialized model, suggesting conditioning the model with prior, domain-specific information about the tasks can help to model the diversity of samples in the CropHarvest dataset.
Forgetfulness – when coupled with task information – improves model performance in more challenging tasks without penalizing performance elsewhere. Training TIML with forgetfulness significantly boosts performance in the Brazil task without substantially impacting performance on the other tasks, and yields significantly higher mean F1 and AUC ROC scores when measured across all tasks. However, training TIML forgetfully without the task information (TIML with no task information or encoder) yields comparable results to the baseline MAML model trained without forgetfulness. We therefore hypothesize that task information provides useful context around which tasks are being kept and forgotten during training, allowing TIML to learn from more difficult tasks in the “forgetful” regime without forgetting easier tasks it has already learned.
6.1.3 Effect of task information
Including task information in the model improves performance, both when it is concatenated to the input data and when it is passed to the model through TIML. However, there are significant differences in performance depending on how this information is passed to the model: passing the task information directly to the classifier (TIML with no encoder) yields mixed results (lower AUC ROC in Kenya, and lower F1 scores in Brazil and Togo compared to MAML or crop-pretraining models). Training TIML with the encoder significantly boosts performance in these regimes, yielding the highest mean AUC ROC and F1 scores. We hypothesize that because the task information remains static for all datapoints in a task, it is challenging for the model to learn from them during the inner loop optimization – the encoder architecture allows it to only be optimized in the outer loop.
6.1.4 Effects of fine-tuning dataset sizes
TIML excels at learning from small dataset sizes. We plot the performance of the models as a function of training set size in Figure 3 for the Kenya and Togo evaluation tasks (the Brazil task has only 26 positive examples, and is therefore already in the small-dataset size regime). In both the Togo and Kenya tasks, the TIML model is amongst the most performant algorithms (as measured by AUC ROC score) for all subset sizes. We highlight that for the smallest sample size (20 fine-tuning samples), TIML is the best performing algorithm for both the Togo and Kenya evaluation tasks.
6.2 Yield Estimation
We share yield estimation results in Table 2. Like You et al. (2017), we report the RMSE score averaged across all counties and use temporal validation to evaluate the models. The TIML and MAML LSTM receive the year as input (to reflect the data available to the Deep Gaussian Process), but the TIML and MAML CNN do not.
For both the LSTM and CNN architectures, TIML is the most performant model. This is the case even though the Deep Gaussian Process is much more memory intensive, since it requires all predictions and hidden vectors (for the training and test data) to be computed together for the Gaussian process modelling step; this may be infeasible for larger datasets. TIML requires substantially less memory since it considers each county independently.
It is also worth noting that while TIML achieves the best result of all models, MAML performs significantly worse than all other models. This suggests that in some contexts, the task information is necessary for meta-learning to work.
In conclusion, we introduce task-informed meta-learning (TIML), a method for conditioning the model with prior information about a specific task. Specifically, the task information is encoded into a set of vectors which are used to modulate the weights learned by a MAML learner prior to task-specific fine-tuning. In addition, we introduce the concept of “forgetful” meta-learning, which can improve meta-learning performance when there are many similar tasks to learn from. We apply TIML to a range of tasks (classification and regression) and a range of model architectures (RNNs and CNNs), demonstrating its utility in a variety of regimes (including those with very few data points, and regimes in which standard MAML fails completely). Although we focus on agriculture-related tasks, TIML is not specific to agriculture and can be applied to any meta-learning setup with task-level metadata.
- How to train your MAML. In International Conference on Learning Representations (ICML), Cited by: Appendix A.
- Learn2learn: a library for Meta-Learning research. External Links: Cited by: Appendix A.
- Strengthening agricultural decisions in countries at risk of food insecurity: the geoglam crop monitor for early warning. Remote Sensing of Environment. Cited by: §4.2.
- Species distribution modeling for machine learning practitioners: a review. In ACM SIGCAS Conference on Computing and Sustainable Societies, Cited by: §1.
- Hurricane forecasting: a novel multimodal machine learning framework. In NeurIPS 2021 Workshop on Tackling Climate Change with Machine Learning, Cited by: §1.
Chimera: a multi-task recurrent convolutional neural network for forest classification and structural estimation. Remote Sensing 11 (7), pp. 768. Cited by: §2.1.
- Task-robust model-agnostic meta-learning. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §3.3.
Deep Gaussian processes.
Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics (AISTATS), Cited by: §5.2.1.
- Does object recognition work for everyone?. In , Cited by: §3.3.
- Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning (ICML), Cited by: §1, §2.2, §3.2, §3.
- MODIS collection 5 global land cover: algorithm refinements and characterization of new datasets. Remote Sensing of Environment. Cited by: §4.2.1.
- SlimML: removing non-critical input data in large-scale iterative machine learning. IEEE Transactions on Knowledge and Data Engineering. Cited by: §3.3.
- Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415. Cited by: §3.2.
- Task agnostic meta-learning for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.3.
- Combining satellite imagery and machine learning to predict poverty. Science. Cited by: §2.1.
- Rapid response crop maps in data sparse regions. In ACM SIGKDD Conference on Data Mining and Knowledge Discovery Workshops, Cited by: §2.1.
- The impact of conflict and political instability on agricultural investments in Mali and Nigeria. Brookings Institute. Cited by: §3.1.
- Deep remote sensing methods for methane detection in overhead hyperspectral imagery. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Cited by: §1.
- Country-wide high-resolution vegetation height mapping with sentinel-2. Remote Sensing of Environment. Cited by: §1.
- Cropland expansion in the united states produces marginal yields at high costs to wildlife. Nature Communications. Cited by: §4.2.
- Presence-Only Geographical Priors for Fine-Grained Image Classification. In International Conference on Computer Vision (ICCV), Cited by: §1.
- Improving machine learning performance by removing redundant cases in medical data sets. Proceedings. AMIA Symposium. Cited by: §3.3.
- PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: Appendix A.
-  Natural earth. Note: https://www.naturalearthdata.com/ Cited by: §4.1.2.
- Scikit-learn: machine learning in Python. Journal of Machine Learning Research. Cited by: §5.1.2.
- Film: visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §3.2.
-  (2020) Programme, concepts and definitions. In World Programme for the Census of Agriculture, Note: http://www.fao.org/3/a-i4913e.pdf Cited by: §4.1.2.
- Optimization as a model for few-shot learning. In International Conference on Learning Representations (ICLR), Cited by: §2.2.
- Meta-learning for few-shot land cover classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §1, §1, §2.1.
- No classification without representation: assessing geodiversity issues in open data sets for the developing world. In NIPS 2017 workshop: Machine Learning for the Developing World, Cited by: §1, §3.3.
- Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.2.
- Massive soybean expansion in south america since 2000 and implications for conservation. Nature Sustainability. Cited by: §4.1.
- Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research. Cited by: §3.2.
- Learning to learn. Springer Science & Business Media. Cited by: §2.2.
- Learning a universal template for few-shot dataset generalization. In Proceedings of the 38th International Conference on Machine Learning (ICML), Cited by: §2.2.
- Meta-dataset: a dataset of datasets for learning to learn from few examples. In International Conference on Learning Representations (ICLR), Cited by: §2.2.
- Learning to predict crop type from heterogeneous sparse labels using meta-learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §1, §2.1.
- CropHarvest: a global satellite dataset for crop type classification. In Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, Cited by: Appendix A, Figure 4, §1, Figure 1, §2.1, §3.1, §3.3, §4.1.1, §5.1, §6.1.
- Crop mapping from image time series: deep learning with multi-scale label hierarchies. Remote Sensing of Environment. Cited by: §1.
Visualizing high-dimensional data using t-SNE. Journal of Machine Learning Research. Cited by: Figure 1.
- MODIS/terra surface reflectance 8-day l3 global 500m SIN grid v006. NASA EOSDIS Land Processes DAAC. External Links: Cited by: §4.2.1.
- Multimodal model-agnostic meta-learning via task-aware modulation. In Neural Information Processing Systems (NeurIPS), Cited by: §2.2.
- MODIS/aqua land surface temperature/emissivity 8-day l3 global 1km SIN grid v006. NASA EOSDIS Land Processes DAAC. External Links: Cited by: §4.2.1.
- Deep transfer learning for crop yield prediction with remote sensing data. In Proceedings of the 1st ACM SIGCAS Conference on Computing and Sustainable Societies (COMPASS), Cited by: §1, §2.1.
- Weakly supervised deep learning for segmentation of remote sensing imagery. Remote Sensing. Cited by: §3.1.
- Group normalization. External Links: Cited by: §3.2.
- SustainBench: benchmarks for monitoring the sustainable development goals with machine learning. In Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, Cited by: §4.2.1.
- Deep gaussian process for crop yield prediction based on remote sensing data. Proceedings of the AAAI Conference on Artificial Intelligence. Cited by: §4.2.1, §5.2.1, §5.2.1, §5.2, Table 2, §6.2.
- MAP-net: multiple attending path neural network for building footprint extraction from remote sensed imagery. IEEE Transactions on Geoscience and Remote Sensing. Cited by: §1.
Appendix A Implementation Details
We implement TIML in PyTorch (Paszke et al., 2019), using the learn2learn library (Arnold et al., 2020). All MAML and TIML models are trained using the same optimizer hyperparameters. Specifically, we use an inner loop learning rate of . We use an Adam optimizer on the outer loop (for both the classifier and the encoder), with a Cosine Annealing Learning rate (as per Antoniou et al. (2019)). For both the classifier and encoder, we use an initial learning rate of and a minimum learning rate of .
When fine-tuning, we use the same learning rate as the inner loop learning rate () for all models with the exception of the yield-estimation standard-MAML CNN. The standard-MAML CNN experienced an exploding loss using this learning rate, so we reduced the learning rate to when fine-tuning it.
Both MAML and TIML are trained for 1000 epochs - we selected the model checkpoint with the best performance on the validation set (consisting of 10% of the training tasks, up to a maximum of 50 tasks).
For the crop type clasification dataset, all LSTM-based classifiers were fine-tuned on the evaluation tasks for 250 gradient steps with batches containing 10 positive and 10 negative examples (as in Tseng et al. (2021b)). We show the variety of agro-ecologies represented in the crop type classification evaluation tasks in Figure 4.
For the yield estimation dataset, all models were fine-tuned on each county for 15 gradient steps, with batches of size 10. The reduced fine-tuning steps relative to the crop classification dataset is due to the much lower amount of data available for each county (compared to the crop classification evaluation tasks). Some counties did not have any fine-tuning data available – the results for these zero-shot counties are shared in Appendix B.
We use the following thresholds to define task-memorization:
Crop Type Classification: An AUC ROC of 0.95 or above
Yield Estimation An RMSE of 4 or less
In both cases, a training task was forgotten if it met the threshold for forgetfulness continuously over the last 20 epochs.
For the crop type classification, we note that the training batches were balanced to contain 10 positive and 10 negative examples, making AUC ROC appropriate.
a.2 Task augmentation for geospatial MAML
Defining tasks according to their geospatial boundaries allows for a form of weak task augmentation, by including nearby datapoints which are not explicitly within the boundary. For example, using a rectangular bounding box instead of a polygon when defining a political boundary includes nearby points which may not be inside the polygon. Similarly, for the yield estimation dataset we include nearby counties in tasks for MAML and TIML.
Appendix B Zero-shot learning
For the Yield estimation task, some counties did not appear in the training data but were present in the evaluation data (i.e. if the first year of data for a county is 2011, then there will be no training data for that county for the evaluation year 2011).
For these counties, the model is therefore evaluated in a zero-shot learning regime (the county is not present when training the meta-model, or during fine-tuning).
We highlight that very few counties are in this zero-shot regime, but include these results for completeness.
|LSTM + TIML||8.99||12.93||17.19||9.97||11.22|
|CNN + TIML||10.44||7.02||9.81||7.25||11.89|