TIML: Task-Informed Meta-Learning for Agriculture

02/04/2022
by   Gabriel Tseng, et al.
0

Labeled datasets for agriculture are extremely spatially imbalanced. When developing algorithms for data-sparse regions, a natural approach is to use transfer learning from data-rich regions. While standard transfer learning approaches typically leverage only direct inputs and outputs, geospatial imagery and agricultural data are rich in metadata that can inform transfer learning algorithms, such as the spatial coordinates of data-points or the class of task being learned. We build on previous work exploring the use of meta-learning for agricultural contexts in data-sparse regions and introduce task-informed meta-learning (TIML), an augmentation to model-agnostic meta-learning which takes advantage of task-specific metadata. We apply TIML to crop type classification and yield estimation, and find that TIML significantly improves performance compared to a range of benchmarks in both contexts, across a diversity of model architectures. While we focus on tasks from agriculture, TIML could offer benefits to any meta-learning setup with task-specific metadata, such as classification of geo-tagged images and species distribution modelling.

READ FULL TEXT VIEW PDF

Authors

page 12

09/21/2018

A Meta-Learning Approach for Custom Model Training

Transfer-learning and meta-learning are two effective methods to apply k...
12/24/2021

The Curse of Zero Task Diversity: On the Failure of Transfer Learning to Outperform MAML and their Empirical Equivalence

Recently, it has been observed that a transfer learning solution might b...
09/24/2012

Improving accuracy and power with transfer learning using a meta-analytic database

Typical cohorts in brain imaging studies are not large enough for system...
04/01/2020

Transfer Learning of Photometric Phenotypes in Agriculture Using Metadata

Estimation of photometric plant phenotypes (e.g., hue, shine, chroma) in...
05/30/2019

Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distribution Tasks

While tasks could come with varying number of instances in realistic set...
06/29/2020

Robustifying Sequential Neural Processes

When tasks change over time, meta-transfer learning seeks to improve the...
02/22/2022

Enabling Reproducibility and Meta-learning Through a Lifelong Database of Experiments (LDE)

Artificial Intelligence (AI) development is inherently iterative and exp...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning is useful for inferring comprehensive geospatial information from sparsely labelled data. This is applicable to a wide range of uses, from vegetation height mapping (Lang et al., 2019) to building footprint detection (Zhu et al., 2021). In particular, learning from geospatial data is crucial to better understanding, mitigating, and responding to climate change, with applications ranging from hurricane forecasting (Boussioux et al., 2021) to methane detection (Kumar et al., 2020). Geospatial data is especially useful to better understand agricultural practices; data such as agricultural land use or yield is extremely incomplete, especially when considered on a global scale, and machine learning is critical in helping fill the gaps in the data. A complete picture of global agricultural practices is vital to mitigate and adapt to the effects of climate change, including by assessing food security in the event of extreme weather, more rapidly responding to food crises, and increasing productive land without sacrificing carbon sinks.

Certain parts of the world collect plentiful field-level agricultural data, but many regions are extremely data-sparse (with this data imbalance reflecting a eurocentric and amerocentric bias as in other labeled datasets in machine learning (Shankar et al., 2017)). While previous work has investigated transfer learning from data-rich areas to improve performance in data-sparse areas (Wang et al., 2018; Rußwurm et al., 2020), geospatial datasets (and agricultural data in particular) are rich in metadata that can inform transfer learning algorithms by enabling models to learn useful context between datapoints, such as the relative geographic locations of datapoints or the higher-level category of the class label (Turkoglu et al., 2021).

We propose a new method for passing such auxiliary information to the model to improve overall performance and equitable generalization. Specifically, we build on previous work applying Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017) to geospatial data. Meta-learning aims to learn a model that can quickly learn a new task from a small amount of new data by optimizing over many training tasks. When using geospatial data, tasks are created by partitioning samples based on agro-ecological (Rußwurm et al., 2020) or political (Tseng et al., 2021a, b) boundaries.

We summarize the main contributions of this paper below111See github.com/nasaharvest/timl for code & data:

  • We introduce Task-Informed Meta-Learning (TIML), an algorithm designed to augment MAML by incorporating task metadata and removing memorized tasks.

  • We show that TIML improves performance for both regression (yield estimation) and classification (crop type classification) tasks across a diversity of neural network architectures.

  • We highlight TIML’s ability to learn from very few positive labels and to perform well on tasks where other transfer-learned models do poorly.

While we motivate our method and focus our experiments on agricultural applications, we highlight that TIML is not specific to agriculture and could be applied to any meta-learning problem that includes task-specific metadata, such as classification of geo-tagged images (Mac Aodha et al., 2019) or species distribution modelling (Beery et al., 2021).

2 Related Work

Figure 1: Remote sensing data from the CropHarvest dataset (Tseng et al., 2021b), collapsed to two dimensions using t-SNE (van der Maaten and Hinton, 2008) and colored according to the continent in which the datapoint is located. Datapoints appear to cluster according to their continent, suggesting that datapoints from the same geographic region share similarities that could be learned by a model. This clustering provides the intuition for the TIML method (that nearby tasks will be more informative during fine-tuning than far-away tasks).

2.1 Transfer learning for remote sensing

In prior work using machine learning for remote sensing data, there have been numerous efforts to learn from data-rich geographies and transfer the resulting model to data-sparse regions or underrepresented classes. Wang et al. (2018) found that training a yield estimation model using data from Argentina boosted the performance of that model in Brazil. In other studies, the source and target tasks are not geographically defined. For example, Jean et al. (2016) used transfer learning to improve performance on a data-sparse task (wealth estimation) by first learning a data-rich task (nighttime light estimation).

Prior work has also used multi-task learning to improve model performance for data-sparse tasks. Kerner et al. (2020)

trained a multi-task model where one task classifies crops in the data-sparse target region (Togo) and the second task classifies crops elsewhere in the world, and showed that augmenting the data-sparse task with global data improved the model performance in Togo.

Chang et al. (2019) trained a multi-task neural network to simultaneously classify forest cover type and regress forest structure variables such as biomass to improve model performance on both tasks.

Rußwurm et al. (2020) introduced the idea of geographically defined tasks for meta-learning and applied it to remote sensing data (specifically for land cover classification). Tseng et al. (2021a) and Tseng et al. (2021b) used meta-learning for agricultural land cover classification.

These approaches often fail to capture important metadata and expert knowledge about the data-sparse tasks of focus, such as their geographic location relative to the pre-training data or the high-level category of crops being classified. This metadata can be useful for learning relationships between samples or tasks that can improve classification performance—for example, remote sensing observations of maize will be more similar between Kenya and Mali than between Kenya and France (Figure 1). In this work, we consider the metadata inherent to agricultural classification tasks, such as the spatial relations between tasks, and how this can inform the model’s predictions.

2.2 Meta-learning

Meta-learning, or learning to learn, consists of learning a function for a task given other example tasks (Thrun and Pratt, 1998). Recent work in this area has focused on few-shot learning, i.e., learning a function for a task given few training datapoints (Snell et al., 2017; Ravi and Larochelle, 2017). In particular, model-agnostic meta-learning (MAML) (Finn et al., 2017) is a few-shot meta-learning algorithm that uses the other example tasks to learn a set of initial weights that can rapidly generalize to a new task. MAML can be used with any neural network architecture.

A more complex variant to few-shot generalization is few-shot dataset generalization, where a single model is used to learn from multiple datasets (as opposed to tasks, which can be drawn from the same dataset) (Triantafillou et al., 2020). In this regime, one solution is to learn an encoder which can modulate the distribution of weights learned by MAML depending on the dataset being considered (Vuorio et al., 2019; Triantafillou et al., 2021).

In this work, we consider how the MAML weights may be modulated even when all the tasks are drawn from the same dataset. In particular, we consider the special case when there is task-level metadata that can inform the modulation of the MAML weights for specific tasks.

3 Methods

Figure 2:

An illustration of the encoder, and the modulation of the MAML learner’s hidden vectors using the encoder’s output. We highlight the differing optimization regimes for the encoder and the MAML learner – the encoder’s output remains static through the MAML learner’s inner loop optimization.

Our proposed approach, Task-Informed Meta-Learning (TIML), builds on Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017). Specifically, we modulate the meta-weights learned by MAML depending on the task of interest. This allows different weight initializations to be learned depending on the tasks, allowing the model to learn a wide distribution of tasks. Model-Agnostic Meta-Learning (MAML) learns a set of model weights which is close to optimal for each of a variety of different tasks, allowing the optimal weights for a specific task to be reached with little data and/or few gradient steps. These initial weights are updated by fine-tuning them on a training task (inner loop training), yielding updated weights . A gradient for is then computed with respect to the loss of the updated model, . This gradient is then used to update (outer loop training).

In the following sections, we first describe how we implemented MAML in a geospatial context (Section 3.1) (focusing on how tasks are constructed) before describing the TIML method in more detail (Section 3.2).

3.1 MAML in a geospatial context

As in previous work applying meta-learning to geospatial data (Tseng et al., 2021b; Wang et al., 2020), we define tasks spatially. Specifically, given a particular task we use political boundaries (counties or countries) to separate a single dataset into many different tasks. The intuition for this is that agricultural practices (and land use) are influenced by the policies and cultural practices of a region, which are often defined along political boundaries (e.g., teff is a popular crop grown in Ethiopia and Eritrea but not bordering countries). This makes political territories useful units when conducting spatial analysis of a region (e.g. (Kimenyi et al., 2014)). In addition, defining tasks in this way allows for region-specific tasks to be defined. For example, some crops may only be grown in certain parts of the world (e.g., cacao is typically grown within 20 degrees of the equator), or data collection efforts for specific crops may only have occurred in certain areas. Spatially defined tasks mean that the model can be trained to identify these regional crops when it is looking at that region, and not elsewhere.

3.2 Task-Informed Meta-Learning

We build on model-agnostic meta-learning (Finn et al., 2017), considering the case where there is additional task-specific information that could inform the model, such as the spatial relationships between tasks. Information such as the spatial coordinates of a task remains static for all datapoints in the task, so is not useful to differentiate positive and negative instances within tasks. However, it may be useful to condition the model prior to inner loop training.

1:  Require: : Distribution over tasks
2:  Require: ,

: step size hyperparameters

3:  randomly initialize meta model , task encoder
4:  while not done do
5:     Sample batch of tasks with task information
6:     for all  do
7:        Generate task embeddings
8:        Evaluate with respect to K examples
9:        Compute adapted meta parameters with gradient descent:
10:     end for
11:     Update
12:     Update
13:  end while
Algorithm 1 Task-Informed Meta-Learning

We introduce Task-Informed Meta-Learning (TIML) (Algorithm 1), which modulates the hidden vectors in the meta-model based on embeddings calculated using task information. We encode the task-specific information into a set of vectors – two for each hidden layer to be modulated in the meta-model, and . We use feature-wise linear modulation (FiLM (Perez et al., 2018)) to modulate the hidden vector outputs of the meta-model using these task encodings. Given a hidden vector output , we compute the Hadamard product of and and add to calculate the modulated vector which is passed to the next layer in the network:

(1)

These task embeddings are updated in the outer loop during training. This means that when the meta-model is being fine-tuned for a specific task, the embeddings remain constant for all datapoints in that task. We illustrate this in Figure 2.

Task encoder

We use a task encoder to learn the embeddings. This encoder consists of linear blocks, where each block contains a linear layer with a GeLU activation (Hendrycks and Gimpel, 2016), group normalization (Wu and He, 2018) and dropout (Srivastava et al., 2014). The task information is encoded into a hidden task vector. Independent blocks are then used to generate an embedding for each hidden vector in the classifier to be modulated.

3.3 Forgetful Meta-Learning

Due to the spatial imbalance of data globally (de Vries et al., 2019; Shankar et al., 2017), spatially partitioned meta-learning tasks may not be geographically well distributed. They can also be semantically imbalanced. In the CropHarvest dataset (Tseng et al., 2021b), a large fraction of the tasks are crop vs. non-crop tasks, reflecting the large number of binary crop vs. non-crop datapoints in the dataset (65.8% of all instances only have crop vs. non-crop labels). We find that in such settings the model is liable to memorize many similar tasks to the detriment of its ability to learn more difficult or rarer tasks, thus hurting generalization performance for the fine-tuning tasks. This limitation is not limited to geospatial or agricultural datasets and could occur for any dataset with imbalanced classes or task difficulty.

Although complex meta-learning methods have been designed to optimize for performance on highly challenging tasks (Jamal and Qi, 2019; Collins et al., 2020)

, we take advantage of the large number of similar tasks in the geospatial data setting to introduce a simple method to prevent memorization of certain tasks: removing training tasks the model has memorized, where memorization is defined as having exceeded a performance threshold for a task over a continuous set of epochs. We call this method “forgetfulness.” Reducing the training set size before training has been previously explored

(Ohno-Machado et al., 1998; Han et al., 2021) to reduce training time - we do it dynamically to improve performance.

4 Datasets

TIML is designed for transfer learning regimes which can be structured in terms of meta-learning (learning from many different tasks). In addition, TIML expects metadata that inform how the model should consider each task and remain constant for all datapoints within the task. We consider two datasets well suited to this regime.We consider a regression and a classification task to demonstrate the suitability of TIML in both contexts.

4.1 Crop Type Classification

Up to date cropland maps are critical to understanding the climate impacts of agriculture (Song et al., 2021). Crop type classification consists of predicting whether or not a given instance contains a crop of interest. Specifically, given a remote sensing-derived pixel time series for a specific latitude and longitude and a crop of interest, the goal is to output a binary value describing whether the crop of interest is being grown at that pixel location.

4.1.1 Data description

We use the CropHarvest dataset (Tseng et al., 2021b). This dataset consists of 90,480 globally distributed datapoints with the associated satellite pixel time series for each point. Of these datapoints, 30,899 (34.2%) contain multi-class agricultural labels; the remaining datapoints contain binary “crop” or “non-crop” labels. Each datapoint is accompanied by a pixel time series from 4 remote sensing products: Sentinel-2 L1C optical observations, Sentinel 1 synthetic aperture radar observations, ERA5 climatology data (precipitation and temperature), and topography (slope and elevation) from a Digital Elevation Model (DEM). The time series includes 1 year of data at monthly timesteps.

4.1.2 Task construction for meta-learning

As with the CropHarvest benchmarks, we defined tasks spatially using bounding boxes for countries drawn by Natural Earth (Patterson and Kelso, ). Tasks consist of binary classification of pixels as either crop vs. non-crop or a specific crop type vs. rest. This yielded 525 tasks, which were randomly split into training and validation tasks. Three evaluation tasks (described in Section 4.1.3) were withheld from the initial training. For each evaluation task, we fine-tuned the model on that task’s training data before evaluating the model on that task’s test data.

Task Information

Task information is encoded in a 13-dimensional vector. Three dimensions are used to encode spatial information, consisting of latitude and longitude transformed to

. This transforms the spatial information from spherical to Cartesian coordinates, ensuring that transformed values at the extreme longitudes are close to each other. The remaining 10 dimensions are used to communicate the type of task the model is being asked to learn. This consists of a one-hot encoding of crop categories from the UN Food and Agriculture Organization (FAO) indicative crop classification

(27), with an additional class for non-crop. For crop vs. non-crop tasks, positive examples are given the value across all the crop type categories.

4.1.3 Evaluation

The CropHarvest dataset is accompanied by 3 evaluation tasks which test the ability of a pre-trained model to learn from a small number of in-distribution datapoints in a variety of agroecologies. These test tasks cover a variety of agroecologies. We describe each task and the accompanying training data below.

Togo crop vs. non-crop: The goal of this task is to classify datapoints as crop or non-crop in Togo. The training set consists of 1,319 datapoints and the test set consists of 306 datapoints – 106 (35%) positive and 200 (65%) negative – sampled from random locations within the country.

The two other evaluation tasks consist of classifying a specific crop. Thus, “rest” below includes all other crop and non-crop classes. For both tasks, entire polygons delineating a field (as opposed to single pixels within a field) were collected, allowing evaluation across the polygons. However, during training, only the polygon centroids were used.

Kenya maize vs. rest: The training set consists of 1,345 imbalanced samples (266 positive and 1,079 negative samples). The test set consists of 45 polygons containing 575 (64%) positive and 323 (36%) negative pixels.

Brazil coffee vs. rest: The training set consists of 794 imbalanced samples (21 positive and 773 negative samples). The test set consists of 66 polygons containing 174,026 (25%) positive and 508,533 (75%) negative pixels.

4.2 Yield Estimation

Accurate and timely yield estimates are a key input to food security forecasts (Becker-Reshef et al., 2020), and to better understand how food production can be sustainably managed (Lark et al., 2020). Yield estimation is a regression task which consists of estimating the yield – the amount of crop harvested per unit of land – of an area, given remote sensing data of that area. Specifically, we estimate soybean yield in the top soybean-producing states in the United States.

4.2.1 Data description

We recreate the yield prediction dataset originally collected by You et al. (2017). This dataset consists of county-level soybean yields for the 11 US states accounting for over 75% of national soybean production from 2009 to 2015. MODIS reflectance (Vermote, 2015) and temperature data (Wan et al., 2015) are used to construct the remote sensing inputs. Since counties cover large areas, inputting the raw satellite data to the model would create extremely high-dimensional inputs. To handle this, You et al. (2017) assumed permutation invariance, meaning the positions of farmland pixels in a county do not affect yield, since they only indicate the positions of cropland. This allows all cropland pixels in a county (based on the MODIS land cover map (Friedl et al., 2010)) to be mapped to a histogram of pixel values, significantly reducing the dimensionality of the input. This is the predictand for soybean yields in each county. Since the original You et al. (2017) paper was released, the MODIS data product version has incremented from version to . Therefore, our histograms are similar but not identical to those in You et al. (2017). We note that this dataset was also released by (Yeh et al., 2021) but we chose to recreate the dataset from You et al. (2017) since we wanted to maintain the temporal validation originally used (as opposed to the random split used in Yeh et al. (2021)).

4.2.2 Task construction for meta-learning

We define tasks to be individual counties, with task pairs consisting of histograms and yields for different years.

Task Information

As with the crop type classification task (Section 4.1), we use 3 dimensions in the task information vector to encode transformed latitude and longitude values describing the location of the county. In addition, we include a one-hot encoding communicating which state the county is in to the model based on the intuition that agricultural practices vary enough across states that it may help the model to have this difference explicitly communicated.

4.2.3 Evaluation

We use temporal validation; specifically, for each year in , we train a model using all the data prior to that year, and evaluate the performance of the model for the unseen year.

5 Experiments

5.1 Crop Type Classification

Model Kenya Brazil Togo Mean

AUC ROC

Random Forest 0.803
No pre-training 0.700
Crop pre-training 0.801
MAML 0.843
TIML
      no forgetfulness 0.850
      no encoder
      no task info or encoder 0.848

F1 score

Random Forest 0.441
No pre-training
Crop pre-training 0.613
MAML 0.652
TIML
      no forgetfulness 0.724
      no encoder 0.691
      no task info or encoder 0.652
Table 1: Results for the crop type classification

evaluation tasks. All results are averaged from 10 runs and reported with the accompanying standard error. We report the area under the receiver operating characteristic curve (AUC ROC) and the F1 score using a threshold of 0.5 to classify a prediction as the positive or negative class. We highlight the

first and second best metrics for each task. TIML achieves the highest F1 score of any model on the Brazil task and the best AUC ROC and F1 scores when averaged across the 3 tasks. We highlight the improvement of TIML relative to other transfer-learning models, showing it is able to leverage task structure to significantly increase performance on the CropHarvest dataset.

We evaluate TIML by training it on the CropHarvest dataset and fine-tuning it on the evaluation tasks, as was done for the benchmark results released with the dataset in Tseng et al. (2021b). MAML (and by extension, TIML) can be applied to any neural network architecture. We use the same base classifier and hyperparameters as in Tseng et al. (2021b): a 1-layer LSTM model followed by a linear classifier.

5.1.1 Ablations

We perform 3 ablations to understand the effects of different components of TIML on overall model performance:

  • No forgetfulness: A TIML model trained without forgetfulness; no tasks are removed in the training loop.

  • No encoder: A TIML model with no encoder. The task information is instead appended to every raw input timestep and passed directly to the classifier.

  • No task information or encoder: No task information passed to the model at all. This model is effectively a normal MAML model, trained with forgetfulness.

5.1.2 Baselines

We compare the TIML architecture to 4 baselines. As with TIML, we fine-tune these models on each benchmark task’s training data and then evaluate them on the task’s test data:

  • MAML: A model-agnostic meta-learning classifier without the task information.

  • Crop pre-training: A classifier pre-trained to classify all data as crop or non-crop (without task metadata), then re-trained on each test task.

  • No pre-training: A randomly initialized classifier, which is not pre-trained on the global CropHarvest dataset but instead is trained directly on the test task training data.

In addition, we trained a Random Forest baseline implemented using scikit-learn (Pedregosa et al., 2011) with the default hyperparameters.

5.2 Yield Estimation

We apply TIML to the original network architectures used by You et al. (2017) – a 1-layer LSTM and a CNN-based regressor. In addition to the remote sensing input, the Deep Gaussian Process baseline model (described in Section 5.2.1) receives as input the year of each training point. We therefore append the year to each timestep of the input to the TIML LSTM, so that the model has comparable inputs to the Deep Gaussian Process. The CNN-based models receive only the remotely sensed data as input.

5.2.1 Baselines

We compared TIML to 2 baselines: the Deep Gaussian Process models (proposed by You et al. (2017) alongside the yield estimation dataset) and standard MAML.

Deep Gaussian Process

To train a Deep Gaussian Process, a deep learning model is first trained to estimate yield given the remote sensing dataset described above. The final hidden vector

of the model (for each input) is used as input to a Gaussian process:

(2)

where the kernel function is conditioned on both the location of the datapoint (defined by its latitude and longitude), , and the year of the datapoint, :

(3)

We include baselines with and without a Gaussian process (i.e., using the outputs of the deep learning models directly instead of passing the final hidden vectors to a Gaussian process). We note that this implementation of Deep Gaussian Processes differs from Damianou and Lawrence (2013).

As noted in Section 4.2.1, the MODIS datasets have been updated since the original Deep Gaussian Process models were run. We therefore retrain them to obtain our baseline results. We use the same hyperparameters as You et al. (2017), with the addition of early stopping when training. You et al. (2017)’s original results are included for comparison.

Additional implementation details for the crop classification and yield estimation datasets are available in Appendix A.

Model 2011 2012 2013 2014 2015 Mean
LSTM 6.22
      + GP
      + MAML 30.06
      + TIML
CNN 5.96
      + GP 5.81
      + MAML 9.79
      + TIML 5.69
(You et al., 2017)
LSTM + GP 5.77 6.23 5.96 5.70 5.49 5.83
CNN + GP 5.70 5.68 5.83 4.89 5.67 5.55
Table 2: The RMSE of county-level model performance for the yield estimation task. We use temporal validation to evaluate the model. Specifically, for each year, models are trained with data up to that year and evaluated with that year’s data. All models are calculated from an average of 10 runs, with the standard error reported. We highlight the first and second best metrics for each task. For completeness, we include the results reported by (You et al., 2017), but highlight that these results were obtained on the MODIS 5.1 dataset (whilst all other models were trained on the MODIS 6.0 dataset) and are the result of 2 runs, compared to 10 runs for all other models. TIML improves on the Deep Gaussian Process models for both architectures, even though MAML performs significantly worse than all other models. This suggests that in some cases, the task information is necessary for meta-learning to work.

6 Results & Discussion

(a) Kenya: Maize vs. Rest
(b) Togo: Crop vs. Non Crop
Figure 3: Results of TIML and the benchmark models when trained on a subset of the evaluation training datasets for the Crop Type Classification Task. Specifically, we plot results for (Figure 2(a)) the Kenya Maize vs. rest evaluation task and (Figure 2(b)) Togo Crop vs. Non Crop evaluation task. All results are averaged from 10 runs, and reported with standard error bars. For both tasks, the subset is balanced so that it contains an equal number of positive and negative samples. TIML is the best performing model in Kenya, and - alongside the crop vs. non crop pretrained model - is among the best performining models in Togo for all subset sizes, indicating TIML’s ability to learn from limited datast sizes. We highlight that for the smallest training sample size, consisting of 20 samples, TIML is the best performing algorithm.

6.1 Crop Type Classification

We show the model results for TIML, its ablations and all baseline models when trained on the CropHarvest dataset in Table 1. Like Tseng et al. (2021b), we report the AUC ROC score and the F1 score calculated using a threshold of 0.5. Overall, TIML is the best performing algorithm on the CropHarvest dataset, achieving the highest F1 and AUC ROC scores when averaged across all tasks. TIML is consistently the best performing algorithm on every task. In particular, TIML is the only transfer learning model that outperforms a randomly-initialized model in the challenging Brazil task, where there are only 26 positive datapoints.

6.1.1 Effects of transfer learning

Standard transfer learning from the global dataset is not guaranteed to confer advantages to the model. For example, first training using MAML or crop pre-training results in lower performance on the Brazil task compared to an LSTM initialized with random weights. We hypothesize this may be due to the difference in distribution of the Brazil task data relative to the other tasks the models are trained on. TIML is the only model to see significant improvements in performance compared to the randomly initialized model, suggesting conditioning the model with prior, domain-specific information about the tasks can help to model the diversity of samples in the CropHarvest dataset.

6.1.2 Forgetfulness

Forgetfulness – when coupled with task information – improves model performance in more challenging tasks without penalizing performance elsewhere. Training TIML with forgetfulness significantly boosts performance in the Brazil task without substantially impacting performance on the other tasks, and yields significantly higher mean F1 and AUC ROC scores when measured across all tasks. However, training TIML forgetfully without the task information (TIML with no task information or encoder) yields comparable results to the baseline MAML model trained without forgetfulness. We therefore hypothesize that task information provides useful context around which tasks are being kept and forgotten during training, allowing TIML to learn from more difficult tasks in the “forgetful” regime without forgetting easier tasks it has already learned.

6.1.3 Effect of task information

Including task information in the model improves performance, both when it is concatenated to the input data and when it is passed to the model through TIML. However, there are significant differences in performance depending on how this information is passed to the model: passing the task information directly to the classifier (TIML with no encoder) yields mixed results (lower AUC ROC in Kenya, and lower F1 scores in Brazil and Togo compared to MAML or crop-pretraining models). Training TIML with the encoder significantly boosts performance in these regimes, yielding the highest mean AUC ROC and F1 scores. We hypothesize that because the task information remains static for all datapoints in a task, it is challenging for the model to learn from them during the inner loop optimization – the encoder architecture allows it to only be optimized in the outer loop.

6.1.4 Effects of fine-tuning dataset sizes

TIML excels at learning from small dataset sizes. We plot the performance of the models as a function of training set size in Figure 3 for the Kenya and Togo evaluation tasks (the Brazil task has only 26 positive examples, and is therefore already in the small-dataset size regime). In both the Togo and Kenya tasks, the TIML model is amongst the most performant algorithms (as measured by AUC ROC score) for all subset sizes. We highlight that for the smallest sample size (20 fine-tuning samples), TIML is the best performing algorithm for both the Togo and Kenya evaluation tasks.

6.2 Yield Estimation

We share yield estimation results in Table 2. Like You et al. (2017), we report the RMSE score averaged across all counties and use temporal validation to evaluate the models. The TIML and MAML LSTM receive the year as input (to reflect the data available to the Deep Gaussian Process), but the TIML and MAML CNN do not.

For both the LSTM and CNN architectures, TIML is the most performant model. This is the case even though the Deep Gaussian Process is much more memory intensive, since it requires all predictions and hidden vectors (for the training and test data) to be computed together for the Gaussian process modelling step; this may be infeasible for larger datasets. TIML requires substantially less memory since it considers each county independently.

It is also worth noting that while TIML achieves the best result of all models, MAML performs significantly worse than all other models. This suggests that in some contexts, the task information is necessary for meta-learning to work.

7 Conclusion

In conclusion, we introduce task-informed meta-learning (TIML), a method for conditioning the model with prior information about a specific task. Specifically, the task information is encoded into a set of vectors which are used to modulate the weights learned by a MAML learner prior to task-specific fine-tuning. In addition, we introduce the concept of “forgetful” meta-learning, which can improve meta-learning performance when there are many similar tasks to learn from. We apply TIML to a range of tasks (classification and regression) and a range of model architectures (RNNs and CNNs), demonstrating its utility in a variety of regimes (including those with very few data points, and regimes in which standard MAML fails completely). Although we focus on agriculture-related tasks, TIML is not specific to agriculture and can be applied to any meta-learning setup with task-level metadata.

References

  • A. Antoniou, H. Edwards, and A. Storkey (2019) How to train your MAML. In International Conference on Learning Representations (ICML), Cited by: Appendix A.
  • S. M. R. Arnold, P. Mahajan, D. Datta, I. Bunner, and K. S. Zarkias (2020) Learn2learn: a library for Meta-Learning research. External Links: 2008.12284 Cited by: Appendix A.
  • I. Becker-Reshef, C. J. Justice, B. Barker, M. L. Humber, F. Rembold, R. Bonifacio, M. Zappacosta, M. Budde, T. Magadzire, C. Shitote, J. Pound, A. Constantino, C. Nakalembe, K. Mwangi, S. Sobue, T. Newby, A. Whitcraft, I. Jarvis, and J. Verdin (2020) Strengthening agricultural decisions in countries at risk of food insecurity: the geoglam crop monitor for early warning. Remote Sensing of Environment. Cited by: §4.2.
  • S. Beery, E. Cole, J. Parker, P. Perona, and K. Winner (2021) Species distribution modeling for machine learning practitioners: a review. In ACM SIGCAS Conference on Computing and Sustainable Societies, Cited by: §1.
  • L. Boussioux, C. Zeng, D. Bertsimas, and T. J. Guenais (2021) Hurricane forecasting: a novel multimodal machine learning framework. In NeurIPS 2021 Workshop on Tackling Climate Change with Machine Learning, Cited by: §1.
  • T. Chang, B. P. Rasmussen, B. G. Dickson, and L. J. Zachmann (2019)

    Chimera: a multi-task recurrent convolutional neural network for forest classification and structural estimation

    .
    Remote Sensing 11 (7), pp. 768. Cited by: §2.1.
  • L. Collins, A. Mokhtari, and S. Shakkottai (2020) Task-robust model-agnostic meta-learning. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §3.3.
  • A. Damianou and N. D. Lawrence (2013) Deep Gaussian processes. In

    Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics (AISTATS)

    ,
    Cited by: §5.2.1.
  • T. de Vries, I. Misra, C. Wang, and L. van der Maaten (2019) Does object recognition work for everyone?. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

    ,
    Cited by: §3.3.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning (ICML), Cited by: §1, §2.2, §3.2, §3.
  • M. A. Friedl, D. Sulla-Menashe, B. Tan, A. Schneider, N. Ramankutty, A. Sibley, and X. Huang (2010) MODIS collection 5 global land cover: algorithm refinements and characterization of new datasets. Remote Sensing of Environment. Cited by: §4.2.1.
  • R. Han, C. H. Liu, S. Li, L. Y. Chen, G. Wang, J. Tang, and J. Ye (2021) SlimML: removing non-critical input data in large-scale iterative machine learning. IEEE Transactions on Knowledge and Data Engineering. Cited by: §3.3.
  • D. Hendrycks and K. Gimpel (2016) Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415. Cited by: §3.2.
  • M. A. Jamal and G. Qi (2019) Task agnostic meta-learning for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.3.
  • N. Jean, M. Burke, M. Xie, W. M. Davis, D. B. Lobell, and S. Ermon (2016) Combining satellite imagery and machine learning to predict poverty. Science. Cited by: §2.1.
  • H. Kerner, G. Tseng, I. Becker-Reshef, C. Nakalembe, B. Barker, B. Munshell, M. Paliyam, and M. Hosseini (2020) Rapid response crop maps in data sparse regions. In ACM SIGKDD Conference on Data Mining and Knowledge Discovery Workshops, Cited by: §2.1.
  • M. Kimenyi, J. Adibe, M. Djiré, A. J. Jirgi, A. Kergna, T. T. Deressa, J. E. Pugliese, and A. Westbury (2014) The impact of conflict and political instability on agricultural investments in Mali and Nigeria. Brookings Institute. Cited by: §3.1.
  • S. Kumar, C. Torres, O. Ulutan, A. Ayasse, D. Roberts, and B.S. Manjunath (2020) Deep remote sensing methods for methane detection in overhead hyperspectral imagery. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Cited by: §1.
  • N. Lang, K. Schindler, and J. D. Wegner (2019) Country-wide high-resolution vegetation height mapping with sentinel-2. Remote Sensing of Environment. Cited by: §1.
  • T. J. Lark, S. A. Spawn, M. Bougie, and H. K. Gibbs (2020) Cropland expansion in the united states produces marginal yields at high costs to wildlife. Nature Communications. Cited by: §4.2.
  • O. Mac Aodha, E. Cole, and P. Perona (2019) Presence-Only Geographical Priors for Fine-Grained Image Classification. In International Conference on Computer Vision (ICCV), Cited by: §1.
  • L. Ohno-Machado, H. S. Fraser, and A. Ohrn (1998) Improving machine learning performance by removing redundant cases in medical data sets. Proceedings. AMIA Symposium. Cited by: §3.3.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: Appendix A.
  • [24] T. Patterson and N. V. Kelso Natural earth. Note: https://www.naturalearthdata.com/ Cited by: §4.1.2.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research. Cited by: §5.1.2.
  • E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville (2018) Film: visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §3.2.
  • [27] (2020) Programme, concepts and definitions. In World Programme for the Census of Agriculture, Note: http://www.fao.org/3/a-i4913e.pdf Cited by: §4.1.2.
  • S. Ravi and H. Larochelle (2017) Optimization as a model for few-shot learning. In International Conference on Learning Representations (ICLR), Cited by: §2.2.
  • M. Rußwurm, S. Wang, M. Korner, and D. Lobell (2020) Meta-learning for few-shot land cover classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §1, §1, §2.1.
  • S. Shankar, Y. Halpern, E. Breck, J. Atwood, J. Wilson, and D. Sculley (2017) No classification without representation: assessing geodiversity issues in open data sets for the developing world. In NIPS 2017 workshop: Machine Learning for the Developing World, Cited by: §1, §3.3.
  • J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.2.
  • X. Song, M. C. Hansen, P. Potapov, B. Adusei, J. Pickering, M. Adami, A. Lima, V. Zalles, S. V. Stehman, C. M. Di Bella, M. C. Conde, E. J. Copati, L. B. Fernandes, A. Hernandez-Serna, S. M. Jantz, A. H. Pickens, S. Turubanova, and A. Tyukavina (2021) Massive soybean expansion in south america since 2000 and implications for conservation. Nature Sustainability. Cited by: §4.1.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research. Cited by: §3.2.
  • S. Thrun and L. Pratt (1998) Learning to learn. Springer Science & Business Media. Cited by: §2.2.
  • E. Triantafillou, H. Larochelle, R. Zemel, and V. Dumoulin (2021) Learning a universal template for few-shot dataset generalization. In Proceedings of the 38th International Conference on Machine Learning (ICML), Cited by: §2.2.
  • E. Triantafillou, T. Zhu, V. Dumoulin, P. Lamblin, U. Evci, K. Xu, R. Goroshin, C. Gelada, K. Swersky, P. Manzagol, and H. Larochelle (2020) Meta-dataset: a dataset of datasets for learning to learn from few examples. In International Conference on Learning Representations (ICLR), Cited by: §2.2.
  • G. Tseng, H. Kerner, C. Nakalembe, and I. Becker-Reshef (2021a) Learning to predict crop type from heterogeneous sparse labels using meta-learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §1, §2.1.
  • G. Tseng, I. Zvonkov, C. Nakalembe, and H. Kerner (2021b) CropHarvest: a global satellite dataset for crop type classification. In Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, Cited by: Appendix A, Figure 4, §1, Figure 1, §2.1, §3.1, §3.3, §4.1.1, §5.1, §6.1.
  • M. O. Turkoglu, S. D’Aronco, G. Perich, F. Liebisch, C. Streit, K. Schindler, and J. D. Wegner (2021) Crop mapping from image time series: deep learning with multi-scale label hierarchies. Remote Sensing of Environment. Cited by: §1.
  • L.J.P. van der Maaten and G.E. Hinton (2008)

    Visualizing high-dimensional data using t-SNE

    .
    Journal of Machine Learning Research. Cited by: Figure 1.
  • E. Vermote (2015) MODIS/terra surface reflectance 8-day l3 global 500m SIN grid v006. NASA EOSDIS Land Processes DAAC. External Links: Link Cited by: §4.2.1.
  • R. Vuorio, S. Sun, H. Hu, and J. J. Lim (2019) Multimodal model-agnostic meta-learning via task-aware modulation. In Neural Information Processing Systems (NeurIPS), Cited by: §2.2.
  • Z. Wan, S. Hook, and G. Hulley (2015) MODIS/aqua land surface temperature/emissivity 8-day l3 global 1km SIN grid v006. NASA EOSDIS Land Processes DAAC. External Links: Link Cited by: §4.2.1.
  • A. X. Wang, C. Tran, N. Desai, D. Lobell, and S. Ermon (2018) Deep transfer learning for crop yield prediction with remote sensing data. In Proceedings of the 1st ACM SIGCAS Conference on Computing and Sustainable Societies (COMPASS), Cited by: §1, §2.1.
  • S. Wang, W. Chen, S. M. Xie, G. Azzari, and D. Lobell (2020) Weakly supervised deep learning for segmentation of remote sensing imagery. Remote Sensing. Cited by: §3.1.
  • Y. Wu and K. He (2018) Group normalization. External Links: 1803.08494 Cited by: §3.2.
  • C. Yeh, C. Meng, S. Wang, A. Driscoll, E. Rozi, P. Liu, J. Lee, M. Burke, D. B. Lobell, and S. Ermon (2021) SustainBench: benchmarks for monitoring the sustainable development goals with machine learning. In Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, Cited by: §4.2.1.
  • J. You, X. Li, M. Low, D. Lobell, and S. Ermon (2017) Deep gaussian process for crop yield prediction based on remote sensing data. Proceedings of the AAAI Conference on Artificial Intelligence. Cited by: §4.2.1, §5.2.1, §5.2.1, §5.2, Table 2, §6.2.
  • Q. Zhu, C. Liao, H. Hu, X. Mei, and H. Li (2021) MAP-net: multiple attending path neural network for building footprint extraction from remote sensed imagery. IEEE Transactions on Geoscience and Remote Sensing. Cited by: §1.

Appendix A Implementation Details

We implement TIML in PyTorch (Paszke et al., 2019), using the learn2learn library (Arnold et al., 2020). All MAML and TIML models are trained using the same optimizer hyperparameters. Specifically, we use an inner loop learning rate of . We use an Adam optimizer on the outer loop (for both the classifier and the encoder), with a Cosine Annealing Learning rate (as per Antoniou et al. (2019)). For both the classifier and encoder, we use an initial learning rate of and a minimum learning rate of .

When fine-tuning, we use the same learning rate as the inner loop learning rate () for all models with the exception of the yield-estimation standard-MAML CNN. The standard-MAML CNN experienced an exploding loss using this learning rate, so we reduced the learning rate to when fine-tuning it.

Both MAML and TIML are trained for 1000 epochs - we selected the model checkpoint with the best performance on the validation set (consisting of 10% of the training tasks, up to a maximum of 50 tasks).

For the crop type clasification dataset, all LSTM-based classifiers were fine-tuned on the evaluation tasks for 250 gradient steps with batches containing 10 positive and 10 negative examples (as in Tseng et al. (2021b)). We show the variety of agro-ecologies represented in the crop type classification evaluation tasks in Figure 4.

For the yield estimation dataset, all models were fine-tuned on each county for 15 gradient steps, with batches of size 10. The reduced fine-tuning steps relative to the crop classification dataset is due to the much lower amount of data available for each county (compared to the crop classification evaluation tasks). Some counties did not have any fine-tuning data available – the results for these zero-shot counties are shared in Appendix B.

a.1 Forgetfulness

We use the following thresholds to define task-memorization:

  • Crop Type Classification: An AUC ROC of 0.95 or above

  • Yield Estimation An RMSE of 4 or less

In both cases, a training task was forgotten if it met the threshold for forgetfulness continuously over the last 20 epochs.

For the crop type classification, we note that the training batches were balanced to contain 10 positive and 10 negative examples, making AUC ROC appropriate.

a.2 Task augmentation for geospatial MAML

Defining tasks according to their geospatial boundaries allows for a form of weak task augmentation, by including nearby datapoints which are not explicitly within the boundary. For example, using a rectangular bounding box instead of a polygon when defining a political boundary includes nearby points which may not be inside the polygon. Similarly, for the yield estimation dataset we include nearby counties in tasks for MAML and TIML.

Appendix B Zero-shot learning

For the Yield estimation task, some counties did not appear in the training data but were present in the evaluation data (i.e. if the first year of data for a county is 2011, then there will be no training data for that county for the evaluation year 2011).

For these counties, the model is therefore evaluated in a zero-shot learning regime (the county is not present when training the meta-model, or during fine-tuning).

We record the results of the yield model in a zero-shot learning regime below in Table 3. These results are included in the overall results reported in Table 2.

We highlight that very few counties are in this zero-shot regime, but include these results for completeness.

(a) Togo
(b) Kenya
(c) Brazil
Figure 4: Example 1km 1km satellite images of the evaluation regions, demonstrating the variety in field sizes and agroecologies being evaluated. (Images were obtained from Google Earth Pro basemaps comprised primarily of high resolution Maxar images, and are reproduced with permission from (Tseng et al., 2021b))
Model 2011 2012 2013 2014 2015
# counties 7 9 5 6 5
LSTM + TIML 8.99 12.93 17.19 9.97 11.22
CNN + TIML 10.44 7.02 9.81 7.25 11.89
Table 3: Zero-shot learning results: RMSE of the TIML model when measured only on counties not present during training (or fine-tuning). We note that these results were obtained with no training data about the county, in a zero-shot learning regime. The number of counties being tested is additionally recorded.