Controlled abstention neural networks for identifying skillful predictions for classification problems

04/16/2021
by   Elizabeth A. Barnes, et al.
0

The earth system is exceedingly complex and often chaotic in nature, making prediction incredibly challenging: we cannot expect to make perfect predictions all of the time. Instead, we look for specific states of the system that lead to more predictable behavior than others, often termed "forecasts of opportunity." When these opportunities are not present, scientists need prediction systems that are capable of saying "I don't know." We introduce a novel loss function, termed the "NotWrong loss", that allows neural networks to identify forecasts of opportunity for classification problems. The NotWrong loss introduces an abstention class that allows the network to identify the more confident samples and abstain (say "I don't know") on the less confident samples. The abstention loss is designed to abstain on a user-defined fraction of the samples via a PID controller. Unlike many machine learning methods used to reject samples post-training, the NotWrong loss is applied during training to preferentially learn from the more confident samples. We show that the NotWrong loss outperforms other existing loss functions for multiple climate use cases. The implementation of the proposed loss function is straightforward in most network architectures designed for classification as it only requires the addition of an abstention class to the output layer and modification of the loss function.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

page 12

04/16/2021

Controlled abstention neural networks for identifying skillful predictions for regression problems

The earth system is exceedingly complex and often chaotic in nature, mak...
06/04/2021

A novel multi-scale loss function for classification problems in machine learning

We introduce two-scale loss functions for use in various gradient descen...
11/14/2017

Loss Functions for Multiset Prediction

We study the problem of multiset prediction. The goal of multiset predic...
04/30/2012

A Conjugate Property between Loss Functions and Uncertainty Sets in Classification Problems

In binary classification problems, mainly two approaches have been propo...
06/29/2019

Deep Gamblers: Learning to Abstain with Portfolio Theory

We deal with the selective classification problem (supervised-learning p...
07/23/2020

Adma: A Flexible Loss Function for Neural Networks

Highly increased interest in Artificial Neural Networks (ANNs) have resu...
11/03/2019

Generalized Learning with Rejection for Classification and Regression Problems

Learning with rejection (LWR) allows development of machine learning sys...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Plain Language Summary

The earth system is exceedingly complex and often chaotic in nature, making prediction incredibly challenging: we cannot expect to make perfect predictions all of the time. Instead, we can look for specific states of the system that lead to more predictable behavior than others, often termed “forecasts of opportunity”. When these opportunities are not present, scientists need prediction systems that are capable of saying “I don’t know.” We present a method for teaching neural networks, a type of machine learning tool, to say “I don’t know” for classification problems. By doing so, the neural network focuses less on the predictions it identifies as problematic and focuses more on the predictions where its confidence is high. In the end, this leads to better predictions.

1 Introduction

The earth system is exceedingly complex and often chaotic in nature, making prediction incredibly challenging: we cannot expect to make perfect predictions all of the time. Instead, we look for specific states of the system that lead to more predictable behavior than others, often termed “forecasts of opportunity” [Mariotti . (2020), Albers  Newman (2019), Mayer  Barnes (2020), Barnes . (2020)]. When skillful forecast opportunities are not present, scientists need prediction systems that are capable of saying “I don’t know.” While this concept of forecasts of opportunity stems from weather and climate predictions, the idea is more general than this. For example, a forecast of opportunity framework may be beneficial when certain predictors are only helpful under certain circumstances. Additionally, if certain predictions (labels) are more predictable than others, or if there is unstructured noise in the training data, a system that can say “I don’t know” may identify the more skillful predictions, when they occur.

Many approaches to identify skillful forecasts of opportunity already exist. For example, retrospective analysis of the forecast can provide a sense of the physical circumstances that can lead to forecast successes or busts [<]e.g.¿Rodwell2013-qz, The ensemble spread can also give a sense of uncertainty in numerical weather prediction systems [<]

e.g.¿Van_Schaeybroeck2016-lo. Albers2019 used a linear inverse modeling approach to identify confident subseasonal predictions and showed that these more confident predictions were indeed more skillful. Recently, Mayer2020 and Barnes2020 suggested that machine learning, specifically neural networks, may be a useful tool to identify forecasts of opportunity for subseasonal-to-seasonal climate predictions. Specifically, a classification network is first trained, then the predicted probabilities are ordered from largest to smallest. A selection of predictions with the highest probabilities are identified as possible forecasts of opportunity. While Mayer2020 and Barnes2020 show that this approach works well for classification tasks (i.e., predicting a specific category) where the network is already tasked with predicting a probability, it is less clear how one might apply this methodology to regression tasks (i.e., predicting a continuous quantity).

Most of the current machine learning approaches used to identify forecasts of opportunity, including those described above, are applied post-training. The network is first trained, and then the model confidence is assessed. Instead, here we lean heavily on work by Thulasidasan2019 and Thulasidasan2020 to further explore a deep learning abstention loss function for classification tasks that teaches the network to say “I don’t know” (abstain) on certain samples

during training. The resulting controlled abstention network (CAN) preferentially learns from the samples in which it has more confidence and abstains on samples in which it has less confidence. The CAN is designed to abstain on a user-defined fraction via a PID controller, which ultimately leads to more accurate predictions than our baseline classification approach. While alternative methods have recently been suggested for abstention (rejection) during training [Geifman  El-Yaniv (20191), Geifman  El-Yaniv (20192)], the CAN approach can be easily implemented in most any network architecture designed for classification, as it only requires the addition of an abstention class to the output layer and modification of the training loss.

We demonstrate the behavior of the CAN for three use cases based on synthetic climate data where the correct answer is known. The first use case explores the utility of the CAN in situations where certain classes (labels) are more predictable than others. The second use case explores the ability of the CAN to learn in the presence of unstructured noise, that is, when there is no way to tell a priori whether the sample is predictable or not. The third use case is modeled loosely after global teleconnections associated with the El Niño Southern Oscillation [<]e.g.¿McPhaden2006-pi,Yeh2018-tf and explores the utility of the CAN in identifying forecasts of opportunity for climate prediction applications.

Section 2 introduces the synthetic climate data and general neural network architecture. Section 3 discusses the baseline loss function and the CAN in detail, and Section 4 presents the results. Additional discussion on the approach compared to previous approaches is provided in Section 5 and conclusions in Section 6.

2 Data and use cases

2.1 Synthetic climate data

To demonstrate the utility of the controlled abstention network (CAN), we use the synthetic benchmark data set introduced by Mamalakis2021. While Mamalakis2021 provides an extensive description of this data, we give a brief overview here. The dataset consists of input fields and output series (where denotes the

sample), which is a function of the input. The input fields represent monthly anomalous global sea surface temperatures (SSTs) generated from a multivariate normal distribution with a correlation matrix estimated from observed SST fields

111https://psl.noaa.gov/data/gridded/data.cobe2.html. The input sample consists of one map of SST anomalies, denoted as . Mamalakis2021 then define the global response to sample as the sum of local, nonlinear responses. Specifically,

(1)

where represents the grid point and is defined locally at each grid point by a piecewise linear function. The slopes (where is an integer that runs from 1 to the number of piecewise linear segments, set here to 5) of each local function are chosen randomly from a multivariate normal distribution with correlation matrix, once again, estimated from observed SST fields.

In the end, this data set consists of input maps of SSTs with spatial correlations indicative of observed SSTs, but where each input map is independent of the others. then represents the sum of contributions from each grid point across the globe, where that contribution is a nonlinear function (specifically piecewise linear) of the SST value at that grid point. To speed up training time, we reduce the number of grid points (pixels) from that used by Mamalakis2021 to 60 longitudes and 15 latitudes for a total of 900 grid points per input map. An example input map is shown in Fig. 1.

2.2 Experimental design

We modify the synthetic climate data to make it suitable for classification by assigning each input map (i.e. sample) to one of classes (see Fig. 1). The ten classes are determined by binning the

values into deciles. For all use cases explored, we task a neural network with ingesting a sample input map and predicting the correct class.

2.2.1 badClasses

For the first use case, badClasses, we modify the data set such that all samples in classes 4 and 5 are “corrupted”; they are randomly assigned an incorrect class. Over the entire data set, 20% of the samples are assigned incorrect labels and 80% of the samples retain their correct labels. In this situation, we would like to to see the CAN identify the samples associated with classes 4 and 5 and abstain on them, since any perceived relationship is unreliable.

2.2.2 mixedLabels

For the mixedLabels use case, we modify the data set such that 5% of all samples (no matter the class) are corrupted by assigning a randomly chosen incorrect label while 95% of the samples retain their correct label. Unlike badClasses, in this case there is no systematic relationship between the input maps and whether the sample is corrupted or not. For mixedLabels, we would like the CAN to learn to abstain on the corrupted samples by identifying them as those that do not behave like the majority of the samples. If the CAN does this, it should be able to learn the reliable samples better. For the testing set, the CAN will not know which samples have been corrupted and which have not, but it should achieve better accuracy as it reduced the chances on learning spurious relationships during training.

2.2.3 fooENSO

For our last use case, fooENSO, we modify the data to loosely reflect forecasts of opportunity related to teleconnections associated with the El Niño Southern Oscillation (ENSO). Warm ENSO events (El Niño events) have long been known to impact global temperatures and precipitation [<]e.g.¿McPhaden2006-pi,Yeh2018-tf, at times leading to skillful forecasts on subseasonal-to-seasonal time scales [<]e.g.¿Johnson2014-fh. To mimic this behavior with our synthetic data set, we average the anomalous synthetic SSTs in the ENSO region within the equatorial eastern Pacific (dashed white box in the map in Fig. 1). When the average value in this box is larger than 0.5 (29% of the samples), we leave the sample as is to reflect an opportunity where a strong El Niño may lead to more predictable behavior of the global climate system. Samples where the average value is less than 0.5 represent “noisy” samples; therefore, we corrupt 50% of these noisy samples by assigning them a randomly chosen incorrect label. As a result, 35% of all samples are corrupt, and 29% of all samples (i.e. those associated with a strong El Niño) retain their correct label. We note that one could corrupt all of the noisy samples, instead of just 50% of them, and the conclusions are the same (see Supp. Fig. S1). With such a setup, we would like the CAN to identify strong synthetic El Niño samples (i.e. large values within the ENSO box, Fig. 1) as reliable samples to learn, while abstaining on the other samples that may have unreliable labels.

Figure 1:

General CAN network architecture used for all use cases. A map of synthetic sea-surface temperatures is fed into a fully connected network and tasked with classifying each sample into one of ten classes. An additional class, the abstention class, is included in the output layer for training with abstention. The number of hidden layers varies between two and three depending on the use case.

2.3 Network architecture and training

We train a fully connected feed-forward network with two to three hidden layers using Python 3.7.9 and TensorFlow 2.4. For badClasses and mixedLabels, the network has two hidden layers with 50 and 25 units, respectively. For fooENSO, we train a network with three hidden layers with 500, 250, and 20 units, respectively, to demonstrate the utility of the CAN for deep networks. Additional use cases are provided in Supp. Fig. S1. For the baseline ANN, the output layer consists of 10 units, representing each of the 10 classes (Fig.

1). For the CAN, we add an additional output unit to represent the abstention class. This will be discussed further in the following subsection.

We train with a ReLU (rectified linear unit) activation function on the hidden layers and a softmax layer at the output. The softmax layer ensures that the sum of all predicted likelihoods is 1.0 for each sample. The network is trained with a learning rate of 0.001 and batch size of 32. For badClasses and mixedLabels, we train on 8,000 samples, validate on 5,000 samples, and test on 5,000 samples. While we could train on a much larger data set, we have intentionally kept the sample size relatively small to demonstrate the utility of the CAN when the sample size is relatively low — as is the case for many geoscience applications. For fooENSO, we train on 32,000 samples, validate on 5,000 samples, and test on 5,000 samples. The number of samples is increased for this use case to accommodate the increase in complexity of the network. That said, additional use cases with smaller training sizes are shown in Supp. Fig. S1. All quantities and figures are computed from the testing data unless otherwise specified.

We employ early stopping to automatically determine the optimal number of epochs to train. Specifically, the network stops training when the validation accuracy stops increasing, with a

patience of 30 epochs. The network with the best performance on the validation accuracy is saved. Specifically for the CAN, we select the best-performing network from epochs where the validation abstention fraction is within 0.1 of the user-chosen abstention setpoint. For all examples shown here, 50 different networks are trained for each configuration (i.e. baseline ANN and CAN) by varying the randomly initialized weights.

3 Methods

3.1 Baseline networks

The baseline artificial neural network (baseline ANN) is identical to the CAN architecture (Fig. 1) with the following exception: it does not include an abstention class, and it uses a standard cross-entropy loss function [<]¿[p. 149]Geron2019, which we define as

(2)

where denotes sample with true label (where ), and is the likelihood assigned to the correct class for sample . From this point on we drop the subscript for readability.

Once the baseline ANN is trained, we invoke abstention on the least-certain predictions by thresholding the ANN-predicted likelihoods (i.e., the output of the network). Specifically, the class prediction by the network for a single sample is defined as the class with the highest likelihood. We then sort these winning likelihoods to threshold the ANN predictions. For example, we abstain on 20% of the samples by throwing out the the 20% smallest winning likelihoods (20% least confident predictions). This thresholding approach for abstention has been shown to be very powerful on its own [<]e.g.¿Mayer2020,Barnes2020, and will serve as a comparison for the CAN.

Throughout this paper, we use “coverage” to denote the fraction of samples for which the network makes a prediction, and “abstention” to refer to the fraction of samples for which the network does not make a prediction. Thus, the percent coverage is always 100% minus the percent abstention. For the baseline approach, abstention and coverage is computed post-training based on the predicted winning likelihoods, while for the CAN these quantities are determined during the training itself (see next section).

3.2 Controlled Abstention Network (CAN)

3.2.1 NotWrong loss

We next introduce the abstention loss function for the controlled abstention network (CAN). Unlike the baseline ANN, the CAN architecture allows the network to assign a label of abstain. The abstention loss is designed to penalize the network for abstention, but penalizes the network even more for getting it wrong. For this reason, we have named it the NotWrong loss and define it as

(3)

where is the likelihood assigned to the abstention class, is a non-negative weight, and is the likelihood assigned to not abstaining,

(4)

The first term in Eq. 3 represents the likelihood of not getting the prediction wrong, that is, the sum of the likelihood of getting it correct plus the likelihood of abstaining, while the second term is a penalty term for abstaining that is weighted by . Without this penalty the network would abstain on every sample to minimize the loss.

Like the abstention loss introduced in Thulasidasan2019, the NotWrong loss has the property that during gradient descent, the network continues to learn on the abstained samples, although to a lesser extent. A proof of this is provided in the supplemental material. This feature of the loss allows the network to move samples in and out of abstention during training while it continues to learn the non-abstained samples better (as will be shown).

3.2.2 PID controller

Figure 2: Example of CAN metrics during training of the badClasses experiment with abstention setpoint of 0.7 and a network random seed of 19.

The parameter in Eq. 3 determines how much the network is penalized for abstaining. can be adaptively modified throughout training so that the network abstains on a specified percent of the training samples. Inspired by the success reported in ¡¿[Chapter 4]Thulasidasan2020, we implement a discrete-time PID controller (velocity algorithm) to modulate throughout training [<]e.g,¿[Eq. (1.38)]Visioli2006.

Thulasidasan2020 solely explores low-abstention setpoints (e.g. 10%), and evaluates the PID terms batch by batch. For our applications, however, we need the algorithm to work well for a broad range of abstention setpoints (e.g. from 10% to 90%). With a high abstension setpoint, say 90%, and a batch size of 32, only 3 samples on average would be covered per batch — this leads to to unstable behavior. Because of this, we evaluate the PID terms on 6 consecutive batches ( samples); this strategy leads to more stable behavior of the abstention fraction, but it does not impede training. An example where the PID controller modulates to control the abstention setpoint during training is shown in Fig. 2a,b.

4 Results

4.1 General performance

Figure 3: CAN testing accuracy as a function of percent coverage for 50 different simulations of each of the three use cases. Pink shading denotes the full range of accuracies of the baseline ANN, while the solid pink line denotes their median. Dots denote results for each CAN simulation; colors denote different abstention setpoints. Panel (b) includes an additional gray line that shows the full range of results from the ORACLE simulation.

For each use case, we train 50 different baseline ANNs and 50 different CAN networks for setpoints ranging from 0.05 to 0.95. Accuracies as a function of percent coverage for the three use cases are shown in Fig. 3. As coverage decreases (abstention increases) accuracy increases for the baseline ANNs and CANs. This demonstrates that more confident predictions are more accurate. That said, for all three use cases, the CAN exhibits higher accuracies compared to the baseline ANN for most coverages. This is further visualized in Fig. 4, where we plot the maximum difference in accuracy between the CAN and the best baseline ANN for the same coverage. For most abstention setpoints the CAN shows an improvement on the baseline ANN. These improvements can be as high as an increase in accuracy of 0.045 (i.e. 4.5%).

For the use case, Fig. 3b shows an additional gray shaded line that represents an additional set of 50 simulations we term ORACLE [<]see also¿Thulasidasan2020. In this setup, we play an all-knowing oracle and remove all of the corrupted samples from the training and validation sets prior to training using the baseline ANN approach. Accuracies are then evaluated on the same testing set as the CAN. Thus, ORACLE represents an upper bound on what we could hope to expect from the CAN. For coverage fractions between 40%-80%, the best CAN models achieve similar accuracies to ORACLE. This suggests that the abstention process has done a nearly ideal job abstaining on the corrupted samples.

Figure 4: Maximum difference in testing accuracy between the CAN and the best baseline ANN for various abstention setpoints. Positive values imply higher accuracies of the CAN compared to the best baseline ANN.

4.2 Strategies of the CAN

Figure 5: (a,b) Testing accuracy computed over the tranquil samples for the fooENSO and badClasses experiments. Shading denotes the full range of accuracies from the baseline ANN, while the solid lines denote the median. (c,d) Fraction of covered samples that are tranquil samples. Gray dashed lines denote the maximum fraction possible given the experimental setup.

There are multiple strategies that the CAN may employ to improve upon the baseline ANN: (1) the CAN may do a better job identifying the non-corrupted samples and abstaining on the corrupted samples, (2) the CAN may do a better job learning the relationship between the non-corrupted inputs and their labels/true classes, and (3) a combination of #1 and #2. Fig. 5 explores these strategies for the badClasses and fooENSO use cases.

Beginning with the fooENSO use case (Fig. 5a,c), we once again plot the accuracy as a function of coverage, but focus only on the tranquil samples this time. We use the identifier tranquil to denote the group of samples that should be identifiable by the network as non-corrupt (i.e. all samples that exhibit a strong El Niño) and not tranquil to denote the rest. Fig. 5c shows that while the baseline ANN and the CAN cover similar fractions of tranquil samples, the CAN is better at learning the relationship between the inputs and outputs of the tranquil samples (Fig. 5a). Thus, for this use case, the CAN outperforms the baseline ANN by taking approach #2. For the badClasses use case (Fig. 5b,d), the CAN appears to outperform the baseline ANN by employing a combination of approaches #1 and #2. That is, the CAN does a significantly better job identifying the tranquil samples (i.e. samples that do not belong to classes 4 and 5; Fig. 5d). It also does a better job learning the relationship between the inputs and outputs of the tranquil samples.

Figure 6: Layer-wise propagation (LRP) relevance heatmap averaged over testing samples for the fooENSO experiment using an abstention setpoint of 0.7. The heatmap shows the mean over all 50 simulations for samples that are correctly assigned a label of 1 (i.e. ). We use the LRP-z rule but set all negative relevances to zero and scale each heatmap by the sum of the positive relevances over the map prior to averaging.

Neural network explainability methods have recently attracted the attention of the climate science community to assist scientists in identifying new forecasts of opportunity [<]e.g¿Toms2020,Barnes2020,Mayer2020. One particular method, layer-wise relevance propagation (LRP), produces a heatmap which approximates the most relevant regions for a neural network’s output according to a set of propagation rules [<]e.g.¿Bach2015,Montavon2017,Mamalakis2021. Fig. 6 shows the average fooENSO LRP heatmap over all 50 simulations for samples that were correctly assigned a label of 1 (i.e. ) by the CAN. Large relevance within the ENSO region (dashed white box) demonstrates that the CAN is indeed using this region to identify successful predictions. Other regions are also relevant (non-zero), since the correct label is determined by the sum of contributions across the globe (see Section 2.2.3). In this way, the CAN architecture may be paired with explainability methods to identify the mechanisms behind forecasts of opportunity within data sets [<]e.g.¿Barnes2020. Likewise, one could also use explainability methods to explore the reasons behind the CAN’s abstention [<]e.g.¿Thulasidasan2019.

5 Discussion

As discussed in the introduction, Thulasidasan2019 and Thulasidasan2020 introduced their own deep abstention classifier (DAC) loss function, defined as

(5)

where all notation is the same as in Eqs. 3 and 4. Our NotWrong loss and the DAC loss are very similar. The important difference is that the quantity inside the first log in the DAC loss represents the likelihood of getting the prediction correct, while in the NotWrong loss it represents the likelihood of getting the prediction not wrong.

While both the DAC loss and NotWrong loss improve accuracies over the baseline ANN, for the use cases explored here we find that the NotWrong loss outperforms the DAC loss, as shown in Supp Fig. S2. We believe that this is because the NotWrong loss puts more energy into learning the correct answer via larger negative derivatives of the loss with respect to the correct class (). This is demonstrated in Supp. Fig. S3, where we display the derivative of the loss with respect to for different value ranges of and . Pink shading in Supp. Fig. S3c represents regions in phase space where the derivative of the NotWrong loss is more negative than the DAC loss, and we find this region of phase space most representative of the use cases explored here (not shown). Future work will explore this behavior further.

6 Conclusions

The ability to say “I don’t know” is an important skill for any scientist.

In the context of prediction with deep learning, the identification of uncertain (unpredictable) samples is often approached post-training. In this paper, we explore an alternative: a deep learning loss function that can abstain during training for classification problems. We introduce a new abstention loss for classification that focuses on getting the answer not wrong, rather than getting it right. The controlled abstention network (CAN) with this loss allows the network to preferentially learn more from confident samples, and ultimately outperform both the baseline ANN approach and the abstention loss of Thulasidasan2019 and Thulasidasan2020 for the climate use cases explored here.

An additional benefit of the abstention loss CAN is its simplicity — it is straightforward to implement in most any network architecture, as it only requires adding an additional class to the output layer and modification of the training loss. The abstention loss framework has the potential to aid deep learning algorithms to identify skillful forecasts, which ultimately improves performance on the samples with predictability.

Acknowledgements.
This work was funded, in part, by the NSF AI Institute for Research on Trustworthy AI in Weather, Climate, and Coastal Oceanography (AI2ES) under NSF grant ICER-2019758. Once published, the code and data will be made available to the community via the Mountain Scholar permanent data repository with a permanent DOI and via Zenodo.

References

  • Albers  Newman (2019) Albers2019Albers, JR.  Newman, M.  201911. A Priori Identification of Skillful Extratropical Subseasonal Forecasts A Priori Identification of Skillful Extratropical Subseasonal Forecasts. Geophysical research letters462112527–12536. 10.1029/2019GL085270
  • Bach . (2015) Bach2015Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, KR.  Samek, W.  201507. On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. PloS one107e0130140. 10.1371/journal.pone.0130140
  • Barnes . (2020) Barnes2020Barnes, EA., Mayer, K., Toms, B., Martin, Z.  Gordon, E.  202012. Identifying Opportunities for Skillful Weather Prediction with Interpretable Neural Networks Identifying Opportunities for Skillful Weather Prediction with Interpretable Neural Networks. arXiv, 2012.07830.
  • Geifman  El-Yaniv (20191) Geifman2019-paperGeifman, Y.  El-Yaniv, R.  2019101. SelectiveNet: A Deep Neural Network with an Integrated Reject Option SelectiveNet: A Deep Neural Network with an Integrated Reject Option. arXiv, 1901.09192.
  • Geifman  El-Yaniv (20192) Geifman2019-thesisGeifman, Y.  El-Yaniv, R.  20192. Uncertainty Estimation and Its Applications in Deep Neural Networks Uncertainty Estimation and Its Applications in Deep Neural Networks. Ph.D. Dissertation, Computer Science Department, Technion.
  • Géron (2019) Geron2019Géron, A.  2019.

    Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow Hands-on machine learning with scikit-learn, keras, and tensorflow.

    O’Reilly UK Ltd.
  • Johnson . (2014) Johnson2014-fhJohnson, NC., Collins, DC., Feldstein, SB., L’Heureux, ML.  Riddle, EE.  2014. Skillful Wintertime North American Temperature Forecasts Out to 4 Weeks Based on the State of ENSO and the MJO Skillful Wintertime North American Temperature Forecasts Out to 4 Weeks Based on the State of ENSO and the MJO. Weather and Forecasting29123–38. 10.1175/WAF-D-13-00102.1
  • Mamalakis . (2021) Mamalakis2021Mamalakis, A., Ebert-Uphoff, I.  Barnes, EA.  202103. Neural Network Attribution Methods for Problems in Geoscience: A Novel Synthetic Benchmark Dataset Neural Network Attribution Methods for Problems in Geoscience: A Novel Synthetic Benchmark Dataset. arXiv, 2103.10005.
  • Mariotti . (2020) Mariotti2020Mariotti, A., Baggett, C., Barnes, EA., Becker, E., Butler, A., Collins, DC.Albers, J.  202005. Windows of Opportunity for Skillful Forecasts Subseasonal to Seasonal and Beyond Windows of Opportunity for Skillful Forecasts Subseasonal to Seasonal and Beyond. Bulletin of the American Meteorological Society1015E608–E625. 10.1175/BAMS-D-18-0326.1
  • Mayer  Barnes (2020) Mayer2020Mayer, KJ.  Barnes, EA.  202012. Subseasonal forecasts of opportunity identified by an interpretable neural network Subseasonal forecasts of opportunity identified by an interpretable neural network. Earth and Space Science Open Archive. 10.1002/essoar.10505448.1
  • McPhaden . (2006) McPhaden2006-piMcPhaden, MJ., Zebiak, SE.  Glantz, MH.  200612. ENSO as an integrating concept in earth science ENSO as an integrating concept in earth science. Science31458061740–1745. 10.1126/science.1132588
  • Montavon . (2017) Montavon2017Montavon, G., Lapuschkin, S., Binder, A., Samek, W.  Müller, KR.  201705. Explaining nonlinear classification decisions with deep Taylor decomposition Explaining nonlinear classification decisions with deep Taylor decomposition. Pattern recognition65211–222. 10.1016/j.patcog.2016.11.008
  • Rodwell . (2013) Rodwell2013-qzRodwell, MJ., Magnusson, L., Bauer, P., Bechtold, P., Bonavita, M., Cardinali, C.Wedi, N.  201309. Characteristics of Occasional Poor Medium-Range Weather Forecasts for Europe Characteristics of Occasional Poor Medium-Range Weather Forecasts for Europe. Bulletin of the American Meteorological Society9491393–1405. 10.1175/BAMS-D-12-00099.1
  • Thulasidasan (2020) Thulasidasan2020Thulasidasan, S.  2020. Deep Learning with abstention: Algorithms for robust training and predictive uncertainty Deep Learning with abstention: Algorithms for robust training and predictive uncertainty. Ph.D. Dissertation, University of Washington. http://hdl.handle.net/1773/45781
  • Thulasidasan . (2019) Thulasidasan2019Thulasidasan, S., Bhattacharya, T., Bilmes, J., Chennupati, G.  Mohd-Yusof, J.  201909–15 Jun. Combating Label Noise in Deep Learning using Abstention Combating label noise in deep learning using abstention. K. Chaudhuri  R. Salakhutdinov (), Proceedings of the 36th International Conference on Machine Learning Proceedings of the 36th international conference on machine learning ( 97,  6234–6243). PMLR. http://proceedings.mlr.press/v97/thulasidasan19a.html
  • Toms . (2020) Toms2020Toms, BA., Barnes, EA.  Ebert‐Uphoff, I.  202006. Physically Interpretable Neural Networks for the Geosciences: Applications to Earth System Variability Physically Interpretable Neural Networks for the Geosciences: Applications to Earth System Variability. Journal of Advances in Modeling Earth Systems. 10.1029/2019MS002002
  • Van Schaeybroeck  Vannitsem (2016) Van_Schaeybroeck2016-loVan Schaeybroeck, B.  Vannitsem, S.  201601. A Probabilistic Approach to Forecast the Uncertainty with Ensemble Spread A Probabilistic Approach to Forecast the Uncertainty with Ensemble Spread. Monthly Weather Review1441451–468. 10.1175/MWR-D-14-00312.1
  • Visioli (2006) Visioli2006Visioli, A.  2006. Practical PID Control Practical pid control. Springer London.
  • Yeh . (2018) Yeh2018-tfYeh, SW., Cai, W., Min, SK., McPhaden, MJ., Dommenget, D., Dewitte, B.Kug, JS.  201803. ENSO atmospheric teleconnections and their response to greenhouse gas forcing ENSO atmospheric teleconnections and their response to greenhouse gas forcing. Reviews of geophysics561185–206. 10.1002/2017rg000568