1 Introduction
The massive amount of data obtained from field observations and computer simulations has resulted in heightening an interest in the application of machine learning (ML) models and artificial intelligence to interpret observational data, detect patterns, and derive operational decisions related to management of water resources. Although scientists have been exploiting ML tools for image recognition
[1], development of autonomous vehicles [2], inferring physical and chemical phenomena [3], and many more, their application for hydrological predictions is still limited.Selecting an appropriate ML model is a challenging problem and the model architecture needs to be designed with an optimal set of hyperparameters. For example, in deep learning models, we have to adjust, among others, the number of hidden layers and nodes per layer, the type of activation functions, and the learning rate. Once these hyperparameters have been set, a very largescale global optimization problem has to be solved in which we fit the learning model to the data by optimizing the network’s weights. Stochastic optimization methods such as stochastic gradient descent
[4, 5] or the ADAM optimizer [6] have been successfully used for this task. In practice, the hyperparameters are often tuned “by hand”, i.e., a set of hyperparameters is chosen, perhaps at random or based on previous experience, and then the ML model’s weights are optimized. If the model does not perform as desired (e.g., in terms of accuracy, or time needed for making predictions), another hyperparameter set is used, and so on. One of the drawbacks of this approach is that training the model can be computationally expensive and, depending on the dataset, may require from a few minutes to several hours of compute time. Thus, if the process is not automated, the scientist is likely to spend a lot of their valuable research time on thinking about which parameters to choose and waiting for the results to come back. Moreover, the final hyperparameters chosen in this fashion are usually not optimal.In this article, we focus on the problem of ML model selection and hyperparameter optimization (HPO) for a pressing application problem of predicting daily groundwater levels in California (CA), United States, for groundwater wells that are affected by streamflow and climatic changes. Although there are approaches other than ML, such as traditional time series forecasting like ARIMA and regression methods, they are not suitable for making daily predictions due to the nonstationary nature of the groundwater levels.
Usually, ML models are used when we have massive amounts of data, requiring high performance computing (HPC) to train and use the models. Local groundwater management agencies, who are the target groups of our methodology, however, usually do not have access to HPC facilities, and the amount of available data to train ML models is relatively low (few years of time series data). Our goal is therefore to identify and optimize the hyperparameters of deep learning models that are lightweight enough to be trained and run on a laptop or simple desktop computer.
To date, only a few systematic approaches have been developed to automate HPO in learning models. This includes the widelyused random and grid search approaches [7]. However, the grid search does not scale well with the number of hyperparameters in the model. Ilievski et al. [8]
use a deterministic surrogate model algorithm based on radial basis functions to tune the hyperparameters of deep neural networks. The authors show the performance of the method on image data only, and it is unclear if the results can be generalized to time series data. Snoek et al.
[9] developed a Gaussian process (GP) based algorithm for HPO, and the authors investigate how the choice of the GP kernel and the GP’s own hyperparameters can influence the performance of the ML model. Bergstra [10] used Bayesian methods for HPO. Bayesian optimization has also been used by [11]to train support vector machines and convolutional neural networks (CNNs) for large data sets. The authors train their learning model on a subset of the data in order to decrease the amount of required training time. The developers of AutoKeras
[12]also used Bayesian optimization with a new type of kernel for the GP model, and a new acquisition function that enables the optimization of network morphism. The disadvantages of Bayesian optimization and GPs, in general, are the underlying assumptions that are made regarding the distributions of the errors and the covariance structure, which is usually not available a priori. Moreover, GPs do not scale very well with the number of observations because optimizing the GP’s own hyperparameters becomes computationally expensive. Another approach to HPO is the use of genetic algorithms. For example, the Multinode Evolutionary Neural Networks for Deep Learning
[13]tool uses a genetic algorithm and support vector machines for tuning the hyperparameters of CNNs. Evolutionary algorithms are known to require hundreds to thousands of function evaluations to find a solution. In our setting, this is not feasible to do. The authors of HyperNOMAD
[14] extended the NOMAD algorithm [15] to automatically optimize the hyperparameters and the learning process of deep neural networks using the mesh adaptive direct search. The authors show the performance of HyperNOMAD on image data. Finally, surrogate models have been used for the automated HPO of deep learning models by [16]. Their strategy is based on exploiting high performance computers and asynchronous computations to evaluate a large number of model architectures, which is in our setting not applicable.In this article, we investigate which ML models are best suited for the task of learning and predicting timeseries data and which HPO methods lead to the best model performance fastest. Here, we take into account the accuracy of the models in terms of their ability to repredict observed groundwater measurements as well as the time it takes to find the best hyperparameters for each model. We analyze the performance of a long shortterm memory recurrent neural network (LSTM), a multilayer perceptron (MLP), a simple recurrent neural network (RNN), and a convolutional neural net (CNN).
In order to find the best hyperparameters, we pose the problem as a bilevel blackbox optimization problem since the objective function (model performance) is not given in closed form. In fact, the objective function evaluation requires solving an often computationally expensive optimization subproblem (the model fitting problem). Our goal is to find the best hyperparameters within as few expensive evaluations as possible. To this end, we use two types of adaptive, derivativefree surrogate model algorithms, namely one based on radial basis functions (RBFs) and one based on GPs. We compare the surrogate model approaches for HPO to the widely used method of randomly sampling the hyperparameter space. We demonstrate the application of the surrogate model approaches for a groundwater prediction application. Here, we consider two cases: one, in which we use a single groundwater observation well, and the second, in which we use multiple groundwater observation wells for model training and predictions.
The remainder of this paper is organized as follows. In Section 2, we introduce the mathematical problem description of HPO. We describe the details of the surrogate model based optimization algorithms in Section 3. In Section 4, we briefly review the learning models we use in our numerical experiments. In Section 5, we provide the details of the groundwater prediction problem, and we describe the setup and the results of our numerical experiments. Finally, Section 6 concludes the paper and outlines future research directions. Supplemental materials contain additional data and results of numerical experiments.
2 Mathematical Problem Description of Hyperparameter Optimization
The problem of finding the best parameters of a learning model can be stated as a bilevel optimization problem where we optimize the hyperparameters (model architecture), , at the upper level and at the lower level we fit the model to the data. We divide the dataset into training data and testing data
. At the upper level, we optimize a loss function
over the hyperparameters for :(1) 
Here, the weights are the optimized weights obtained from solving the lower level optimization problem over given the current network architecture :
(2) 
The domain in which the hyperparameters live is defined as:
(3) 
where denotes a discrete set of values that can be assumed by parameter . The domain over which the weights are optimized is continuous. We use the notation in (2) to indicate that we optimize over the parameters given a set of parameters and the training data set .
Our assumption that all hyperparameters can only assume a limited number of values is motivated by practical considerations. Some parameters, such as the number of network layers and the number of nodes per layer can only be integers. Other parameters, such as the dropout rate can range between 0 and 1. However, in practice, only a finite number of options is considered for all hyperparameters due to limited resources and an assumption that very small changes in these hyperparameters are not likely to affect the performance of the learning model significantly. Therefore, we also make the assumption here that we only have a limited number of choices for all hyperparameters. If we allowed some variables to live in the continuous space, the number of possible solutions increases and it would likely lead to a higher number of required optimization iterations to find the optimal hyperparameters. This, in turn, will increase the total time to solution, which would be an undesirable effect for our target application.
The loss functions and in equations (1) and (2
) may be the same, for example, a root mean squared error between the data (actual value) and the fitted model (estimated values), but they may also be different depending on the goal of the analysis. For example, the upper level function could also include a regularization term to encourage model simplicity or low training and prediction times. In order to obtain the objective function value
for a certain architecture at the upper level, we have to solve the fitting problem (2) at the lower level. The lower level problem is usually a difficult largescale global optimization problem and stochastic optimizers are commonly used to solve it. Due to the stochasticity of the optimization method, solving problem (2) twice for the same hyperparameters but with different random number seeds will in general lead to two different solutions. Therefore, the upper level problem can be considered as a stochastic optimization problem. In order to obtain a good estimate of for a given , the lower level problem should be solved several times, and thus instead of minimizing at the upper level, we minimize its expected value whereis a random variable that encodes the randomness due to the lower level optimizer. Depending on the application, it might also be appropriate to minimize the worst case loss. The final optimal solution
as well as the best hyperparameter choice depend on .Solving the lower level optimization problem is in general timeconsuming and, depending on the network architecture (defined by
), may require many minutes to hours (a large number of layers, nodes, and epochs lead to long compute times at the lower level). This compute time requirement restricts how many times the lower level problem can be solved for each hyperparameter set in order to obtain statistically significant point estimates of the expectation value.
Due to the problem structure described above, there is no algebraic description of the upper level available (blackbox) and neither is gradient information. The loss function is generally multimodal with several local and global optima. Therefore, HPO requires an efficient, gradientfree, global optimization method.
3 Surrogate Models for Efficient Hyperparameter Optimization (HPO)
Surrogate modelbased derivativefree optimization algorithms have been developed to efficiently solve computationally expensive blackbox problems. A surrogate model approximates the costly objective function [17]: , where denotes the difference between the true objective and the surrogate approximation. Here we omit the dependence of on for ease of notation and because we are only interested in approximating the relationship between and the corresponding value of , in which the optimization over the ’s is implied.
Most surrogate model optimization algorithms have been developed for problems with continuous parameters, which are constrained only by lower and upper bounds, see, e.g., [18, 19, 20, 21]. However, there are some algorithms for problems with other characteristics such as computationally expensivetocompute constraints [22, 23, 24], integer constraints on all parameters [25], mixedinteger parameters [26, 27, 28, 29], multiple conflicting objectives [30, 31], hidden constraints [32, 33], and multiple levels of objective function fidelity [34, 35].
Many difficult optimization problems have been tackled successfully using surrogate model methods, e.g., in the structure optimization of nanomaterials [36], in cloud simulations [37], in climate modeling [38]; in watershed water quality management [22], and in aerodynamic design [39].
In this article, we will use a surrogate model approach for optimizing the hyperparameters of deep learning models. We assume that all parameters are restricted to only a finite number of possible values as discussed in Section 2. We implement and compare two different approaches: (a) a radial basis function (RBF) as surrogate and a weighted score to iteratively select sample points (similar to SOI [25]); (b) a Gaussian process (GP) model and an expected improvement [21] acquisition function, which we optimized over an integer lattice, using a genetic algorithm to iteratively generate a new sample point (a new set of hyperparameters ).
Global surrogate model algorithms generally follow the same structure: (1) Generate an initial experimental design and evaluate the expensive objective function at the selected points; (2) Use all inputoutput data pairs to compute the parameters of the surrogate model; (3) Optimize a computationally cheap auxiliary function on the surrogate model to determine the next point in the parameter space where the expensive objective will be evaluated. We iterate between steps (2) and (3) until a stopping criterion has been met. Natural stopping criteria include a maximum number of allowed function evaluations, a maximum CPU time, or a maximum number of failed iterative improvement trials. In step (2), different types of surrogate models can be used, such as GP models [40], RBFs [41, 42], or polynomial regression models [43]. Different methods have been developed for iteratively selecting new sample points. For example, [21] maximized an expected improvement criterion, [19] used a stochastic sampling approach, and [18] minimized a “bumpiness” measure. In this article, we compare the performance of the expected improvement criterion and the stochastic sampling when all parameters are assumed to be integer values. The pseudocode of the hyperparameter optimization method is shown in Algorithm 1, and the individual steps are described in more detail in the following subsections.
3.1 Step 1: Preparing the Hyperparameters for Optimization
When preparing the hyperparameters in Step 1, we have to take into account that they may live on very different ranges. For example, the dropout rate can range between 0 and 1. If we only consider the values as the finite set of possible values, we map these values to . On the other hand, the batch size may vary between 50 and 550 in steps of 50 ({50, 100, 150, …, 550}). In this case, we map these values to , and so on, thus decreasing the difference in parameter ranges^{1}^{1}1Note that we usually use a similar approach in continuous optimization where we scale the parameters to the unit hypercube, which improves the surrogate models and eliminates difficulties when sampling by perturbation.. We denote the mapped set of possible integervalued vectors by . To evaluate the loss function, we map the parameters back to their original values, e.g., by multiplying with 0.1 and 50, respectively, for the two examples above.
3.2 Step 2: Initial Experimental Design
In Step 2, we generate randomly selected initial points from at which we will evaluate the upper level objective function. These points satisfy the integer constraints by definition. For each point in the initial design, we solve the lower level problem times and compute the upper level objective function as the average of evaluations. These evaluations can be done in parallel to reduce the required evaluation time. In general, can be increased., depending on the expected function evaluation time.
3.3 Steps 49: The Adaptive Sampling Loop
During the adaptive sampling loop, we compute the parameters of the surrogate model using all previously evaluated points (which is different from trust region type methods that may use a subset of the evaluated points). Generally, any surrogate model can be used in the algorithm. We consider RBFs and GPs in this article. In the following, we describe the details of these surrogate models and the associated adaptive sampling methods.
Radial Basis Functions and Stochastic Sampling
Our first approach (“RBF approach”) to the HPO of learning models is based on RBF models with a cubic kernel. These models have previously been shown to perform well for various optimization problems including problems with integer constraints, see e.g., [36, 22, 44]. RBFs have the general form
(4) 
where , are already evaluated points, is the RBF function with a cubic kernel (other kernels are possible, see e.g., [18]), denotes the Euclidean norm, and is a linear polynomial tail. The parameters , and are determined by solving a linear system of equations:
(5) 
where the elements of the matrix are , , is a matrix with all entries 0 of appropriate dimension, and
(6) 
The matrix in (5) is invertible if and only if rank [41], where denotes the problem dimension. Here, is the optimal weight vector (lower level solution) corresponding to the th evaluated parameter vector .
In order to identify the next sample point, we find the point that has the lowest objective function value. Then, we generate a large number of candidate points by perturbing each value of the best point found so far by either adding or subtracting a value randomly chosen from . Larger perturbation values can be chosen, but in our target application, the parameter ranges are relatively small, and, thus, a maximum perturbation of 2 is reasonable to prevent creating too many candidate points lying outside the upper and lower bounds. However, if the perturbed point falls outside of , we reflect it over the corresponding boundary to the inside of . The points created by perturbation enable a local search around . In addition, we generate a second set of candidate points by randomly selecting another points from . These points represent our global search.
In order to select the next evaluation point from these two sets of candidate points, we follow the ideas of [19] in which two scores are computed for each point, namely the distance to the already evaluated points (distance score) and the objective function value as predicted by the RBF surrogate (RBF score). We compute a weighted sum of both scores and the candidate point that achieves the best weighted sum will become the next evaluation point (for details, see the online supplement Section A). By placing a large weight on the distance score, we give more weight (i.e., importance) to preferentially sample points, which are far away from already sampled points (global search). If the weight for the RBF score is large, we preferentially sample points with low predicted objective function values. This represents a local search, because points with low predicted objective function values are usually in the vicinity of the best point found so far. Note that candidates that have already been evaluated previously are discarded. Thus, if we did not have a stopping criterion and sampling continued, the globally optimal solution would eventually be found, following a simple counting argument.
Gaussian Process Models and Expected Improvement
Our second approach (GP approach) to the HPO of learning models is based on GP models and expected improvement sampling [21] over an integer lattice. According to a literature review, GP (or kriging) models have been widely used as surrogate models to approximate expensivetocompute objective functions. By definition, GP models not only provide a prediction of the function values at unsampled points, but also an estimate of the associated uncertainty. In kriging, we treat the function we want to approximate like the realization of a stochastic process, and write the surrogate as
(7) 
Here, represents the mean of the stochastic process and is distributed as . Thus, the term represents a deviation from the underlying global model (the mean). In kriging, we assume a correlation between the errors, which is based on the distance to the other points. At an unsampled point , the kriging prediction is treated like a realization of a random variable
that is normally distributed with mean
and variance
. The correlation between two random variables and is defined as(8) 
where determines how fast the correlation decreases in the th dimension, and reflects the smoothness of the function in the th dimension (we use for all ). We denote by the matrix whose th element is given by (8). The kriging parameters , and are estimated by maximum likelihood. Then, the prediction at a new point is defined as
(9) 
where is a vector of ones of appropriate dimension and as in (6) and
(10) 
The corresponding mean squared error is
(11) 
with
(12) 
and
(13) 
From the properties of the GP model, we can now define an expected improvement function which we use to iteratively select a new sample point . We denote by a random variable that represents our uncertainty about the function value. Its mean and variance at a point are defined by the kriging predictor, namely and . We denote by the best function value we have found so far during the upper level optimization. If the value is less than , we have an improvement . The expected value of this improvement is computed as
(14) 
where
(15) 
and and are the normal cumulative distribution and density functions, respectively, and is the squared root of the mean squared error.
In order to select the next evaluation point , we have to maximize the expected improvement. Since the maximum argument will not necessarily fall onto a point that satisfies the integer constraints, we have to maximize the function over an integer lattice. Therefore, we use a genetic algorithm (see [45] for introducing genetic algorithms), which returns an integerfeasible point at which we do the next expensive function evaluation.
3.4 Algorithm Parameters
Our algorithm has several parameters that can be adjusted and may impact the results of our simulations. In our numerical experiments, we use points as the number of initial experimental design points. Generally, we would like to use a large number, but the more initial points we use, the fewer points can be acquired iteratively (as the bottleneck is usually the number of allowed evaluations). For each hyperparameter vector, we solve the lower level problem times to obtain an estimate of the expected value of the outer objective function. In each iteration, we generate candidate points, with . If the number of hyperparameters and the number of values each hyperparameter can assume are small, we can alternatively generate candidate points by completely enumerating all possible solutions for which we would then compute the weighted scores or the expected improvement values, and choose the best candidate, but in our HPO setting, this is not applicable due to the huge number of possible solutions (see also Table 2 in Section 5.3).
The weights for the selection criteria in the RBF approach cycle through the pattern . In the first iteration, the weight is 0; in the second iteration, the weight is 0.25, and so on, and after the weight 1 was used, it will be set back to 0 in the next iteration. This cycling enables the sampling to transition from searching locally to globally. As a stopping criterion, we use a maximum number of allowed function evaluations, i.e., a maximum number of network architectures. We set this number to 50.
4 Review of Learning Models
In this section, we provide a brief review of the different deep learning models that we are using in our numerical study and their associated hyperparameters. These hyperparameters determine the model architecture, and thus how well the model performs in terms of minimizing the loss.
The number of layers, the number of nodes in each layer, the type of activation functions, and the type of optimizer are examples of hyperparameters. Each node in a network layer may have a different activation function. The Rectified Linear Unit (
ReLU) activation function, , is commonly used. The learning model corresponding to a given architecture () is trained to find the weight vector () that minimizes the loss. The optimizers used for this task share the idea of backpropagation
[46]which is an iterative method based on gradient descent and the chain rule. Therefore, the number of iterations (=the number of epochs) and the batch size (=the number of samples in the gradient update) are additional hyperparameters that must be determined. Among many existing optimizers, ADAM
[6], Adamax, and RMSprop
[47] are widely used. Each of these optimizer has their own hyperparameters such as the learning rate and the decay rate. Finally, a dropout rate () can be set to help prevent overfitting at each layer except for the output layer. The dropout rate means that randomly selected (number of nodes in the layer) nodes are ignored during training.4.1 Multilayer Perceptron Model (MLP)
The MLP [48]
is a feedforward neural network that consists of an input layer, at least one hidden layer, and an output layer. It can model nonlinear patterns by introducing multiple layers and nonlinear activation functions that transform input signals. It has been primarily used for classification and function estimation problems
[49]. MLPs have been used for, e.g., forecasting drought [50], estimating evapotranspiration [51], and predicting water flow [52]. Unlike other statistical models, the MLP does not make any assumptions regarding the prior probability density distribution of the training data
[53].4.2 Convolutional Neural Networks (CNN)
The CNN [54] is also a feedforward neural network that applies a sliding window filter on the input dataset, and has three main characteristics, namely local connectivity, shared weight, and spatial or temporal subsampling. The local connectivity means that the weighted linear combination is restricted to within the filter (not fully connected). All local fields share the same weights. The weighted values are transformed with the activation function
, and the function returns a feature map. The subsampling is an optional step to extract other features in the feature map by reducing its size. An example method of subsampling is to pool the data to replace the input by a summary statistic such as the maximum (maxpooling) or mean (avgpooling) value. The filters can also be used to extract features. The filter can be applied to both the input and the feature map to make deeper neural networks. Finally, the extracted features are fully connected to the output layer. Note that additional hidden layers can be constructed between the final feature layer and the output layer. CNN specific hyperparameters that must be optimized in addition to the ones mentioned at the beginning of Section
4 include the number of filters, the filter size, whether or not to apply subsampling, and the method and the size of the subsampling.4.3 Simple Recurrent Neural Networks (RNN)
Recurrent Neural Networks (RNNs) can represent dynamic temporal behavior by using a directed graph structure between nodes to store sequence representations. The decision an RNN sees at time affects the decision at time . Unlike feedforward neural networks, which take the entire sequence input as is, the RNN first takes in the output generated from the previous sequence steps (memory) before taking the input step. In the simplest RNN, the memory is carried forward through the hidden state, which is a function of the product of the input at time and a weight matrix that is added to a product of the hidden state at time and another hidden weight matrix. The output is a function of the product of the last hidden state and a weight matrix. The activation function in the RNN is a hyperbolic tangent function that allows an increase or decrease in the value of the state. The weight matrices are adjusted by error backpropagation. A drawback of RNNs is that often, despite choosing the optimal set of hyperparameters, errors flowing “backward in time” in a fully connected RNN may either vanish or blow up, which leads to oscillating weights or long training times. RNNs have been used in applications like speech recognition [58]
[59], and time series forecasting in hydrology [60, 61], where system memory and context are critical to prediction.4.4 Long ShortTerm Memory Recurrent Neural Networks (LSTM)
Long shortterm memory networks [62] are an improvement of RNNs and are able to learn longterm dependencies. LSTMs have a structure that allows them to represent temporal behavior and remember previous behavior through a special arrangement of modules. Besides a hidden state, LSTMs also have a cell state that runs through the entire chain and is responsible for storing the long term dependencies. The information flow between the cell state, the hidden state, and the input state is controlled through different types of gates. A simple LSTM unit contains an input gate, an output gate, and a forget gate. The forget gate [63]
enables the network to overcome the potentially unbounded growth of cell states when a continuous time series is used. The forget gate is a sigmoid activation function that decides how much information from the previous cell state should be passed to the next. The output from the forget gate is a function of the input, the previous hidden states with their corresponding weight matrices, and a bias vector. It enables resetting the memory blocks once their contents are out of date.
The input gate decides how to update the cell state. The hyperbolic tangent activation function generates new candidate solutions using the weighted input and hidden state and a bias vector. The new cell state is generated from the weighted sum of the previous cell state and the input state. The output gate generates the new hidden state from the new cell state and the input carrying the most recent information.
5 Numerical Experiments
We designed numerical experiments to answer two main questions: (1) Which learning model is best suited for making accurate predictions of groundwater levels? (2) Which HPO method leads to the best results fastest? In order to answer these questions, we use our HPO methods to train and test the learning models on historical observations and we compare the performance of the four learning models to each other. We use the optimized model architectures to predict the future groundwater levels and we compare these predictions to “future” data, i.e., groundwater data that was not used for optimizing the hyperparameters or training the models.
In the following, we describe our groundwater application problem, the data used in the numerical experiments, and the setup and results of the experiments. Additional results and data are provided in the online supplement.
5.1 Description of the Groundwater Prediction Application
In California (CA), United States, groundwater has always been an important resource for agriculture, drinking, and other beneficial purposes. An unprecedented multiyear drought endured in the recent past (20122016) led to regulations on longterm water conservation and management of groundwater. Our groundwater application is motivated by the 2014 Sustainable Groundwater Management Act (SGMA) [68] in CA, which requires local agencies to manage groundwater to reduce undesirable impacts such as groundwater overdraft leading to unsustainable lowering of groundwater levels.
However, this situation is not unique to CA, and drought severity is expected to increase over many regions with changes in climate [69] increasing the agricultural dependencies on groundwater [70]. There is therefore an increased urgency for time series forecasting of groundwater depths.
Groundwater depths are typically estimated using mechanistic multiscale, multiphysics simulation models such as MODFLOW, HydroGeoSphere, OpenGeoSys, and TOUGH [71, 72, 73]. However, these models require extensive characterization of hydrostratigraphic properties and accurate quantification of boundary conditions, including recharge sources, climate variability as well as changes in pumping [74]. Such properties, boundary conditions, and parameters are not always known apriori, and some of these parameters may even need to be estimated by solving an inverse problem. Inverse estimation of parameters requires running simulation models repeatedly until an optimal solution is reached, thereby increasing the computational costs substantially [75]. Moreover, these models are not desirable when multiple future climate scenarios and the corresponding groundwater predictions are needed to make actionable decisions for sustainable groundwater management. Thus, it is appealing to use purely datadriven and machine learning methods that can predict the groundwater level changes for a broad spectrum of managementrelated decision making (e.g., SGMA). Datainformed models as the ones discussed in this article are computationally inexpensive. They can be continually updated as new observations are obtained, and therefore will enable decision makers to respond to changing groundwater levels in a timely manner. Moreover, in some cases datainformed learning models have shown better performance than processinformed simulations, possibly due to deficiencies in model structure from incomplete process representations [76].
In the groundwater modeling context, artificial neural networks (ANNs) have previously been used for both modeling and prediction purposes (see e.g., [77]). Most of these applications employ feedforward networks to predict groundwater level change, changes in water quality, or to optimize pumping well design. More recently, RNNs have been used to predict groundwater levels (e.g., [61, 78]
). While LSTMs are more commonly applied in the field of speech recognition, natural language processing and computer vision, their application is still limited in the field of hydrologic sciences. For example, LSTMs have recently been used to predict changes of river discharge from precipitation across the Continental United States using publicly available datasets
[79]. [78] use an LSTM to predict groundwater levels in the Hetao Irrigation District in China. The authors did not use an automated optimization procedure for tuning the hyperparameters and manually tuned the parameters by trial and error. The authors trained their model on 12 years of monthly data (water diversion, evapotranspiration, precipitation, temperature, groundwater depth, time) and show that their LSTM model is able to make accurate predictions for two years into the future.In this article, we focus in particular on the Butte County watershed in CA. This area belongs to CA’s Central Valley, which is one of the largest agricultural lands in the world and produces over 250 crops a year [80]. Butte County is considered a highpriority watershed by SGMA and it is therefore vital to develop tools for groundwater predictions that will enable actionable and timely decision support for the planning and management of groundwater resources in this region in the future.
5.2 Observation Data in Butte County, CA, USA
Our study focuses on the Butte County watershed in CA. At this location, we have highfrequency (daily) observations of groundwater levels available for the duration of approximately eight years (20102018). We train our learning models on historic daily observations of temperature, precipitation, streamflow, groundwater levels, as well as features such as the week and month of the year that the measurement was taken.
We obtained the groundwater level measurements for particular well sites in Butte County from the California Natural Resources Agency under their SGMA data viewer portal for 20102018. The wells are located within a radial distance from each other. We retrieved the daily averaged streamflow discharge at a nearby river site (Butte Creek) from the California Data Exchange Center (CDEC) from the station code BCD: Butte Creek Durham. The daily averaged precipitation and temperature were also obtained from the CDEC website at the Chico weather station. These stations are in close proximity to the groundwater wells and are located within the same Butte Creek basin.
Since there were gaps in the observations for temperature, precipitation, and groundwater levels (e.g., which may happen due to faulty sensors or bad data quality), we imputed the missing values using the R package
imputeTS “Time Series Missing Value Imputation” [81]. This package provides stateoftheart imputation algorithms along with plotting functions for univariate time series missing data statistics. We use the na.seadec function with the Kalman Smoothing and State Space Models algorithm. This algorithm removes the seasonal component from the time series, performs imputation on the deseasonalized time series, and then adds the seasonal component again. To create an uninterrupted time series (due to some faulty data entries at the end of the hydrological year, i.e., end of September), we used the na.locf algorithm (”Imputation by Last Observation Carried Forward”) of the imputeTS package.In the following study, we consider two scenarios, namely a“single well case” and “multiwell case”. In the single well case, we train the model on the groundwater levels measured only at a single well and we predict only for this one well. In the multiwell case, we use the groundwater level data from three different wells to train the learning models, and we make predictions for all three wells. Note that we do not train a separate model for each well, but rather we have one model for all three wells. Table 1 shows a summary of the data sources we use in our study. The time series for all variables are shown in the online supplement in Section B. The time series for the groundwater levels at all wells reflect the significant decline in the amount of available groundwater between 2010 and 2016.
5.3 Setup of the Numerical Experiments
We conducted our numerical experiments with python () on Ubuntu 16.04 with Intel^{®} Xeon(R) CPU E31245 v6 @ 3.70GHz 8, and 31.2 GiB memory. We use the Keras package [82]
with the TensorFlow
[83] backend for our deep learning architectures. For the genetic algorithm in the GP approach we use the DEAP python package [84]. In the genetic algorithm we use 100 generations with 100 individuals each and we use a crossover probability of 0.75. For the GP itself, we use the implementation in Sklearn [85].Preparing the Data
We divide our observation data into two sets – one for training the models () and one for testing the models (), i.e., comparing the model’s groundwater predictions to the true observations. In our setting, there is a dependency of the number of daily observations used for training and testing the models on a hyperparameter that we are optimizing. This dependency is described in more detail in the following. The model output is the groundwater level for the next day and the inputs are the values of all the variables at the current day and all values at some previous days. More specifically, we define the number of previous days whose data are used as input as a lag number, . The lag number means that the variable values of the previous days are used. This lag thus determines the size of and . To illustrate it with an example, suppose that there are a total 5 days of observations and for each day we have measurements of 6 variables. We denote , , as the day of the measurement. If the lag , each sample consists of two days of observations (the current and the previous day) and therefore has 26 =12 variable values. The total number of samples we obtain is thus four: ({d(1), d(2)}, {d(2),d(3)}, {d(3),d(4)}, {d(4),d(5)}). If the lag , we use the measurements for each variable at the two previous days in addition to the data for the current day, and thus each sample consists of three days and has 36=18 variable values. The total number of samples we have is then three: ({d(1),d(2),d(3)}, {d(2),d(3),d(4)}, {d(3),d(4),d(5)}). Thus, as is, the lag would determine the total number of samples for our model input. However, we divide the dataset such that the number of training samples is the same regardless of the lag. In other words, the number of samples in is always the same and the dimension of each sample (= the number of days the number of variables in each day) varies depending on the lag. The samples that were not used for defining constitute the test set . Therefore, the lag determines how much historical information will be used in each sample. By setting the lag as a hyperparameter, the optimization aims to not only find the best network architecture, but also the optimal amount of historical information in each sample that leads to minimizing the outer objective function.
In order to make future groundwater predictions, we use the given values (here from the dataset ) of temperature, precipitation, and streamflow discharge and predict the corresponding groundwater levels with the trained models. Because the input groundwater data in are assumed to be unknown, we use the predicted groundwater level value from the model. Therefore, the input groundwater data are dynamically obtained as we make predictions for each additional day.
Before using the variable values for training and testing our learning models, we normalize all variables’ values to [0,1]. The reason is that different observed variables affect the groundwater level in specific ways and their values are generally on different ranges. Scaling them to [0,1] enables us to better capture the variables’ interactions within the learning models. For the groundwater data, this is, however, not straightfroward. The recent drought in CA has caused the groundwater levels to drop to unprecedented low levels. Thus, for scaling purposes, it is difficult to assign a maximum and a minimum groundwater level as future groundwater levels may go below or above recorded values. We therefore assigned a minimum groundwater level of 0, which here refers to the mean sea level and it is a fair assumption that the water level can never drop below 0 even though this value has never been observed (unless it is located near the sea coast). For the upper bound, we use the maximum observed groundwater level. The reason is that over the past, we have observed a steadily decreasing trend in the amount of groundwater. Although wet years may cause the groundwater levels to go above the maximum recorded values, it is a good choice as it reflects the reality observed so far. Another possible upper limit could be the ground surface elevation (GSE), as groundwater cannot go above the ground. However, the GSE is also a dynamic parameter, as several places in CA suffer from land subsidence due to groundwater withdrawal. For the other variables (temperature, precipitation, streamflow discharge) the scaling is trivial as we have full information of the minimum and maximum values of the whole time series (past observations as well as future predictions). Note that this scaling does not mean that the predicted groundwater levels will be below the chosen maximum, and the future groundwater level predictions can go beyond their historical maximum levels depending on the input values of the other variables.
Replications of Inner and Outer Optimizations and Hyperparameters
The training of the models in the inner optimization and the search for the optimal hyperparameter in the outer optimization both have stochastic elements. For performing the inner largescale optimization for a given hyperparameter set, we use stochastic optimizers. Therefore, given a specific hyperparameter set , we solve the inner problem five times and we define the inner optimization’s loss function as the average over these five trials (as an approximation of ). The outer optimization also has stochastic elements (in the initial experimental design and in the adaptive sampling methods). Therefore, in order to obtain an average outer optimization performance, we repeat the HPO for each model with each algorithm five times. We allow each HPO method to try 50 different hyperparameter sets.
The hyperparameters we optimize as well as the possible (unscaled) values for each hyperparameter are summarized in Table 2. We keep the learning and the decay rates at their default values of and , respectively. Note that we restricted the possible number of lags for RNN and LSTM to 60 as both models experience problems with backpropagation when larger lags are used. From a practical perspective, allowing a lag of 365 days seems reasonable as this would allow the model to use the information from a whole year worth of data, and thus recognize seasonality and correlations between the time of the year and the groundwater levels. However, for RNN and also for LSTM, a maximum lag of 365 days was too large, causing problems in training and thus problems in the optimization. We could run the LSTM and RNN five times successfully by restricting the maximum lag to 60 days for the single well case. Although the structure of the networks and their detailed mechanics are different (see Section 4), the number of input/output nodes are formed in the same way. The number of input nodes are the same as the number of variables in a sample. For the earlier example with five days and lag , all models would have 12 input nodes. The number of output nodes are the same as the number of wells whose groundwater data we are learning, i.e., one for the single well case and three for the multiwell case.
For LSTM, we use the ReLu activation function in all the deeper layers except the outer layer. For training the LSTM for a given hyperparameter set, we use the RMSprop optimizer. In the MLP, we also use the ReLu activation function for all nodes except those in the output layer. The nodes of the output layer have the linear activation function because we are dealing with realvalued outputs. We use ADAM for the lower level optimization. During the construction of the CNN, we add a fully connected hidden layer with 50 nodes between the last convolution layer and the output layer. For simplicity, we do not apply the subsampling option here. The activation function and the lower level optimizer for the CNN and the RNN are the same as for the MLP. We leave the dropout rate of the recurrent step fixed at 0. Even with our simplifying assumptions that all layers in the model have the same hyperparameter values for the dropout rate, the number of nodes, the number of filters, the filter size, and the large step sizes (e.g., steps of 50 for the number of epochs), we can see from Table 2 (bottom) that the number of possible hyperparameter combinations ranges from hundred thousands to millions. Trying all of these combinations is computationally infeasible, especially in our setting where optimal hyperparameters have to be determined with limited compute resources.
5.4 Numerical Results
In the following subsections, we compare the results of our numerical experiments obtained with LSTM, MLP, CNN, and RNN for the single well case. For the multiwell case, we compare only LSTM, MLP, and CNN. We did not include the RNN model for the multiwell case as it suffers from the problems with backpropagation mentioned earlier. It therefore often fails to fit a model to a given architecture. Also, as we will see for the single well results, the RNN performed worst among all learning models. Therefore, we do not include the RNN in the multiwell analysis.
We compare the performance of our two optimization approaches (the RBF approach and the GP approach) with a random sampling (RS) approach for HPO. We illustrate our results in the form of progress plots, time series plots, wall clock time plots, and plots of the distributions of the optimal hyperparameters found in each of the five trials. In addition, we illustrate the models’ predictive performance for data in the future that was not used during optimizing the hyperparameters or during training the models. This “outofsample prediction” allows us to gain insights into the models’ extrapolation performance. We report additional results including tables with the optimal hyperparameter values for each model and each optimization strategy as well as the corresponding losses (and the resulting groundwater prediction errors) in the online supplement in Section C. Section C of the online supplement also contains additional analysis plots.
5.5 Results for Predicting Groundwater at a Single Well in Butte County, CA
Comparison of Mean Squared Errors
In Figure 1 we show the distribution of the optimal outer objective function values over five trials for all models and all optimization methods after the optimization has finished. The final mean squared errors are fairly low for all learning models and all optimization approaches. Figure 1 shows that there is a noticeable difference between the performance of the types of learning models, with MLP generally leading to the smallest (best) final mean squared errors. Also the variability of the results of all HPO methods is lowest for MLP, which indicates that the MLP performance is less sensitive to the choice of the HPO method. On the other hand, the results of RNN are worst for all HPO methods. This may be related to the fact that all HPO methods have difficulties in finding optimal solutions or that the RNN is simply not able to capture the groundwater data better than what we found. Figure 1 also indicates that using the GP method leads except for LSTM to the highest errors for each model, i.e., in some cases even RS performs better than the GP approach. The RBF approach on the other hand leads except for the LSTM to the lowest errors. For LSTM, the RBF approach reaches similarly good values as the GP approach, but the variability in the results is slightly higher.
Analysis of Optimized Hyperparameters
We compare the distributions of the final hyperparameter values over all five trials obtained with the different HPO methods. Figure 1 showed that the MLP optimized with the RBF approach leads overall to the best results. Figure 2
shows the MLP hyperparameters. Here, the colored bars represent the mean and the black lines show the standard deviation over the five solutions for the MLP. We see that generally all HPO methods lead to a more complex MLP network, the RBF method leading to approximately 5 layers with approximately 35 nodes each. For the RBF method, the lag is large, using almost a full year of past data. The dropout rate and the optimal number of epochs for the RBF method are similar to those obtained with the GP. Note that larger dropout rates discourage overfitting of the models to the data. We observe the largest differences between the HPO methods for the batch size, with GP favoring large batches and RBF using significantly smaller batches.
The distributions of the optimal hyperparameters for LSTM, CNN, and RNN are shown in the online supplement in Section C.1.1, in Figures 3, 4, and 5, respectively. For the CNN, we observe a similar behavior as for the MLP: the optimal RBF solutions have a large number of layers, filters, and lags. For LSTM, all HPO methods prefer smaller, less complex networks, and for the RNN, the HPO methods return networks of different complexities with generally larger standard deviations on the optimal hyperparameter values.
Analysis of HPO Convergence
Besides the final solutions obtained with the different HPO methods, we are also interested in how fast the methods converge. For this purpose, we use progress plots that illustrate how quickly the average and the standard deviation of the outer objective function values over five trials decrease as the number of function evaluations increases. Ideally, the graphs should drop off fast, which means that improved network architectures were found within very few hyperparameter evaluations. The plot of the standard deviation reflects how robust each method is. Small standard deviation envelopes indicate that all five trials with the HPO methods converged to an approximately equally good solution. We show these convergence plots in Figures 36.
For the LSTM (Figures 3) we can see that the GP approach leads to faster convergence and more robust solutions (narrow standard deviation bands) than the RBF and the RS approaches (for LSTM, the GP approach leads to the best results, see also Figure 1). The standard deviation associated with the random sampling remains the largest throughout. This can be expected as the random sampling does not have a systematic search approach that allows it to intensify the search locally around good solutions.
For MLP, on the other hand, the RBF performed best and the convergence plot (Figure 4) reflects that the RBF approach also converged faster than the GP and the RS. For MLP, both GP and RS have wider standard deviation bands than RBF throughout indicating less robustness. The standard deviations obtained with RS are smaller for MLP than for LSTM. Since the RS is lacking any systematic search for improvements, this may be an indicator that there are several similarly good solutions for MLP.
For CNN (Figure 5), we can see significant differences in the convergence behavior between the three HPO methods. The RBF approach leads to the best solution fastest and has smaller standard deviations than GP and RS. We notice that the search with the GP gets stuck after about 12 function evaluations and it gets stuck in different solutions as is reflected by the width of the standard deviation envelopes.
Finally, for RNN (Figures 6), we observe the overall worst performance. All HPO models make only slow progress towards improved solutions and the five trials with the RBF method show the largest variability, but also the average best performance.
Analysis of Compute Times
As training larger, more complex networks requires more compute time, we also plot the cumulative function evaluation time for each network type and each HPO method in order to compare the wall clock time to solution. Here, we only compare the amount of time spent on function evaluations and we assume that the HPO methods’ own computational overheads are negligible. We show the average and the standard deviations of the compute times over all five trials in Figures 710. Comparing the figures, we see that optimizing the hyperparameters of the MLP (Figure 8) is fastest for all HPO methods (within 5 hours) even though the optimal MLP architectures were fairly complex.
For the LSTM (Figure 7), we can see that the compute times of all three HPO methods are approximately equal. Since the final optimized networks are smaller compared to MLP for all HPO methods and the number of epochs is approximately the same (compare Figure 3 in Section C.1.1 in the online supplement), we conclude that on average all HPO methods tried networks of similar complexity. The compute times for the CNN (Figure 9) are significantly larger than for all other network types (RBF needs on average over 30 hours for the CNN compared to less than 8 hours for all other network types). For RNN (Figure 10), we observe large variations between the times for the individual optimization trials. RBF may take from 3 to 11 hours and RS takes between 7 hours and 25 hours. Thus, if the goal is to train learning models on a laptop or simple desktop computer to make predictions for future groundwater levels in a timely manner, one may prefer using the MLP or the LSTM in terms of expected compute time.
Analysis of Performance on Testing Data
We illustrate in Figures 11 and 12 how well the optimized MLP and LSTM models capture the testing time series data (for CNN and RNN results, see the online supplement, Section C.1.2, Figures 6 and 7, respectively). As described earlier, we train the models on a subset of the historical observations (approximately beginning 2010end 2013) and we test their performance on another subset (approximately beginning 2014end 2015) to compute the outer objective function values. In Figures 11 and 12, we show the training data and for each optimization trial, we show the predictions of the optimized models for the testing data. The closer the predictions (red graphs) are to the true data (blue), the better. Figure 11 shows that the MLP models make more robust predictions than the LSTM models (Figure 12). All predictions for MLP lie closer to the true data than for the LSTM for all HPO methods, and RS leads to the largest variability in testing performance for both models. This agrees with our observations from Figure 1 which showed that, overall, MLP yielded the smallest mean squared errors.
OutofSample Future Groundwater Predictions
Finally, we are interested in how well the optimized learning models will predict groundwater data that was not used during the HPO. We retrain the models with the optimized hyperparameters on the whole dataset (20102016) and we predict the groundwater levels from February 20, 2016 to February 28, 2018. We illustrate these “outofsample” predictions for each of the five trials for MLP and LSTM in Figures 13 and 14, respectively. We see that both models have trouble accurately predicting future data, LSTM more so than MLP. Both models are able to keep the trend in the data (decreasing groundwater levels in the summer, increasing levels in the winter), but the MLP predictions appear more robust and they do not have the large fluctuations that can be observed for the LSTM models. In particular, the large fluctuation predicted by the LSTM indicate large and unphysical drawdowns of the groundwater in the winter months. Also, while MLP tends to overestimate the amount of available groundwater (by up to 7), the LSTM generally underpredicts the amount of available water. The outofsample predictions for CNN and RNN are provided in the online supplement in Section C.1.3 in Figures 8 and 9, respectively.
5.6 Results for Predicting Groundwater at Multiple Wells in Butte County, CA
Watersheds are large and generally have more than a single well from which groundwater is pumped or with the help of which the amount of available groundwater is monitored. Using multiple wells will allow us to better assess the average health of a watershed, and thus to avoid letting local fluctuations drive policies and decisions. Therefore, within Butte County, we selected three additional wells. Ideally, we do not want to train and optimize a separate learning model for each well, but rather we want one learning model that can be trained on and used for predicting the data at multiple wells simultaneously. We train and optimize the hyperparameters as before with the only difference that we now have three groundwater input data streams and three nodes in each output layer. We use these data to optimize the hyperparameters of MLP, LSTM, and CNN. For reasons of space considerations, we present detailed results only for the MLP. The LSTM and CNN results are in the online supplement in Section C.2.
Comparison of Mean Squared Errors
We show a comparison of the distributions of the final mean squared errors for all three learning models and all HPO methods obtained from five optimization trials in Figure 15. As expected, the optimal mean squared errors are larger for the multiwell case than for the single well case as more time series must be fit and one set of hyperparameters may not be optimal for the data of all three wells. Overall, the variability in the solutions is larger for each model and each HPO method than for the single well case. However, as for the single well case, the MLP optimized with the RBF approach leads to the lowest errors. RS performs reasonably well for the MLP and the CNN, but for the LSTM, we observe significantly larger errors than with the RBF or the GP approach. Thus, in terms of performance consistency, one should opt for an HPO method that systematically explores the search space rather than the RS method.
Analysis of Optimized Hyperparameters
We show the distribution of the optimized hyperparameters only for the MLP (Figure 16). The results for LSTM and CNN are in the online supplement in Section C.2.1 in Figures 10 and 11, respectively. For the MLP, the RBF method chooses again a large complex network (5 layers and on average over 30 nodes per layer) and a large lag. In contrast to the single well case, the optimal batch size obtained with the RBF is large. In comparison to the optimal number of epochs and dropout rate in the single well case, the optimal choices in the multiwell case are smaller. All HPO methods lead to approximately the same values for the number of nodes, layers, and the batch size, and the largest difference can be observed for the dropout rate and the number of epochs.
Analysis of HPO Convergence
As for the single well case, we illustrate how fast each HPO method converges. We show the progress plots on the example of MLP in Figure 17 (results for LSTM and CNN are in the online supplement in Section C.2.2 in Figures 12 an 13, respectively). Similarly to the single well progress plots, we see that the RBF method converges fastest. The standard deviation envelopes around the average convergence are for all three HPO methods approximately equal.
Analysis of Compute Times
The cumulative computation time for the MLP over all function evaluations is shown in Figure 18. All HPO methods require approximately the same time and the HPO is completed within 4 hours. The CNN again requires the largest amount of time to solution (up to 30 hours, see Figure 17 in Section C.2.4 in the online supplement) and the HPO for the LSTM completes within 10 hours (see Figure 16 in Section C.2.4 in the online supplement). Thus, we conclude that the MLP optimized with the RBF should be the learning model and HPO method of choice as it attains the best solutions within the lowest computation time.
Analysis of Performance on Testing Data
In Figure 19, we show the predictions obtained with the optimized MLP models for all three wells for the testing data. All three HPO methods are able to capture the seasonality in the data and all three HPO methods appear to predict lowerthantrue groundwater levels for well A and they predict somewhat higherthantrue levels for well B. For well C, the predictions are fairly accurate for all three HPO methods. The CNN predictions (Figure 14 in Section C.2.3 in the online supplement) show a similar pattern. LSTM (Figure 15 in Section C.2.3 in the online supplement) optimized with RBF and GP tends to overpredict the groundwater levels for all three wells. For the multiwell case, we do not have additional observations for all three wells for making “outofsample” predictions.
6 Summary and Directions of Future Research
Our work presented in this article is motivated by the need for computationally lightweight approximation models that can run on a laptop or simple desktop computer and that can reasonably accurately predict groundwater levels with the overarching goal to enable timely and reliable water resources management decisions.
In this article, we studied the applicability of several purely datainformed learning models whose hyperparameters were optimized with three different methods for predicting the groundwater level. We used four deep learning models, namely convolutional neural networks (CNNs), recurrent neural networks (RNNs), multilayer perceptron (MLP), and long shortterm memory networks (LSTMs). We formulated a bilevel optimization problem for finding the optimal hyperparameters of the learning models. At the upper level, we have a pureinteger stochastic problem, and at the lower level, we have a largescale global optimization problem. We solved the hyperparameter optimization (HPO) problem with three different methods. We used two surrogate model based optimization methods and a random sampling method. One surrogate model approach is based on radial basis functions (RBFs) and an adaptive sampling method that uses perturbations of the best point found so far to determine the next sample point. These perturbations are implemented such that the new sample point automatically satisfies the integer constraints. The second surrogate model approach uses a Gaussian process (GP) model, and iterative sample points are chosen by maximizing the expected improvement over an integer lattice.
In our numerical experiments, we trained the learning models on daily observation data of groundwater levels and meteorological variables at, respectively, one (“single well case”) and three (“multiwell case”) groundwater monitoring wells in Butte County, California, USA. Our study shows that with the right HPO method, different types of deep learning models, including CNN, LSTM, and MLP (and to a limited extend also RNN), can be tuned to make accurate forecasts of daily groundwater levels. The main difference between the performance of the methods is the total optimization time and the associated optimal model complexity. We showed that the surrogate model based HPO methods lead to better performing network architectures than randomly sampling the hyperparameter space. Our results show that the “simplest” learning model, the MLP, optimized with the RBF method, generally performs best and its optimization time is lowest. The RNN is the worst performing model and the most cumbersome to use as it often fails to fit a model for an architecture due to its problems during backpropagation, and therefore it often does not return an objective function value. For the CNN, we found that its performance as well as the variability of the performance depend on the HPO method used, and the RBF approach leads to the most robust results. CNN requires considerably more computation time than other models. The LSTM is not able to achieve as good performance as the MLP. Our numerical experiments show that the GP approach is most successful for finding the optimal LSTM hyperparameters for the single and the multiwell cases. Random sampling is less successful for the LSTM and usually has the largest variability in the results.
Finally, we used the optimized learning models to make predictions of groundwater levels, beyond our data for training and testing the models for the single well case study. We found that the inaccuracy of all models is considerably higher than for the testing dataset. The MLP is able to capture the seasonal trend in groundwater levels, but tends to overpredict the true value. The LSTM predictions also follow the same trend, but often yield unphysical jumps of groundwater levels. The CNN captures the seasonality well, and similarly to the MLP, overpredicts the groundwater level. The RNN’s predictions suffer from the same unphysical jumps as the LSTM’s.
Thus, our study showed that if one chooses purely datainformed models for learning and predicting groundwater levels, MLPs optimized with the RBF approach for HPO should be used. The timetooptimizedsolution is lowest and the solution accuracy as well as prediction accuracy are highest. Our study showed that the “simplest” learning model, i.e., the MLP, leads to the best results.
Directions of future research include applying the RBFoptimized MLP model for groundwater predictions at other watersheds in California and other parts of the United States. This will help us assess how well our method generalizes to watersheds with different characteristics (e.g., geology, groundwater use, weather patterns). Furthermore, we will study the sensitivity of the model performance to different input variables. For example, in watersheds that are heavily used for agricultural purposes, we expect that groundwater pumping data could further improve the model predictions. This analysis can shed light on which data should be collected and thus where the necessary data acquisition infrastructure investments should be made. Similarly, we will study how sensitive the model predictions are to the amount of training data points, i.e., how sensitive is the model performance to using daily data versus monthly averages and how does the learning model performance change when fewer years of data are used in training. This future research direction also includes applying our method for learning model optimization on other time series data such as predicting water quality.
Acknowledgements
This work was supported by Laboratory Directed Research and Development (LDRD) funding from Berkeley Lab, provided by the Director, Office of Science, of the U.S. Department of Energy under Contract No. DEAC0205CH11231.
References
 [1] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(436444), 2015.
 [2] M. Kuderer, S. Gulati, and W. Burgard. Learning driving styles for autonomous vehicles from demonstration. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 2641–2646, 2015.

[3]
S. Rudy, A. Alla, S.L. Brunton, and J.N. Kutz.
Datadriven identification of parametric partial differential equations.
SIAM Journal on Applied Dynamical Systems, 18(2):643–660, 2019.  [4] L. Bottou. Largescale machine learning with stochastic gradient descent. In 19th International Conference on Computational Statistics, pages 177–186, 2010.
 [5] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
 [6] D.P. Kingma and J.L. Ba. ADAM: A method for stochastic optimization. In ICLR 2015, 2015.
 [7] J. Bergstra and Y. Bengio. Random search for hyperparameter optimization. The Journal of Machine Learning Research, 13(1):281–305, 2012.
 [8] I. Ilievski, T. Akhtar, J. Feng, and C.A. Shoemaker. Efficient hyperparameter optimization of deep learning algorithms using deterministic RBF surrogates. In Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.
 [9] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, 2012.
 [10] J. Bergstra, D. Yamins, and D.D. Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In Proceedings of the 30th International Conference on Machine Learning, 2013.
 [11] A. Klein, S. Falkner, S. Bartels, P. Hennig, and F. Hutter. Fast Bayesian optimization of machine learning hyperparameters on large datasets. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS) 2017, Fort Lauderdale, Florida, USA, volume 54, 2017.
 [12] H. Jin, Q. Song, and X. Hu. AutoKeras: An efficient neural architecture search system. arXiv preprint arXiv:1806.10282 [cs.LG], 2019.
 [13] S.R. Young, D.C. Rose, T.P. Karnowski, S.H. Lim, and R.M. Patton. Optimizing deep learning hyperparameters through an evolutionary algorithm. In MLHPC ’15 Proceedings of the Workshop on Machine Learning in HighPerformance Computing Environments, volume Article No.4, 2015.
 [14] D. Lakhmiri, S. Le Digabel, and C. Tribes. HyperNOMAD: Hyperparameter optimization of deep neural networks using mesh adaptive direct search. arXiv preprint arXiv:1907.01698 [cs.LG], July 2019.
 [15] S. Le Digabel. Algorithm 909: NOMAD: Nonlinear optimization with the MADS algorithm. ACM Transactions on Mathematical Software, 37, 2011.
 [16] P. Balaprakash, M. Salim, T.D. Uram, V. Vishwanath, and S.M. Wild. Deephyper: Asynchronous hyperparameter search for deep neural networks. In 2018 IEEE 25th International Conference on High Performance Computing (HiPC), pages 42–51, 2018.
 [17] A.J. Booker, J.E. Dennis Jr, P.D. Frank, D.B. Serafini, V. Torczon, and M.W. Trosset. A rigorous framework for optimization of expensive functions by surrogates. Structural Multidisciplinary Optimization, 17:1–13, 1999.
 [18] H.M. Gutmann. A radial basis function method for global optimization. Journal of Global Optimization, 19:201–227, 2001.
 [19] R.G. Regis and C.A. Shoemaker. A stochastic radial basis function method for the global optimization of expensive functions. INFORMS Journal on Computing, 19:497–509, 2007.
 [20] K. Holmström. An adaptive radial basis algorithm (ARBF) for expensive blackbox global optimization. Journal of Global Optimization, 41:447–464, 2008.
 [21] D.R. Jones, M. Schonlau, and W.J. Welch. Efficient global optimization of expensive blackbox functions. Journal of Global Optimization, 13:455–492, 1998.
 [22] J. Müller and J. Woodbury. GOSAC: global optimization with surrogate approximation of constraints. Journal of Global Optimization, doi:10.1007/s108980170496y, 2017.
 [23] L. Nuñez, R.G. Regis, and K. Varela. Accelerated random search for constrained global optimization assisted by radial basis function surrogates. Journal of Computational and Applied Mathematics, 340:276–295, 2018.
 [24] R.G. Regis. Stochastic radial basis function algorithms for largescale optimization involving expensive blackbox objective and constraint functions. Computers & Operations Research, 38:837–853, 2011.
 [25] J. Müller, C.A. Shoemaker, and R. Piché. SOI: a surrogate model algorithm for expensive nonlinear integer programming problems including global optimization applications. Journal of Global Optimization, 59:865–889, 2013.
 [26] E. Davis and M. Ierapetritou. Kriging based method for the solution of mixedinteger nonlinear programs containing blackbox functions. Journal of Global Optimization, 43:191–205, 2009.
 [27] K. Holmström. An adaptive radial basis algorithm (ARBF) for expensive blackbox mixedinteger global optimization. Journal of Global Optimization, 9:311–339, 2008.
 [28] J. Müller. MISO: Mixed integer surrogate optimization framework. Optimization and Engineering, 17(1):177–203, 2015.
 [29] J. Müller, C.A. Shoemaker, and R. Piché. SOMI: A surrogate model algorithm for computationally expensive nonlinear mixedinteger blackbox global optimization problems. Computers and Operations Research, 40:1383–1400, 2013.
 [30] R. Datta and R.G. Regis. A surrogateassisted evolution strategy for constrained multiobjective optimization. Expert Systems with Applications, 57:270–284, 2016.
 [31] J. Müller. SOCEMO: Surrogate Optimization of Computationally Expensive Multiobjective Problems. INFORMS Journal on Computing, 29(4):581–596, 2017.
 [32] H.K.H. Lee, R.B. Gramacy, C. Linkletter, and G.A. Gray. Optimization subject to hidden constraints via statistical emulation. Pacific Journal of Optimization, 7:467–478, 2011.
 [33] J. Müller and M. Day. Surrogate optimization of computationally expensive blackbox problems with hidden constraints. INFORMS Journal on Computing, Articles in Advance 05 Jul 2019(https://doi.org/10.1287/ijoc.2018.0864), 2019.
 [34] J. Müller. An algorithmic framework for the optimization of computationally expensive bifidelity blackbox problems”. INFOR: Information Systems and Operational Research, DOI: 10.1080/03155986.2019.1607810, 2019.
 [35] A.I.J. Forrester, A. Sóbester, and A.J. Keane. Multifidelity optimization via surrogate modelling. Proceedings of the Royal Society, 463:3251–3269, 2007.
 [36] O. Karslıoğlu, M. Gehlmann, J. Müller, S. Nemšàk, J. Sethian, A. Kaduwela, H. Bluhm, and C. Fadley. An efficient algorithm for automatic structure optimization in Xray standingwave experiments. Journal of Electron Spectroscopy and Related Phenomena, 230:10–20, 2019.
 [37] W. Langhans, J. Müller, and W. Collins. Optimization of the Eddy‐Diffusivity/Mass‐Flux shallow cumulus and boundary‐layer parameterization using surrogate models. Journal of Advances in Modeling Earth Systems, 11:402–416, 2019.
 [38] J. Müller, R. Paudel, C.A. Shoemaker, J. Woodbury, Y. Wang, and N. Mahowald. CH4 parameter estimation in CLM4.5bgc using surrogate global optimization. Geoscientific Model Development Discussions, 8:141–207, 2015.
 [39] D. Toal and A. Keane. Efficient multipoint aerodynamic design optimization via cokriging. Journal of Aircraft, 48(5):1685–1695, 2011.
 [40] G. Matheron. Principles of geostatistics. Economic Geology, 58:1246–1266, 1963.
 [41] M.J.D. Powell. Advances in Numerical Analysis, vol. 2: wavelets, subdivision algorithms and radial basis functions. Oxford University Press, Oxford, pp. 105210, chapter The Theory of Radial Basis Function Approximation in 1990. Oxford University Press, London, 1992.
 [42] M.J.D. Powell. Recent Research at Cambridge on Radial Basis Functions. New Developments in Approximation Theory, pp. 215232. Birkhäuser, Basel, 1999.
 [43] R.H. Myers and D.C. Montgomery. Response Surface Methodology, Process and Product Optimization using Designed Experiments. WileyInterscience Publication, 1995.
 [44] S.M. Wild and C.A. Shoemaker. Global convergence of radial basis function trustregion algorithms for derivativefree optimization. SIAM Review, 55:349–371, 2013.
 [45] M. Mitchell. An Introduction to Genetic Algorithms. Cambridge, MA: MIT Press, 1996.
 [46] D.E. Rumelhart, G.E. Hinton, R.J. Williams, et al. Learning representations by backpropagating errors. Cognitive modeling, 5(3):1, 1988.
 [47] G. Hinton, N. Srivastava, and K. Swersky. Neural networks for machine learning. lecture 6a, overview of minibatch gradient descent. lecture notes, 2012.

[48]
C.M. Bishop et al.
Neural networks for pattern recognition
. Oxford University Press, 1995.  [49] S. Trenn. Multilayer perceptrons: Approximation order and necessary number of hidden units. IEEE transactions on neural networks, 19(5):836–844, 2008.
 [50] Z. Ali, I. Hussain, M. Faisal, H.M. Nazir, T. Hussain, M.Y. Shad, A. Mohamd Shoukry, and S. Hussain Gani. Forecasting drought using multilayer perceptron artificial neural network model. Advances in Meteorology, ID 5681308:9 pages, 2017.
 [51] H. Tabari and P.H. Talaee. Multilayer perceptron for reference evapotranspiration estimation in a semiarid region. Neural Computing and Applications, 23(2):341–348, 2013.
 [52] P. Araujo, G. Astray, J.A. FerrerioLage, J.C. Mejuto, J.A. RodriguezSuarez, and B. Soto. Multilayer perceptron neural network for flow prediction. Journal of Environmental Monitoring, 13(1):35–41, 2011.
 [53] M.W. Gardner and S.R. Dorling. Artificial neural networks (the multilayer perceptron) – a review of applications in the atmospheric sciences. Atmospheric Environment, 32(1415):2627–2636, 1998.
 [54] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [55] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
 [56] Z. Cui, W. Chen, and Y. Chen. Multiscale convolutional neural networks for time series classification. arXiv preprint arXiv:1603.06995, 2016.
 [57] A. Borovykh, S. Bohte, and C.W. Oosterlee. Conditional time series forecasting with convolutional neural networks. arXiv preprint arXiv:1703.04691, 2017.
 [58] T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association, 2010.
 [59] I. Sutskever, J. Martens, and G.E. Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 1017–1024, 2011.
 [60] Y.M. Chiang, L.C. Chang, and F.J. Chang. Comparison of staticfeedforward and dynamicfeedback neural networks for rainfall–runoff modeling. Journal of Hydrology, 290(34):297–311, 2004.
 [61] P. Coulibaly, F. Anctil, R. Aravena, and B. Bobée. Artificial neural network modeling of water table depth fluctuations. Water Resources Research, 37(4):885–896, 2001.
 [62] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural Computing, 9:1735–1780, 1997.
 [63] F.A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: continual prediction with LSTM. Neural Computing, 12:2451–2471, 2000.
 [64] M. Sundermeyer, R. Schluter, and H. Ney. LSTM neural networks for language modeling. In Proceedings of the 12th Annual Conference of the International Speech Communication Association, Portland, Oregon, USA, pages 601–608, 2012.
 [65] A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In Proceedins of the 2013 International Conference on Acoustics, Speech, and Signal Processing, 2013.
 [66] X. Ma, Z. Tao, Y. Wang, H. Yu, and Y. Wang. Long shortterm memory neural network for traffic speed prediction using remote microwave sensor data. Transportation Research Part C: Emerging Technologies, 54:187–197, 2015.
 [67] D. Hsu. Multiperiod time series modeling with sparsity via Bayesian variational inference. arXiv preprint arXiv:1707.00666v3, 2018.
 [68] California Department of Water Resources. SGMA groundwater management.
 [69] B.I. Cook, J.S. Mankin, and K.J. Anchukaitis. Climate change and drought: From past to future. Current Climate Change Reports, 4(2):164–179, 2018.
 [70] M Taylor. Liquid Assets: Improving management of the state’s groundwater resources. Technical report, Legislative Analyst’s Office, 2010.
 [71] C.D. Langevin, J.D. Hughes, E.R. Banta, R.G. Niswonger, S. Panday, and A.M. Provost. Documentation for the MODFLOW 6 groundwater flow model. Technical report, US Geological Survey, 2017.
 [72] C.I. Steefel, C.A.J. Appelo, B. Arora, D. Jacques, T. Kalbacher, O. Kolditz, V. Lagneau, P.C. Lichtner, K.U. Mayer, J.C.L. Meeussen, et al. Reactive transport codes for subsurface environmental simulation. Computational Geosciences, 19(3):445–478, 2015.
 [73] T. Xu, N. Spycher, E. Sonnenthal, G. Zhang, L. Zheng, and K. Pruess. TOUGHREACT version 2.0: A simulator for subsurface reactive transport under nonisothermal multiphase flow conditions. Computers & Geosciences, 37(6):763–774, 2011.
 [74] S. Sahoo, T.A. Russo, J. Elliott, and I. Foster. Machine learning algorithms for modeling groundwater level changes in agricultural regions of the US. Water Resources Research, 53(5):3878–3895, 2017.
 [75] B. Arora, B.P. Mohanty, and J.T. McGuire. Inverse estimation of parameters for multidomain flow models in soil columns with different macropore densities. Water Resources Research, 47(4):2010WR009451, 2011.
 [76] F. Karandish and J. Šimunek. A comparison of numerical and machinelearning modeling of soil water content with limited input data. Journal of Hydrology, 543:892–909, 2016.
 [77] I.N. Daliakopoulos, P. Coulibaly, and I.K. Tsanis. Groundwater level forecasting using artificial neural networks. Journal of Hydrology, 309(14):229–240, 2005.
 [78] J Zhang, Y. Zhu, X. Zhang, M. Ye, and J. Yang. Developing a Long ShortTerm Memory (LSTM) based model for predicting water table depth in agricultural areas. Journal of Hydrology, 561(918929), 2018.
 [79] F. Kratzert, D. Klotz, C. Brenner, K. Schulz, and M. Herrnegger. Rainfall–runoff modelling using long shortterm memory (LSTM) networks. Hydrology and Earth System Sciences, 22(11):6005–6022, 2018.
 [80] C.C. Faunt. Groundwater availability of the Central Valley aquifer, California. Professional paper 1766, 225 p., U.S. Geological Survey, 2009.
 [81] S. Moritz and T. BartzBeielstein. imputeTS: Time series missing value imputation in R. 2017.
 [82] F. Chollet. keras. GitHub repository, https://github.com/fchollet/keras, 2015.
 [83] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
 [84] F. FélixAntoine, F.M. De Rainville, M.A. Gardner, C. Gagné, and M. Parizeau. DEAP: Evolutionary algorithms made easy. Journal of Machine Learning Research, 13:2171–2175, 2012.
 [85] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
Comments
There are no comments yet.