Introduction
As environmental scientists, the volumes of observational data that we have at our disposal are ever increasing. Movements such as the InternetofThings (IoT), exemplified in the context of weather data by the Met Office’s Weather Observation Website (kirk2021weather)
, have enabled near realtime collection and sharing of environmental data by low cost sensing equipment around the world. The implications of this are numerous, but where once the challenge was to collect sufficient data for specific modelling problems (often hindered by expense), now often the challenge is to maximise the utility of the high volumes of data that we already have. The rise of ‘data science’ can be viewed, to some extent, as a response to this shift in challenges.
In the case of weather modelling and forecasting, it is likely that harnessing the evergrowing network of IoT type environmental sensor data, in addition to the observations provided by traditional official weather stations, will facilitate the development of higherprecision, finerscale models which can serve more specific predictions to stakeholders (bell2015good; chapman2017can). Linked to this, a benefit of IoT type sensor data is that these observations have the potential to be more representative of the weather experienced by the device owners themselves (e.g. due to private weather stations being located at homes), rather than representative of remote rural locations (as tends to be the case for official weather stations). The adoption of data from these unofficial private weather stations and IoT type environmental sensors could therefore enable models to provide more specific, personalised weather information at hyperlocal scales.
However, while crowdsourcing weather data can greatly increase the number of observations being made, and the number of unique locations which are observed, it also opens the door to data quality issues owing to the low cost, low maintenance nature of unofficial weather stations compared to official weather stations. The traditional way of addressing data quality issues is to have some form of manuallyguided rules based quality control procedure to subjectively approve or deny the inclusion of each sensor’s observations into downstream models. While a manuallyguided procedure may seem the best approach in terms of having complete handson control at an individual observation level, such an approach will tend to suffer from scalability issues as the number of sensors increases, and is difficult to achieve consistently through space and time.
As the number of sensors enters or exceeds the thousands, it becomes necessary to automate aspects of the quality control procedure in order to keep up with the scale of the task. Common approaches include statistical timeseries analysis or rulebased outlier detection algorithms to help identify sensors that are producing data of questionable quality, which can then be excluded from input into subsequent models. Here we propose a unified approach whereby detection of outliers is achieved as part of a downstream statistical data model itself: in this case a Bayesian deep neural network based spatiotemporal interpolator of crowdsourced temperature observations collected by the Met Office’s Weather Observation Website, with a mixture model or mixture density network architecture to enable automatic identification and correction of outliers as part of the modelling process.
In this paper we proceed by briefly providing some background on IoT sensor data and its potential benefits for environmental modelling applications, as well as an overview of existing methods for outlier detection. We then introduce our deep mixture model approach for spatiotemporal interpolation with simultaneous probabilistic outlier detection, using an example dataset composed of surface temperature observations collected by the Met Office’s Weather Observation Website. By adopting a mixture model approach, we incorporate our knowledge about data issues into the data model through our choice of probability distributions, which provide our likelihood function. This, in combination with our Bayesian approach allows us to quantify both aleatoric and epistemic uncertainties — uncertainty about the data and uncertainty about the fit of the model — in order to provide a wellcalibrated posterior predictive distribution. Bayesian deep learning frameworks (here we use Tensorflow Probability) allow us to combine the above benefits of Bayesian statistical modelling with the flexibility and scalability of deep neural networks.
We assess the performance of our model on heldout test data, finding our approach to be successful in filtering outliers in order to provide ‘clean’ spatiotemporal interpolation that is free from outlierinduced anomalies. In addition, as our probabilistic approach (including epistemic uncertainty via Monte Carlo dropout as a Bayesian approximation) provides a wellcalibrated predictive distribution rather than single point predictions, it therefore provides useful information for downstream applications and decision making.
Background
IoT sensor data in environmental modelling
Within the field of environmental modelling, the concept of ‘models of everywhere’ (beven2007towards) has been proposed. This is a concept which stems specifically from hydrology but is applicable across environmental sciences. The concept aims to “change the nature of the modelling process, from one in which general model structures are used in particular catchment applications to one in which modelling becomes a learning process about places" (beven2012modelling). This idea is driven by the need to constrain uncertainty in the modelling process in order to support policy setting and decision making (blair2019models). The concept is a reaction to the shortfalls of the use of ‘generic models’, in which spatiallydiscretised (gridded) predictions are likely to fail to provide wellcalibrated probabilistic predictions for the specific locations or areas (not grid cells) which are of interest to stakeholder decision making (beven2015hyperresolution). However, the issue is not simply one of scale, and increasing the resolution of imperfect models does not solve the problem of what beven2015hyperresolution term ‘hyperresolution ignorance’ in that uncertainty about parameters will still exist, and a model outputting at finer scales will not necessarily be providing more information. This is a topical issue for weather forecasting too, as numerical weather prediction models continue to increase in resolution.
Technological challenges such as limited computational power have slowed the adoption of the ‘models of everywhere’ concept, but blair2019data propose that data science, including cloud computing infrastructure, may provide the means to make the ‘models of everywhere’ concept a reality by using data mining techniques to combine information from remote sensing and insitu earth monitoring systems in datadriven models. This includes liveupdating IoT sensors (atzori2010internet; nundloll2019design) such as the unofficial weather stations which provide data to the Met Office’s Weather Observation Website (kirk2021weather). These provide a greater number of observations from more numerous unique sites than traditional monitoring systems (e.g. traditional weather stations), and the combination of increasingly dense observations (by IoT sensors) and machine learning may allow datadriven models to supersede alternative modelling approaches, such as ‘generic’ physics based models. However, IoT sensors have greater potential for data quality issues over traditional monitoring systems owing to their cheaper costs, less stringent maintenance, and being more numerous. Therefore, in order to maximise the benefits of IoT sensor data for environmental modelling, issues of data quality have to be addressed as part of the solution.
We propose that the approach we present here, which uses Bayesian deep learning to combine information from remote sensing and insitu earth monitoring in order to provide specific and wellcalibrated predictions for any point within the extent of observed space and time, does satisfy the ideals behind the ‘models of everywhere’ concept. As such, it can be viewed as an example of the kind of large scale datadriven environmental modelling that is likely to become more feasible as computing power continues to increase  putting ‘models of everywhere’ at our fingertips.
Outlier detection
There are many possible approaches to outlier detection, ranging from fullymanual data checking, to manually designed rulebased filters, to statistical and machine learning based systems, which may include both supervised and unsupervised learning
(with supervised learning having the downside that it requires the creation of manually labelled training datasets in advance; e.g.
nesa2018outlier). For a full review of outlier detection techniques we refer the reader to wang2019progress, who provide a general review of developments in outlier detection since the year 2000. In addition, ayadi2017outlierprovide a review of techniques specifically for wireless sensor networks, including a comparison of the respective pros and cons of statistical, nearestneighbour, artificial intelligence, clustering and classification based approaches (although these categories have some overlap).
napoly2018developmentpropose a combination of rulebased and zscore thresholds for outlier detection in crowdsourced air temperature data. This approach has been adopted by other authors
(e.g. venter2020hyperlocal; zumwald2021mapping) but this is not the approach we take.The approach we adopt for this study is a regression approach using a deep neural network mixture model — or mixture density network (bishop1994mixture)
— through which we represent the conditional distribution of reported temperature values as a mixture of a Gaussian and a Uniform distribution, with parameters learned by our deep neural network. We explain the full details of the approach in subsequent sections, but in brief terms, our approach incorporates outlier detection into the spatiotemporal modelling process itself, by having the neural network learn the probability that an observation is an outlier (whose values are best explained as having been generated by the Uniform distribution) as an unsupervised subtask to the overall supervised spatiotemporal modelling task. The benefit of this holistic approach is that it allows the user to incorporate knowledge about data issues into the data model itself through the use of suitable probability distributions, and makes for more seamless model checking when compared to a twostage procedure of separate outlierdetection followed by data modelling.
Method
Dataset
We demonstrate our approach using surface air temperature data from the Met Office’s Weather Observation Website archives (kirk2021weather). These data contain observations from 1893 unique IoT type weather stations (Figure 1), from which we have taken a continuous 14 day window from 2020/01/26 to 2020/11/09 to use as our dataset in this study. The data provide our target variable, surface air temperature in degrees Celsius, as well as spatiotemporal location information in the form of British National Grid (BNG) Easting and Northing, and a timestamp. Collectively these Weather Observation Website weather stations record 8000 observations per hour on average, which equates to about four observations per site per hour, although this varies by site. Each sensor records observations at different intervals, rather than synchronously at set times, so that collectively the observations provide good coverage across continuous time (Figure 2).
In addition to using the Weather Observation Website data, we also make use of gridded UK elevation data as covariate or auxiliary information in order to help inform the spatiotemporal interpolation. The data used comes from NASA’s Surface Radar Topography Mission (SRTM; farr2007shuttle) and is accessed via the Raster package in the R programming language. The elevation data is rasterised with a grid size of 528 by 927 metres (longer latitudinally than longitudinally), resulting in 0.66 million grid cell elevation dataset covering the UK and Ireland.
For input into our model, we extract terrain elevation images centred on each observation (in the case of training) or location to be predicted. The images extracted have a resolution of 32x32 grid cells with a grid cell size of 500m (we use bilinear interpolation so that the image resolution is not locked to the overall digital elevation model resolution). These images provide auxiliary information, from which the convolutional layers of our deep neural network learn to extract useful contextual covariates (e.g. as explained in kirkwood2020deep) for the task of spatiotemporal interpolation of surface air temperature data. Illustrative examples could include slopes facing the sun that warm faster, or valleys that channel cool air from cold mountainous areas. There are likely to be many such complex interactions between the landscape and surface air temperatures, and by providing elevation data as images to our deep neural network we allow them to be learned from data. Further details of the preparation of our dataset for model training, evaluation, and testing are provided in the section ‘Practical setup’.
Mixture model concept
We design our model to address three considerations: 1) The capacity to represent our target phenomenon (a spatiotemporally varying temperature distribution in this case), under the assumption that outliers can be objectively identified and excluded. 2) The capacity to successfully identify outliers. 3) A means by which to achieve both 1 and 2 simultaneously within a single probabilistic data model.
At the heart of our model is a twopart mixture probability distribution whose individual component distributions  and  represent the two classes of observation that we judge to exist within our dataset, as evidenced by exploratory visualisation of the data (Figure 2
). These are 1) The ‘true’ signal distribution of our target phenomenon, which we assume here is a Gaussian distribution as is common for temperature measurements, and 2) the outlier distribution, in this case we choose a Uniform distribution ‘catchall’ that can account for the generation of spurious observations by biased or faulty weather stations. It is worth noting that the selection of these distributions is a modelling choice, and that different target variables are likely to warrant the use of different distributions in the model output, from which the likelihood is derived (the probability of the data given the model).
We then introduce parameter – the probability that an individual data point comes from the “true” Gaussian distribution of temperature. Equivalently, is the probability that a data point is spurious and therefore comes from the uniform distribution. More formally, let denote the temperature at location and time point . The probability distribution of is defined as:
(1)  
(2)  
(3) 
The “true” temperature distribution is therefore assumed Normal with mean
and variance
, while the spurious observations are centered at the “true” mean but are allowed to vary uniformly around this mean. This range of 100C was chosen from exploratory data analysis and was deemed sufficient to capture the outliers in the data.A perhaps more intuitive way of interpreting this model, is to introduce a latent binary variable,
where and . The probability model for temperature conditional on is then:(4)  
(5)  
(6) 
We can think of as the result of a ‘coin toss’ where at any given location and time point , we can get a spurious observation with probability . Note that varies with space and time, in order to flexibly capture the flawed data points (as opposed to assuming a constant ).
Note further that the fact that appears in the model for the spurious data points, i.e. the Uniform distribution, allows some information from such data points to be utilised. This is based on a belief that on average, the flawed data are centred on the true mean, such that negative and positive biases cancel each other out (though in practice this may well be optimistic; bell2015good). Note that any flawed data point which is much further from the mean than the distribution implies, will be “absorbed” by the Uniform distribution so that the Normal part of the model can be interpreted as the model for the true temperature process. As such, predictions from the part after the model is implemented, can be seen as “corrected”.
Network architecture
The parameters of our mixture distribution are , and . We therefore require that our model has the capacity to learn to optimise these parameters in relation to space and time so that predictions from (2) are a reasonable representation of the real data generating processes at location and time (as we assess through model checking against heldout test data).
To achieve this, our neural network architecture consists of two halves, which we term the signal network and the outlier network. The signal network is tasked with learning the parameters, and , of our ‘true’ Gaussian distribution, which are conditioned on the space and time variables that we provide as inputs to the model (the details of which we explain in subsequent sections). The outlier network meanwhile is simply tasked with learning
or in other words the probability that an observation is an outlier, which is conditioned on site ID (which we provide onehot encoded) and time. Onehot encoding means representing our n site IDs as n separate predictor variables, to which we assign the value 1 only if an observation corresponds to that site, otherwise a value of 0 is assigned. The onehot encoding approach allows us to input categorical variables into the neural network in a sensible way. We provide site IDs (rather than more general spatial variables such as easting and northing) to the outlier network because it has no need to learn generalisable patterns, its sole purpose is to identify outliers probabilistically during the training phase, and this ability is improved by making the task as simple as possible. Overfitting is not a concern since the outlier network serves no purpose in the spatiotemporal interpolation beyond the training stage.
From the perspective of deep neural networks as “black boxes”, we can view our signal and outlier networks simply as function approximators that learn to provide optimal values of their respective output parameters, such that
(7)  
(8) 
and
(9) 
however we have designed the architecture of the two branches — signal network and outlier network — in line with their specific goals. The architectures of each branch, and the specific space and time variables that they take as inputs, are explained in the following paragraphs, accompanied by Figure 3 as a visual aid.
Our signal network architecture (Figure 3) is designed for terrainaware interpolation, which it achieves through the combination of a convolutional branch to derive relevant terrain features from gridded auxiliary information (e.g. terrain elevation, satellite imagery), and a fullyconnected branch for interpolation in space and time. The combined effect is to achieve spatiotemporal interpolation in a hybrid space that includes local terrain context so that, for example, the differences between valleys and hilltops (and anything relevant about their orientations) can be recognised. Unlike more traditional geostatistical approaches, which might offer the model predefined derivatives from terrain analysis as input features, our deep learning approach allows these derivatives to be learned optimally for the task at hand via trainable convolutional filtering of raw terrain elevation grids (behrens2018multi; padarian2019using; wadoux2019using; kirkwood2020deep; kirkwood2020bayesian).
For its location input (input B in Figure 3) our signal network receives easting, northing, and elevation as spatial location information (all in metres), and continuous time and time of day as temporal location information (in minutes). To provide a cyclic representation of time of day to the network (to aid learning of the diurnal cycle), we transform our minuteoftheday variable into position on a circle defined by the two dimensions and where is the specific minute of the day and : the total number of minutes in a day. It is important that our signal network is able to generalise well to unobserved locations, and so overfit must be avoided. In aid of this, and in line with the Bayesian interpretation of our model — which we discuss in the next section — we run our signal network with a dropout rate of 0.5 on all hidden layers (or spatial dropout in the case of convolutional layers).
In contrast, generalisation is not a concern for our outlier network (Figure 3), whose sole task is to model the probability that each training observation is an outlier. To make this task as simple as possible, we provide the outlier network directly with onehot encoded site IDs, as well as continuous time, such that the outlier network provides outlier probabilities as a linear function of site ID plus a (sitetailored) nonlinear function of continuous time (eq. 10) facilitated by passing continuous time through a single hidden layer.
(10) 
For full layerbylayer details of our neural network architecture, we encourage readers to view our code for this study at https://github.com/charliekirkwood/deepoutliers.
Bayesian inference
With the parameters and architecture of our model established, we would like to use the Bayesian framework to learn a posterior distribution for all trainable parameters given the data, , on which we will train the model. The parameters that control the probability distribution of temperature are , and but these are of course themselves functions of the weights within the entire neural network, which we collectively refer to as . By Bayes’ rule we can obtain this posterior distribution over the weights given the data as
(11) 
So that is proportional to the likelihood of the data given the weights, , multiplied by our prior distribution over the weights, . Assuming independence of temperature values given , and , the likelihood of our mixture model is:
(12) 
where is given in equation (1).
Here, we adopt a prior distribution for by utilising Monte Carlo Dropout as suggested by gal2016dropout. The prior is defined by assuming that a particular “fixed” weight
in the network can be randomly “dropped out”, by introducing a set of Bernoulli random variables
. An individual weight is then defined as(13) 
so that with probability and with probability . The fixed weights
are learned by stochastic gradient descent during training, whereas the dropout rate
is considered a hyperparameter of the network and is fixed apriori. Equation (13) means that the weights are probabilistic in nature so that stochastic forward passes can be used in a Monte Carlo setting to provide an approximate posterior distribution for .The particular setup assumes that
is fixed apriori, preferably by tuning it. It is however possible that this is automatically estimated using ‘Concrete Dropout’
(gal2017concrete), or by exploring the number of other approaches to Bayesian inference in neural networks that have been proposed
(e.g. mackay1995probable; graves2011practical; neal2012bayesian; heek2019bayesian). At present, Bayesian inference in deep neural networks, with their extreme dimensionality (a modest 696 114 trainable parameters in our case), remains a challenge and an ongoing topic of research.After obtaining the posterior distribution (the practicalities of which we discuss in the next section), we are in a position to compute the posterior predictive distribution for any point in space and time. To obtain robust predictions of the phenomenon of interest, we can set (i.e. exclude the uniform distribution component) and thus generate predictions exclusively from the ‘true’ Gaussian distribution (2). Specifically, we can obtain samples from the posterior predictive distribution of any (both observed and not):
(14) 
Practical setup
We use Weather Observation Website surface air temperature observations from a fourteen day period from 26/10/2020 to 09/11/2020 for this study. This period was selected for containing interesting weather patterns (as evident even in the simple time series of observations; Figure 2), including storm Aiden which passed over on the UK on the 31st of October 2020. We randomly subsampled the observations from this period to a single observation per site per hour (where available), which provides 417141 observations in total. We then split this dataset by site ID into 10 folds of approximately equal unique number of unique sites (about 145 unique sites per fold). We split our folds in this sitewise manner in order to assess the fit of the model at sites unseen during training, and therefore to assess the ability of the model to interpolate to new spatial locations throughout the period of observed time.
We assigned data folds one to eight to be used for training, with fold nine providing an evaluation set for hyperparameter tuning, and fold ten providing a held out test set for assessing the performance of the final trained model at locations unseen by the model. Running on a single GPU workstation (with Nvidia GTX 3070) our neural network trains at one epoch every 3 seconds, so that training for 600 epochs takes about 30 minutes.
Results and discussion
Our approach has not required the manual labelling of outliers in the training data, but we can see from the output of the model — specifically the parameter , which controls the mixing of the Gaussian and Uniform distributions — that observations that visually appear to be outliers have been assigned a high probability of being outliers generated by the Uniform distribution (see for example Figure 4, in which sites are coloured by the average predicted outlier probability of their observations). On the basis of this qualitative assessment, we have confidence that predictions generated by our neural network’s Gaussian output distribution are a clean (outlier free) representation of the true surface air temperature  we also find this to be evident in the clean look of maps generated by the model (using only the Gaussian distribution for prediction), which do not contain the localised bright or dark spots that could be expected if the model had incorrectly fitted to outlier observations. All subsequent reporting of results, and their discussion, is made on the basis of using only the Gaussian distribution for prediction, so that all predictions are ‘outlierfiltered’.
In terms of the quantitative performance of the model as assessed on held out test data (from sites unseen by the model during training), we find that our deep learning approach to spatiotemporal interpolation provides a good degree of predictive skill in both a deterministic and probabilistic sense (Figure 5). In a deterministic sense (Figure 5A), the mean of the predictive distribution provides an R of 0.90 and a root mean square error (RMSE) of 1.15 degrees Celcius. Probabilistically, our model achieves a continuous rank probability score of 0.6 (Figure 5
B), and the predictive distribution has good calibration, with held out test observations falling within the 95% prediction interval 92.7% of the time. We can see from the quantilequantile plot (
Figure 5C) and predictioninterval coverage plot (Figure 5D) that the probabilistic calibration of the predictive distribution performs well across the range of predicted quantiles, although we do see a slight underdispersion in the tails (i.e. beyond a 90% prediction interval). This may be attributable to limitations of our Monte Carlo dropout approach to approximate Bayesian inference, in that our posterior distribution is fundamentally centred about a single optimum, rather than composed of diverse samples from separate local optima as in full Bayesian inference via MCMC sampling methods (or other proposed approximations such as the ‘deep ensembles’ approach; lakshminarayanan2016simple) which may reduce the diversity and coverage of the posterior. However, as we assess here on heldout test data, this underdispersion, if present, appears to be minimal and not overly concerning, especially given that some of our test observations are outliers themselves, which means that perfect calibration (using predictions from our Gaussian distribution alone) cannot be expected.Overall the performance metrics indicate that our deep mixture density network approach to outlierfiltered spatiotemporal interpolation is doing a good job of providing accurate and trustworthy predictions of historic surface air temperatures for locations in space which have not been observed. It provides a statistical hindcast which is likely to be both computationally cheaper and better calibrated than numerical hindcast alternatives. When run over a long duration, our approach should also provide high quality probabilistic climatology estimates at any unobserved location, which may be useful for planning purposes.
Turning to the maps produced by our model, we can see that our deep learning approach produces detailed predictions which take account of surface topography. For any snapshot in time, we can obtain a map of the predicted mean (average value of ; Figure 6), the average aleatoric uncertainty (average value of ; Figure 7
), the epistemic uncertainty of the mean (standard deviation of the posterior distribution of
; Figure 8), and the total uncertainty (standard deviation of the posterior predictive distribution; Figure 9). Maps of any desired predictive quantiles, or other statistics of the posterior predictive distribution, can also be produced. In all such maps we can see that our deep learning approach produces predictions and predictive uncertainties that are highly spatially specific. In combination with the high quality of probabilistic calibration achieved (e.g. Figure 5 this indicates that our model is producing a predictive distribution that is both sharp and wellcalibrated  the ideals for probabilistic predictions and forecasts as proposed by gneiting2007probabilistic.Additionally, we can sample from the posterior distribution to generate simulated realisations of surface air temperature fields for any snapshot of time within the observed period. We can generate these both with the Gaussian output distribution active in order to achieve samples of the predictive distribution itself (including aleatoric uncertainty; Figure 10), or sample from only the posterior distribution of the mean (without independent noise from the Gaussian) in order to view alternative hypotheses for the mean temperature field at a given time (i.e. the epistemic uncertainty; Figure 11). These simulated realisations help to convey the uncertainty in the model, by offering different explanations for plausible data generating processes.
To visualise the output of the model through time, rather than purely in space, we can compare samples from the model (again with and without aleatoric uncertainty included) to observations recorded at a held out test site as timeseries (Figure 12). As is indicated by the overall model fit and calibration metrics (Figure 5, the predictive performance for held out test sites is good  we can see in the timeseries of samples from the model that samples of the mean track the observations quite closely (but do not track noise in the observations) meanwhile, samples from the posterior predictive distribution, with aleatoric uncertainty included, do a good job of covering the distribution of observations, including noise. Both epistemic and aleatoric uncertainty vary through time (and space, as we saw in Figure 7 and Figure 8). Animations of the model output (perhaps the best way to view spatiotemporal model output) are available to view at https://github.com/charliekirkwood/animations.
The role of the model we present here, as a spatiotemporal interpolator of weather data, is similar to the role that would traditionally be filled by numerical hindcasting (e.g. palmer2004development). This is where the same physicsbased numerical weather prediction models used for forecasting are fitted retrospectively to historic weather observations, to provide a ‘best fit’ of historic weather conditions. In order to provide an indication of uncertainty, ensemble hindcasts can also be run, but it is generally the case that numerical weather prediction ensembles are underdispersive in relation to observations (e.g. gneiting2005calibrated). By providing wellcalibrated spatiotemporal interpolations, our deep learning approach may have the potential to provide a probabilisticallysuperior (and computationally cheaper) alternative to numerical hindcasting, despite our model having no notion of the physical equations that govern atmospheric dynamics (i.e. NavierStokes; kimura2002numerical). The level of detail of spatial structure captured by the model will be limited by a combination of the resolution of auxiliary information (gridded terrain elevation data in this case) and the spatial density of observations, but the model remains free to provide predictions for any point in space. The resolution, or spatial precision, of our proposed approach can naturally improve as the density of observations, and the resolution of auxiliary information, increases.
It is interesting to observe the difference between samples of our model’s posterior distribution both with and without aleatoric uncertainty — the independent noise provided by the Gaussian output distribution — included (e.g. by comparing the top and bottom of Figure 12, or comparing Figure 10 with Figure 11). As can be seen in Figure 12, the independent noise of our Gaussian output distribution is required in order to provide wellcalibrated coverage in relation to observations (at least in our setup, in which independent noise is a part of the model). Without this aleatoric uncertainty included, the distribution over our plausible mean functions would be underdispersive in relation to the observations. This has parallels in the setup of numerical weather prediction and hindcasting (e.g. rawlins2007met; rougier2013model; bauer2015quiet), in which ensembles tend to be underdispersive in part for the same reason: that while these numerical ensemble members do capture epistemic uncertainty in initial conditions rougier2013intractable and perhaps across model parameters leutbecher2008ensemble, they tend not to model aleatoric uncertainty. In order to achieve wellcalibrated numerical weather forecasts, statisticalpost processing must therefore be used, such as Bayesian model averaging in which individual ensemble members are ‘dressed’ with suitably scaled Gaussian noise (raftery2005using), thus effectively transforming the underdispersed ensemble at the bottom of Figure 12 to the wellcalibrated ensemble at the top of Figure 12. It is perhaps another strength of our Bayesian deep learning approach that ‘ensemble’ predictions of both forms (with and without aleatoric uncertainty) can be generated equally easily by sampling from the same model, and that our full posterior predictive distribution (which includes aleatoric uncertainty) is innately wellcalibrated and requires no subsequent postprocessing.






Conclusions
We have presented a deep learning approach that provides wellcalibrated outliercorrected spatiotemporal interpolation of crowdsourced weather observations. Our deep mixture density network approach to outlier classification unifies outlier detection and correction as part of a same single probabilistic data modelling process, which provides a more streamlined modelling and modelchecking workflow compared to alternative two stage techniques (in which outlier detection and filtering is performed separately prior to data modelling).
Our unified approach allows us to, through a single probabilistic data model (our Bayesian deep neural network), generate high fidelity spatiotemporal predictions from historic crowdsourced weather observations. The ultimate functionality is therefore similar to that of numerical hindcasting or reanalysis, but our approach is likely to be computationally cheaper and our predictions are innately wellcalibrated, requiring no postprocessing. By providing a full predictive distribution, the uncertainty of predictions is fully quantified, therefore making our output useful to decision makers. The predictive uncertainty can also be viewed and mapped as its two separate components: aleatoric uncertainty, or irreducible uncertainty in the data, and epistemic uncertainty, or reducible uncertainty about our state of knowledge. In addition our predictions can be provided at any point in space and time, therefore catering for hyperlocal scales, and so may be viewed as satisfying the requirements of a ‘models of everywhere’ approach to harnessing Internet of Things (IOT) type weather observations. On the basis of all these benefits we therefore consider our approach to have potentially powerful applications for quality control, data assimilation and climatological studies that maximise the utility of IOT data for environmental modelling applications in an increasingly datarich world.
Acknowledgements
We acknowledge funding from the UK’s Engineering and Physical Sciences Research Council (EPSRC project ref: 2071900) and from the UK Met Office, by which CK’s PhD studentship is funded.
Data availability
The code to reproduce this study is available at https://github.com/charliekirkwood/wowpaper and includes functions to download NASA’s SRTM elevation data via the raster package in R. Data from the Met Office’s Weather Observation Website can be downloaded from https://wow.metoffice.gov.uk/
Comments
There are no comments yet.