massivedatans
Big Data vs. complex physical models - a scalable nested sampling inference algorithm for many data sets
view repo
The data torrent unleashed by current and upcoming instruments requires scalable analysis methods. Machine Learning approaches scale well. However, separating the instrument measurement from the physical effects of interest, dealing with variable errors, and deriving parameter uncertainties is usually an afterthought. Classic forward-folding analyses with Markov Chain Monte Carlo or Nested Sampling enable parameter estimation and model comparison, even for complex and slow-to-evaluate physical models. However, these approaches require independent runs for each data set, implying an unfeasible number of model evaluations in the Big Data regime. Here we present a new algorithm, collaborative nested sampling, for deriving parameter probability distributions for each observation. Importantly, in our method the number of physical model evaluations scales sub-linearly with the number of data sets, and we make no assumptions about homogeneous errors, Gaussianity, the form of the model or heterogeneity/completeness of the observations. Collaborative nested sampling has immediate application in speeding up analyses of large surveys, integral-field-unit observations, and Monte Carlo simulations.
READ FULL TEXT VIEW PDF
Due to the escalating growth of big data sets in recent years, new paral...
read it
Markov chain Monte Carlo methods are often deemed too computationally
in...
read it
We present dynesty, a public, open-source, Python package to estimate
Ba...
read it
Monte Carlo sampling techniques have broad applications in machine learn...
read it
These lecture notes aim at a post-Bachelor audience with a backgound at ...
read it
The article is about algorithms for learning Bayesian hierarchical model...
read it
Economic assessment in environmental science concerns the measurement or...
read it
Big Data vs. complex physical models - a scalable nested sampling inference algorithm for many data sets
Big Data has arrived in astronomy Feigelson2012 ; Zhang2015a ; Mickaelian2016 ; Kremer2017 . In the previous century it was common to analyse a few dozen objects in detail. For instance, one would use Markov Chain Monte Carlo to forward fold a physical model and constrain its parameters. This would be repeated for each member of the sample. However, current and upcoming instruments provide a wealth of data (
millions of independent sources) where it becomes computationally difficult to follow the same approach, even though it is embarrassingly parallel. Currently, much effort is put into studying and applying machine learning algorithms such as (deep learning) neural networks or random forests for the analysis of massive datasets. This can work well if the measurement errors are homogeneous, but typically these methods make it difficult to insert existing physical knowledge into the analysis, to deal with variable errors and missing data points, and generally to separate the instrument measurement process from the physical effects of interest. Furthermore, we would like to derive probability density distributions of physical parameters for each object, and do model comparison between physical effects/sources classes.
In this work I show how nested sampling can be used to analyse data sets simultaneously. The key insight is that nested sampling allows effective sharing of evaluation points across data sets, requiring much fewer model evaluations than if the data sets were analysed individually. I only assume that the model can be split into two components: a slow-to-evaluate physical model which performs a prediction into observable space, and a fast-to-compute comparison to the individual data sets (e.g. the likelihood of a probability distribution). Otherwise, the user is free to chose arbitrary physical models and likelihoods. In §3, I present as an example the line fitting of a hypothetical many-object spectroscopic survey. A more advanced example could include broadened H/O/C line emissions from various ionisation states under red noise errors, without modification of our algorithm.
Nested sampling Skilling2004
is a global parameter space exploration algorithm, which zooms in from the entire volume towards the best-fit models by steadily increasing the likelihood threshold. In the process it produces parameter posterior probability distributions and computes the integral over the parameter space. Assume that the parameter space is a
-dimensional cube. A number of live points are randomly^{1}^{1}1In general, following the prior. For most problems one can assume uniform sampling with appropriate stretching of the parameter space under the inverse cumulative of the prior distributions. placed in the parameter space. Their likelihood is evaluated. Each point represents of the entire volume. The live point with the lowest likelihood is then removed, implying the removal of space with likelihood below and shrinkage of the volume to , on average. A new random live point is drawn, with the requirement that its likelihood must be above . This replacement procedure is iterated, shrinking the volume exponentially. Each removed (“dead”) point and its likelihood is stored. The integral over the parameter space can then be approximated by , where is the removed volume at the iteration. At a late stage in the algorithm the volume probed is tiny and the likelihood increase is negligible, so that the weights of the remaining live points becomes small. Then the iterative procedure can be stopped (the algorithm converged). The posterior probability distribution of the parameters is approximated as importance samples of weight at the dead point locations, and can be resampled into a set of points with equal weights, for posterior analyses similar to those with Markov Chains. More details on the convergence and error estimates can be found in skilling2009nested .Efficient general solutions exist for drawing a new point above a likelihood threshold in low dimensions (). The idea is to draw only in the neighbourhood of the current live points, which already fulfill the likelihood threshold. The best-known algorithm in astrophysics and cosmology is multinest Shaw2007 ; Feroz2009 . There, the contours traced out by the points are clustered into ellipses, and new points drawn from the ellipses. To avoid accidentally cutting away too much of the parameter space, the tightest-fitting ellipses are enlarged by an empirical (problem-specific) factor. Another algorithm is RadFriends buchner2014statistical , which defines the neighbourhood as all points within a radius of an existing live point. By leaving out randomly a portion of the live points, and determining their distance to the remaining live points, the largest nearest-neighbour radius is determined. The worst-case analysis through bootstrapping cross-validation over multiple rounds makes RadFriends robust, independent of contour shapes and free of tuning parameters. Figure 1 illustrates the generated regions. RadFriends
is efficient if one chooses a standardised euclidean metric (i.e. normalise by the standard deviation of the live points along each axis). I use this algorithm in this work, although any other method can be substituted.
Consider two independent nested sampling runs on different data sets, but initialised to the same random number generator state. Initially points are generated from across the entire parameter space, typically giving bad fits. If the data sets are somewhat similar, this phase of zooming to the relevant parameter space will be the same for the two runs. Importantly, while the exact likelihood value will be different for the same point, the ordering of the points will be similar. In other words, for both, the worst-fitting point to be removed is the same. The next key insight is that new points can be drawn efficiently from a contour which is the union of the likelihood contours from both runs. Ideally, the point can be accepted by both runs, keeping the runs similar (black points in Figure 2). When a point is shared, the (slow) predicting model has to be only evaluated once, speeding up the run. The model prediction is then compared against the data to produce a likelihood for each data set, an operation which I presume to be fast (e.g. simply computing
where , and are the predictions, measurements and errors in data space respectively for data set ).
What if the point can be accepted by only one run? It cannot simply be rejected, otherwise the uniform sampling property of nested sampling is broken. Instead, accepted points are stored in queues, one for each run/data set. Once both runs have a non-empty queue, the first accepted point is removed from each queue and replaces the dead point of each data set. Joint sampling also helps even if a point is not useful right away. If a point was only accepted by one run, but the following point is accepted by both runs, the latter becomes a live point immediately for one run, but can later also become a live point for the other run (if it suffices the likelihood threshold at that later iteration). This technique allows sustained sharing of points, decreasing the number of unique live points and increasing the speed-up.
At a later point in the algorithm, the contours may significantly diverge and not share any live points. This is because the best-fit parameters of data sets will differ. Then, nested sampling runs can continue as in the classic case, without speed-up, falling back to a linear scaling. This happens earlier, the more different the data sets are. The run is longer for data sets with high signal-to-noise, making the algorithm most efficient when most observations are near the detection limit. This is typically the case in surveys (a consequence of powerlaw distributions).
I now describe the algorithm for simultaneously analysing data sets, where is a large number. The algorithm components are the nested sampling integrator, the constrained sampler and the likelihood function, as in classic nested sampling, except that work on data sets. The constrained sampler behaves substantially different in our algorithm.
The likelihood function receives a single parameter vector, and information which data sets to consider. It calls the physical model with the parameter vector to compute into a prediction into data space. The physical model may perform complex and slow numerical computations/simulations at this point.
Finally the prediction is compared with the individual data sets to produce a likelihood for each considered data set. The likelihood at this point can be Gaussian (see above), Poisson (comparing the predicted counts to observed counts), a red noise Gaussian process, or any other probability distribution as appropriate for the instrument. In any case, this computation must be fast compared with producing the model predictions to receive any performance gains.
The integrator essentially deals with each run individually as in standard nested sampling, keeping track of the volume at the current iteration, and storing the live points and their weights for each data set individually. It calls the constrained sampler (see below), which holds the live points, to receive the next dead point (for all data sets simultaneously). The integrator must also test for convergence, and advance further only those runs which have not yet converged. Here I use the standard criterion that the nested sampling error is (from last equation in skilling2009nested ). Once all runs have terminated, corresponding to each data set the integral estimates and posterior samples are returned, giving the user the same output as e.g. a multinest analysis.
The sampler initially draws live points and stores their likelihoods in an array of size . Sequential IDs are assigned to live points and the mapping between live point IDs and data sets ( indices) is stored. The integrator informs the sampler when it should remove the lowest likelihood point and replace it. The integrator also informs the sampler when some data sets have finished and can be discarded from further consideration, in which case the sampler works as if they had never participated.
The main task of the constrained sampler is to do joint draws under likelihood constraint to replace the lowest likelihood point in each of the data sets. For this, initially empty queues are introduced (see Figure 3). First, it is attempted to draw from the joint contour over all data sets (superset draw), i.e. letting RadFriends define a region based on the all unique live points. From this region a point is drawn which has for at least one data set. Some will accept and the corresponding queues are filled. If this fails to fill all queues after several (e.g. 10) attempts, a focussed draw is done. In that case, only the data sets with empty queues are considered, the region is constructed from their live points, and the likelihood only evaluated for these data sets. For example, in the illustration of Figure 3, only Data Set 3 would be considered. Once all queues have at least one entry, nested sampling can advance: For each data set, the first queue entry is removed and replaces the dead live point. In Figure 3 this is illustrated by the queues pushing out the lowest live points. These dead points are returned to the integrator.
Storing queue entries is only useful if they can replace live points in the upcoming iterations. Playing nested sampling forward this implies that to be accepted into the end of the queue at position , it must have a likelihood higher than points from the runs live points and previous entries of the queue. In other words, the first entry must merely beat a single existing live point, the second entry must beat both a live point and either another live point or the first queue entry (which will become a live point in the next iteration).
It can occur that between two groups of data sets the live points are not shared any more, i.e. the live point sets are disjoint. For example, one may have a dichotomy between broad and narrow line objects, and the contours identify some of the data sets in the former, some in the latter class. Then it is not useful to consider all live points when defining the region, because it introduces unnecessary multi-modality. Instead, subsets can be identified which share live points (see Figure 4), and these subsets can be processed independently. Algorithms for identifying connected subsets of graphs are well-known. The necessary graph can be constructed with nodes corresponding to the data sets, nodes corresponding to the live points, and connecting the graph according to the current live point statuses. This introduces some computational overhead, especially for large and . However, one can lazily defer graph construction and maintenance until they are needed. A few simple checks can trivially rule out disjoint subsets: If there are fewer unique live points across all data sets than , some must be shared and there are no disjoint subsets. We can also track any live point which has been accepted by all data sets. Any such a superpoint, until one data set removes it as a dead point, guarantees that there are no disconnected subsets.
The described analysis of divergence of contours effectively clusters data sets by similarity. Interestingly, the similarity is not defined in data space, nor in parameter space, but only through the constraints in parameter space. This is important because clustering in data space can be non-trivial for data with varying errors and completeness, and clustering in parameter space would scale poorly with model dimensionality because it is metric dependent. Instead, the likelihood ordering that makes nested sampling unique is taken advantage of. The clustering improves the efficiency of region draws, as it eliminates space between clusters, and improves super-set draws because unrelated data sets, which cannot benefit, are not pulled in.
A simple example problem illustrates the use and scaling of the algorithm. We consider a spectroscopic survey which collected spectra in the wavelength range. We look for a Gaussian line at rest frame (but randomly shifted) with standard deviation of . The amplitudes vary with a powerlaw distribution with index , and a minimum value of in units of the Gaussian noise standard deviation. We generate a large random data set and analyse the first data sets simultaneously to understand the scaling of the algorithm, with to . Figure 5 presents some high and low signal-to-noise examples of the simulated data set.
The parameter space of the analysis has three dimensions: The amplitude, width and location of a single Gaussian line, with log-uniform/log-uniform/uniform priors from , and respectively. The Gaussian line is our “slow-to-compute” physical model. The likelihood function is a simple product of Gaussian errors. A more elaborate example would include physical modelling of an ionised outflow emitting multiple lines with Doppler broadening and red detector noise, without necessitating any modification of the presented algorithm.
Figure 6 shows the number of model evaluations necessary for analysing data sets. The algorithm scales much better than the baseline linear scaling, i.e. analysing the data sets individually one-by-one. For instance, it takes only twenty times more model evaluations to analyse observations than a single observations, a -fold speedup. Indeed, the algorithm scales in this problem much better than the naive linear cost of parallel analyses.
We can now plot the posterior distributions of the found line locations. Figure 7 demonstrates the wide variety of uncertainties. The spectra of Figure 5 are shown in the same colours. For many, the line could be identified and characterised with small uncertainties (cyan, black), for others, the method remains unsure (pink, yellow, magenta, gray). Figure 8 shows that the input redshift distribution is correctly recovered.
After parameter estimation we can consider model comparison: is the line significantly detected? For this, lets consider the Bayes factor, , where is the integral computed by nested sampling under the single-line model, and
is the same for the null hypothesis (no line). The latter can be analytically computed as
. Figure 9 shows in black the derived Bayes factors (truncated at ). To define a lower threshold for significant detections, we Monte Carlo simulate a dataset with spectra without signal, and derive values. This can be done rapidly with the presented algorithm. The red histogram in Figure 9 shows the resulting Bayes factors. The quantile of -values in this signal-free data set is . Therefore, in the “real” data, those with a Bayes factor can be securely claimed to have a line, with a small false detections fraction ().A scalable algorithm for analysing massive data sets with arbitrarily detailed physical models and complex, inhomogeneous noise properties was presented. The algorithm brings to the Big Data regime parameter estimation with uncertainties, classification of objects and distinction between physical processes.
The key insight in this work is to take advantage of a property specific to nested sampling: The contours at a given iteration are sub-volumes which can look similar across similar data sets, and it is permitted to draw new points from the union of contours^{4}^{4}4As in classic nested sampling, the volume shrinkage estimates are valid on average. Multiple runs can test whether this leads to additional scatter in the integral estimate. In practice, single runs already give correct uncertainties for many problems.. These joint draws reduce drastically the number of unique model evaluations, in particular at the beginning of the nested sampling run. The same approach cannot be followed with Markov Chain proposals: There, the following proposal depends on the current points, and different acceptances prohibit a later joint proposal. The joint sampling suggests Collaborative Nested Sampling as the name of the algorithm.
The algorithm has some overhead related to the management of live points, in particularly to determine the unique set of live points across a dynamically selected subgroup of data sets. The memory usage also grows if big data sets have to be held in the same machine. If only chunks of are managable, the analyses can be split into such sizes and analysed in parallel across multiple machines. In that case, one can take advantage of the scaling of the algorithm until .
Applications: The algorithm can be applied immediately to any existing large data sets, such as spectra from the Sloan Digital Sky Survey Eisenstein2011SDSS which inspired the example presented here. Low signal-to-noise data or exhaustive searches for lines with multiple solutions are addressed in the algorithm. Aside from surveys with large data sets, individual integral-field-unit observations, where each pixel contains a spectrum, can be analysed with the algorithm, permitting also the fitting of complex ionisation or radiative transfer models. There the optimal speed-up case for the algorithm is satisfied, because often many pixels will share lines with similar positions and widths. Compared to existing approaches, nested sampling naturally allows model comparisons (e.g. detection of additional lines) and multiple fit solutions.
As another example, eROSITA Predehl2014eRosita requires the source classification and characterisation of 3 million point sources in its all-sky survey Kolodzig2012 . The desire to use existing physical models, the complex detector response and non-Gaussianity of count data make standard machine learning approaches difficult to apply. Furthermore, standard fitting techniques can fail to correctly converge and individual visual inspection in the Big Data regime is impractical.
Even in the analysis of single objects the presented algorithm can help. One might test the correctness of selecting a more complex model, e.g. based on Bayes Factors, as in the toy example presented. Large Monte Carlo simulations of a null hypothesis model can be quickly analysed with the presented method, with a model evaluation cost that is essentially independent of the number of generated data sets, i.e. comparable to the single-object analysis.
I thank Surangkhana Rukdee and Frederik Beaujean for reading the manuscript.
I acknowledge support from the CONICYT-Chile grants Basal-CATA PFB-06/2007, FONDECYT Postdoctorados 3160439 and the Ministry of Economy, Development, and Tourism’s Millennium Science Initiative through grant IC120009, awarded to The Millennium Institute of Astrophysics, MAS. This research was supported by the DFG cluster of excellence “Origin and Structure of the Universe”.
Comments
There are no comments yet.