Modeling the Gaia Color-Magnitude Diagram with Bayesian Neural Flows to Constrain Distance Estimates

by   Miles D. Cranmer, et al.
Princeton University

We demonstrate an algorithm for learning a flexible color-magnitude diagram from noisy parallax and photometry measurements using a normalizing flow, a deep neural network capable of learning an arbitrary multi-dimensional probability distribution. We present a catalog of 640M photometric distance posteriors to nearby stars derived from this data-driven model using Gaia DR2 photometry and parallaxes. Dust estimation and dereddening is done iteratively inside the model and without prior distance information, using the Bayestar map. The signal-to-noise (precision) of distance measurements improves on average by more than 48 the accuracy of distances have improved over other models, especially in the noisy-parallax regime. Applications are discussed, including significantly improved Milky Way disk separation and substructure detection. We conclude with a discussion of future work, which exploits the normalizing flow architecture to allow us to exactly marginalize over missing photometry, enabling the inclusion of many surveys without losing coverage.



page 8

page 9

page 10

page 11

page 12

page 13


Empirical analysis of the variability in the flow-density relationship for smart motorways

The fundamental diagram is an assumed functional relationship between tr...

Inter-Mobile-Device Distance Estimation using Network Localization Algorithms for Digital Contact Logging Applications

Mobile applications are being developed for automated logging of contact...

Improved Modeling of Persistence Diagram

High-dimensional reduction methods are powerful tools for describing the...

A Physics-Informed Deep Learning Paradigm for Traffic State Estimation and Fundamental Diagram Discovery

Traffic state estimation (TSE) bifurcates into two main categories, mode...

A statistical model of tristimulus measurements within and between OLED displays

We present an empirical model for noises in color measurements from OLED...

Sparse 3D Topological Graphs for Micro-Aerial Vehicle Planning

Micro-Aerial Vehicles (MAVs) have the advantage of moving freely in 3D s...

Code Repositories


XD vs normalizing flows demo

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gaia’s precise astrometry has impacted the astrophysics community in countless ways. While Gaia data is predominantly useful for mapping out the Milky Way and its bulk properties, for example in Bovy (2017); it has been used for calibration of standard candles like the Red Clump as in Hawkins et al. (2017); Huber et al. (2017), and vice versa — using the Red Clump to improve Gaia parallax calibration — Hall et al. (2019); and mapping out the interstellar medium to create dust maps of the Milky Way, as in Green et al. (2019). Gaia’s astrometric data has been incredibly useful for the study of substructures in the Milky Way and its halo, for example, stellar streams, as in Price-Whelan & Bonaca (2018); Malhan et al. (2018); Brown et al. (2018); Koposov et al. (2019). The precise positional and proper motion estimates even allow for automated algorithms to be used to detect stellar streams, as described in Malhan & Ibata (2018). Recently the astrometric data has been used to observe possible dynamical evidence for dark matter in stellar streams in Bonaca et al. (2019), although this stream was at a distance where the parallax measurements couldn’t be reliably used besides a foreground filter. Producing more accurate astrometry for the Gaia DR2 dataset would impact many current and future use cases. Gaia parallaxes greatly degrade in quality at a distance starting at about , with only million sources having greater than 10 signal-to-noise (SNR, defined here as parallax over parallax uncertainty), hampering studies of structure in the galactic halo such as stellar streams.

We can make reasonable distance estimates from these uncertain parallaxes by relying on distance priors such as those derived in Bailer-Jones (2015); Bailer-Jones et al. (2018), or by exploiting patterns in stellar photometry, either using theoretical Color-Magnitude Diagrams (CMDs) for stars or learning them as in Leistedt & Hogg (2017); Anderson et al. (2018). If we were to rely on purely theoretical models for estimating distances from photometry, we would be subject to systematic errors from the models themselves. Distance prior-derived distances are also heavily prior-dominated, and of limited use for substructure studies, also demonstrated in Section 4.3

. By denoising in a model-independent fashion — training a flexible machine learning algorithm to model the most common photometric measurements for stars, hence acting as a prior for distance estimates — we avoid difficulties from modeling the theoretical photometry, and naturally learn over top of the Gaia systematic errors. Using a very flexible model for a color-magnitude population density would reduce potential model-dependent bias from entering distance estimates.

One way of building very flexible models is by using machine learning models. Typically in astronomy, machine learning is thought of purely as an approach to classification or regression, which is predicting an output scalar or vector value based on a set of input parameters, such as the simplest example: creating a line of best fit. The extension of this is to fitting quadratic functions, and then higher-order polynomials, and other simple models, which can be generalized to multi-input, multi-output using additional parameters. One generalization of these simple machine learning models which are fit to data is a popular technique called deep learning, which is a flexible approach capable of fitting very complex surfaces over input data, relying typically on stochastic gradient descent to fit millions of parameters. Deep learning can also be applied to many additional optimization problems than just regression, such as density estimation, which we will use it for in this paper. Density estimation is the problem of fitting a function that models an unknown probability distribution and can be approached with deep learning using a model called a normalizing flow.

Deep learning can be thought of as a recursive generalized linear regression — you repeatedly compute linear regression (fitting a hyperplane) on an input, following each regression with an element-wise nonlinearity (such as converting negative values to zero). In the case of normalizing flows, as we will see, we also apply a mask over the linear regression weights to make the function have a triangular Jacobian matrix. We use such a normalizing flow in this paper to model photometric measurements of Gaia stars.

One common criticism of deep learning models targets their lack of interpretability and potential over-flexibility. Generally, it is recommended that deep learning should only be used when it gives you large performance gains or lets you model a relation that would be extremely difficult to represent with classical machine learning models. As discussed in this paper, due to the large size of the Gaia DR2 dataset, and the non-Gaussian contours of the density of stars on a CMD, a deep neural network is a useful model for its scalability and flexibility, so we choose it over traditional machine learning models.

2 Data

Our dataset is a slice of the Gaia DR2 catalog. For technical papers describing the 2nd data release used in this paper, consult Brown et al. (2018); Lindegren et al. (2018); Riello et al. (2018). We make use of the following features: bp_rp, bp_g, phot_g_mean_mag, ra, dec, parallax, along with their corresponding uncertainties, and features to calculate the renormalized unit weight error (RUWE): astrometric_chi2_al and astrometric_n_good_obs_al. We choose to not apply any filters based on parallax or parallax uncertainty, meaning we take all noisy measurements (although we do filter using RUWE, discussed later), including negative parallaxes. We exclude all stars below -30 deg declination, since the Bayestar dust map, which we use from Green et al. (2018) in the package Green (2018), does not extend there. We also exclude stars lying in holes of this dust map. We include all stars during training, even the low-SNR parallaxes. However, we make cuts during training such that stars in high-density regions of the sky are excluded to avoid stars with bad goodness-of-fit for parallax. We do this by requiring that the renormalized unit-weight error of measured parallaxes, discussed on, is less than 1.4.

We apply a single global parallax offset of 0.029 mas from Lindegren et al. (2018) to all of the Gaia data. While the true parallax offset is conditional on several variables, we believe that training with a constant parallax offset makes it easier to apply different post-processing offsets to the finished model. A more complex parallax offset can also be used during evaluation.

3 Model

We wish to train a model that takes bp_rp, bp_g, phot_g_mean_mag, ra, dec, parallax from the Gaia DR2 catalog, along with their uncertainties, and produces a Bayesian posterior over distance. Our model is similar in strategy, but not in terms of the actual density model, to Anderson et al. (2018). Anderson et al. (2018) built a model on Gaia DR1 parallaxes cross-matched with 2MASS photometry, and applies the “Extreme Deconvolution” algorithm from Bovy Jo et al. (2011)

to build a Gaussian Mixture Model (GMM) over color-magnitude values. This GMM is then used as a prior for the same data, to build a final parallax probability distribution for every data point. The model described in this paper is similar to this GMM model but differs by using a normalizing flow to model density in photometric space. Various other changes are made, such as that we propagate uncertainty from the dust map, both into our training (down-weighting photometry from uncertain dust estimates) and during evaluation (sampling the dust map).

Normalizing flows are not yet popular for density estimation in astrophysics or the natural sciences in general, although they are being used for some likelihood-free inference applications derived from Papamakarios et al. (2018) in Cosmology and Particle Physics, for example in the “MadMiner” package for LHC data Brehmer et al. (2019), and “PyDELFI” for Cosmology Alsing et al. (2019). This paper represents the first application that we are aware of for normalizing flows being applied to learning a stellar CMD.

Our model makes several assumptions which largely follow those made in Anderson et al. (2018):

  • The fundamental assumption of the model is that for a given star, there will be other stars with similar photometry, as is assumed for any regression-type optimization problem.

  • We model the contents of Gaia DR2, not the Milky Way. Since parallax measurements are noisier for stars in the halo, which also tend to be lower metallicity, the CMD model will bias to metallicities in the disk. However, we lessen the effect of this by training on all stars, rather than just high-SNR.

  • The model assumes that the expected color-magnitude relation is unchanging with respect to location in the Milky Way, in that we do not include (right ascension and declination) as a parameter in our CMD.

  • This model makes no further physical assumptions about stellar structure or the color-magnitude relation. This is a trade-off because while it does avoid potential inaccuracies of physical models, it also misses the many successes of these models. Future work will attempt to combine physical models with data-driven to help regularize the training.

  • We assume that the Bayestar dust map of Green et al. (2018) produces accurate dust posteriors along each line of sight. We further assume that for training, the median dust estimate will not bias the model during training. However, we incorporate the full dust posterior for evaluation.

  • We make the assumption that the Gaia catalog has no parallax bias relative to any parameters (such as proper motion, or a specific sky coordinate such as a high-density region).

A potentially problematic assumption in our model is the fact that the dust map we use, from Green et al. (2018), makes use of stellar models to estimate dust, meaning that there are implicitly some stellar models being introduced in our distance estimates. In the future, it would be desirable to combine the learning of a dust map with the learning of a CMD so we could make completely model-independent estimates, but for now, this is our approach. The assumptions for the model described in our paper are different from those in Anderson et al. (2018) in that we do not model the color-magnitude relation using 128 Gaussians; rather, we use a deep normalizing flow which has more flexibility in describing the joint prior. The path of stellar evolution has many sharp turns and discontinuities, so, while the spread of stars about a single isochrone may be Gaussian-like due to a Gaussian spread in ages and metallicities, the overall CMD is better modelled with a highly flexible model that can express sharp turns, which is where deep learning can be extremely helpful, as GMMs perform poorly at modeling the contours of the CMD. We also deal with a significantly larger dataset, using the majority of DR2 (640M stars) versus a selection of DR1 (1.4M stars). We also incorporate the entirety of the samples from Green et al. (2018) rather than median dust estimates during evaluation.

The crux of our model is a very flexible “normalizing flow” neural network. A good introduction on this family of model can be found at This architecture models the joint posterior density for Gaia magnitudes, as

where is the absolute (dereddened) G-band mean magnitude from Gaia DR2, corresponding to column phot_g_mean_mag, and and similarly for the absolute integrated BP and RP mean magnitudes. This prior density models the dashed lines in the graphical model in fig. 1.

Figure 1: A Bayesian graphical model for distance estimates in our model. Gaia data is used in combination with a normalizing flow model of the color-magnitude diagram to generate a Bayesian posterior over distance. The dashed lines, which represent the three-dimensional color-magnitude diagram, are those learned by the normalizing flow described in this paper. The red paths are the update step: the current expectation value for distance () is used to update the absolute -magnitude for each star as well as the dust value integrated along the line of sight. This iteration occurs 5 times during training, and 10 times during evaluation. An optional prior can be applied during training and evaluation that is conditional on .

This posterior models the density of stars in Gaia DR2 and weights stars observed in Gaia DR2 by

where we define as

which is a metric for the signal-to-noise ratio of data points, much like when calculating a mean from uncertain measurements, one weights each measurement by the inverse variance. These

values are the variance of estimates in each of the quantities. These variances incorporate both the uncertainty of the dust map as well as the intrinsic uncertainty in the Gaia parallax and distance prior.

During training, we sample 32 parallaxes for each Gaia DR2 data point from the truncated normal distribution:


using the DR2 parallax value for , and

equal to the DR2 standard deviation in the parallax value estimate. As stated in

Hogg (2018), a likelihood for Gaia parallaxes treats the individual measurements as drawn from . Therefore this truncated distribution acts as a minimal distance prior on the Gaia parallax likelihood, which states that we can only have positive distances.

3.1 Model architecture

A normalizing flow can roughly be thought of as a Gaussian that has been parametrically warped by an invertible neural network. This is a smooth invertible mapping between probability distributions: . Our specific model relies on the “Masked Autoregressive Flow” variant, described in Papamakarios et al. (2017) for density estimators, which uses the following mappings:

for and . This equation is similar to the common matrix multiply followed by vector addition that is found in neural networks, but with a particular mask applied. This is done so that the Jacobian of this mapping is triangular — and hence the determinant is easy to calculate as the product of elements along the diagonal. Recall that if you would like to change variables from to via a smooth function , for probability distributions one must write:

This mapping is invertible, and the inverse also has a triangular Jacobian. The fact that the Jacobian is easy to calculate lets us normalize the transformation between probability distributions. The inverse is

which transforms from our data variables () to a latent space (

) where we set the Gaussian. This transform is what we compute to calculate the probability of given data (though we use a reparametrized version, so it doesn’t need to be done sequentially). This transform can be repeated to form a complex flow. Autoregressive models can also be used to exactly marginalize over inputs in the case of missing data (such as missing photometry), as discussed in Section 


Our model uses a sequence of blocks of MADE BatchNorm Reverse, where MADE

is the Masked Autoencoder for Distribution Estimation model defined in

Germain et al. (2015), Reverse is the reversing layer found in Dinh et al. (2016), and BatchNorm

is a batch norm-like layer (typically used in convolutional neural networks) also defined for normalizing flow models in

Dinh et al. (2016)

. We use the PyTorch code found at as the template of our codebase, which we alter for our usecase.

The MADE model essentially applies three densely-connected neural network layers with a mask (hence, a masked autoencoder) applied to the weights at each layer to satisfy properties of the neural flow. It forms an autoencoder such that each of the output units only depends on the preceding input units (). One can think of this transform as parametrizing an arbitrary bijective vector field, where the vectors show the flow of points from distribution to distribution. The BatchNorm

is the equivalent of a batch normalization for normalizing flows, and

Reverse permutes the order of the probability variables since the MADE’s mask treats each slightly differently (e.g., depends on , whereas only on ). Hence the BatchNorm and Reverse layers help regularize training. Defining one block as MADE BatchNorm Reverse, our model has a probability distribution with variables. We then specify a hidden dimension size in each MADE of and use sequential blocks.

3.2 Dust estimation

We build the Bayestar dust map from Green et al. (2018) into the model with an iterative approach using the software package dustmap from Green (2018). We use extinction values based on a blackbody integrated over an dust model from O’Donnell (1994), using the extinction package from Barbary (2016), which gives us conversions from the Bayestar dust map to Gaia DR2 bands:

In the future, we would like to expand this graphical model with an estimate for the temperature and other stellar parameters of each star, and incorporate a map of . We would also like to fit a separate graphical model which learns the best extinction to maximize likelihood over the data.

Training and evaluation take different approaches due to the computational expense. During training, we calculate a distance by dividing each of the 32 parallax samples from eq. 1. We calculate the mean of these distance samples to get the current best-estimate for a distance. For every best-estimate distance, we query Bayestar at the center ra and dec position. We convert this reddening into each of the Gaia bands, giving us estimates for and which gives us and . Then, using each of the 32 samples for parallax, we convert the estimate into 32 samples for . Next, using the current model for , we calculate the probability of each tuple. These likelihood values are treated as weights, and the weights are used to calculate a new best-estimate for distance via a weighted sum:


where each of the is a distance sample. This is then fed back into the loop and a new reddening is found using Bayestar.

This iteration is repeated 5 times during training, due to the computational expense, but 10 times during evaluation. Once the final is given, we calculate a best-estimate value for the dust. We use this to get final estimates for the dereddened . We then calculate using the raw parallax samples and dereddened . Note that we do not use to calculate the final since this could create a feedback loop for the probability density and create unphysical artifacts, which we experimentally observed. A final is then found by averaging the . The standard deviation of the is used to calculate :

where is the uncertainty in the dust map for band at the given distance and sky location. We can add these variances because and are linearly related. We could also choose to estimate by multiplying each of the weights in eq. 2 by , though we found using it created artifacts in the density which did not go away with further training.

Using and the dust map we calculate the dereddened: and . We also calculate the uncertainty due to these colors:

Finally, we are left with and , along with a measure of the combined uncertainty of the color-magnitude point

. Our loss function for this point is then:

We sum this over all stars in the DR2 catalog, after calculating the best-estimate color-magnitude points, and minimize.

This algorithm learns a very flexible prior on dereddened tuples for stars in Gaia DR2, which can then be combined with a distance prior and raw parallax measurements to generate a Bayesian distance estimate for every star in Gaia. This process is done iteratively until the dust estimate converges.

3.3 Model Optimization

We conduct a hyperparameter search for the normalizing flow over 80 different models, finding the best model using Bayesian optimization with summed-log-likelihood as an optimization metric. The model is trained on the entirety of Gaia DR2 above

, completing several passes over the data with a mini-batch size of 2048. We found this mini-batch size balances accuracy: much smaller batch sizes led to the creation of artifacts in the density map, and much larger batch sizes resulted in early convergence to less accurate models.

During model selection, we randomly initialize the model weights using Xavier initialization (see Glorot & Bengio 2010

) with a normal distribution, and train for one pass over Gaia DR2, recording the likelihood over each random one million-star subset. We explore the (learning rate, layers, hidden units) space such that the model fits in a 16 GB GPU, and record the likelihood as a function of these variables and fractional epochs. We then pass these measurements to a Gaussian Process with a radial basis function kernel. We select the next model architecture by maximizing likelihood plus the uncertainty in the likelihood. In total, we explore 80 different architectures and find that the best loss was for a model with 35 layers of 500 hidden units each. We also tried models that used mixtures of normalizing flows but found these underperformed. We also found that using the distance prior from

Bailer-Jones et al. (2018) during the estimate of gave the noisy measurements too much weight, since for very noisy measurements, as , the distance prior and will be equal, but will be finite, so very noisy measurements will have a large effect on the CMD.

4 Results

We first demonstrate the ability of this model to estimate the true color-magnitude diagram of noisy Gaia data on simulated data in Section 4.1. Next, in Section 4.2 onwards, we present the results on real data. The trained normalizing flow is visualized in fig. 3, which shows a clear main sequence and giant branch, with some color in the log plot indicating it has learned weight at the white dwarf part of the CMD as well, unprecedented in previous attempts using full CMD GMM models. We will release the catalog, which is described in Table 1, and code from links on

4.1 Simulation

We simulate a basic Gaia dataset of 30M stars drawn from simple analytic distributions in distance and color space. We apply a known noise-free dust map that is a 2D Gaussian over that is uniform along each line of sight. We distribute the stars uniformly over each line of sight, (though the distances follow a specific prior, see below). We fit a line in space on the high-SNR Gaia data and use this to sample colors. We randomly sample values from the high-SNR Gaia data and project onto the line to get a and value. We then randomly perturb these colors using a Gaussian to create the truth CMD shown in fig. 1(a). The distances are randomly sampled from the Bailer-Jones distance prior with :

These are used to create the , , and observed bands at Earth, followed by reddening according to , for our dust map, and mapped to fixed extinction conversions for each band. Next, we use the formula:

to create a parallax measurement error for all the stars. We then sample the observed parallax,

, for all stars from a Gaussian distributed with

as the standard deviation. This transformation results in the adjusted parallax distribution shown in fig. 1(d). Using as a simple distance estimate, we can visualize the noisy reddened CMD in fig. 1(c).

(a) The true CMD.
(b) The reconstructed CMD with our model.
(c) The observed noisy CMD under dust and noisy parallax estimates.
(d) The distribution of true and observed parallaxes in our simulation.
Figure 2: A test of our normalizing flow reconstruction of the CMD, using a simulated Gaia dataset and a simple fake CMD in plot (a). The reconstructed version of this is shown in (b), after heavy reddening and noise in the parallax measurements (c).

We train a model on this with blocks of hidden nodes with the same block scheme (MADE, BatchNorm, Reverse) as our real Gaia model for a few epochs, allowing the model to use the true dust map as part of its iterative dust estimation scheme (though recall it still needs to estimate accurate distances to calculate the true dust). We then numerically integrate this CMD over the dimension and visualize it in the same space as fig. 1(a) in fig. 1(b). As can be seen, the reconstructed CMD, without any hyperparameter tuning or extensive training, and noisy reddened data, is very close to the original.

4.2 Data Products

After training the model on the Gaia dataset, we apply it back to the data to generate a catalog. We choose not to use the Bailer-Jones distance priors, since we found they harmed the accuracy of stellar distance estimates in the halo as they are very prior dominated (which can be seen in fig. 6

). Instead, we use a constant distance prior over non-negative distances. The catalog contains the mean and standard deviation in the Bayesian posterior estimate for each of the DR2 sources. We plan on adding another catalog that includes 100 quantiles of the posteriors. Alternatively, one can run our code, which will be added to, to sample from the entire posteriors in our trained model. These will be described in the data documentation.

Number Gaia Value (if relevant) Model Value
Total Catalog Entries 640,875,169
Fraction of Stars with
Fraction of Stars with
Fraction of Stars with
Fraction of Stars with
Fraction of Stars with
Fraction of Stars with
Mean SNR Improvement of the points with 48.6%
Negative Fraction of Parallaxes 0
Mean SNR of the points with 1.388
Average Distance of the points with 5.376 kpc
Average Distance (weighted by SNR) of the points with 5.845 kpc
Table 1: SNR is defined using distance: , for the model described in this paper, and parallax: , for Gaia. This choice is made because distance is the fundamental astrometric measurement of this model, and parallax is the fundamental astrometric measurement of the Gaia catalog, and both results are given as the parameters of a Gaussian.

A visualization of the normalizing flow marginalized over can be seen in terms of probability and log-probability density in fig. 3.

Figure 3: The trained normalized flow, representing a color-magnitude diagram, is visualized here as a probability density in the space of and , marginalized over . Plot (a) shows probability and plot (b) shows log-probability.

The tightening of the CMD from applying this prior to the data can be seen in fig. 3(a) through fig. 3(c). The tightening is used as a visual metric for the improvement in the distance estimates. As seen in Table 1, more stars have higher signal-to-noise ratios, with only 18.9% of stars having SNR less than 1.0 versus the Gaia catalog which has 46.1%. The mean SNR improvement of those stars which do not have negative parallaxes is 48.6%.

Figure 4: Comparison of Gaia (left) and model (right) color-magnitude diagrams of all stars, after denoising the distance estimates using our color-magnitude diagram as a prior. The left plots use inverse parallax as distance to calculate , and the right plots demonstrate the tightening of the CMD around the main and giant branches with our model. The logarithm of the normalized number density is represented by color. Stars with negative parallaxes are excluded in the left plot, but included in the right plot, which makes the tightening of the CMD more impressive. For both, only Gaia stars with with RUWE and dec are included. The plots in (a) have no filter on the stars plotted, the plots in (b) have a filter of applied, and the plots in (c) have a filter of applied.

4.3 Application Examples

Here we describe several potential applications of this dataset: obtaining an estimate of the distance to M67, filtering foreground stars, and substructure detection in three spatial dimensions. First, we visualize the improved distance estimates to stars in the GD-1 stream. We make use of the reduction from Price-Whelan & Bonaca (2018) to select the stars in a strip of the DR2 sky, and then compare the distance estimates from Gaia parallaxes alone with distance estimates from Gaia parallaxes sampled using the Bailer-Jones distance prior, and finally our model distance. As seen in fig. 7 compared to fig. 5 (raw Gaia) and fig. 6 (distance priors), distance estimates to stars in GD-1 are greatly improved with this photometric model. By using the distance estimates from our model, one can perform kinematic searches for substructure farther than without requiring associated stars be part of an isochrone with the same metallicity and age. In other words, this enables generic filtering of foreground stars in Gaia data, and also adds a third distance dimension for clustering stellar substructures, such as with the STREAMFINDER algorithm in Malhan & Ibata (2018).

Figure 5: As an example of distance estimates without our model, the distances to candidate members of the GD-1 stellar stream are shown in black dots on the foreground stars as a colored density, shown using distance as the inverse of parallax on the y-axis. Compare to fig. 6 and fig. 7 as an example of foreground separation without prior knowledge of stellar properties. This series of plots demonstrates that the distance prior estimates alone are too prior dominated to be useful for studying individual clusters of halo stars.
Figure 6: Distances to GD-1 candidate members shown in black dots on the foreground stars as a colored density, shown using distance priors from Bailer-Jones et al. (2018) with Gaia parallaxes. Compare to fig. 5 and fig. 7.
Figure 7: Distances to GD-1 candidate members shown in black dots on the foreground stars as a colored density, shown using the expectation values of our model. Comparing to fig. 5 and fig. 6, we see that the black dots have moved to bluer contours, indicating that the signal-to-noise of the GD-1 stars in the declination-distance space improves by orders of magnitude, and the typical distance of a candidate star is closer to the estimated distance to GD-1 of . With a median value of and a median uncertainty of , these estimates are also more accurate to the true distance of GD-1 at about 7-10 kpc.

Finally, we give an example of estimating the distance to the cluster M67, using full Bayesian posteriors from our model. The cluster M67 is at (Yakut et al. 2009) from Earth and was also used as a demonstration cluster in Anderson et al. (2018). We do this with our model without distance priors (a flat prior over non-negative distances). First, we filter Gaia DR2 sources to the cone centered about , =, with radius . Since the estimate of inverted Gaia parallaxes is reasonable by itself (excluding negative parallaxes), we filter down to only parallaxes with SNR less than some bound, shown as the columns in fig. 8. to demonstrate that our model improves the precision and accuracy of noisy parallaxes. The top row of fig. 8 shows the inverse parallaxes. The middle row in fig. 8 has plots for the distribution of distances from Gaia parallaxes sampled using the Bailer-Jones et al. (2018) distance prior, and then in the bottom row of fig. 8, we show the distribution of the model-derived distances presented in this paper.

Figure 8: A histogram of Gaia-derived distance estimates to stars within of the direction to the M67 cluster, showing only those stars with noisy parallaxes under some SNR threshold. The known distance to M67 is within the black lines between 800 and 900 . The orange and green Gaussians overplotted show a two-component Bayesian GMM fitted to model the M67 cluster and background stars, respectively. Our model predicts the distances shown in the bottom row of plots, which are the most accurate and precise, as summarized in Table 2.

We then fit a two-component Bayesian GMM to every distance distribution, modeling the cluster distance with one component and the background stars with the other. These are overplotted in fig. 8. The model estimates for each SNR are shown in table 2.

Model M67 Center Estimate (kpc) Spread (kpc)
Inverse Parallax with SNR (and )
Inverse Parallax with SNR (and )
Bailer-Jones Prior with SNR
Bailer-Jones Prior with SNR
Normalizing Flow with SNR
Normalizing Flow with SNR
Table 2: This plot shows estimates of the distance to the M67 cluster using various models on noisy parallax data, for different noise cutoffs. SNR is defined here with parallax: .

As is evident in table 2, our normalizing flow model improves precision and accuracy to the M67 cluster of the very low-quality parallaxes: not only is the spread in the Gaussian component almost half of that with the Bailer-Jones distance priors, but the estimate derived from SNR parallaxes is much closer to the final estimate of kpc using our model. Also, since the distance priors rely on other surveys to estimate stellar density in every direction, they implicitly contain models for this particular cluster, meaning it is easier for the distance prior model in this example. The distance prior strategy fails in the example in fig. 6, with distance estimates to stars in the GD-1 stellar stream: the estimates are completely off the accepted value range. When there are anomalous stars, the prior-derived distance estimates become much less useful, and a photometry model is necessary, as shown giving much more reasonable distances in fig. 7.

5 Future work

The Masked Autoencoder architecture from Germain et al. (2015) and Papamakarios et al. (2017) which we use in our technique models a joint posterior in using conditional probabilities:

What this means is that we can exactly marginalize over through for some without additional computation by excluding them from the joint posterior, as long as we fix a hierarchy of that maximizes. In other words, the marginalized probability is:

which uses the same normalizing flow model. While we cannot marginalize over an arbitrary without using an ensemble model, if we have photometric bands with different coverage, we can fit a color-magnitude density that simultaneously models many different survey bands, and still train and evaluate the color-magnitude probability with limited information. E.g., if we were to model Gaia and AllWISE bands simultaneously in a color-magnitude density, and we order Gaia AllWISE, then we can exactly evaluate the color-magnitude density for a given Gaia photometry marginalized over AllWISE bands, but not vice versa. This technique, also used in Alsing et al. (2019), is powerful and would be useful to exploit for future distance estimates to maximize coverage while using all available surveys. In the future, we would also like to make use of maps for , as well as simultaneously model stellar parameters, rather than use extinction conversions based on a simple blackbody.

6 Conclusion

We have demonstrated an algorithm for learning a flexible probability distribution in Gaia color-magnitude space from noisy parallax and photometry measurements using a normalizing flow. These deep neural networks, capable of learning arbitrary multi-dimensional probability distributions, have been shown in this paper to be capable of modeling CMDs well, and work at predicting CMDs accurately in an iterative dust estimation scheme. We have also presented a catalog of 640M photometric distance posteriors derived from this data-driven model using Gaia DR2 photometry and parallaxes to learn a prior in Gaia color space. Overall, the signal-to-noise of distance measurements in this catalog improves on average by 48% over the raw Gaia data, including only the non-negative Gaia parallaxes, and we also demonstrate how the accuracy of distances have improved over other models. Applications are discussed for this catalog, including significantly improved Milky Way disk separation and substructure detection.

We also will maintain a GitHub repository at where we will post links to the distance catalog along with future versions, host code of the normalizing flow for photometry data, and answer questions about the paper and implementing the algorithm.

7 Acknowledgments

Miles Cranmer would like to thank David W. Hogg for the initial project idea of improving Gaia distances with a generative neural network model, Johann Brehmer and Thomas Kipf for the idea of creating this model with a normalizing flow, Wolfgang Kerzendorf for feedback on calculating extinction conversions, Gregory Green for help with his dustmap package, George Papamakarios for the idea of marginalizing over input to the normalizing flow using Papamakarios et al. (2017), and Iain Murray for comments on an early draft. This work made use of Astropy (Robitaille et al. 2013), PyTorch (Paszke et al. 2017), Scikit-Learn (Pedregosa et al. 2011), among other scientific packages mentioned in the paper.