1 Introduction
Gaia’s precise astrometry has impacted the astrophysics community in countless ways. While Gaia data is predominantly useful for mapping out the Milky Way and its bulk properties, for example in Bovy (2017); it has been used for calibration of standard candles like the Red Clump as in Hawkins et al. (2017); Huber et al. (2017), and vice versa — using the Red Clump to improve Gaia parallax calibration — Hall et al. (2019); and mapping out the interstellar medium to create dust maps of the Milky Way, as in Green et al. (2019). Gaia’s astrometric data has been incredibly useful for the study of substructures in the Milky Way and its halo, for example, stellar streams, as in PriceWhelan & Bonaca (2018); Malhan et al. (2018); Brown et al. (2018); Koposov et al. (2019). The precise positional and proper motion estimates even allow for automated algorithms to be used to detect stellar streams, as described in Malhan & Ibata (2018). Recently the astrometric data has been used to observe possible dynamical evidence for dark matter in stellar streams in Bonaca et al. (2019), although this stream was at a distance where the parallax measurements couldn’t be reliably used besides a foreground filter. Producing more accurate astrometry for the Gaia DR2 dataset would impact many current and future use cases. Gaia parallaxes greatly degrade in quality at a distance starting at about , with only million sources having greater than 10 signaltonoise (SNR, defined here as parallax over parallax uncertainty), hampering studies of structure in the galactic halo such as stellar streams.
We can make reasonable distance estimates from these uncertain parallaxes by relying on distance priors such as those derived in BailerJones (2015); BailerJones et al. (2018), or by exploiting patterns in stellar photometry, either using theoretical ColorMagnitude Diagrams (CMDs) for stars or learning them as in Leistedt & Hogg (2017); Anderson et al. (2018). If we were to rely on purely theoretical models for estimating distances from photometry, we would be subject to systematic errors from the models themselves. Distance priorderived distances are also heavily priordominated, and of limited use for substructure studies, also demonstrated in Section 4.3
. By denoising in a modelindependent fashion — training a flexible machine learning algorithm to model the most common photometric measurements for stars, hence acting as a prior for distance estimates — we avoid difficulties from modeling the theoretical photometry, and naturally learn over top of the Gaia systematic errors. Using a very flexible model for a colormagnitude population density would reduce potential modeldependent bias from entering distance estimates.
One way of building very flexible models is by using machine learning models. Typically in astronomy, machine learning is thought of purely as an approach to classification or regression, which is predicting an output scalar or vector value based on a set of input parameters, such as the simplest example: creating a line of best fit. The extension of this is to fitting quadratic functions, and then higherorder polynomials, and other simple models, which can be generalized to multiinput, multioutput using additional parameters. One generalization of these simple machine learning models which are fit to data is a popular technique called deep learning, which is a flexible approach capable of fitting very complex surfaces over input data, relying typically on stochastic gradient descent to fit millions of parameters. Deep learning can also be applied to many additional optimization problems than just regression, such as density estimation, which we will use it for in this paper. Density estimation is the problem of fitting a function that models an unknown probability distribution and can be approached with deep learning using a model called a normalizing flow.
Deep learning can be thought of as a recursive generalized linear regression — you repeatedly compute linear regression (fitting a hyperplane) on an input, following each regression with an elementwise nonlinearity (such as converting negative values to zero). In the case of normalizing flows, as we will see, we also apply a mask over the linear regression weights to make the function have a triangular Jacobian matrix. We use such a normalizing flow in this paper to model photometric measurements of Gaia stars.
One common criticism of deep learning models targets their lack of interpretability and potential overflexibility. Generally, it is recommended that deep learning should only be used when it gives you large performance gains or lets you model a relation that would be extremely difficult to represent with classical machine learning models. As discussed in this paper, due to the large size of the Gaia DR2 dataset, and the nonGaussian contours of the density of stars on a CMD, a deep neural network is a useful model for its scalability and flexibility, so we choose it over traditional machine learning models.
2 Data
Our dataset is a slice of the Gaia DR2 catalog. For technical papers describing the 2nd data release used in this paper, consult Brown et al. (2018); Lindegren et al. (2018); Riello et al. (2018). We make use of the following features: bp_rp, bp_g, phot_g_mean_mag, ra, dec, parallax, along with their corresponding uncertainties, and features to calculate the renormalized unit weight error (RUWE): astrometric_chi2_al and astrometric_n_good_obs_al. We choose to not apply any filters based on parallax or parallax uncertainty, meaning we take all noisy measurements (although we do filter using RUWE, discussed later), including negative parallaxes. We exclude all stars below 30 deg declination, since the Bayestar dust map, which we use from Green et al. (2018) in the package Green (2018), does not extend there. We also exclude stars lying in holes of this dust map. We include all stars during training, even the lowSNR parallaxes. However, we make cuts during training such that stars in highdensity regions of the sky are excluded to avoid stars with bad goodnessoffit for parallax. We do this by requiring that the renormalized unitweight error of measured parallaxes, discussed on https://www.cosmos.esa.int/web/gaia/dr2knownissues#AstrometryConsiderations, is less than 1.4.
We apply a single global parallax offset of 0.029 mas from Lindegren et al. (2018) to all of the Gaia data. While the true parallax offset is conditional on several variables, we believe that training with a constant parallax offset makes it easier to apply different postprocessing offsets to the finished model. A more complex parallax offset can also be used during evaluation.
3 Model
We wish to train a model that takes bp_rp, bp_g, phot_g_mean_mag, ra, dec, parallax from the Gaia DR2 catalog, along with their uncertainties, and produces a Bayesian posterior over distance. Our model is similar in strategy, but not in terms of the actual density model, to Anderson et al. (2018). Anderson et al. (2018) built a model on Gaia DR1 parallaxes crossmatched with 2MASS photometry, and applies the “Extreme Deconvolution” algorithm from Bovy Jo et al. (2011)
to build a Gaussian Mixture Model (GMM) over colormagnitude values. This GMM is then used as a prior for the same data, to build a final parallax probability distribution for every data point. The model described in this paper is similar to this GMM model but differs by using a normalizing flow to model density in photometric space. Various other changes are made, such as that we propagate uncertainty from the dust map, both into our training (downweighting photometry from uncertain dust estimates) and during evaluation (sampling the dust map).
Normalizing flows are not yet popular for density estimation in astrophysics or the natural sciences in general, although they are being used for some likelihoodfree inference applications derived from Papamakarios et al. (2018) in Cosmology and Particle Physics, for example in the “MadMiner” package for LHC data Brehmer et al. (2019), and “PyDELFI” for Cosmology Alsing et al. (2019). This paper represents the first application that we are aware of for normalizing flows being applied to learning a stellar CMD.
Our model makes several assumptions which largely follow those made in Anderson et al. (2018):

The fundamental assumption of the model is that for a given star, there will be other stars with similar photometry, as is assumed for any regressiontype optimization problem.

We model the contents of Gaia DR2, not the Milky Way. Since parallax measurements are noisier for stars in the halo, which also tend to be lower metallicity, the CMD model will bias to metallicities in the disk. However, we lessen the effect of this by training on all stars, rather than just highSNR.

The model assumes that the expected colormagnitude relation is unchanging with respect to location in the Milky Way, in that we do not include (right ascension and declination) as a parameter in our CMD.

This model makes no further physical assumptions about stellar structure or the colormagnitude relation. This is a tradeoff because while it does avoid potential inaccuracies of physical models, it also misses the many successes of these models. Future work will attempt to combine physical models with datadriven to help regularize the training.

We assume that the Bayestar dust map of Green et al. (2018) produces accurate dust posteriors along each line of sight. We further assume that for training, the median dust estimate will not bias the model during training. However, we incorporate the full dust posterior for evaluation.

We make the assumption that the Gaia catalog has no parallax bias relative to any parameters (such as proper motion, or a specific sky coordinate such as a highdensity region).
A potentially problematic assumption in our model is the fact that the dust map we use, from Green et al. (2018), makes use of stellar models to estimate dust, meaning that there are implicitly some stellar models being introduced in our distance estimates. In the future, it would be desirable to combine the learning of a dust map with the learning of a CMD so we could make completely modelindependent estimates, but for now, this is our approach. The assumptions for the model described in our paper are different from those in Anderson et al. (2018) in that we do not model the colormagnitude relation using 128 Gaussians; rather, we use a deep normalizing flow which has more flexibility in describing the joint prior. The path of stellar evolution has many sharp turns and discontinuities, so, while the spread of stars about a single isochrone may be Gaussianlike due to a Gaussian spread in ages and metallicities, the overall CMD is better modelled with a highly flexible model that can express sharp turns, which is where deep learning can be extremely helpful, as GMMs perform poorly at modeling the contours of the CMD. We also deal with a significantly larger dataset, using the majority of DR2 (640M stars) versus a selection of DR1 (1.4M stars). We also incorporate the entirety of the samples from Green et al. (2018) rather than median dust estimates during evaluation.
The crux of our model is a very flexible “normalizing flow” neural network. A good introduction on this family of model can be found at http://akosiorek.github.io/ml/2018/04/03/norm_flows.html. This architecture models the joint posterior density for Gaia magnitudes, as
where is the absolute (dereddened) Gband mean magnitude from Gaia DR2, corresponding to column phot_g_mean_mag, and and similarly for the absolute integrated BP and RP mean magnitudes. This prior density models the dashed lines in the graphical model in fig. 1.
This posterior models the density of stars in Gaia DR2 and weights stars observed in Gaia DR2 by
where we define as
which is a metric for the signaltonoise ratio of data points, much like when calculating a mean from uncertain measurements, one weights each measurement by the inverse variance. These
values are the variance of estimates in each of the quantities. These variances incorporate both the uncertainty of the dust map as well as the intrinsic uncertainty in the Gaia parallax and distance prior.During training, we sample 32 parallaxes for each Gaia DR2 data point from the truncated normal distribution:
(1) 
using the DR2 parallax value for , and
equal to the DR2 standard deviation in the parallax value estimate. As stated in
Hogg (2018), a likelihood for Gaia parallaxes treats the individual measurements as drawn from . Therefore this truncated distribution acts as a minimal distance prior on the Gaia parallax likelihood, which states that we can only have positive distances.3.1 Model architecture
A normalizing flow can roughly be thought of as a Gaussian that has been parametrically warped by an invertible neural network. This is a smooth invertible mapping between probability distributions: . Our specific model relies on the “Masked Autoregressive Flow” variant, described in Papamakarios et al. (2017) for density estimators, which uses the following mappings:
for and . This equation is similar to the common matrix multiply followed by vector addition that is found in neural networks, but with a particular mask applied. This is done so that the Jacobian of this mapping is triangular — and hence the determinant is easy to calculate as the product of elements along the diagonal. Recall that if you would like to change variables from to via a smooth function , for probability distributions one must write:
This mapping is invertible, and the inverse also has a triangular Jacobian. The fact that the Jacobian is easy to calculate lets us normalize the transformation between probability distributions. The inverse is
which transforms from our data variables () to a latent space (
) where we set the Gaussian. This transform is what we compute to calculate the probability of given data (though we use a reparametrized version, so it doesn’t need to be done sequentially). This transform can be repeated to form a complex flow. Autoregressive models can also be used to exactly marginalize over inputs in the case of missing data (such as missing photometry), as discussed in Section
5.Our model uses a sequence of blocks of MADE BatchNorm Reverse, where MADE
is the Masked Autoencoder for Distribution Estimation model defined in
Germain et al. (2015), Reverse is the reversing layer found in Dinh et al. (2016), and BatchNormis a batch normlike layer (typically used in convolutional neural networks) also defined for normalizing flow models in
Dinh et al. (2016). We use the PyTorch code found at
https://github.com/ikostrikov/pytorchflows/blob/master/flows.py as the template of our codebase, which we alter for our usecase.The MADE model essentially applies three denselyconnected neural network layers with a mask (hence, a masked autoencoder) applied to the weights at each layer to satisfy properties of the neural flow. It forms an autoencoder such that each of the output units only depends on the preceding input units (). One can think of this transform as parametrizing an arbitrary bijective vector field, where the vectors show the flow of points from distribution to distribution. The BatchNorm
is the equivalent of a batch normalization for normalizing flows, and
Reverse permutes the order of the probability variables since the MADE’s mask treats each slightly differently (e.g., depends on , whereas only on ). Hence the BatchNorm and Reverse layers help regularize training. Defining one block as MADE BatchNorm Reverse, our model has a probability distribution with variables. We then specify a hidden dimension size in each MADE of and use sequential blocks.3.2 Dust estimation
We build the Bayestar dust map from Green et al. (2018) into the model with an iterative approach using the software package dustmap from Green (2018). We use extinction values based on a blackbody integrated over an dust model from O’Donnell (1994), using the extinction package from Barbary (2016), which gives us conversions from the Bayestar dust map to Gaia DR2 bands:
In the future, we would like to expand this graphical model with an estimate for the temperature and other stellar parameters of each star, and incorporate a map of . We would also like to fit a separate graphical model which learns the best extinction to maximize likelihood over the data.
Training and evaluation take different approaches due to the computational expense. During training, we calculate a distance by dividing each of the 32 parallax samples from eq. 1. We calculate the mean of these distance samples to get the current bestestimate for a distance. For every bestestimate distance, we query Bayestar at the center ra and dec position. We convert this reddening into each of the Gaia bands, giving us estimates for and which gives us and . Then, using each of the 32 samples for parallax, we convert the estimate into 32 samples for . Next, using the current model for , we calculate the probability of each tuple. These likelihood values are treated as weights, and the weights are used to calculate a new bestestimate for distance via a weighted sum:
(2) 
where each of the is a distance sample. This is then fed back into the loop and a new reddening is found using Bayestar.
This iteration is repeated 5 times during training, due to the computational expense, but 10 times during evaluation. Once the final is given, we calculate a bestestimate value for the dust. We use this to get final estimates for the dereddened . We then calculate using the raw parallax samples and dereddened . Note that we do not use to calculate the final since this could create a feedback loop for the probability density and create unphysical artifacts, which we experimentally observed. A final is then found by averaging the . The standard deviation of the is used to calculate :
where is the uncertainty in the dust map for band at the given distance and sky location. We can add these variances because and are linearly related. We could also choose to estimate by multiplying each of the weights in eq. 2 by , though we found using it created artifacts in the density which did not go away with further training.
Using and the dust map we calculate the dereddened: and . We also calculate the uncertainty due to these colors:
Finally, we are left with and , along with a measure of the combined uncertainty of the colormagnitude point
. Our loss function for this point is then:
We sum this over all stars in the DR2 catalog, after calculating the bestestimate colormagnitude points, and minimize.
This algorithm learns a very flexible prior on dereddened tuples for stars in Gaia DR2, which can then be combined with a distance prior and raw parallax measurements to generate a Bayesian distance estimate for every star in Gaia. This process is done iteratively until the dust estimate converges.
3.3 Model Optimization
We conduct a hyperparameter search for the normalizing flow over 80 different models, finding the best model using Bayesian optimization with summedloglikelihood as an optimization metric. The model is trained on the entirety of Gaia DR2 above
, completing several passes over the data with a minibatch size of 2048. We found this minibatch size balances accuracy: much smaller batch sizes led to the creation of artifacts in the density map, and much larger batch sizes resulted in early convergence to less accurate models.During model selection, we randomly initialize the model weights using Xavier initialization (see Glorot & Bengio 2010
) with a normal distribution, and train for one pass over Gaia DR2, recording the likelihood over each random one millionstar subset. We explore the (learning rate, layers, hidden units) space such that the model fits in a 16 GB GPU, and record the likelihood as a function of these variables and fractional epochs. We then pass these measurements to a Gaussian Process with a radial basis function kernel. We select the next model architecture by maximizing likelihood plus the uncertainty in the likelihood. In total, we explore 80 different architectures and find that the best loss was for a model with 35 layers of 500 hidden units each. We also tried models that used mixtures of normalizing flows but found these underperformed. We also found that using the distance prior from
BailerJones et al. (2018) during the estimate of gave the noisy measurements too much weight, since for very noisy measurements, as , the distance prior and will be equal, but will be finite, so very noisy measurements will have a large effect on the CMD.4 Results
We first demonstrate the ability of this model to estimate the true colormagnitude diagram of noisy Gaia data on simulated data in Section 4.1. Next, in Section 4.2 onwards, we present the results on real data. The trained normalizing flow is visualized in fig. 3, which shows a clear main sequence and giant branch, with some color in the log plot indicating it has learned weight at the white dwarf part of the CMD as well, unprecedented in previous attempts using full CMD GMM models. We will release the catalog, which is described in Table 1, and code from links on https://github.com/MilesCranmer/public_CMD_normalizing_flow.
4.1 Simulation
We simulate a basic Gaia dataset of 30M stars drawn from simple analytic distributions in distance and color space. We apply a known noisefree dust map that is a 2D Gaussian over that is uniform along each line of sight. We distribute the stars uniformly over each line of sight, (though the distances follow a specific prior, see below). We fit a line in space on the highSNR Gaia data and use this to sample colors. We randomly sample values from the highSNR Gaia data and project onto the line to get a and value. We then randomly perturb these colors using a Gaussian to create the truth CMD shown in fig. 1(a). The distances are randomly sampled from the BailerJones distance prior with :
These are used to create the , , and observed bands at Earth, followed by reddening according to , for our dust map, and mapped to fixed extinction conversions for each band. Next, we use the formula:
to create a parallax measurement error for all the stars. We then sample the observed parallax,
, for all stars from a Gaussian distributed with
as the standard deviation. This transformation results in the adjusted parallax distribution shown in fig. 1(d). Using as a simple distance estimate, we can visualize the noisy reddened CMD in fig. 1(c).We train a model on this with blocks of hidden nodes with the same block scheme (MADE, BatchNorm, Reverse) as our real Gaia model for a few epochs, allowing the model to use the true dust map as part of its iterative dust estimation scheme (though recall it still needs to estimate accurate distances to calculate the true dust). We then numerically integrate this CMD over the dimension and visualize it in the same space as fig. 1(a) in fig. 1(b). As can be seen, the reconstructed CMD, without any hyperparameter tuning or extensive training, and noisy reddened data, is very close to the original.
4.2 Data Products
After training the model on the Gaia dataset, we apply it back to the data to generate a catalog. We choose not to use the BailerJones distance priors, since we found they harmed the accuracy of stellar distance estimates in the halo as they are very prior dominated (which can be seen in fig. 6
). Instead, we use a constant distance prior over nonnegative distances. The catalog contains the mean and standard deviation in the Bayesian posterior estimate for each of the DR2 sources. We plan on adding another catalog that includes 100 quantiles of the posteriors. Alternatively, one can run our code, which will be added to
https://github.com/MilesCranmer/public_CMD_normalizing_flow, to sample from the entire posteriors in our trained model. These will be described in the data documentation.Number  Gaia Value (if relevant)  Model Value 

Total Catalog Entries  640,875,169  
Fraction of Stars with  
Fraction of Stars with  
Fraction of Stars with  
Fraction of Stars with  
Fraction of Stars with  
Fraction of Stars with  
SNR  
SNR  
SNR  
SNR  
SNR  
Mean SNR Improvement of the points with  48.6%  
Negative Fraction of Parallaxes  0  
Mean SNR of the points with  1.388  
Average Distance of the points with  5.376 kpc  
Average Distance (weighted by SNR) of the points with  5.845 kpc 
A visualization of the normalizing flow marginalized over can be seen in terms of probability and logprobability density in fig. 3.
The tightening of the CMD from applying this prior to the data can be seen in fig. 3(a) through fig. 3(c). The tightening is used as a visual metric for the improvement in the distance estimates. As seen in Table 1, more stars have higher signaltonoise ratios, with only 18.9% of stars having SNR less than 1.0 versus the Gaia catalog which has 46.1%. The mean SNR improvement of those stars which do not have negative parallaxes is 48.6%.
4.3 Application Examples
Here we describe several potential applications of this dataset: obtaining an estimate of the distance to M67, filtering foreground stars, and substructure detection in three spatial dimensions. First, we visualize the improved distance estimates to stars in the GD1 stream. We make use of the reduction from PriceWhelan & Bonaca (2018) to select the stars in a strip of the DR2 sky, and then compare the distance estimates from Gaia parallaxes alone with distance estimates from Gaia parallaxes sampled using the BailerJones distance prior, and finally our model distance. As seen in fig. 7 compared to fig. 5 (raw Gaia) and fig. 6 (distance priors), distance estimates to stars in GD1 are greatly improved with this photometric model. By using the distance estimates from our model, one can perform kinematic searches for substructure farther than without requiring associated stars be part of an isochrone with the same metallicity and age. In other words, this enables generic filtering of foreground stars in Gaia data, and also adds a third distance dimension for clustering stellar substructures, such as with the STREAMFINDER algorithm in Malhan & Ibata (2018).
Finally, we give an example of estimating the distance to the cluster M67, using full Bayesian posteriors from our model. The cluster M67 is at (Yakut et al. 2009) from Earth and was also used as a demonstration cluster in Anderson et al. (2018). We do this with our model without distance priors (a flat prior over nonnegative distances). First, we filter Gaia DR2 sources to the cone centered about , =, with radius . Since the estimate of inverted Gaia parallaxes is reasonable by itself (excluding negative parallaxes), we filter down to only parallaxes with SNR less than some bound, shown as the columns in fig. 8. to demonstrate that our model improves the precision and accuracy of noisy parallaxes. The top row of fig. 8 shows the inverse parallaxes. The middle row in fig. 8 has plots for the distribution of distances from Gaia parallaxes sampled using the BailerJones et al. (2018) distance prior, and then in the bottom row of fig. 8, we show the distribution of the modelderived distances presented in this paper.
We then fit a twocomponent Bayesian GMM to every distance distribution, modeling the cluster distance with one component and the background stars with the other. These are overplotted in fig. 8. The model estimates for each SNR are shown in table 2.
Model  M67 Center Estimate (kpc)  Spread (kpc) 

Inverse Parallax with SNR (and )  
Inverse Parallax with SNR (and )  
BailerJones Prior with SNR  
BailerJones Prior with SNR  
Normalizing Flow with SNR  
Normalizing Flow with SNR 
As is evident in table 2, our normalizing flow model improves precision and accuracy to the M67 cluster of the very lowquality parallaxes: not only is the spread in the Gaussian component almost half of that with the BailerJones distance priors, but the estimate derived from SNR parallaxes is much closer to the final estimate of kpc using our model. Also, since the distance priors rely on other surveys to estimate stellar density in every direction, they implicitly contain models for this particular cluster, meaning it is easier for the distance prior model in this example. The distance prior strategy fails in the example in fig. 6, with distance estimates to stars in the GD1 stellar stream: the estimates are completely off the accepted value range. When there are anomalous stars, the priorderived distance estimates become much less useful, and a photometry model is necessary, as shown giving much more reasonable distances in fig. 7.
5 Future work
The Masked Autoencoder architecture from Germain et al. (2015) and Papamakarios et al. (2017) which we use in our technique models a joint posterior in using conditional probabilities:
What this means is that we can exactly marginalize over through for some without additional computation by excluding them from the joint posterior, as long as we fix a hierarchy of that maximizes. In other words, the marginalized probability is:
which uses the same normalizing flow model. While we cannot marginalize over an arbitrary without using an ensemble model, if we have photometric bands with different coverage, we can fit a colormagnitude density that simultaneously models many different survey bands, and still train and evaluate the colormagnitude probability with limited information. E.g., if we were to model Gaia and AllWISE bands simultaneously in a colormagnitude density, and we order Gaia AllWISE, then we can exactly evaluate the colormagnitude density for a given Gaia photometry marginalized over AllWISE bands, but not vice versa. This technique, also used in Alsing et al. (2019), is powerful and would be useful to exploit for future distance estimates to maximize coverage while using all available surveys. In the future, we would also like to make use of maps for , as well as simultaneously model stellar parameters, rather than use extinction conversions based on a simple blackbody.
6 Conclusion
We have demonstrated an algorithm for learning a flexible probability distribution in Gaia colormagnitude space from noisy parallax and photometry measurements using a normalizing flow. These deep neural networks, capable of learning arbitrary multidimensional probability distributions, have been shown in this paper to be capable of modeling CMDs well, and work at predicting CMDs accurately in an iterative dust estimation scheme. We have also presented a catalog of 640M photometric distance posteriors derived from this datadriven model using Gaia DR2 photometry and parallaxes to learn a prior in Gaia color space. Overall, the signaltonoise of distance measurements in this catalog improves on average by 48% over the raw Gaia data, including only the nonnegative Gaia parallaxes, and we also demonstrate how the accuracy of distances have improved over other models. Applications are discussed for this catalog, including significantly improved Milky Way disk separation and substructure detection.
We also will maintain a GitHub repository at https://github.com/MilesCranmer/public_CMD_normalizing_flow where we will post links to the distance catalog along with future versions, host code of the normalizing flow for photometry data, and answer questions about the paper and implementing the algorithm.
7 Acknowledgments
Miles Cranmer would like to thank David W. Hogg for the initial project idea of improving Gaia distances with a generative neural network model, Johann Brehmer and Thomas Kipf for the idea of creating this model with a normalizing flow, Wolfgang Kerzendorf for feedback on calculating extinction conversions, Gregory Green for help with his dustmap package, George Papamakarios for the idea of marginalizing over input to the normalizing flow using Papamakarios et al. (2017), and Iain Murray for comments on an early draft. This work made use of Astropy (Robitaille et al. 2013), PyTorch (Paszke et al. 2017), ScikitLearn (Pedregosa et al. 2011), among other scientific packages mentioned in the paper.
References
 Alsing et al. (2019) Alsing, J., Charnock, T., Feeney, S., & Wandelt, B. 2019, MNRAS, stz1960, doi: 10.1093/mnras/stz1960
 Anderson et al. (2018) Anderson, L., Hogg, D. W., Leistedt, B., PriceWhelan, A. M., & Bovy, J. 2018, AJ, 156, 145, doi: 10.3847/15383881/aad7bf
 BailerJones (2015) BailerJones, C. A. L. 2015, PASP, 127, 994, doi: 10.1086/683116
 BailerJones et al. (2018) BailerJones, C. a. L., Rybizki, J., Fouesneau, M., Mantelet, G., & Andrae, R. 2018, A&A, 156, 58, doi: 10.3847/15383881/aacb21
 Barbary (2016) Barbary, K. 2016, extinction v0.3.0, doi: 10.5281/zenodo.804967. https://doi.org/10.5281/zenodo.804967
 Bonaca et al. (2019) Bonaca, A., Hogg, D. W., PriceWhelan, A. M., & Conroy, C. 2019, ApJ, 880, 38, doi: 10.3847/15384357/ab2873
 Bovy (2017) Bovy, J. 2017, MNRAS, 470, 1360, doi: 10.1093/mnras/stx1277
 Bovy Jo et al. (2011) Bovy Jo, Hogg, D. W., & Roweis, S. T. 2011, Annals of Applied Statistics, 5, 1657, doi: 10.1214/10AOAS439
 Brehmer et al. (2019) Brehmer, J., Kling, F., Espejo, I., & Cranmer, K. 2019, arXiv eprints, arXiv:1907.10621
 Brown et al. (2018) Brown, A. G. A., Vallenari, A., Prusti, T., et al. 2018, A&A, 616, A1, doi: 10.1051/00046361/201833051
 Dinh et al. (2016) Dinh, L., SohlDickstein, J., & Bengio, S. 2016, arXiv:1605.08803 [cs, stat]
 Germain et al. (2015) Germain, M., Gregor, K., Murray, I., & Larochelle, H. 2015, arXiv:1502.03509 [cs, stat]

Glorot & Bengio (2010)
Glorot, X., & Bengio, Y. 2010, in Proceedings of the thirteenth international conference on artificial intelligence and statistics, 249–256
 Green (2018) Green, G. 2018, The Journal of Open Source Software, 3, 695, doi: 10.21105/joss.00695
 Green et al. (2019) Green, G. M., Schlafly, E. F., Zucker, C., Speagle, J. S., & Finkbeiner, D. P. 2019, arXiv eprints, arXiv:1905.02734
 Green et al. (2018) Green, G. M., Schlafly, E. F., Finkbeiner, D., et al. 2018, MNRAS, 478, 651, doi: 10.1093/mnras/sty1008
 Hall et al. (2019) Hall, O. J., Davies, G. R., Elsworth, Y. P., et al. 2019, MNRAS, 486, 3569, doi: 10.1093/mnras/stz1092
 Hawkins et al. (2017) Hawkins, K., Leistedt, B., Bovy, J., & Hogg, D. W. 2017, MNRAS, 471, 722, doi: 10.1093/mnras/stx1655
 Hogg (2018) Hogg, D. W. 2018, ArXiv eprints, 1804, arXiv:1804.07766
 Huber et al. (2017) Huber, D., Zinn, J., BojsenHansen, M., et al. 2017, ApJ, 844, 102, doi: 10.3847/15384357/aa75ca
 Koposov et al. (2019) Koposov, S. E., Belokurov, V., Li, T. S., et al. 2019, MNRAS, 485, 4726, doi: 10.1093/mnras/stz457
 Leistedt & Hogg (2017) Leistedt, B., & Hogg, D. W. 2017, A&A, 154, 222, doi: 10.3847/15383881/aa91d5
 Lindegren et al. (2018) Lindegren, L., Hernández, J., Bombrun, A., et al. 2018, A&A, 616, A2, doi: 10.1051/00046361/201832727
 Malhan & Ibata (2018) Malhan, K., & Ibata, R. A. 2018, MNRAS, 477, 4063, doi: 10.1093/mnras/sty912
 Malhan et al. (2018) Malhan, K., Ibata, R. A., & Martin, N. F. 2018, MNRAS, 481, 3442, doi: 10.1093/mnras/sty2474
 O’Donnell (1994) O’Donnell, J. E. 1994, ApJ, 422, 158, doi: 10.1086/173713
 Papamakarios et al. (2017) Papamakarios, G., Pavlakou, T., & Murray, I. 2017, arXiv eprints, arXiv:1705.07057
 Papamakarios et al. (2018) Papamakarios, G., Sterratt, D. C., & Murray, I. 2018, arXiv:1805.07226 [cs, stat]
 Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., et al. 2017, in NIPS Autodiff Workshop
 Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., et al. 2011, Journal of Machine Learning Research, 12, 2825
 PriceWhelan & Bonaca (2018) PriceWhelan, A. M., & Bonaca, A. 2018, ApJ, 863, L20, doi: 10.3847/20418213/aad7b5
 Riello et al. (2018) Riello, M., Angeli, F. D., Evans, D. W., et al. 2018, A&A, 616, A3, doi: 10.1051/00046361/201832712
 Robitaille et al. (2013) Robitaille, T. P., Tollerud, E. J., Greenfield, P., et al. 2013, A&A, 558, A33, doi: 10.1051/00046361/201322068
 Yakut et al. (2009) Yakut, K., Zima, W., Kalomeni, B., et al. 2009, A&A, 503, 165, doi: 10.1051/00046361/200911918
Comments
There are no comments yet.