Model Comparison of Dark Energy models Using Deep Network

07/01/2019 ∙ by Shi-Yu Li, et al. ∙ Beijing Normal University 0

This work uses the combination of the variational auto-encoder and the generative adversarial network to compare different dark energy models in the light of the observations, e.g., the distance modulus from SNIa. The network finds the analytical variational approximation to the true posterior of the latent parameters of the models, yielding consistent model comparison results to those derived by the standard Bayesian method which suffers from the computationally expensive integral over the parameters in the product of the likelihood and the prior. The parallel computation nature of the network together with the stochastic gradient descent optimization technique lead to an efficient way to comparison the physical models given a set of observations. The converged network also provides interpolation to dataset which is useful for data reconstruction.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It is well known that the predictions of the universe from CDM are in perfect concordance with the observations of the Cosmic Microwave Background (CMB)(Aghanim et al. 2018), Supernova Type-Ia(Betoule et al. 2014), and the Baryon Acoustic Oscillations (BAO)(Alam et al. 2017), making CDM the standard paradigm in cosmology. Such a successful model, however, still has its own theoretical problems which are known as fine tuning and cosmic coincidence(Sahni 2002; Peebles & Ratra 2003). More over, a few observations such as the Hubble parameter at high redshift(Delubac et al. 2015) and the linear redshift-space distortions(Macaulay et al. 2013) have shown tensions with CDM. All of these motivate the research of the universe that allows time-evolving dark energy. People developed different evolving scalar fields to describe the evolution of the dark energy, such as the canonical scalar fields(Caldwell et al. 1998), the phantom fields(Caldwell et al. 2003; Elizalde et al. 2004; Scherrer & Sen 2008). Various parametrisations of evolving dark energy that broadly describes a large number of scalar field dark energy models are also proposed, such as the CPL (Chevallier & Polarski 2001) and GCG (Thakur et al. 2012). Given a specific model and a set of cosmological data, one can study the evolution of the dark energy conveniently.

Then a question of model choice naturally arises with the development of various dark energy models. A variety of methods such as -test, Akaike information criterion (AIC)(Penny et al. 2006), Mallows , Bayesian information criterion (BIC)(Penny et al. 2006), minimum description length (MDL) (Rissanen 1978), Bayesian model averaging are proposed to select a good or useful model in the light of the observations( ). MacKay 1992 strongly recommend to use Bayesian evidence to assign preferences to alternative models since the evidence is the Bayesian’s transportable quantity between models, and the popular easy-to-use AIC and BIC as well as MDL methods are all approximations to the Bayesian evidence (Penny et al. 2006). The Bayesian evidence for model selection has been applied to the study of cosmology for a long time (Trotta 2008; Martin et al. 2011; Lonappan et al. 2018; Basilakos et al. 2018), and recently a detailed study of Bayesian evidence for a large class of cosmological models taking into account around 21 different dark energy models are performed by Lonappan et al. 2018

. Although the Bayesian evidence remains the preferred method compared with information criterions, a full Bayesian inference for model selection is very computationally expensive and often suffers from multi-modal posteriors and parameter degeneracies, which lead to a large time consumption to obtain the final result.

The variational auto-encoder (VAE) and the generative adversarial network (GAN) which build upon the variational Bayes theory provide an efficient way to tackle with the model selection problem. The VAE (Kingma & Welling 2014

) has the ability to approximate the generative process (generate the observed data given the model parameters) and the inference process (infer the model parameters given the observations) which allow one to interpolate between the observed values, thus is useful in the reconstruction problem. The GAN with semi-supervised learning (

Goodfellow et al. 2014; Salimans et al. 2016

) has the ability to learn effectively the distribution of the data, and assign probabilities to different models where the data may come from. Thus the combination of VAE and GAN brings us a novel and convenient way to do data reconstruction and model selection at the same time. Since the variational method provides an analytical approximation of the posterior, it is possible to use the fast gradient descent method to find the constrain of the parameters rather than using the Monte Carlo Markov Chain which may suffer from a low acceptance ratio if the posterior is ill-posed. Moreover, the variational method benefits from the natural parallelization of the network computation which can be accelerated by the GPU cards.

In this article, we use the VAE-GAN network to learn the distribution of the distance moduli in the CDM, CDM and CPL universe, then feed the observations of the SNIa to the network to reconstruct the dark energy and discriminate the most probable model. The statistical background of the VAE and the GAN is briefly reviewed in Section 2, the model structure is described at the end of this section. In Section 3, two toy models are created to test the reconstruction and model discrimination ability of the network. Section 4 describes the observables used in this work, the generation of the training set is introduced. Section 5 reports and discusses the results of the data reconstruction and the model comparison given by the network, some prospects that extend the current work follows the discussion.

2 The VAE-GAN Network

The VAE-GAN network proposed by Larsen (Larsen et al. 2016) combines a variational auto-encoder with a generative adversarial network, aiming to use the learned feature representations in the GAN discriminator as the basis for the VAE reconstruction objective, results in a better capture of the data distribution, improving the quality of the inference and the generative process of the network in the light of data. This section briefly reviews the background of the VAE and GAN and then introduce the method to do model selection and data reconstruction using VAE-GAN.

2.1 The Variational Autoencoder

A VAE (Kingma & Welling 2014) consist of an encoder and a decoder . The decoder mimics the generative process of a model or a natural phenomenon once given the model parameters or the latent variables , yielding the likelihood distribution of the data . The encoder approximates the inverse process that given a set of observations it infers the posterior distribution of the model parameters or the latent variables .

The optimal and are obtained by maximizing the lower bound of the marginal likelihood of the observations via variational bayes (Penny et al. 2006; Kingma & Welling 2014),

(1)

Here, the first item

is the Kullback-Leibler divergence which measures the difference between two distributions.

is the prior distribution of the latent variables. is the likelihood of the data. The marginal likelihood equals to its lower bound if and only if the approximate posterior is the same with the true posterior . Eq.1 implies that the variational optimal encoder and decoder should constrain the posterior close to the prior while keeping the likelihood as large as possible.

2.2 The Generative Adversarial Network

A GAN (Goodfellow et al. 2014) consists of a generator and a discriminator . The generator functions similarly as the decoder that it maps the latent variables to the data space , the difference is that the mapping is determinant and the sampling process happens only at the latent space. The discriminator assigns probability to to tell how possible it is the real data (not produced by the generator). The optimal and are obtained by searching the Nash equilibrium of the minmax game with the value function:

(2)

where the is the distribution of the real data. A small modification of the game (Salimans et al. 2016) allows

to classify

into one of K+1 possible classes, for example, to tell which one of the K classes of dark energy models is the most probable that is generated from, or is just the output of the generator thus belongs to the -th class.

(3)

Here is the label of the model. corresponds to the in Eq.2, giving the probability that is classified into real.

is the joint distribution of the real data and the model class.

is the probability that is classified to the right model .

2.3 Training Algorithm

The combination of VAE and GAN provides a convenient way to do data interpolation and model selection at the same time, once a set of optimal is obtained by optimizing Eq.1 and Eq.3. The basic logic of VAE-GAN network is shown in Figure 1. The observed data is fed to the encoder to find the posterior. Then is sampled from the posterior and fed to the decoder/generator to derive the reconstruction (or interpolation) of the input data . Finally, the discriminator/classifier extracts the useful features from the reconstruction to derive the probability that belongs to a certain model.

Figure 1: The structure of the VAE-GAN network (reproduced from Larsen et al. 2016 with an additional classifier described in Salimans et al. 2016).

Now the remaining question is that given a set of observations and their covariance as well as a set of model candidates , how to find the optimal . Since Eq. 1 and Eq.3

set constraints on functions, any flexible functions that have learning abilities can fit in this work. A possible choice is the convolutional neural network (CNN) which is good at representation learning and shift-invariant feature extraction. Suppose

are CNNs whose parameters are respectively. One can generate a batch of training samples from the model candidates and train the networks on this fixed data using stochastic gradient descent:

  1. Pick up from the training samples and retrieve the observed part , is the class label. Adding multivariate Gaussian random noise to yields ;

  2. Feed to the encoder to get the posterior , then calculate the KL divergence (corresponding to the first item in Eq.1

    ). Suppose the posterior is a multivariate Gaussian distribution with diagonal covariance,

    , and the prior is the standard normal distribution,

    , then the KL divergence can be analytically written as , here the square and sum operations are element-wise (Kingma & Welling 2014). Find the gradient of the KL divergence with respect to the parameter of the encoder,

    (4)
  3. Sample from the posterior and feed it to the generator to obtain the reconstruction . The observed part of the reconstruction together with and give the negative log likelihood (corresponding to the second item in Eq.1), where is the misfit that is broadly used in the model regression problems. The is the normalization constant of the likelihood, having the value of , here is the dimension of the covariance . Since likelihood depends on both the encoder and the generator, its gradient provides update to ,

    (5)
  4. Sample from the prior and feed to to generate a new sample . Feed to the discriminator

    to obtain the logits

    which can be interpreted into probabilities, e.g., , is the -th element of the logits . Substituting the probabilities to Eq.3 yields,

    (6)

    The gradient of provides the modification to ,

    (7)
  5. Update the parameters of the encoder, the generator and the discriminator using a learning rate of ,

    (8)

The training process can be easily generalized to mini-batch training to obtain a faster convergence rate. Several training techniques that stabilize or accelerate the training process are also applicable in this problem (Radford et al. 2015; Salimans et al. 2016; Sønderby et al. 2017; Mescheder et al. 2018).

The encoder consists of four 1-D convolutional layers and two dense layers. Each layer is followed by a batch renormalization layer (Ioffe 2017

) and an activation layer with Leaky Rectified Linear Unit (a simple variant to ReLU

Nair & Hinton 2010), except the last layer which acts as the output. The input of the encoder has a size of 580 which is the number of the distance modulus in SNIa Union2.1 dataset later introduced in Section 4

. The dimension of the latent variable should be no less than the number of the parameters in the physics models used in the problem, and we set it 20. The size of the convolutional kernel is fixed to 7 and the stride is 4 except for the input layer whose convolutional kernel and stride are of size 69 and 1 respectively. The generator consists of one dense layer and four 1-D fractional convolutional layers. The sizes of the convolutional kernel and the stride are the same as the encoder (because it is the inverse process of the encoder), except that the output of the last layer has a dimension of 2048. The discriminator consists of four 1-D convolutional layers and one dense layers. The configuration is similar to the encoder, except that the size of the input and the output are 2048 and

respectively.

3 Tests On Toy Models

This section creates two toy models to test the data reconstruction and the model comparison ability of the network.

Model 1,

(9)

Model 2,

(10)

Model 1 and Model 2 have similar distributions as shown Figure 2. Given the observations which are generated by the underlying model on with an error matrix , we would like to fit the two toy models to the observations to tell which one is most possible to be the true model, and interpolate the data with the model at , for example, evenly stay in the interval with .

Figure 2: The distribution of the outputs of the toy models.

First we concatenate and sort and , and call the new one . Then sample from the priors of the toy models and generate the training samples (Note that it depends on the model that which set of parameters should be used). Here 12800 samples for each model are generated as the training dataset. Finally the training set together with the observation error are fed to the network. Once the training converges, one can put the observations into the network to tell which toy model is most probable and get the interpolation, see Figure 3. In this task, the discriminator has a classification accuracy of almost 1. It assigns a probability of 97% to the parabolic model (Model 1), which is indeed the case.

Figure 3: Reconstruction of the data. The red dots represent the observed data with noise characterized by . The blue line shows the true model where the observations are generated from. The black line is the reconstruction of the data by the network given the observations.

4 The Dataset

4.1 The Observations

The observations are from the Union2.1 Supernova Type-Ia compilation (Suzuki et al. 2012) which contains 580 SNIa. Union2.1 provides the distance modulus with their covariance matrix. Let denote the redshift of the 580 SNIa, and denote the measured distance moduli, denote the covariance of the distance moduli with systematics.

4.2 The Training Set

We study the model comparison problem among three dark energy models: 1) ( CDM); 2) (CDM); 3) (CPL), given a set of observations of distance modulus at different redshift. The expansion rate of a spatially flat FRW universe is determined by the matter and the dark energy,

(11)

The luminosity distance is closely related to the hubble expansion rate (Eq.12), and the distance modulus is given by Eq.13.

(12)
(13)

For each dark energy model, 12800 samples are generated at the redshift , given the priors of the parameters as,

(14)

has 1468 elements evenly located in the interval, . The samples are fixed as the training set.

5 Results & Discussions

Figure 4: Reconstruction of the distance modulus by the network.

Figure 4 shows the reconstruction of the distance modulus given by the network. Given the observed distance modulus , the discriminator assigns probability to each model with,

(15)

Giving the conclusion that the CDM is slightly more favoured than the other two models while CPL is the least favoured in the light of the observations. This result is consistent with the one derived by the Bayesian evidence method in Lonappan et al. 2018 which find the log evidence of each model to be . These evidences can be interpreted into probabilities 66.2%, 20.7%, 13.1%, respectively, given the non-informative prior .

The classification accuracy on the three models with the observation error is reported as 47.9%. The accuracy is subjectively low, though, it is not unexpected. The integration operations in Eq. 11 and Eq. 12 which act as low-pass filters smooth out the local high frequency features that are useful for model comparison. Thus the convolutional kernel in the network need to search for useful low frequency features which are less informative once the output distributions of the models overlap with each other. This is the case we meet in this problem. If one use to represent the model’s prediction of the distribution of the data (it is the evidence of the model), then the theoretical optimal discriminator assigns each model with a probability of . Once drops in the overlapped region where , the discriminator loses its ability to discriminate the models confidently.

Note that CDM is a special case of CDM while the later is a special case of CPL model, that means there is always a region overlapped among the data distributions of the three models. If the measurements of the distance moduli are accurate enough, the region covered by is negligible compared to , and the later is negligible compared to . Thus randomly generated by has an extremely low probability to drop in the region of and reversely a high probability is assigned to that it comes from if it drops in the region of , this situation is also fit to the comparison between and . Then the discriminator has a great confidence to tell which model comes from.

Figure 5: The normalized distributions of the projections of the training data to the 1-st and the 4-th principal components (PC). The red dashed line represents the projection of the Union 2.1 to the PCs. (a) The distribution of projection on the 1-st PC with no observation errors; (b) The distribution of the projection on the 1-st PC with the covariance matrix from the Union 2.1; (c) The distribution of projection on the 4-th PC with no observation errors; (d) The distribution of the projection on the 4-th PC with the covariance matrix from the Union 2.1. (a) and (c) share the same set of PCs, while (b) and (d) share another set of PCs.

Figure 5 is an illustration of the discussion above. The left column shows the normalized histograms of projections of the training samples to its 1-st and 4-th principal components (PC) with no observation errors. The upper left panel reveals that the low frequency part (1-st PC) of the model contributes little to the model discrimination, because the projection of the Union 2.1 data to the 1st PC drops in the region where all the model have similar probabilities. The lower left panel shows that the high frequency part (4-th PC) of the model is useful for model discrimination, since the projection of the Union 2.1 data to the 4-th PC locates in the region where the model have obviously different probabilities.

The discriminator degrades, however, if a non-zero observation error involves in the problem. The overlapped region expands due to the errors so that it is not negligible anymore. An extreme limit is that the errors become infinity, thus the distribution of the three models become the same so that the discriminator can only make a random guess which model is true. In this situation the accuracy degrades to . The finite observation errors lead to a non-negligible intersection where the discriminator lose the ability to tell confidently which model the data comes from. This is illustrated at the bottom row of Figure 5. The lower right panel shows the distribution of 4-th PC scores of the training samples with the covariance matrix of Union 2.1. The projection of the Union 2.1 data to the 4-th PC now locates in the region where the different models have the similar probabilities. This explains why the network yields a result that is in good concordance with the standard Bayesian analysis but has a subjectively low classification accuracy - the model classification accuracy is intrinsically determined by the nested structure of the three models as well as the observation noise. The variational network successfully learn the posterior distribution and the likelihood distribution to produce a consistent result.

Although this work uses the distance modulus for model comparison and data reconstruction, it is easy to extend the scenario to Hubble parameters or other dataset. The framework should be further considered to include not only but also its -th derivatives , e.g., both the angular diameter distance and the hubble parameter measured by BAO, to let the encoder and the discriminator benefit from different datasets. Another improvement of the framework is to allow the reconstruction of the data to implicitly include a model averaging process which will enhance the generalization of the reconstruction, for example, extend the VAE-GAN to its more powerful variant CVAE-GAN (Bao et al. 2017). These are left to the future works.

Acknowledgements.
This work was funded by the National Science Foundation of China (Grants No.11573006, 11528306), National Key R&D Program of China (2017YFA0402600), the 13th Five-year Informatization Plan of Chinese Academy of Sciences, Grant No. XXH13505-04.

References

  • Aghanim et al. (2018) Aghanim, N., Akrami, Y., Ashdown, M., et al. 2018, arXiv preprint arXiv:1807.06209
  • Alam et al. (2017) Alam, S., Ata, M., Bailey, S., et al. 2017, Monthly Notices of the Royal Astronomical Society, 470, 2617
  • Bao et al. (2017)

    Bao, J., Chen, D., Wen, F., Li, H., & Hua, G. 2017, in Proceedings of the IEEE International Conference on Computer Vision, 2745

  • Basilakos et al. (2018) Basilakos, S., et al. 2018, The European Physical Journal C, 78, 889
  • Betoule et al. (2014) Betoule, M., Kessler, R., Guy, J., et al. 2014, Astronomy & Astrophysics, 568, A22
  • Caldwell et al. (1998) Caldwell, R. R., Dave, R., & Steinhardt, P. J. 1998, Physical Review Letters, 80, 1582
  • Caldwell et al. (2003) Caldwell, R. R., Kamionkowski, M., & Weinberg, N. N. 2003, Physical Review Letters, 91, 071301
  • Chevallier & Polarski (2001) Chevallier, M., & Polarski, D. 2001, International Journal of Modern Physics D, 10, 213
  • Delubac et al. (2015) Delubac, T., Bautista, J. E., Rich, J., et al. 2015, Astronomy & Astrophysics, 574, A59
  • Elizalde et al. (2004) Elizalde, E., Nojiri, S., & Odintsov, S. D. 2004, Physical Review D, 70, 043539
  • Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., et al. 2014, in Advances in Neural Information Processing Systems 27, ed. Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Weinberger (Curran Associates, Inc.), 2672
  • Ioffe (2017) Ioffe, S. 2017, in Advances in neural information processing systems, 1945
  • Kingma & Welling (2014) Kingma, D. P., & Welling, M. 2014, in Second International Conference on Learning Representations, ICLR
  • Larsen et al. (2016)

    Larsen, A. B. L., S?nderby, S. K., Larochelle, H., & Winther, O. 2016, in Proceedings of Machine Learning Research, Vol. 48, Proceedings of The 33rd International Conference on Machine Learning, ed. M. F. Balcan & K. Q. Weinberger (New York, New York, USA: PMLR), 1558

  • Lonappan et al. (2018) Lonappan, A. I., Kumar, S., Dinda, B. R., Sen, A. A., et al. 2018, Physical Review D, 97, 043524
  • Macaulay et al. (2013) Macaulay, E., Wehus, I. K., & Eriksen, H. K. 2013, Physical review letters, 111, 161301
  • MacKay (1992) MacKay, D. J. 1992, Neural computation, 4, 415
  • Martin et al. (2011) Martin, J., Ringeval, C., & Trotta, R. 2011, Physical Review D, 83, 063524
  • Mescheder et al. (2018) Mescheder, L., Nowozin, S., & Geiger, A. 2018, in International Conference on Machine Learning (ICML)
  • Nair & Hinton (2010) Nair, V., & Hinton, G. E. 2010, in Proceedings of the 27th international conference on machine learning (ICML-10), 807
  • Peebles & Ratra (2003) Peebles, P. J. E., & Ratra, B. 2003, Reviews of modern physics, 75, 559
  • Penny et al. (2006) Penny, W., Mattout, J., & Trujillo-Barreto, N. 2006, Statistical Parametric Mapping: The analysis of functional brain images. London: Elsevier
  • Radford et al. (2015) Radford, A., Metz, L., & Chintala, S. 2015, arXiv preprint arXiv:1511.06434
  • Rissanen (1978) Rissanen, J. 1978, Automatica, 14, 465
  • Sahni (2002) Sahni, V. 2002, Classical and Quantum Gravity, 19, 3435
  • Salimans et al. (2016) Salimans, T., Goodfellow, I., Zaremba, W., et al. 2016, in Advances in Neural Information Processing Systems 29, ed. D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Curran Associates, Inc.), 2234
  • Scherrer & Sen (2008) Scherrer, R. J., & Sen, A. 2008, Physical Review D, 78, 067303
  • Sønderby et al. (2017) Sønderby, C. K., Caballero, J., Theis, L., Shi, W., & Huszár, F. 2017
  • Suzuki et al. (2012) Suzuki, N., Rubin, D., Lidman, C., et al. 2012, The Astrophysical Journal, 746, 85
  • Thakur et al. (2012) Thakur, S., Nautiyal, A., Sen, A. A., & Seshadri, T. 2012, Monthly Notices of the Royal Astronomical Society, 427, 988
  • Trotta (2008) Trotta, R. 2008, Contemporary Physics, 49, 71