1 Introduction
It is well known that the predictions of the universe from CDM are in perfect concordance with the observations of the Cosmic Microwave Background (CMB)(Aghanim et al. 2018), Supernova TypeIa(Betoule et al. 2014), and the Baryon Acoustic Oscillations (BAO)(Alam et al. 2017), making CDM the standard paradigm in cosmology. Such a successful model, however, still has its own theoretical problems which are known as fine tuning and cosmic coincidence(Sahni 2002; Peebles & Ratra 2003). More over, a few observations such as the Hubble parameter at high redshift(Delubac et al. 2015) and the linear redshiftspace distortions(Macaulay et al. 2013) have shown tensions with CDM. All of these motivate the research of the universe that allows timeevolving dark energy. People developed different evolving scalar fields to describe the evolution of the dark energy, such as the canonical scalar fields(Caldwell et al. 1998), the phantom fields(Caldwell et al. 2003; Elizalde et al. 2004; Scherrer & Sen 2008). Various parametrisations of evolving dark energy that broadly describes a large number of scalar field dark energy models are also proposed, such as the CPL (Chevallier & Polarski 2001) and GCG (Thakur et al. 2012). Given a specific model and a set of cosmological data, one can study the evolution of the dark energy conveniently.
Then a question of model choice naturally arises with the development of various dark energy models. A variety of methods such as test, Akaike information criterion (AIC)(Penny et al. 2006), Mallows , Bayesian information criterion (BIC)(Penny et al. 2006), minimum description length (MDL) (Rissanen 1978), Bayesian model averaging are proposed to select a good or useful model in the light of the observations( ). MacKay 1992 strongly recommend to use Bayesian evidence to assign preferences to alternative models since the evidence is the Bayesian’s transportable quantity between models, and the popular easytouse AIC and BIC as well as MDL methods are all approximations to the Bayesian evidence (Penny et al. 2006). The Bayesian evidence for model selection has been applied to the study of cosmology for a long time (Trotta 2008; Martin et al. 2011; Lonappan et al. 2018; Basilakos et al. 2018), and recently a detailed study of Bayesian evidence for a large class of cosmological models taking into account around 21 different dark energy models are performed by Lonappan et al. 2018
. Although the Bayesian evidence remains the preferred method compared with information criterions, a full Bayesian inference for model selection is very computationally expensive and often suffers from multimodal posteriors and parameter degeneracies, which lead to a large time consumption to obtain the final result.
The variational autoencoder (VAE) and the generative adversarial network (GAN) which build upon the variational Bayes theory provide an efficient way to tackle with the model selection problem. The VAE (Kingma & Welling 2014
) has the ability to approximate the generative process (generate the observed data given the model parameters) and the inference process (infer the model parameters given the observations) which allow one to interpolate between the observed values, thus is useful in the reconstruction problem. The GAN with semisupervised learning (
Goodfellow et al. 2014; Salimans et al. 2016) has the ability to learn effectively the distribution of the data, and assign probabilities to different models where the data may come from. Thus the combination of VAE and GAN brings us a novel and convenient way to do data reconstruction and model selection at the same time. Since the variational method provides an analytical approximation of the posterior, it is possible to use the fast gradient descent method to find the constrain of the parameters rather than using the Monte Carlo Markov Chain which may suffer from a low acceptance ratio if the posterior is illposed. Moreover, the variational method benefits from the natural parallelization of the network computation which can be accelerated by the GPU cards.
In this article, we use the VAEGAN network to learn the distribution of the distance moduli in the CDM, CDM and CPL universe, then feed the observations of the SNIa to the network to reconstruct the dark energy and discriminate the most probable model. The statistical background of the VAE and the GAN is briefly reviewed in Section 2, the model structure is described at the end of this section. In Section 3, two toy models are created to test the reconstruction and model discrimination ability of the network. Section 4 describes the observables used in this work, the generation of the training set is introduced. Section 5 reports and discusses the results of the data reconstruction and the model comparison given by the network, some prospects that extend the current work follows the discussion.
2 The VAEGAN Network
The VAEGAN network proposed by Larsen (Larsen et al. 2016) combines a variational autoencoder with a generative adversarial network, aiming to use the learned feature representations in the GAN discriminator as the basis for the VAE reconstruction objective, results in a better capture of the data distribution, improving the quality of the inference and the generative process of the network in the light of data. This section briefly reviews the background of the VAE and GAN and then introduce the method to do model selection and data reconstruction using VAEGAN.
2.1 The Variational Autoencoder
A VAE (Kingma & Welling 2014) consist of an encoder and a decoder . The decoder mimics the generative process of a model or a natural phenomenon once given the model parameters or the latent variables , yielding the likelihood distribution of the data . The encoder approximates the inverse process that given a set of observations it infers the posterior distribution of the model parameters or the latent variables .
The optimal and are obtained by maximizing the lower bound of the marginal likelihood of the observations via variational bayes (Penny et al. 2006; Kingma & Welling 2014),
(1) 
Here, the first item
is the KullbackLeibler divergence which measures the difference between two distributions.
is the prior distribution of the latent variables. is the likelihood of the data. The marginal likelihood equals to its lower bound if and only if the approximate posterior is the same with the true posterior . Eq.1 implies that the variational optimal encoder and decoder should constrain the posterior close to the prior while keeping the likelihood as large as possible.2.2 The Generative Adversarial Network
A GAN (Goodfellow et al. 2014) consists of a generator and a discriminator . The generator functions similarly as the decoder that it maps the latent variables to the data space , the difference is that the mapping is determinant and the sampling process happens only at the latent space. The discriminator assigns probability to to tell how possible it is the real data (not produced by the generator). The optimal and are obtained by searching the Nash equilibrium of the minmax game with the value function:
(2) 
where the is the distribution of the real data. A small modification of the game (Salimans et al. 2016) allows
to classify
into one of K+1 possible classes, for example, to tell which one of the K classes of dark energy models is the most probable that is generated from, or is just the output of the generator thus belongs to the th class.(3) 
Here is the label of the model. corresponds to the in Eq.2, giving the probability that is classified into real.
is the joint distribution of the real data and the model class.
is the probability that is classified to the right model .2.3 Training Algorithm
The combination of VAE and GAN provides a convenient way to do data interpolation and model selection at the same time, once a set of optimal is obtained by optimizing Eq.1 and Eq.3. The basic logic of VAEGAN network is shown in Figure 1. The observed data is fed to the encoder to find the posterior. Then is sampled from the posterior and fed to the decoder/generator to derive the reconstruction (or interpolation) of the input data . Finally, the discriminator/classifier extracts the useful features from the reconstruction to derive the probability that belongs to a certain model.
Now the remaining question is that given a set of observations and their covariance as well as a set of model candidates , how to find the optimal . Since Eq. 1 and Eq.3
set constraints on functions, any flexible functions that have learning abilities can fit in this work. A possible choice is the convolutional neural network (CNN) which is good at representation learning and shiftinvariant feature extraction. Suppose
are CNNs whose parameters are respectively. One can generate a batch of training samples from the model candidates and train the networks on this fixed data using stochastic gradient descent:
Pick up from the training samples and retrieve the observed part , is the class label. Adding multivariate Gaussian random noise to yields ;

Feed to the encoder to get the posterior , then calculate the KL divergence (corresponding to the first item in Eq.1
). Suppose the posterior is a multivariate Gaussian distribution with diagonal covariance,
, and the prior is the standard normal distribution,
, then the KL divergence can be analytically written as , here the square and sum operations are elementwise (Kingma & Welling 2014). Find the gradient of the KL divergence with respect to the parameter of the encoder,(4) 
Sample from the posterior and feed it to the generator to obtain the reconstruction . The observed part of the reconstruction together with and give the negative log likelihood (corresponding to the second item in Eq.1), where is the misfit that is broadly used in the model regression problems. The is the normalization constant of the likelihood, having the value of , here is the dimension of the covariance . Since likelihood depends on both the encoder and the generator, its gradient provides update to ,
(5) 
Update the parameters of the encoder, the generator and the discriminator using a learning rate of ,
(8)
The training process can be easily generalized to minibatch training to obtain a faster convergence rate. Several training techniques that stabilize or accelerate the training process are also applicable in this problem (Radford et al. 2015; Salimans et al. 2016; Sønderby et al. 2017; Mescheder et al. 2018).
The encoder consists of four 1D convolutional layers and two dense layers. Each layer is followed by a batch renormalization layer (Ioffe 2017
) and an activation layer with Leaky Rectified Linear Unit (a simple variant to ReLU
Nair & Hinton 2010), except the last layer which acts as the output. The input of the encoder has a size of 580 which is the number of the distance modulus in SNIa Union2.1 dataset later introduced in Section 4. The dimension of the latent variable should be no less than the number of the parameters in the physics models used in the problem, and we set it 20. The size of the convolutional kernel is fixed to 7 and the stride is 4 except for the input layer whose convolutional kernel and stride are of size 69 and 1 respectively. The generator consists of one dense layer and four 1D fractional convolutional layers. The sizes of the convolutional kernel and the stride are the same as the encoder (because it is the inverse process of the encoder), except that the output of the last layer has a dimension of 2048. The discriminator consists of four 1D convolutional layers and one dense layers. The configuration is similar to the encoder, except that the size of the input and the output are 2048 and
respectively.3 Tests On Toy Models
This section creates two toy models to test the data reconstruction and the model comparison ability of the network.
Model 1,
(9) 
Model 2,
(10) 
Model 1 and Model 2 have similar distributions as shown Figure 2. Given the observations which are generated by the underlying model on with an error matrix , we would like to fit the two toy models to the observations to tell which one is most possible to be the true model, and interpolate the data with the model at , for example, evenly stay in the interval with .
First we concatenate and sort and , and call the new one . Then sample from the priors of the toy models and generate the training samples (Note that it depends on the model that which set of parameters should be used). Here 12800 samples for each model are generated as the training dataset. Finally the training set together with the observation error are fed to the network. Once the training converges, one can put the observations into the network to tell which toy model is most probable and get the interpolation, see Figure 3. In this task, the discriminator has a classification accuracy of almost 1. It assigns a probability of 97% to the parabolic model (Model 1), which is indeed the case.
4 The Dataset
4.1 The Observations
The observations are from the Union2.1 Supernova TypeIa compilation (Suzuki et al. 2012) which contains 580 SNIa. Union2.1 provides the distance modulus with their covariance matrix. Let denote the redshift of the 580 SNIa, and denote the measured distance moduli, denote the covariance of the distance moduli with systematics.
4.2 The Training Set
We study the model comparison problem among three dark energy models: 1) ( CDM); 2) (CDM); 3) (CPL), given a set of observations of distance modulus at different redshift. The expansion rate of a spatially flat FRW universe is determined by the matter and the dark energy,
(11) 
The luminosity distance is closely related to the hubble expansion rate (Eq.12), and the distance modulus is given by Eq.13.
(12) 
(13) 
For each dark energy model, 12800 samples are generated at the redshift , given the priors of the parameters as,
(14) 
has 1468 elements evenly located in the interval, . The samples are fixed as the training set.
5 Results & Discussions
Figure 4 shows the reconstruction of the distance modulus given by the network. Given the observed distance modulus , the discriminator assigns probability to each model with,
(15) 
Giving the conclusion that the CDM is slightly more favoured than the other two models while CPL is the least favoured in the light of the observations. This result is consistent with the one derived by the Bayesian evidence method in Lonappan et al. 2018 which find the log evidence of each model to be . These evidences can be interpreted into probabilities 66.2%, 20.7%, 13.1%, respectively, given the noninformative prior .
The classification accuracy on the three models with the observation error is reported as 47.9%. The accuracy is subjectively low, though, it is not unexpected. The integration operations in Eq. 11 and Eq. 12 which act as lowpass filters smooth out the local high frequency features that are useful for model comparison. Thus the convolutional kernel in the network need to search for useful low frequency features which are less informative once the output distributions of the models overlap with each other. This is the case we meet in this problem. If one use to represent the model’s prediction of the distribution of the data (it is the evidence of the model), then the theoretical optimal discriminator assigns each model with a probability of . Once drops in the overlapped region where , the discriminator loses its ability to discriminate the models confidently.
Note that CDM is a special case of CDM while the later is a special case of CPL model, that means there is always a region overlapped among the data distributions of the three models. If the measurements of the distance moduli are accurate enough, the region covered by is negligible compared to , and the later is negligible compared to . Thus randomly generated by has an extremely low probability to drop in the region of and reversely a high probability is assigned to that it comes from if it drops in the region of , this situation is also fit to the comparison between and . Then the discriminator has a great confidence to tell which model comes from.
Figure 5 is an illustration of the discussion above. The left column shows the normalized histograms of projections of the training samples to its 1st and 4th principal components (PC) with no observation errors. The upper left panel reveals that the low frequency part (1st PC) of the model contributes little to the model discrimination, because the projection of the Union 2.1 data to the 1st PC drops in the region where all the model have similar probabilities. The lower left panel shows that the high frequency part (4th PC) of the model is useful for model discrimination, since the projection of the Union 2.1 data to the 4th PC locates in the region where the model have obviously different probabilities.
The discriminator degrades, however, if a nonzero observation error involves in the problem. The overlapped region expands due to the errors so that it is not negligible anymore. An extreme limit is that the errors become infinity, thus the distribution of the three models become the same so that the discriminator can only make a random guess which model is true. In this situation the accuracy degrades to . The finite observation errors lead to a nonnegligible intersection where the discriminator lose the ability to tell confidently which model the data comes from. This is illustrated at the bottom row of Figure 5. The lower right panel shows the distribution of 4th PC scores of the training samples with the covariance matrix of Union 2.1. The projection of the Union 2.1 data to the 4th PC now locates in the region where the different models have the similar probabilities. This explains why the network yields a result that is in good concordance with the standard Bayesian analysis but has a subjectively low classification accuracy  the model classification accuracy is intrinsically determined by the nested structure of the three models as well as the observation noise. The variational network successfully learn the posterior distribution and the likelihood distribution to produce a consistent result.
Although this work uses the distance modulus for model comparison and data reconstruction, it is easy to extend the scenario to Hubble parameters or other dataset. The framework should be further considered to include not only but also its th derivatives , e.g., both the angular diameter distance and the hubble parameter measured by BAO, to let the encoder and the discriminator benefit from different datasets. Another improvement of the framework is to allow the reconstruction of the data to implicitly include a model averaging process which will enhance the generalization of the reconstruction, for example, extend the VAEGAN to its more powerful variant CVAEGAN (Bao et al. 2017). These are left to the future works.
Acknowledgements.
This work was funded by the National Science Foundation of China (Grants No.11573006, 11528306), National Key R&D Program of China (2017YFA0402600), the 13th Fiveyear Informatization Plan of Chinese Academy of Sciences, Grant No. XXH1350504.References
 Aghanim et al. (2018) Aghanim, N., Akrami, Y., Ashdown, M., et al. 2018, arXiv preprint arXiv:1807.06209
 Alam et al. (2017) Alam, S., Ata, M., Bailey, S., et al. 2017, Monthly Notices of the Royal Astronomical Society, 470, 2617

Bao et al. (2017)
Bao, J., Chen, D., Wen, F., Li, H., & Hua, G. 2017, in Proceedings of the IEEE International Conference on Computer Vision, 2745
 Basilakos et al. (2018) Basilakos, S., et al. 2018, The European Physical Journal C, 78, 889
 Betoule et al. (2014) Betoule, M., Kessler, R., Guy, J., et al. 2014, Astronomy & Astrophysics, 568, A22
 Caldwell et al. (1998) Caldwell, R. R., Dave, R., & Steinhardt, P. J. 1998, Physical Review Letters, 80, 1582
 Caldwell et al. (2003) Caldwell, R. R., Kamionkowski, M., & Weinberg, N. N. 2003, Physical Review Letters, 91, 071301
 Chevallier & Polarski (2001) Chevallier, M., & Polarski, D. 2001, International Journal of Modern Physics D, 10, 213
 Delubac et al. (2015) Delubac, T., Bautista, J. E., Rich, J., et al. 2015, Astronomy & Astrophysics, 574, A59
 Elizalde et al. (2004) Elizalde, E., Nojiri, S., & Odintsov, S. D. 2004, Physical Review D, 70, 043539
 Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., et al. 2014, in Advances in Neural Information Processing Systems 27, ed. Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Weinberger (Curran Associates, Inc.), 2672
 Ioffe (2017) Ioffe, S. 2017, in Advances in neural information processing systems, 1945
 Kingma & Welling (2014) Kingma, D. P., & Welling, M. 2014, in Second International Conference on Learning Representations, ICLR

Larsen et al. (2016)
Larsen, A. B. L., S?nderby, S. K., Larochelle, H., & Winther, O. 2016, in Proceedings of Machine Learning Research, Vol. 48, Proceedings of The 33rd International Conference on Machine Learning, ed. M. F. Balcan & K. Q. Weinberger (New York, New York, USA: PMLR), 1558
 Lonappan et al. (2018) Lonappan, A. I., Kumar, S., Dinda, B. R., Sen, A. A., et al. 2018, Physical Review D, 97, 043524
 Macaulay et al. (2013) Macaulay, E., Wehus, I. K., & Eriksen, H. K. 2013, Physical review letters, 111, 161301
 MacKay (1992) MacKay, D. J. 1992, Neural computation, 4, 415
 Martin et al. (2011) Martin, J., Ringeval, C., & Trotta, R. 2011, Physical Review D, 83, 063524
 Mescheder et al. (2018) Mescheder, L., Nowozin, S., & Geiger, A. 2018, in International Conference on Machine Learning (ICML)
 Nair & Hinton (2010) Nair, V., & Hinton, G. E. 2010, in Proceedings of the 27th international conference on machine learning (ICML10), 807
 Peebles & Ratra (2003) Peebles, P. J. E., & Ratra, B. 2003, Reviews of modern physics, 75, 559
 Penny et al. (2006) Penny, W., Mattout, J., & TrujilloBarreto, N. 2006, Statistical Parametric Mapping: The analysis of functional brain images. London: Elsevier
 Radford et al. (2015) Radford, A., Metz, L., & Chintala, S. 2015, arXiv preprint arXiv:1511.06434
 Rissanen (1978) Rissanen, J. 1978, Automatica, 14, 465
 Sahni (2002) Sahni, V. 2002, Classical and Quantum Gravity, 19, 3435
 Salimans et al. (2016) Salimans, T., Goodfellow, I., Zaremba, W., et al. 2016, in Advances in Neural Information Processing Systems 29, ed. D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Curran Associates, Inc.), 2234
 Scherrer & Sen (2008) Scherrer, R. J., & Sen, A. 2008, Physical Review D, 78, 067303
 Sønderby et al. (2017) Sønderby, C. K., Caballero, J., Theis, L., Shi, W., & Huszár, F. 2017
 Suzuki et al. (2012) Suzuki, N., Rubin, D., Lidman, C., et al. 2012, The Astrophysical Journal, 746, 85
 Thakur et al. (2012) Thakur, S., Nautiyal, A., Sen, A. A., & Seshadri, T. 2012, Monthly Notices of the Royal Astronomical Society, 427, 988
 Trotta (2008) Trotta, R. 2008, Contemporary Physics, 49, 71
Comments
There are no comments yet.