Over the last few years there has been a shift in the way machine learning models are trained. Most approaches, ranging from supervised and multi-task learning to probabilistic methods, such as variational inference, are nowadays based on gradient-based optimization methods. In particular, they often rely on stochastic gradient descent, due to its simplicity and data scalability.
In this context, a common practice before the learning step is to standardize—or more generally, apply a data scaling (a.k.a., feature scaling) transformation to— the data (Han et al., 2011, Chapter 3.5.2). Empirical results show that data scaling often alleviates numerical issues and yields more accurate results on the learning process. However, despite the broad use of data scaling methods, little to no attention has been paid into answering basic questions such as: Why do these methods improve the learning process? When should I choose one method over the others?
In this work, we aim to shed some light on the effect that data scaling methods, such as standardization and normalization, have on gradient-based learning algorithms. To this end, we rely on black-box variational inference (BBVI) (Ranganath et al., 2014) as a use case of gradient-based learning algorithms, and on the exponential family due to its well-studied theoretical properties. These properties allow us to analyze how the gradients behave as a function of the scaling factor, or more precisely, to find the relationship between Lipschitz constants of the gradient of the objective function (in our example, the log-likelihood) evaluated using the original data and the scaled (e.g., standardized) data.
Our analysis shows that both, standardization and normalization, usually result in smoother optimization landscapes, which are often easier to optimize (Santurkar et al., 2018, Section 3.1). However, our study also shows that there is no general rule in order to decide whether or not to apply data scaling, nor which data scaling method will improve the optimization process.
Fortunately, our framework allows us to derive a novel algorithm, Lipschitz standardization
, whose scaling factor results in the desired (estimator of the) Lipschitz constant of the function gradient to be optimized. Moreover, since data scaling is only suitable for continuous variables, we extend our algorithm to handle discrete data using optimally scaled continuous variables.
Lipschitz standardization fulfils two important properties. First, it allows us to scale the data optimally for a given learning rate, thus leading to faster and more accurate (overall) learning processes. Second, it favors a fairer learning across different variables in the data, i.e., it eases that all the variables in the data are accurately learnt. The latter property is thus of particular interest when dealing, e.g., with multivariate Bayesian inference or multi-task learning problems, where a subset of the dimensions (e.g., tasks or observed variables) often dominate the learning process(Yu et al., 2020).
Our experiments on synthetic and real-world data, as well as on different generative models, show the effectiveness of the proposed Lipschitz standardization. Lipschitz standardization does not only lead to fairer and more accurate learning than existing approaches, but it also mitigates numerical instabilities, improving the robustness of the learning process.
2 Problem Statement
In this section we introduce to the reader the notation we adopt throughout the paper, as well the framework we choose to evaluate different data transformation approaches.
Within existing approaches, we here consider only data scaling transformations of the form , where and are respectively the original and the scaled data, and is the scaling factor. There are two main reasons for this choice: i) they preserve important properties of the data distribution, such as its domain and tails; and ii) they are broadly used in practice (Han et al., 2011). Note that as shifting the data, , may violate distributional restrictions (e.g., non-negativity), we assume without loss of generality that the data may have been already shifted and its likelihood is selected afterwards. Specifically, we focus on two broadly used data scaling methods:
denote the values of, respectively, the empirical standard deviation and absolute maximum value of the observed data.
In order to analyze the effect of data scaling in gradient-based learning, we rely here on a large class of probabilistic models as a use case. Following Hoffman et al. (2013), we consider the fairly simple —but general— graphical model depicted in Figure 1
. Thus, the joint distribution over the observed variables (the data), the set of local latent variables , and the set of global latent variables , can be written as
Here, each observation
corresponds to a D-dimensional feature vector. For simplicity, we further assume that it factorizes as
where the latent variables and determine the likelihood parameters for each observation .
Furthermore, and without loss of generality, we consider black-box variational inference (BBVI) (Ranganath et al., 2014) to approximate the posterior distribution of the latent variables, . BBVI uses a mean-field variational distribution family of the form , where and are the local and global variational parameters, respectively.
More in detail, BBVI relies on (stochastic) gradient ascent to find the parameters that maximise the evidence lower bound (ELBO),111 Or equivalently, that minimise the Kullback-Leibler divergence from
Or equivalently, that minimise the Kullback-Leibler divergence fromto (Blei et al., 2017) which is given by
Thus, the gradient ascent algorithm updates the global parameters as , and the local as , where is the current step of the algorithm.
Importantly, to disentangle the contribution of the data and the model parameters on the gradient evaluation, one can apply the chain rule with respect to the likelihood parametersin the computation of the log-likelihood term of the gradient, i.e., ; and thus, making explicit the effect of data scaling, e.g., , on the gradient-based learning process. To this end, we restrict ourselves to the exponential family (Bishop, 2006, Chapter 2.4) since, as shown in the next section, it exhibits nice theoretical properties.
2.1 Scaling the exponential family
From now on, we consider each dimension in the observed data to be modeled using a member of the exponential family. In other words, we assume here all likelihood functions to be of the form
where are the natural parameters and a function of the latent variables are the sufficient statistics, is the base measure, and is the log-partition function. Note that and are vectors of size , and that we here remove the sub-indexes to avoid cluttering the notation.
Working with the exponential family let us draw the following relationships (Appendix A):
Let be a density function of the exponential family with variable and parameters . Assume a scaling function such that for any it defines the function (and random variable)
it defines the function (and random variable). If
for every the function is bijective;
the base measure factorises, ; and
every sufficient statistics factorises as , where either or .
Then, by defining such that , where is the pairwise product and , we have that
In words, Proposition 2.1 provide a relationship between the log-likelihood functions, as well as their first and second-order partial derivatives, of the original and the scaled data. Importantly, although requirements (a)-(c) may look restrictive at first, as reported in Table 1
, many commonly used distributions fulfil such properties. It also is worth-mentioning that in the case of the log-normal distribution the scaling function is, instead of .
2.2 The learning pipeline
At this point, we consider important to clarify which is the current flow since the data is given, until the model parameters are inferred. This process can be described as a simple diagram given by
where i) is the data transformation function; ii) is the result from the inference model, where we recall that the natural paramaters are a function of the latent variables and ; and iii) is the function that recovers the original parameters based on the scaled ones, as shown in Proposition 2.1.
3 Why does standardization work?
Standardization and normalization are widely used in statistics and classical machine learning, since they tend to improve the accuracy of the resulting models (Juszczak et al., 2002; Milligan & Cooper, 1988). However, there is a priori no way to say which one to use.
As an example, in distance-based machine learning methods, e.g. clustering algorithms, the effectiveness of data standardization and normalization may be explained by the fact that they bring all the data mass into a similar range, making the distance between points comparable among dimensions (Aksoy & Haralick, 2001). In this context, Milligan & Cooper (1988) reported that data normalization empirically works better than standardization. Juszczak et al. (2002) backed such conclusion, yet they concluded, as Gnanadesikan et al. (1995) did, that worry-free scaling methods do not exist and further research is needed.
In other approaches such as maximum likelihood or variational inference,222In the Bayesian framework, one may also argue that standardization eases the prior selection process (even for those random variables indirectly related with the data), improving the overall performance of the algorithm. the distance argument becomes less convincing, since explicit distance between points is no longer evaluated.
An alternative argument is that these methods improve numerical stability by moving the data, and thus the model parameters, to a well-behaved part of the real space. This is sensible since computers struggle to work with tiny —and large— numbers and that would inevitably affect the learning process.
Next, we introduce, to the best of our knowledge, a novel reasoning to explain the effectiveness of standardization and normalization in gradient-based methods. In a similar fashion as Ioffe & Szegedy (2015)
showed that batch normalization smooths the optimization landscape of the loss function with respect to the neural network parameters, we show that standardization and normalization often smooth out the optimization landscape of the log-likelihood-based objective functions.
3.1 “Standardizing” the optimization landscape
We say that a real-valued function is -smooth, if it is twice differentiable on , and its first derivative is Lipschitz continuous with constant . That is, if for any , we have
Let us now denote the logarithm of likelihood functions (consider, e.g., the distributions in Table 1) of the original and the scaled data, respectively, by and . Let us also assume that the optimization landscape for the -th natural parameter, , of the original log-likelihood is -smooth. Then, using Proposition 2.1, we can obtain a relationship between the Lipschitz constant of the scaled log-likelihood with respect to the original one,
where are two different (scaled) parameter vectors that differ only in the -th component, i.e., . This implies that the optimization landscape of the -th natural parameter of the scaled log-likelihood is -smooth, with .
Note that, in Eq. 3.1, values of smaller than one imply that the optimization landscape is smoother for the scaled data than for the original one. Thus, under reasonably small values of , one may expect an improvement in the learning process when using the scaled data. In contrast, when is greater than one, scaling the data will “sharpen” the optimization landscape.
shows that for common likelihood functions, such as (log-)normal and gamma distributions,is smaller than one if and only if is also smaller than one. For such distributions, data standardization (normalization) is performed with a scaling factor smaller than 1 only when the empirical standard deviation (maximum value) of the original data is larger than one, which is common in real data. Note though that not all distributions in Table 1 smooth out when . For example, the inverse-gamma distribution smooths out in the opposite case.
3.2 Lipschitz constant and learning rate
A question that promptly arises is whether smaller Lipschitz constants are always better in terms of gradient-based optimization. However, it is relatively easy to find examples where normalization works worse than standardization, as shown in Figure 3, even though its Lipschitz constants are usually smaller.
One could argue that the problem comes out of numerical issues, since normalization may result in transformed data values that are very close to zero (e.g., in heavy-tailed distributions). This does not explain common cases where scaling factors for both, standardization and normalization, are in the same order of magnitude. An alternative explanation is that too small values of may lead to over-smoothed optimization landscapes, thus deteriorating the learning process (e.g., due to negligible gradients).
In the context of gradient-based non-convex optimization, Nesterov (2018) showed that for the case of a differentiable function whose first derivative is Lipschitz continuous with constant , the optimal gradient step size is . Unfortunately, when standardization or normalization are used, whether the Lipschitz constant is near to the optimal boils down to luck. The same holds for obtaining similar Lipschitz constants across dimensions, which is important in multivariate problems to avoid that the learning process is dominated by a subset of the dimensions.
4 Can we do better?
In this section, we propose a novel data scaling algorithm capable of controlling the Lipschitz constant of the derivative for a given log-likelihood function. We remark that this idea can be extrapolated to other objective functions beyond the ELBO and likelihood function without loss of generality. Intuitively, this algorithm moves the data into a region of the domain where the Lipschitz constant of the optimization landscape is close to the optimum.
Unfortunately, the existence of the Lipschitz constant is not ensured (e.g., when the second derivative is unbounded). Even if it exists, the actual computation of this constant is, in general, not straight-forward. Therefore, we instead make use of a local estimator of the Lipschitz constant.
Specifically, we use the second derivative of the log-likelihood, , as an estimator of the Lipschitz constant, . This choice becomes clear when we recall the mean value theorem (adapted to our settings).
Theorem 4.1 (Mean Value Theorem).
Let be a twice-differentiable real-valued function with respect to . Then, for any two values and , there exists such that
By using the above result, we can obtain an estimator of the (local) Lipschitz as
where is the empirical estimator of the natural parameters for the given distribution. For example, in the case of , with and
being the empirical mean and variance of the observed data, we have and .
With a tractable estimator of the Lipschitz constant and its relationship between original and scaled data, given in Proposition 2.1, it is now possible to build an efficient algorithm to control the smoothness of the log-likelihood.
Note that the scaling factor controls the Lipschitz constants of all the natural parameters of the log-likelihood, so we adopt as criterion that the sum of these constants amounts to a given objective value, . This can be expressed as an optimization problem, i.e.,
where is the Lipschitz constant of the scaled data with respect to the th-parameter.
For many distributions, it is possible to find a unique closed-form solution of the above optimization problem, as the following proposition showcases.
Remark. Alternatively, we may solve the optimization problem in Eq. 11 by using a gradient descent algorithm where, again, we can take advantage of Proposition 2.1 to recompute the estimator without re-evaluating the log-likelihood at every step of the optimization algorithm. As a result, while we have centered our analysis in BBVI using the exponential family, we remark here that the optimization problem in Eq. 11 can be solved for any objective function (beyond the log-likelihood), as long as we can compute its estimator of the Lipschitz constant in a similar fashion as in Eq. 10. That is, as long as the objective function is twice differentiable with respect to the parameters we aim to learn.
4.1 Faster and fairer learning
Note that Lipschitz constants play an important role in the optimization process. On one hand, having a close-to-optimal Lipschitz constant translates to faster convergence. On the other hand, sharing the same constant across dimensions ensures that each dimension is learnt at a similar rate, thus avoiding unfair learning – see, e.g., Figure 2 for a visual example of unfair learning.
Note that the Lipschitz criterion introduced in Eq. 11 enables selecting the Lipschitz constant for each dimension. We also remark that, while for each dimension there is a unique optimal learning rate, in practice, only a single learning rate is set.
Following previous notation, if we now let be the set of scaling factors for all dimensions in the data, we can rewrite the Lipschitz criterion as where is the Lipschitz constant of the scaled data with respect to the th-parameter, , and is the target Lipschitz constant for the -th dimension of the data. Thus, in order for fair learning across dimensions to happen, we enforce all the per-dimension Lipschitz constants to be equal, i.e., for all . Since in our use-case the optimization is not performed over but , we apply the chain rule, i.e.,
so that the Lipschitz condition can be now approximated as , and therefore, per-dimension Lipschitz condition is given by:
5 Discrete data
Previously, we introduced the idea of making the optimization landscape equally smooth for every data dimension. Up to this point, our algorithm only applies to continuous data, and thus, likelihood functions. However, real-world datasets are often characterized by the diversity of both continuous and discrete data types —and thus likelihoods— that they present.
Next, we propose an approach to deal with discrete data, so that they can also be scaled to be Lipschitz optimal, and thus, to be learnt fairly with the rest of continuous dimensions. Specifically, this approach can be summarized in three steps: i) transform the discrete dimensions to continuous data; ii) use a set of Gamma distributions to fit the now continuous dimensions, which we can Lipschitz standardize; and iii) estimate the parameters of the original discrete distribution using the learnt continuous distribution. In the following we will refer to this approach as Gamma trick.
Discrete likelihoods can be separated in two different types: those with a single degree of freedom —that is, a single uni-dimensional parameter—, and those with more than a degree of freedom. In this paper, we consider the Bernoulli, Poisson and categorical distributions, since they appear in many real-world applications.
From this point onward, and without loss of generality, we assume discrete data to be a subset of the natural numbers. The reason underneath is that, when using the Gamma trick, shifting the data away from zero causes the estimator of its first parameter to rapidly decrease, which makes easier to fulfil the constraint in Eq. 4.1, necessary to find the optimal scaling factor. For further details see Appendix C.
In order to turn the original data into continuous form we apply a transformation of the form , where is a continuous noise variable that should: i) have as support a subset of the unit interval with non-zero measure such that, when added to , the original value is identifiable; and ii) preserve the original shape of the data as much as possible. As an example, in our experiments, is sampled from a distribution.
Recall that in Eq. 6 we presented the learning pipeline that data follow. For the case when the Gamma trick is applied, this pipeline can be re-written as
where is the expected value of the discrete variable, is the mean of the auxiliary continuous variable, and is computed using the de-standardized natural parameters , i.e., it is a function of the natural parameters, .
For example, distributions with one degree of freedom, such as the Bernoulli and Poisson distributions, are characterised by their expected value. Hence, to recover the original distributional parameters after learning, it is enough to compute the mean of the auxiliary continuous distribution, i.e.,
and recover the original parameters using the average value of the original discrete data. Specifically, for a fixed , we obtain the likelihood parameters as for the Bernoulli, and as for the Poisson distribution.
Next, we focus on the categorical distribution as an example of distribution with various degrees of freedom. In particular, it is characterized by a -dimensional vector parameter,
, representing the probability of being assigned to each class. By using a one-hot representation of the data and treating each class (now binary variable) as an independent Bernoulli dimension, we can learn each class individually by the price of increasing the number of dimensions. We refer to this approximation asBernoulli trick. Finally, we can recover the original parameters by ensuring that their means add up to one. That is,
In order to obtain a Gamma trick version for the categorical distribution, it suffices to apply the Gamma trick to each of the Bernoulli distributions, thus being able to scale the data as in the continuous case. Note however that, when using the Gamma trick on a-dimensional categorical distribution, we increase the number of dimensions by . To account for that in our Lipschitz criterion, and in a similar way as we did in Eq. 11, we set the objective for each auxiliary variable to , such that, when added up, it preserves the same objective as the rest of (non-categorical) dimensions, i.e., .
The combination of the Gamma trick and the Lipschitz criterion constitutes the full algorithm we propose in this work, which we name as Lipschitz standardization.
In this section, we first study separately the effect of Lipschitz standardization building blocks (Lipschitz criterion and Gamma trick) using synthetic data. Afterwards, we evaluate their performance on different real datasets and models (i.e., mixture model, matrix factorization, and variational autoencoders). Refer to AppendixD for further details on the experimental setup.
Methods comparison. Thorough this section, we compare the performance of the following data scaling methods: i) id, which does not scale the original data; ii) max, which normalizes the data; iii) std, which standardizes the data; iv) ours-none; which applies the Lipschitz criterion only to continuous dimensions; v) ours-bern, which applies the Lipschitz criterion to continuous dimensions and the Bernoulli trick to the categorical (an thus, binary) dimensions; and vi) ours-gamma, which is the proposed Lipschitz standardization algorithm, combining the Gamma trick, for all discrete variables, with Lipschitz criterion.
6.1 Synthetic data
Experimental setup. We use data generated from a mixture model. Then, given the generated data, we rely on BBVI to approximate the posterior distribution of the latent variables, i.e., the mixture component assignments and parameters. In a first setting, we assume all our dimensions to be continuous. In the second setting, different types of discrete variables are appended to the continuous data from the former setting. Besides, we distinguish in both settings between those cases where all dimensions contain information about the clusters and those in which some dimensions are uniformative about the clustering.
Metric. We compute the Kullback-Leibler divergence from the actual log-marginal of the data to the one inferred by the model, i.e., . We present our results in terms of a modified version of the Relative Percent Difference (RPD) that lies in the interval , thus comparing the KL given by two different methods.
Results. Table 2 shows the results for the continuous setting (here, we omit the log-normal case since it is equivalent to the normal one), and compare the performance of our method, Lipschitz standardization, with the rest of approaches. Here, KL-RPD values smaller than one mean that our method outperforms the competing approaches, and KL-RPD values greater than one mean that the competing methods results in more accurate learning. Hence, we can observe that ours-gamma outperforms the rest of algorithms, being the difference even larger as we consider more complex settings, such as learning with uninformative variables, which in turn act as a source of noise.
Missing imputation error over different datasets and missing percentage.
Furthermore, Table 3 shows similar results when discrete dimensions are appended to the continuous data. In particular, in all but one case ours-gamma performs significantly better than learning using the original data, i.e., than the id approach (which was the second best method in Table 2). Note that, in this toy example, both ours-bern and ours-gamma provide similar results, being ours-bern slightly better when there are categorical dimensions in the data. An exception to these results is the the setup with the informative Bernoulli, where the Gamma trick performs significantly worse than the id approach. We believe that these results are due to the particular learning dynamics in the mixture model, and the reduced dimensionality of the data. However, as we will show in the experiments with real-world data, the Gamma trick becomes necessary for fair learning when dealing with more complex datasets.
In order to better understand the previous results, we depict in Figure 2 a setting example, extracted from Table 3, where unfair learning can be easily observed. In this figure, we observe that, for the id approach, the likelihood for the categorical dimension even decreases as we perform more steps of the gradient descent algorithm. In other words, when the original data is fitted to the BBVI algorithm, learning the continuous dimension cannot be achieved without “forgetting” the discrete variable. In contrast, this behaviour does not appear when the data is scaled using Lipschitz standardization, which result in a similar rate of convergence, and thus fair learning, of both dimensions.
6.2 Real-world data
Experimental setup. We use six different datasets from the UCI repository (Dua & Graff, 2017) and apply BBVI to solve a missing-data imputation task. To show the flexibility of the proposed Lipschitz standardization, we consider three different generative models: i) mixture model; ii) matrix factorization; and (vanilla) Variational Auto-Encoder (VAE) (Diederik et al., 2014). Additionally, we pick a likelihood for each dimension based on its observable properties (e.g., containing only positive real data).
Methods. Here, we compare (when applicable) the same data scaling methods as in the synthetic experiments, with the only difference being that we now consider the std and id methods as a single one, std/id, since data is standardized beforehand to provide a fair initialization across all methods. In short, we now consider: i) max; ii) std/id; iii) ours-none; iv) ours-bern; and v) ours-gamma.
Metric. Analogously to Nazabal et al. (2018)
, missing imputation error is used as evaluation metric. Specifically, normalized mean squared error is used for numerical variables and error rate for nominal variables, being the final result the average across all dimensions.
Results. Figure 3 shows the results, divided by dataset and percentage of missing data. The results show that, for all these datasets and models, applying Lipschitz standardization results in at least as good imputation error as the best of the competing methods. Furthermore, it can be observed that in highly heterogeneous datasets, such as the Adult and defaultCredit, and datasets with exclusively nominal dimensions, i.e., letter and Breast, Lipschitz standardization (i.e, ours-gamma) clearly outperforms the rest of the methods, to the extent of beating the state-of-the-art results achieved in Nazabal et al. (2018) (see, e.g., Breast).
It is also worth-mentioning that, in the case of defaultCredit
, outliers were removed from the box-plot in Figure3 to provide a more clear visualization of the results. In this particular dataset, all methods but Lipschitz standardization failed at some point, obtaining large imputation errors, as shown in Appendix E. This poor performance may be the result of unfair learning due to the presence of discrete dimensions. The ability of Lipschitz standardization to scale all the dimension, including the discrete ones, makes it the most robust approach among the ones considered here.
As a final remark, we show in Table 4 the set of scaling factors provided by each of the considered data-scaling methods for the continuous dimensions in the Wine dataset. Here, we observe that the optimal Lipschitz scaling factors may significantly differ from the ones resulting from standardization and normalization. Importantly, no general rule may seem to apply in regards of whether the scaling factor for the Lipschitz standardization is larger or smaller than for the other methods.
In this work, we have studied the effect that data scaling approaches, such as standardization and normalization, have on models when they are learnt via gradient-based approaches. Our analysis has shown that no general rule for choosing between existing data scaling methods exists, but fortunately allowed us to introduce Lipschitz standardization, a novel algorithm that scales each (continuous or discrete) dimension of the data such that they all converge at a similar and optimal rate. Our experiments on real world data show that our algorithm results in accurate, fair, and robust learning, where its performance in the worst is at least as good as extant approaches. In fact, our experiments show that Lipschitz standardization becomes particularly necessary as the data become more complex, e.g., highly heterogeneous.
We leave for future research questions such as implementing Lipschitz standardization in a probabilistic programming pipeline, or applying it the context of other learning paradigms, such as multi-task learning. Finally, we would also like to study whether the Lipschitz criterion proposed in this paper can be adapted to be applied directly to the gradients, instead of to the data as a pre-processing step.
Aksoy & Haralick (2001)
Aksoy, S. and Haralick, R. M.
Feature normalization and likelihood-based similarity measures for image retrieval.Pattern recognition letters, 22(5):563–582, 2001.
- Bishop (2006) Bishop, C. M. Pattern recognition and machine learning. springer, 2006.
- Blei et al. (2017) Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
- Diederik et al. (2014) Diederik, P. K., Welling, M., et al. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR), volume 1, 2014.
- Dua & Graff (2017) Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
Gnanadesikan et al. (1995)
Gnanadesikan, R., Kettenring, J. R., and Tsao, S. L.
Weighting and selection of variables for cluster analysis.Journal of Classification, 12(1):113–136, 1995.
- Han et al. (2011) Han, J., Pei, J., and Kamber, M. Data mining: concepts and techniques. Elsevier, 2011.
- Hoffman et al. (2013) Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
- Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- Juszczak et al. (2002) Juszczak, P., Tax, D., and Duin, R. P. Feature scaling in support vector data description. In Proc. asci, pp. 95–102. Citeseer, 2002.
- Milligan & Cooper (1988) Milligan, G. W. and Cooper, M. C. A study of standardization of variables in cluster analysis. Journal of classification, 5(2):181–204, 1988.
- Nazabal et al. (2018) Nazabal, A., Olmos, P. M., Ghahramani, Z., and Valera, I. Handling incomplete heterogeneous data using vaes. arXiv preprint arXiv:1807.03653, 2018.
- Nesterov (2018) Nesterov, Y. Lectures on convex optimization, volume 137. Springer, 2018.
- Ranganath et al. (2014) Ranganath, R., Gerrish, S., and Blei, D. Black box variational inference. Artificial Intelligence and Statistics, pp. 814–822, 2014.
- Santurkar et al. (2018) Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. How does batch normalization help optimization? In Advances in Neural Information Processing Systems, pp. 2483–2493, 2018.
- Yu et al. (2020) Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., and Finn, C. Gradient surgery for multi-task learning. arXiv preprint arXiv:2001.06782, 2020.
Appendix A Proof of Proposition 2.1
Let be a density function of the exponential family with variable and parameters . Assume a scaling function such that for any it defines the function (and random variable) . If
for every the function is bijective;
the base measure factorises, ; and
every sufficient statistics factorises as , where either or .
Then, by defining such that , where is the pairwise product and , we have that
First, we explicitly write the scaled version of log-likelihood function:
Now we compute the value of using again the same conditions:
Result 2 is a consequence of the first result and the chain rule:
Finally, result 3 comes as a direct consequence of applying the chain rule yet again:
Appendix B Proof of Proposition 4.1
Let be the Lipschitz constant of the -th natural parameter, and the objective value. Then, if it exists, the solution of the optimization problem in Equation 11 is unique and has the following closed-form solutions:
Recall that we want to solve the optimization problem in Equation 11, that is,
where is the Lipschitz constant of the scaled log-likelihood th partial derivative.
For simplicity, let us denote by the estimator of the (local) Lipschitz constant for the original data (which is a function of the second partial derivatives of ), instead of the actual Lipschitz constant. The important assumptions here are that all are non-negative real numbers and is positive.
All need to do is solve Equation 26 for each distribution.
Which gives four different solutions. Using the fact that, by construction, , we can reduce it to a single solution.
First, as it has to be positive we have that .
Moreover, as and are non-negative we have that , we have that . Therefore we can also discard that solution, resulting in a single solution:
In order for this solution to exists, three conditions need to hold. Specifically:
since is positive by construction.
which holds as long as .
Therefore, the (log-)normal distribution has a unique solution if and only if which is given by
Therefore there are two solutions. Again, using that , we have that the only valid solution is
which exists if and only if and .
These solutions only exists when and the positive one is the only valid one.
Appendix C Proof of the statement about Gamma trick noise
In section 5 it was introduced the concept of Gamma trick, which acts as a approximation for discrete distributions. Moreover, it was explained that it is beneficial if the original variable is somewhat far from zero.
This statement, at first look arbitrary, it is justified by the following result: The second derivative of a Gamma log-likelihood with respect to the first natural parameter, , rapidly decreases as the data moves away from zero.
For the case of a distribution in the exponential family, it is a well-known result that .
Therefore, in the case of the Gamma distribution, we have that:
Figure 4 shows a plot of the above equation as a function of the shape . It is easy to observe that as the shape grows the value of (our approximation to) drastically decreases.
Finally, by supposing that discrete data are natural numbers, means that the mode is at least one, which in practice means that the value for is bigger than (usually close to ), thus ensuring that the value of (our approximation to) is negligible.
Appendix D Experiments details
Here we give a deeper description of the models appearing in the experiments.
Where are -dimensional vectors and
are one-hot encodings of size.
To ensure that the parameters fulfil the domain restriction of each particular distribution, the following transformations are performed after applying the linking function:
Greater than :
Smaller than :
When it comes to the real-world examples the only parameter that have to be specified for this model is the number of clusters, . In particular, we use if the dataset is Breast, Wine, or spam, and otherwise.
In order to implement the discrete latent parameters such that they can be trained with automatic differentiation, the latent categorical distribution is implemented using a GumbelSoftmax distribution with a temperature that updates every 20 epochs as:
Matrix factorization. Similar to the mixture model, the matrix factorization model follows the same graphical model and it is (almost) fully described by:
There some details that have to be noted. First, the variance of the local parameters, , is shared among instances and learnt as a deterministic parameter. In the same way, only the first parameter, , of each distribution is learnt following this scheme. The remaining parameters are learnt using gradient descent as deterministic parameters.
The same transformations as in the mixture model are performed to the parameters in order to fulfil their particular domain requirements.
When it comes to the real-world examples the only parameter that have to be specified for this model is the latent size, . In particular, we set it automatically as half the number of dimensions of each dataset (before applying any trick that may increase the dimensionality).
Variational Auto-Encoder. We follow the structure of a vanilla VAE with the following components:
We assume normal latent variables with a standard normal as prior.
Hidden layers have 256 neurons.
The latent size is set to the of the data (original) number of dimensions.
All intermediate layer have a dropout of to avoid overfitting (e.g. in Breast, which has instances).
Layers are initialized using a Xavier uniform policy.
Depending on the dataset the model could be unstable in the final parts of training. Therefore, we halve the learning rate when reached half the total epochs, and once again when we are three fourths in the training process.
Specifics about the encoder:
As we have to avoid using the missing data (since it is going to be our test set), we implement an input-dropout layer as in Nazabal et al. (2018).
In order to guarantee a common input (and thus, a common well-behaved neural net) across all data scaling methods, we put a batch-normalization layer at the beginning of the encoder. Note that this does not interfere with the goal of this work, which is about the evaluation of the loss function.
In order to obtain the distributional parameters of , and , we pass the result of the encoder through two linear layers, one for the mean and another for the log-scale. The latter is transformed to the scale via a softplus function.
Specifics about the decoder:
The decoder output size is set to the number of parameters to learn. Each one being transformed accordingly with softplus functions to fulfil their distributional restrictions, as done for the other models.
d.2 Experimental setup
For the experiments with only continuous distributions, we consider a variety of distributional parameters for each distribution taken into consideration, trying to cover as many plausible scenarios as possible. Then, we take every possible combination of distributions, generate samples and train a mixture model with different seeds. The probabilities for each model are supposed uniform and the distributional parameters considered can be seen in Tables 5 and 6.
All experiments are trained using Adam with a learning rate of
and default Pytorch’s options. For the synthetic experiments we do not use stochastic gradient descent, i.e., we set the batch size to.