# Lipschitz standardization for robust multivariate learning

Current trends in machine learning rely on out-of-the-box gradient-based approaches. With the aim of mitigating numerical errors and to improve the convergence of the learning process, a common empirical practice is to standardize or normalize the data. However, there is a lack of theoretical analysis regarding why and when these methods result in an improvement of the learning process. In this work, we first study these methods in the context of black-box variational inference, specifically analyzing the effect that scaling the data has on the smoothness of the optimization landscape. Our analysis shows that no general rule applies in order to decide which of the existing data scaling methods, or even if they, will improve the learning process. Second, we highlight the issues that arise when dealing with multivariate data, due to the discrepancy in smoothness of the likelihood functions for different variables, and the inability to scale discrete data. Finally, we propose a novel Lipschitz standardization, and its extension for discrete data, which overcomes the aforementioned limitations. Specifically, as backed by our experiments, Lipschitz standardization i) favors a fairer learning across different variables in the data; and ii) results in faster and more accurate learning.

## Authors

• 2 publications
• 20 publications
• ### Provable Smoothness Guarantees for Black-Box Variational Inference

Black-box variational inference tries to approximate a complex target di...
01/24/2019 ∙ by Justin Domke, et al. ∙ 0

• ### Lipschitz Optimisation for Lipschitz Interpolation

Techniques known as Nonlinear Set Membership prediction, Kinky Inference...
02/28/2017 ∙ by Jan-Peter Calliess, et al. ∙ 0

• ### Fast Black-box Variational Inference through Stochastic Trust-Region Optimization

We introduce TrustVI, a fast second-order algorithm for black-box variat...
06/07/2017 ∙ by Jeffrey Regier, et al. ∙ 0

• ### Analysis of Gradient Clipping and Adaptive Scaling with a Relaxed Smoothness Condition

We provide a theoretical explanation for the fast convergence of gradien...
05/28/2019 ∙ by Jingzhao Zhang, et al. ∙ 0

• ### Provable Bregman-divergence based Methods for Nonconvex and Non-Lipschitz Problems

The (global) Lipschitz smoothness condition is crucial in establishing t...
04/22/2019 ∙ by Qiuwei Li, et al. ∙ 0

• ### Boosting Black Box Variational Inference

Approximating a probability density in a tractable manner is a central t...
06/06/2018 ∙ by Francesco Locatello, et al. ∙ 0

• ### Comparative analysis of SVIT and Skype features in e-Learning process

The article discusses capabilities of the SVIT and Skype programs in ter...
02/22/2018 ∙ by K. S. Malakhov, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Over the last few years there has been a shift in the way machine learning models are trained. Most approaches, ranging from supervised and multi-task learning to probabilistic methods, such as variational inference, are nowadays based on gradient-based optimization methods. In particular, they often rely on stochastic gradient descent, due to its simplicity and data scalability.

In this context, a common practice before the learning step is to standardize—or more generally, apply a data scaling (a.k.a., feature scaling) transformation to— the data (Han et al., 2011, Chapter 3.5.2). Empirical results show that data scaling often alleviates numerical issues and yields more accurate results on the learning process. However, despite the broad use of data scaling methods, little to no attention has been paid into answering basic questions such as: Why do these methods improve the learning process? When should I choose one method over the others?

In this work, we aim to shed some light on the effect that data scaling methods, such as standardization and normalization, have on gradient-based learning algorithms. To this end, we rely on black-box variational inference (BBVI) (Ranganath et al., 2014) as a use case of gradient-based learning algorithms, and on the exponential family due to its well-studied theoretical properties. These properties allow us to analyze how the gradients behave as a function of the scaling factor, or more precisely, to find the relationship between Lipschitz constants of the gradient of the objective function (in our example, the log-likelihood) evaluated using the original data and the scaled (e.g., standardized) data.

Our analysis shows that both, standardization and normalization, usually result in smoother optimization landscapes, which are often easier to optimize (Santurkar et al., 2018, Section 3.1). However, our study also shows that there is no general rule in order to decide whether or not to apply data scaling, nor which data scaling method will improve the optimization process.

Fortunately, our framework allows us to derive a novel algorithm, Lipschitz standardization

, whose scaling factor results in the desired (estimator of the) Lipschitz constant of the function gradient to be optimized. Moreover, since data scaling is only suitable for continuous variables, we extend our algorithm to handle discrete data using optimally scaled continuous variables.

Lipschitz standardization fulfils two important properties. First, it allows us to scale the data optimally for a given learning rate, thus leading to faster and more accurate (overall) learning processes. Second, it favors a fairer learning across different variables in the data, i.e., it eases that all the variables in the data are accurately learnt. The latter property is thus of particular interest when dealing, e.g., with multivariate Bayesian inference or multi-task learning problems, where a subset of the dimensions (e.g., tasks or observed variables) often dominate the learning process

(Yu et al., 2020).

Our experiments on synthetic and real-world data, as well as on different generative models, show the effectiveness of the proposed Lipschitz standardization. Lipschitz standardization does not only lead to fairer and more accurate learning than existing approaches, but it also mitigates numerical instabilities, improving the robustness of the learning process.

## 2 Problem Statement

In this section we introduce to the reader the notation we adopt throughout the paper, as well the framework we choose to evaluate different data transformation approaches.

Within existing approaches, we here consider only data scaling transformations of the form , where and are respectively the original and the scaled data, and is the scaling factor. There are two main reasons for this choice: i) they preserve important properties of the data distribution, such as its domain and tails; and ii) they are broadly used in practice (Han et al., 2011). Note that as shifting the data, , may violate distributional restrictions (e.g., non-negativity), we assume without loss of generality that the data may have been already shifted and its likelihood is selected afterwards. Specifically, we focus on two broadly used data scaling methods:

 Standardization: ~x =x/std, (1) Normalization: ~x =x/max, (2)

where and

denote the values of, respectively, the empirical standard deviation and absolute maximum value of the observed data.

In order to analyze the effect of data scaling in gradient-based learning, we rely here on a large class of probabilistic models as a use case. Following Hoffman et al. (2013), we consider the fairly simple —but general— graphical model depicted in Figure 1

. Thus, the joint distribution over the observed variables (the data)

, the set of local latent variables , and the set of global latent variables , can be written as

 p(X,Z,β)=p(β)N∏n=1p(xn|zn,β)p(zn). (3)

Here, each observation

corresponds to a D-dimensional feature vector

. For simplicity, we further assume that it factorizes as

 p(xn|zn,β)=D∏d=1pd(xnd|ηnd), (4)

where the latent variables and determine the likelihood parameters for each observation .

Furthermore, and without loss of generality, we consider black-box variational inference (BBVI) (Ranganath et al., 2014) to approximate the posterior distribution of the latent variables, . BBVI uses a mean-field variational distribution family of the form , where and are the local and global variational parameters, respectively. More in detail, BBVI relies on (stochastic) gradient ascent to find the parameters that maximise the evidence lower bound (ELBO),111

Or equivalently, that minimise the Kullback-Leibler divergence from

to  (Blei et al., 2017) which is given by

 L(X,γ,ϕ)=Eq[logp(X|Z,β)]−KL(q(Z,β)∥p(Z,β)).

Thus, the gradient ascent algorithm updates the global parameters as , and the local as , where is the current step of the algorithm.

Importantly, to disentangle the contribution of the data and the model parameters on the gradient evaluation, one can apply the chain rule with respect to the likelihood parameters

in the computation of the log-likelihood term of the gradient, i.e., ; and thus, making explicit the effect of data scaling, e.g., , on the gradient-based learning process. To this end, we restrict ourselves to the exponential family (Bishop, 2006, Chapter 2.4) since, as shown in the next section, it exhibits nice theoretical properties.

### 2.1 Scaling the exponential family

From now on, we consider each dimension in the observed data to be modeled using a member of the exponential family. In other words, we assume here all likelihood functions to be of the form

 p(x|η)=h(x)exp[η⊤T(x)−A(η)], (5)

where are the natural parameters and a function of the latent variables are the sufficient statistics, is the base measure, and is the log-partition function. Note that and are vectors of size , and that we here remove the sub-indexes to avoid cluttering the notation.

Working with the exponential family let us draw the following relationships (Appendix A):

###### Proposition 2.1.

Let be a density function of the exponential family with variable and parameters . Assume a scaling function such that for any

it defines the function (and random variable)

. If

1. for every the function is bijective;

2. the base measure factorises, ; and

3. every sufficient statistics factorises as , where either or .

Then, by defining such that , where is the pairwise product and , we have that

1. ,

2. .

In words, Proposition 2.1 provide a relationship between the log-likelihood functions, as well as their first and second-order partial derivatives, of the original and the scaled data. Importantly, although requirements (a)-(c) may look restrictive at first, as reported in Table 1

, many commonly used distributions fulfil such properties. It also is worth-mentioning that in the case of the log-normal distribution the scaling function is

### 2.2 The learning pipeline

At this point, we consider important to clarify which is the current flow since the data is given, until the model parameters are inferred. This process can be described as a simple diagram given by

 xi)→˜xii)−→˜ηiii)−−→η, (6)

where i) is the data transformation function; ii) is the result from the inference model, where we recall that the natural paramaters are a function of the latent variables and ; and iii) is the function that recovers the original parameters based on the scaled ones, as shown in Proposition 2.1.

## 3 Why does standardization work?

Standardization and normalization are widely used in statistics and classical machine learning, since they tend to improve the accuracy of the resulting models (Juszczak et al., 2002; Milligan & Cooper, 1988). However, there is a priori no way to say which one to use.

As an example, in distance-based machine learning methods, e.g. clustering algorithms, the effectiveness of data standardization and normalization may be explained by the fact that they bring all the data mass into a similar range, making the distance between points comparable among dimensions (Aksoy & Haralick, 2001). In this context, Milligan & Cooper (1988) reported that data normalization empirically works better than standardization. Juszczak et al. (2002) backed such conclusion, yet they concluded, as Gnanadesikan et al. (1995) did, that worry-free scaling methods do not exist and further research is needed.

In other approaches such as maximum likelihood or variational inference,222In the Bayesian framework, one may also argue that standardization eases the prior selection process (even for those random variables indirectly related with the data), improving the overall performance of the algorithm. the distance argument becomes less convincing, since explicit distance between points is no longer evaluated.

An alternative argument is that these methods improve numerical stability by moving the data, and thus the model parameters, to a well-behaved part of the real space. This is sensible since computers struggle to work with tiny —and large— numbers and that would inevitably affect the learning process.

Next, we introduce, to the best of our knowledge, a novel reasoning to explain the effectiveness of standardization and normalization in gradient-based methods. In a similar fashion as Ioffe & Szegedy (2015)

showed that batch normalization smooths the optimization landscape of the loss function with respect to the neural network parameters, we show that standardization and normalization often smooth out the optimization landscape of the log-likelihood-based objective functions.

### 3.1 “Standardizing” the optimization landscape

We say that a real-valued function is -smooth, if it is twice differentiable on , and its first derivative is Lipschitz continuous with constant . That is, if for any , we have

 |g′(a)−g′(b)|≤L|a−b|. (7)

Let us now denote the logarithm of likelihood functions (consider, e.g., the distributions in Table 1) of the original and the scaled data, respectively, by and . Let us also assume that the optimization landscape for the -th natural parameter, , of the original log-likelihood is -smooth. Then, using Proposition 2.1, we can obtain a relationship between the Lipschitz constant of the scaled log-likelihood with respect to the original one,

 |∂˜ηiℓ(˜x;˜η=˜a)−∂˜ηiℓ(˜x;˜η=˜b)| =|fi(ω)| |∂ηiℓ(x;η=a)−∂ηiℓ(x;η=b)| ≤|fi(ω)|Li |ai−bi| =fi(ω)2Li |˜ai−˜bi|, (8)

where are two different (scaled) parameter vectors that differ only in the -th component, i.e., . This implies that the optimization landscape of the -th natural parameter of the scaled log-likelihood is -smooth, with .

Note that, in Eq. 3.1, values of smaller than one imply that the optimization landscape is smoother for the scaled data than for the original one. Thus, under reasonably small values of , one may expect an improvement in the learning process when using the scaled data. In contrast, when is greater than one, scaling the data will “sharpen” the optimization landscape.

Table 1

shows that for common likelihood functions, such as (log-)normal and gamma distributions,

is smaller than one if and only if is also smaller than one. For such distributions, data standardization (normalization) is performed with a scaling factor smaller than 1 only when the empirical standard deviation (maximum value) of the original data is larger than one, which is common in real data. Note though that not all distributions in Table 1 smooth out when . For example, the inverse-gamma distribution smooths out in the opposite case.

### 3.2 Lipschitz constant and learning rate

A question that promptly arises is whether smaller Lipschitz constants are always better in terms of gradient-based optimization. However, it is relatively easy to find examples where normalization works worse than standardization, as shown in Figure 3, even though its Lipschitz constants are usually smaller.

One could argue that the problem comes out of numerical issues, since normalization may result in transformed data values that are very close to zero (e.g., in heavy-tailed distributions). This does not explain common cases where scaling factors for both, standardization and normalization, are in the same order of magnitude. An alternative explanation is that too small values of may lead to over-smoothed optimization landscapes, thus deteriorating the learning process (e.g., due to negligible gradients).

In the context of gradient-based non-convex optimization, Nesterov (2018) showed that for the case of a differentiable function whose first derivative is Lipschitz continuous with constant , the optimal gradient step size is . Unfortunately, when standardization or normalization are used, whether the Lipschitz constant is near to the optimal boils down to luck. The same holds for obtaining similar Lipschitz constants across dimensions, which is important in multivariate problems to avoid that the learning process is dominated by a subset of the dimensions.

## 4 Can we do better?

In this section, we propose a novel data scaling algorithm capable of controlling the Lipschitz constant of the derivative for a given log-likelihood function. We remark that this idea can be extrapolated to other objective functions beyond the ELBO and likelihood function without loss of generality. Intuitively, this algorithm moves the data into a region of the domain where the Lipschitz constant of the optimization landscape is close to the optimum.

Unfortunately, the existence of the Lipschitz constant is not ensured (e.g., when the second derivative is unbounded). Even if it exists, the actual computation of this constant is, in general, not straight-forward. Therefore, we instead make use of a local estimator of the Lipschitz constant.

Specifically, we use the second derivative of the log-likelihood, , as an estimator of the Lipschitz constant, . This choice becomes clear when we recall the mean value theorem (adapted to our settings).

###### Theorem 4.1 (Mean Value Theorem).

Let be a twice-differentiable real-valued function with respect to . Then, for any two values and , there exists such that

 g′(a)−g′(b)=g′′(c)(a−b). (9)

Thus, by taking the maximum value of the second derivative in Eq. 9, we obtain the same inequality as in Eq. 7. That is, , where .

By using the above result, we can obtain an estimator of the (local) Lipschitz as

 Li≈maxn ∣∣∂2ηilogp(xn;ˆη)∣∣, (10)

where is the empirical estimator of the natural parameters for the given distribution. For example, in the case of , with and

being the empirical mean and variance of the observed data

, we have and .

With a tractable estimator of the Lipschitz constant and its relationship between original and scaled data, given in Proposition 2.1, it is now possible to build an efficient algorithm to control the smoothness of the log-likelihood.

Note that the scaling factor controls the Lipschitz constants of all the natural parameters of the log-likelihood, so we adopt as criterion that the sum of these constants amounts to a given objective value, . This can be expressed as an optimization problem, i.e.,

 ω∗=argminω(I∑i=1˜Li−L∗)2, (11)

where is the Lipschitz constant of the scaled data with respect to the th-parameter.

For many distributions, it is possible to find a unique closed-form solution of the above optimization problem, as the following proposition showcases.

###### Proposition 4.1.

Let be the Lipschitz constant of the -th natural parameter, and the objective value. Then, if it exists, the solution of the optimization problem in Eq. 11 is unique and has the following closed-form solutions (Appendix B):

• (Log-)normal:

 ω∗= ⎷−L1+√−L21+4L2L∗2L2⟺L2>0.
• Gamma:

 ω∗=√L∗−L1L2⟺L10.
• Exponential:

 ω∗=√L∗L1⟺L1>0.

Remark. Alternatively, we may solve the optimization problem in Eq. 11 by using a gradient descent algorithm where, again, we can take advantage of Proposition 2.1 to recompute the estimator without re-evaluating the log-likelihood at every step of the optimization algorithm. As a result, while we have centered our analysis in BBVI using the exponential family, we remark here that the optimization problem in Eq. 11 can be solved for any objective function (beyond the log-likelihood), as long as we can compute its estimator of the Lipschitz constant in a similar fashion as in Eq. 10. That is, as long as the objective function is twice differentiable with respect to the parameters we aim to learn.

### 4.1 Faster and fairer learning

Note that Lipschitz constants play an important role in the optimization process. On one hand, having a close-to-optimal Lipschitz constant translates to faster convergence. On the other hand, sharing the same constant across dimensions ensures that each dimension is learnt at a similar rate, thus avoiding unfair learning – see, e.g., Figure 2 for a visual example of unfair learning.

Note that the Lipschitz criterion introduced in Eq. 11 enables selecting the Lipschitz constant for each dimension. We also remark that, while for each dimension there is a unique optimal learning rate, in practice, only a single learning rate is set.

Following previous notation, if we now let be the set of scaling factors for all dimensions in the data, we can rewrite the Lipschitz criterion as where is the Lipschitz constant of the scaled data with respect to the th-parameter, , and is the target Lipschitz constant for the -th dimension of the data. Thus, in order for fair learning across dimensions to happen, we enforce all the per-dimension Lipschitz constants to be equal, i.e., for all . Since in our use-case the optimization is not performed over but , we apply the chain rule, i.e.,

 ∂ϕℓ(˜x;˜η)=D∑d=1Id∑i=1∂ϕ˜ηdi ∂˜ηdiℓ(˜xd;˜ηd), (12)

so that the Lipschitz condition can be now approximated as , and therefore, per-dimension Lipschitz condition is given by:

 Id∑i=1˜Ldi=L∗d=1Dα∀d∈{1,2,…,D}. (13)

## 5 Discrete data

Previously, we introduced the idea of making the optimization landscape equally smooth for every data dimension. Up to this point, our algorithm only applies to continuous data, and thus, likelihood functions. However, real-world datasets are often characterized by the diversity of both continuous and discrete data types —and thus likelihoods— that they present.

Next, we propose an approach to deal with discrete data, so that they can also be scaled to be Lipschitz optimal, and thus, to be learnt fairly with the rest of continuous dimensions. Specifically, this approach can be summarized in three steps: i) transform the discrete dimensions to continuous data; ii) use a set of Gamma distributions to fit the now continuous dimensions, which we can Lipschitz standardize; and iii) estimate the parameters of the original discrete distribution using the learnt continuous distribution. In the following we will refer to this approach as Gamma trick.

Discrete likelihoods can be separated in two different types: those with a single degree of freedom —that is, a single uni-dimensional parameter—, and those with more than a degree of freedom. In this paper, we consider the Bernoulli, Poisson and categorical distributions, since they appear in many real-world applications.

From this point onward, and without loss of generality, we assume discrete data to be a subset of the natural numbers. The reason underneath is that, when using the Gamma trick, shifting the data away from zero causes the estimator of its first parameter to rapidly decrease, which makes easier to fulfil the constraint in Eq. 4.1, necessary to find the optimal scaling factor. For further details see Appendix C.

In order to turn the original data into continuous form we apply a transformation of the form , where is a continuous noise variable that should: i) have as support a subset of the unit interval with non-zero measure such that, when added to , the original value is identifiable; and ii) preserve the original shape of the data as much as possible. As an example, in our experiments, is sampled from a distribution.

Recall that in Eq. 6 we presented the learning pipeline that data follow. For the case when the Gamma trick is applied, this pipeline can be re-written as

 x→¯¯¯x→˜x→˜η→¯¯¯η→¯¯¯μ→μ→η, (14)

where is the expected value of the discrete variable, is the mean of the auxiliary continuous variable, and is computed using the de-standardized natural parameters , i.e., it is a function of the natural parameters, .

For example, distributions with one degree of freedom, such as the Bernoulli and Poisson distributions, are characterised by their expected value. Hence, to recover the original distributional parameters after learning, it is enough to compute the mean of the auxiliary continuous distribution, i.e.,

 μ=¯¯¯μ−E[ϵ], (15)

and recover the original parameters using the average value of the original discrete data. Specifically, for a fixed , we obtain the likelihood parameters as for the Bernoulli, and as for the Poisson distribution.

Next, we focus on the categorical distribution as an example of distribution with various degrees of freedom. In particular, it is characterized by a -dimensional vector parameter,

, representing the probability of being assigned to each class. By using a one-hot representation of the data and treating each class (now binary variable) as an independent Bernoulli dimension, we can learn each class individually by the price of increasing the number of dimensions. We refer to this approximation as

Bernoulli trick. Finally, we can recover the original parameters by ensuring that their means add up to one. That is,

 πk=μk∑Ki=1μi∀k∈{1,2,…,K}, (16)

where .

In order to obtain a Gamma trick version for the categorical distribution, it suffices to apply the Gamma trick to each of the Bernoulli distributions, thus being able to scale the data as in the continuous case. Note however that, when using the Gamma trick on a

-dimensional categorical distribution, we increase the number of dimensions by . To account for that in our Lipschitz criterion, and in a similar way as we did in Eq. 11, we set the objective for each auxiliary variable to , such that, when added up, it preserves the same objective as the rest of (non-categorical) dimensions, i.e., .

The combination of the Gamma trick and the Lipschitz criterion constitutes the full algorithm we propose in this work, which we name as Lipschitz standardization.

## 6 Experiments

In this section, we first study separately the effect of Lipschitz standardization building blocks (Lipschitz criterion and Gamma trick) using synthetic data. Afterwards, we evaluate their performance on different real datasets and models (i.e., mixture model, matrix factorization, and variational autoencoders). Refer to Appendix

D for further details on the experimental setup.

Methods comparison. Thorough this section, we compare the performance of the following data scaling methods: i) id, which does not scale the original data; ii) max, which normalizes the data; iii) std, which standardizes the data; iv) ours-none; which applies the Lipschitz criterion only to continuous dimensions; v) ours-bern, which applies the Lipschitz criterion to continuous dimensions and the Bernoulli trick to the categorical (an thus, binary) dimensions; and vi) ours-gamma, which is the proposed Lipschitz standardization algorithm, combining the Gamma trick, for all discrete variables, with Lipschitz criterion.

### 6.1 Synthetic data

Experimental setup. We use data generated from a mixture model. Then, given the generated data, we rely on BBVI to approximate the posterior distribution of the latent variables, i.e., the mixture component assignments and parameters. In a first setting, we assume all our dimensions to be continuous. In the second setting, different types of discrete variables are appended to the continuous data from the former setting. Besides, we distinguish in both settings between those cases where all dimensions contain information about the clusters and those in which some dimensions are uniformative about the clustering.

Metric. We compute the Kullback-Leibler divergence from the actual log-marginal of the data to the one inferred by the model, i.e., . We present our results in terms of a modified version of the Relative Percent Difference (RPD) that lies in the interval , thus comparing the KL given by two different methods.

Results. Table 2 shows the results for the continuous setting (here, we omit the log-normal case since it is equivalent to the normal one), and compare the performance of our method, Lipschitz standardization, with the rest of approaches. Here, KL-RPD values smaller than one mean that our method outperforms the competing approaches, and KL-RPD values greater than one mean that the competing methods results in more accurate learning. Hence, we can observe that ours-gamma outperforms the rest of algorithms, being the difference even larger as we consider more complex settings, such as learning with uninformative variables, which in turn act as a source of noise.

Furthermore, Table 3 shows similar results when discrete dimensions are appended to the continuous data. In particular, in all but one case ours-gamma performs significantly better than learning using the original data, i.e., than the id approach (which was the second best method in Table 2). Note that, in this toy example, both ours-bern and ours-gamma provide similar results, being ours-bern slightly better when there are categorical dimensions in the data. An exception to these results is the the setup with the informative Bernoulli, where the Gamma trick performs significantly worse than the id approach. We believe that these results are due to the particular learning dynamics in the mixture model, and the reduced dimensionality of the data. However, as we will show in the experiments with real-world data, the Gamma trick becomes necessary for fair learning when dealing with more complex datasets.

In order to better understand the previous results, we depict in Figure 2 a setting example, extracted from Table 3, where unfair learning can be easily observed. In this figure, we observe that, for the id approach, the likelihood for the categorical dimension even decreases as we perform more steps of the gradient descent algorithm. In other words, when the original data is fitted to the BBVI algorithm, learning the continuous dimension cannot be achieved without “forgetting” the discrete variable. In contrast, this behaviour does not appear when the data is scaled using Lipschitz standardization, which result in a similar rate of convergence, and thus fair learning, of both dimensions.

### 6.2 Real-world data

Experimental setup. We use six different datasets from the UCI repository (Dua & Graff, 2017) and apply BBVI to solve a missing-data imputation task. To show the flexibility of the proposed Lipschitz standardization, we consider three different generative models: i) mixture model; ii) matrix factorization; and (vanilla) Variational Auto-Encoder (VAE) (Diederik et al., 2014). Additionally, we pick a likelihood for each dimension based on its observable properties (e.g., containing only positive real data).

Methods. Here, we compare (when applicable) the same data scaling methods as in the synthetic experiments, with the only difference being that we now consider the std and id methods as a single one, std/id, since data is standardized beforehand to provide a fair initialization across all methods. In short, we now consider: i) max; ii) std/id; iii) ours-none; iv) ours-bern; and v) ours-gamma.

Metric. Analogously to Nazabal et al. (2018)

, missing imputation error is used as evaluation metric. Specifically, normalized mean squared error is used for numerical variables and error rate for nominal variables, being the final result the average across all dimensions.

Results. Figure 3 shows the results, divided by dataset and percentage of missing data. The results show that, for all these datasets and models, applying Lipschitz standardization results in at least as good imputation error as the best of the competing methods. Furthermore, it can be observed that in highly heterogeneous datasets, such as the Adult and defaultCredit, and datasets with exclusively nominal dimensions, i.e., letter and Breast, Lipschitz standardization (i.e, ours-gamma) clearly outperforms the rest of the methods, to the extent of beating the state-of-the-art results achieved in Nazabal et al. (2018) (see, e.g., Breast).

It is also worth-mentioning that, in the case of defaultCredit

, outliers were removed from the box-plot in Figure

3 to provide a more clear visualization of the results. In this particular dataset, all methods but Lipschitz standardization failed at some point, obtaining large imputation errors, as shown in Appendix E. This poor performance may be the result of unfair learning due to the presence of discrete dimensions. The ability of Lipschitz standardization to scale all the dimension, including the discrete ones, makes it the most robust approach among the ones considered here.

As a final remark, we show in Table 4 the set of scaling factors provided by each of the considered data-scaling methods for the continuous dimensions in the Wine dataset. Here, we observe that the optimal Lipschitz scaling factors may significantly differ from the ones resulting from standardization and normalization. Importantly, no general rule may seem to apply in regards of whether the scaling factor for the Lipschitz standardization is larger or smaller than for the other methods.

## 7 Conclusions

In this work, we have studied the effect that data scaling approaches, such as standardization and normalization, have on models when they are learnt via gradient-based approaches. Our analysis has shown that no general rule for choosing between existing data scaling methods exists, but fortunately allowed us to introduce Lipschitz standardization, a novel algorithm that scales each (continuous or discrete) dimension of the data such that they all converge at a similar and optimal rate. Our experiments on real world data show that our algorithm results in accurate, fair, and robust learning, where its performance in the worst is at least as good as extant approaches. In fact, our experiments show that Lipschitz standardization becomes particularly necessary as the data become more complex, e.g., highly heterogeneous.

We leave for future research questions such as implementing Lipschitz standardization in a probabilistic programming pipeline, or applying it the context of other learning paradigms, such as multi-task learning. Finally, we would also like to study whether the Lipschitz criterion proposed in this paper can be adapted to be applied directly to the gradients, instead of to the data as a pre-processing step.

## Appendix A Proof of Proposition 2.1

###### Proposition A.1.

Let be a density function of the exponential family with variable and parameters . Assume a scaling function such that for any it defines the function (and random variable) . If

1. for every the function is bijective;

2. the base measure factorises, ; and

3. every sufficient statistics factorises as , where either or .

Then, by defining such that , where is the pairwise product and , we have that

1. ,

2. .

###### Proof.

First, we explicitly write the scaled version of log-likelihood function:

 logp(˜xω,˜η) =logh(˜xω)+∑i˜ηiTi(˜xω)−A(˜η) =log(f0(ω)h(x))+∑iηifi(ω)(fi(ω)Ti(x)+gi(x))−A(˜η) =logh(x)+logf0(ω)+∑iηiTi(x)+∑i˜ηigi(ω)−A(˜η) (17)

where we have made use of conditions b and c to separate the scaling factor from the data itself.

Now we compute the value of using again the same conditions:

 A(˜η) =log∫h(˜xω)exp(˜η⊤T(˜xω))d˜xω=logf0(ω)+log∫h(x)exp(η⊤T(x))ωdx+∑i˜ηigi(ω) =A(η)+logf0(ω)+logω+∑i˜ηigi(ω) (18)

And by substituting this into Equation A we obtain result 1:

 logp(˜x,˜η) =logh(x)+logf0(ω)+∑iηiTi(x)+∑i˜ηigi(ω)−A(˜η) =logh(x)+logf0(ω)+∑iηiTi(x)+∑i˜ηigi(ω)−A(η)−logf0(ω)−logω−∑i˜ηigi(ω) =logp(x;η)−logω (19)

Result 2 is a consequence of the first result and the chain rule:

 ∂˜ηilogp(˜xω;˜η) =∂˜ηi[logp(x;η)−logω]=∂˜ηilogp(x;η) =∂˜ηiηi∂ηilogp(x;η)=fi(ω) ∂ηilogp(x;η) (20)

Finally, result 3 comes as a direct consequence of applying the chain rule yet again:

 ∂2˜ηilogp(˜xω;˜η) =∂˜ηi[∂˜ηilogp(˜xω;˜η)]=∂˜ηi[fi(ω) ∂ηilogp(x;η)] =fi(ω) ∂˜ηi[∂ηilogp(x;η)]=fi(ω) ∂˜ηiηi ∂ηi[∂ηilogp(x;η)] =fi(ω)2 ∂2ηilogp(x;η) (21)

## Appendix B Proof of Proposition 4.1

###### Proposition B.1.

Let be the Lipschitz constant of the -th natural parameter, and the objective value. Then, if it exists, the solution of the optimization problem in Equation 11 is unique and has the following closed-form solutions:

• (Log-)normal:

 ω∗= ⎷−L1+√−L21+4L2L∗2L2⟺L2>0 (22)
• Gamma:

 ω∗=√L∗−L1L2⟺L10 (23)
• Exponential:

 ω∗=√L∗L1⟺L1>0 (24)
###### Proof.

Recall that we want to solve the optimization problem in Equation 11, that is,

 ω∗=argminω(I∑i=1˜Li−L∗)2, (25)

where is the Lipschitz constant of the scaled log-likelihood th partial derivative.

For simplicity, let us denote by the estimator of the (local) Lipschitz constant for the original data (which is a function of the second partial derivatives of ), instead of the actual Lipschitz constant. The important assumptions here are that all are non-negative real numbers and is positive.

Equation 25 is minimized when . Thus, we can apply the third result of Proposition 2.1, obtaining:

 I∑i=1˜Li=I∑i=1fi(ω)2Li=L∗ (26)

All need to do is solve Equation 26 for each distribution.

• (Log-)normal:

 f1(ω)=ωf2(ω)=ω2
 ω2L1+ω4L2−L∗=0 ⇒u=ω2⇒uL1+u2L2−L∗=0 ⇒u=−L1±√L21+4L2L∗2L2 ⇒ω=± ⎷−L1±√L21+4L2L∗2L2 (27)

Which gives four different solutions. Using the fact that, by construction, , we can reduce it to a single solution.

First, as it has to be positive we have that .

Moreover, as and are non-negative we have that , we have that . Therefore we can also discard that solution, resulting in a single solution:

 ω= ⎷−L1+√L21+4L2L∗2L2.

In order for this solution to exists, three conditions need to hold. Specifically:

• since is positive by construction.

• which holds as long as .

Therefore, the (log-)normal distribution has a unique solution if and only if which is given by

 ω= ⎷−L1+√L21+4L2L∗2L2. (28)
• Gamma:

 f1(ω)=1f2(ω)=ω
 L1+ω2L2−L∗=0⇒ω=±√L∗−L1L2 (29)

Therefore there are two solutions. Again, using that , we have that the only valid solution is

 ω=√L∗−L1L2 (30)

which exists if and only if and .

• Exponential:

 f1(ω)=ω
 ω2L1=L∗⇒ω=±√L∗L1 (31)

These solutions only exists when and the positive one is the only valid one.

## Appendix C Proof of the statement about Gamma trick noise

In section 5 it was introduced the concept of Gamma trick, which acts as a approximation for discrete distributions. Moreover, it was explained that it is beneficial if the original variable is somewhat far from zero.

This statement, at first look arbitrary, it is justified by the following result: The second derivative of a Gamma log-likelihood with respect to the first natural parameter, , rapidly decreases as the data moves away from zero.

For the case of a distribution in the exponential family, it is a well-known result that .

Therefore, in the case of the Gamma distribution, we have that:

 ∂2η1logp(x;η) =∂η1[α−logβ+logΓ(α)+(1−α)ψ(α)] =∂η1[η1+1−log(−η2)+logΓ(η1+1)−η1ψ(η1+1)] =1+∂η1logΓ(η1+1)−ψ(η1+1)−η1ψ(1)(η1+1) =1+ψ(η1+1)−ψ(η1+1)−η1ψ(1)(η1+1) =1−ψ(1)(η1+1)=1−(α−1)ψ(1)(α) (32)

Figure 4 shows a plot of the above equation as a function of the shape . It is easy to observe that as the shape grows the value of (our approximation to) drastically decreases.

Finally, by supposing that discrete data are natural numbers, means that the mode is at least one, which in practice means that the value for is bigger than (usually close to ), thus ensuring that the value of (our approximation to) is negligible.

## Appendix D Experiments details

### d.1 Models

Here we give a deeper description of the models appearing in the experiments.

Mixture model. As described in Subsection 6.1, the mixture model follows the graphical model from Figure 1 and it is fully described by:

• Priors:

 p(πn)=U(K)p(β)=N(0K,IK)
• Posteriors:

 qϕ(zn)=Cat(πn)qϕ(β)=N(μ,Σ)

 η(zn,βd)=znβd

Where are -dimensional vectors and

are one-hot encodings of size

.

To ensure that the parameters fulfil the domain restriction of each particular distribution, the following transformations are performed after applying the linking function:

• Greater than :

 η′=softplus(η)+l+$1×10−15$
• Smaller than :

 η′=−(softplus(η)+u+$1×10−15$)

When it comes to the real-world examples the only parameter that have to be specified for this model is the number of clusters, . In particular, we use if the dataset is Breast, Wine, or spam, and otherwise.

In order to implement the discrete latent parameters such that they can be trained with automatic differentiation, the latent categorical distribution is implemented using a GumbelSoftmax distribution with a temperature that updates every 20 epochs as:

 temp=max(0.001,e−epoch⋅0.001)

Matrix factorization. Similar to the mixture model, the matrix factorization model follows the same graphical model and it is (almost) fully described by:

• Priors:

 p(μn)=N(0K,IK)p(β)=N(0K,IK)
• Posteriors:

 qϕ(zn)=N(μn,σ)qϕ(β)=N(μ,Σ)

 η(zn,βd)=znβd

There some details that have to be noted. First, the variance of the local parameters, , is shared among instances and learnt as a deterministic parameter. In the same way, only the first parameter, , of each distribution is learnt following this scheme. The remaining parameters are learnt using gradient descent as deterministic parameters.

The same transformations as in the mixture model are performed to the parameters in order to fulfil their particular domain requirements.

When it comes to the real-world examples the only parameter that have to be specified for this model is the latent size, . In particular, we set it automatically as half the number of dimensions of each dataset (before applying any trick that may increase the dimensionality).

Variational Auto-Encoder. We follow the structure of a vanilla VAE with the following components:

• Encoder: 3-layer neural network with hyperbolic tangents as activation functions.

• Decoder: 4-layer neural network with ReLU as activation functions.

General notes:

• We assume normal latent variables with a standard normal as prior.

• Hidden layers have 256 neurons.

• The latent size is set to the of the data (original) number of dimensions.

• All intermediate layer have a dropout of to avoid overfitting (e.g. in Breast, which has instances).

• Layers are initialized using a Xavier uniform policy.

• Depending on the dataset the model could be unstable in the final parts of training. Therefore, we halve the learning rate when reached half the total epochs, and once again when we are three fourths in the training process.

• As we have to avoid using the missing data (since it is going to be our test set), we implement an input-dropout layer as in Nazabal et al. (2018).

• In order to guarantee a common input (and thus, a common well-behaved neural net) across all data scaling methods, we put a batch-normalization layer at the beginning of the encoder. Note that this does not interfere with the goal of this work, which is about the evaluation of the loss function.

• In order to obtain the distributional parameters of , and , we pass the result of the encoder through two linear layers, one for the mean and another for the log-scale. The latter is transformed to the scale via a softplus function.

• The decoder output size is set to the number of parameters to learn. Each one being transformed accordingly with softplus functions to fulfil their distributional restrictions, as done for the other models.

### d.2 Experimental setup

Synthetic data.

For the experiments with only continuous distributions, we consider a variety of distributional parameters for each distribution taken into consideration, trying to cover as many plausible scenarios as possible. Then, we take every possible combination of distributions, generate samples and train a mixture model with different seeds. The probabilities for each model are supposed uniform and the distributional parameters considered can be seen in Tables 5 and 6.

All experiments are trained using Adam with a learning rate of

and default Pytorch’s options. For the synthetic experiments we do not use stochastic gradient descent, i.e., we set the batch size to

.