Wasserstein Gradients for the Temporal Evolution of Probability Distributions

09/10/2018 ∙ by Yaqing Chen, et al. ∙ 0

Many studies have been conducted on flows of probability measures, often in terms of gradient flows. We introduce here a novel approach for the modeling of the instantaneous evolution of empirically observed distribution flows over time with a data-analytic focus that has not yet been explored. The proposed model describes the observed flow of distributions on one-dimensional Euclidean space R over time based on the Wasserstein distance, utilizing derivatives of optimal transport maps over time. The resulting time dynamics of optimal transport maps are illustrated with time-varying distribution data that include yearly income distributions, the evolution of mortality over calendar years, and data on age-dependent height distributions of children from the longitudinal Zürich growth study.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There exists a sizeable literature on flows of probability measures, often described in terms of gradient flows (Ambrosio et al. 2008; Santambrogio 2017). Given two probability measures and , one aims to construct a path from to , transporting the pile of mass corresponding to to that corresponding to , while minimizing the transport cost. The optimal transport from to attains the minimum transport cost as in the Monge-Kantorovich problem (Ambrosio 2003; Villani 2003, 2008). However, the statistical modeling of the instantaneous evolution of observed distributions that are indexed by time seems not to have been explored yet. Figure 1 shows an example of time-indexed densities, which correspond to demographic age-at-death distributions from 1947 to 2014 in the US, for females and males. Motivated by this and similar examples, we consider in this paper temporal flows for one-dimensional probability distributions.

Figure 1: Time-varying densities of age-at-death (in years) for the US from 1933 to 2014.

Our goal is to develop statistical models that reflect instantaneous evolution of such temporal flows of distributions, based on the Wasserstein distance, and corresponding optimal transport maps from one distribution to another, and to obtain estimates for the components of these models from data that are generated by each of the one-dimensional distributions. Specifically, we propose Wasserstein temporal gradients, which are limits of difference quotients of transport maps with respect to the time index. In the setting we consider here, the optimal transport map from

to with respect to Wasserstein distance corresponds to the probability transform . Here, and are two probability measures corresponding to times and , and and

are the cumulative distribution function (cdf) of

and the quantile function of

, respectively.

Recently, there has been intensive interest in comparing distributions with the Wasserstein distance, both in theory and applications (e.g. Bolstad et al. 2003; Bigot et al. 2017a, b; Cazelles et al. 2017; Galichon 2017), and in visualizations of distributions (e.g. Delicado and Vieu 2017). In the one-dimensional case that we consider here it is well known that the Wasserstein transport can also be expressed in terms of quantile functions (Hoeffding 1940; Zhang and Müller 2011; Chowdhury and Chaudhuri 2016).

The difference between the optimal transport and identity maps (corresponding to no transport) captures the transport direction and distance of each small element of mass from the starting probability measure to the target measure and can be used to quantify the change between the two probability measures. Since the optimal transport map from one probability distribution to itself is the identity map (if the corresponding cdf is strictly increasing), we propose to quantify the instantaneous change of a temporal distribution flow by the derivative of the optimal transport map with respect to time, . The proposed dynamics, Wasserstein temporal gradients and the corresponding estimators are introduced in Section 2. We study consistency and convergence rate of the estimates of the Wasserstein temporal gradients in Section 3. In Section 4, we discuss simulations, followed by data illustrations in Section 5 that feature longitudinal income, mortality and human growth data.

2 Modeling temporal distribution flows

2.1 Preliminaries

First we introduce the definition of -Wasserstein distance. For the space of continuous probability measures on with finite

-th moments,

where denotes the Borel -field on , denotes the Lebesgue measure on , and means is absolutely continuous with respect to , we define the Wasserstein distance (Villani 2008), also known as Mallow’s distance (Mallows 1972) or earth mover’s distance (Rubner et al. 2000), as follows.

Definition 1 (-Wasserstein distance).

Let and be two probability measures on in . The -Wasserstein distance between and is defined as

(1)

where the expectation is with respect to the joint distribution of

.

Denoting by and the quantile functions of and , respectively, it is well known (Cambanis et al. 1976) that this definition is equivalent to

(2)

which implies that the study of can be reduced to that of distances of quantile functions.

These definitions lead to the concept of optimal transport maps and geodesic paths. For two random variables

and defined on the same probability space , we define a transport map such that , whence

(3)

indicating that -Wasserstein distance corresponds to the minimum cost of the probability mass transport from to , which is attained when , yielding an order-preserving transport from to , where the cdf of is assumed to be strictly increasing.

Definition 2 (Optimal transport maps).

Let and be two probability measures in . The optimal transport map from to is defined as

Example 1.

Let and . The corresponding the density functions are shown in the left panel of Figure 2. The right panel demonstrates the optimal transport map from to , .


Figure 2: Optimal transport map between and .

As shown in Figure 2, the difference between the optimal transport map and the identity map reveals the difference between and . For any fixed , if , the mass in a neighborhood of will be moved to the right by the optimal transport from to , where satisfies that for all , . The continuity of , which is due to the continuity of and , ensures the existence of such neighborhoods . Analogously, if , then the mass in a neighborhood of will be moved to the left by the optimal transport from to .

2.2 Wasserstein temporal gradients

For the remainder of this paper, we focus on with , denoted by for simplicity. Consider a set of probability distributions on with compact support, say a subset of , indexed by time with corresponding cdfs and quantile functions . Given a time point and a small , the optimal transport map from to is of the form

(4)

To model the instantaneous change in a flow of distributions, recall that for a one-dimensional differentiable real-valued function, the instantaneous change is quantified by the derivative. We are aiming to generalize the notion of derivatives to the scenario of temporal distribution flows. Here, the difference between the optimal transport map and the identity map models the change of the distribution flow when increases by , noting that if the cdf of is strictly increasing. Assuming the existence of the derivative of optimal transport maps with respect to time, we introduce Wasserstein temporal gradients to model the instantaneous evolution of the distribution flows over time.

Definition 3 (Wasserstein temporal gradients).

The Wasserstein temporal gradient at time is

(5)
Example 2.

Let , for . Then the Wasserstein temporal gradient at is:

with further details in Appendix A.

Example 3.

For , let

be a Gaussian distribution

, truncated on the interval , with density function

(6)

where and are the density function and cdf of standard Gaussian distribution, respectively. Then the Wasserstein temporal gradient at is

where , with further details in Appendix A.

Simple calculations show that if is strictly increasing,

where and . Taking partial derivatives with respect to on both sides of implies

Thus,

(7)

Note that for a scalar valued differentiable function ,

Thus, comparing the actual flow for a given longitudinal trajectory with the optimal flow provides insights into how the rank of changes at each time . If , then , i.e., the rank of increases instantaneously at time ; If , then , i.e., the rank of decreases instantaneously at time .


Figure 3: Derivatives of densities with respect to time (DTG) v.s. Wasserstein temporal gradients (WTG) for with and at .

Compared with the alternative of considering the pointwise derivatives of densities with respect to time, which we refer to density temporal gradients (DTG), the Wasserstein temporal gradient (WTG) reveals different features of the evolution of distribution flows, as it captures the direction and speed of the mass flow in a neighborhood of . In contrast, the DTG characterizes how fast the density instantaneously tends to increase or decrease for any in the support of the current distribution. The interpretation of the WTG is more intuitive in terms of transportation of mass and it is better suited to reflect the salient features of density evolution.

An example is given in Figure 3, which schematically depicts the comparison between DTG and WTG for (as in Example 2) with and and . On the interval between the two points (-0.3156 and 1.134) at which the DTG is equal to zero as shown in the top left panel in Figure 3, the DTG is greater than zero, indicating that the density within this interval instantaneously tends to increase, as demonstrated by the red upward arrows in the bottom left panel. Outside of this interval, DTG is less than zero, indicating that the density of this area tends to decrease, as demonstrated by the blue downward arrows in the bottom left panel. The WTG is greater than zero on the left of the zero point (0.5683) and less than zero on the right, which implies that mass tends to move to the right in a small neighborhood of any and to the left in that of any , as indicated by the red right and blue left arrows in the bottom right panel, respectively. The absolute value of DTG quantifies how fast the density tends to change, and that of WTG how fast the mass tends to move. These changes are visualized through the arrows and their length in Figure 3.

2.3 Estimation

In practice, we only observe data generated by random distributions on a discrete grid. Consider pairs , where time takes values in and is a random distribution in ,. We aim to use observations on a dense grid in to obtain , a continuously indexed smooth distribution flow defined by the conditional Fréchet mean with respect to the Wasserstein distance, i.e.,

(8)

with compact domain , cdf and quantile function

. For this purpose of smoothing/interpolating sequences of distributions, we utilize local Fréchet regression

(Petersen and Müller 2018), which targets a localized Fréchet mean

(9)

Here is a weight function, , , , is a smoothing kernel, i.e., a density function symmetric around zero, and is a bandwidth.

An additional challenge is that are not directly observed, but must be inferred from a sample generated by . We then proceed as follows to estimate . First, for every with , estimate by the empirical cdf obtained from the data , i.e.,

(10)

Second, for every , compute the empirical quantile function as the left continuous inverse of the empirical cdf , i.e.,

Note that due to the compactness of , is well defined for .

Third, smooth the empirical quantile functions over by local Fréchet regression (Petersen and Müller 2018) on with respect to the Wasserstein distance. Use and to denote the cdf and quantile function of the distribution after smoothing, which is an estimate of in (8). Additional details are in Appendix B.

For any and small , the estimated optimal transport map from to is given by , and the estimate of the Wasserstein temporal gradient at can subsequently be obtained by means of difference quotients,

(11)

These estimates obviously depend on the choice of the span that assumes the role of a tuning parameter; this is typical for estimates of derivatives which nearly always require such a tuning parameter, and its choice will be discussed in Section 4.

3 Theory

For the theoretical developments, we focus on a slightly modified estimate,

(12)

where , with , where is the corresponding quantile function and is the corresponding probability measure, for all . Note that both numerator and denominator of this modified estimate are bounded away from 0 at the rate , which ensures some degree of stability for the estimates. All proofs and auxiliary results are in Appendix C. The following mild assumptions will be needed.

  1. The kernel in (B.2

    ) used in the local Fréchet regression is a probability density function, symmetric around zero. Defining

    , and are both finite.

  2. The density function of in (8) has a uniform lower bound, i.e., there exist constants and for each such that .

  3. The density function of in (8) has a uniform upper bound, i.e., there exists such that

Theorem 1.

Under (B0)–(B2), for all , if and , where is as in (B.2), the following holds for the estimates of the optimal transports from to :

(13)

Define . For sequences for a given constant with , is consistent for the true difference quotient , as follows.

Corollary 1.

Under (B0)–(B2), if and ,

We are now in a position to obtain the consistency of the proposed estimate (12) of the Wasserstein temporal gradient .

Proposition 1.

Assume functions are continuous for all . Under (B1), if and ,

If the derivative of the Wasserstein temporal gradient with respect to time also exists, we have the stronger result that if and ,

Then the convergence rate of estimate (12) for the Wasserstein temporal gradient as target is given by

Corollary 2.

Under (B0)–(B2), assuming are all continuous functions, and that the derivative of the Wasserstein temporal gradient with respect to time exists for all , the following holds for the estimate of the Wasserstein temporal gradient if and :

Choosing yields

Among bandwidth sequences with , the optimal sequences are with rates

Corollary 2 thus establishes the consistency and convergence rate for the empirical estimate of the Wasserstein temporal gradient (12).

4 Implementation and Simulations

An important auxiliary parameter for implementation of Wasserstein temporal gradients is the bandwidth , which is needed to construct the estimate by the local Fréchet regression as in (B.1) and (B.2). Suppose we have observations on a dense grid

. While one aims at a balance between bias and variance with an optimal choice of bandwidth rate

for the local Fréchet regression when estimating the Wasserstein temporal gradient in Corollary 2, this does not lead to a practical solution, since the constant at is unknown and choosing it as 1 often leads to severe undersmoothing, for example no point of the grid may be included in when is large and the kernel has compact support. Thus, an approach for practical bandwidth choice is needed.

For this we consider here (leave-one-out) cross-validation (CV). The optimal bandwidth is selected as the minimizer of the average squared Wasserstein distance between the empirical distribution with cdf in (10) and leave-one-out estimate of given by implementing local Fréchet regression on with bandwidth (with further details provided in Appendix B),

(14)

To assess the finite sample performance of the cross-validation selection
, we compared its performance in simulations with the optimal choice , when the target is known, where is the set of bandwidths that are considered, and the average integrated squared error (AISE) is

(15)

with Wasserstein temporal gradients for the target probability distribution flow defined in (8). The dependency of AISE on is through using in the local Fréchet regression when estimating (for details see Appendix B).

In the simulations we used with and 1001. The distributions are truncated on the interval (with density function as given in (6)), where , with , and and are independent. Note that the quantile function of the target is not that of with , replaced by the corresponding expectations, but rather is

for , where is the cdf of . Consequently, the target Wasserstein temporal gradients in (15) do not have a closed form and need to be computed numerically. The number of observations for the distribution at each were taken as , respectively. We considered the set of bandwidths , kernel , and in (B.2). Boxplots and medians of the AISEs corresponding to the optimal bandwidths chosen by AISE and CV in each of the 1000 Monte Carlo runs for and 1001 and , 100 and 1000 are in Figure 4 and Table 1, respectively. The main message is that CV performs satisfactorily, as it tracks the optimal choice closely.

Regarding the choice of , as we aim to use difference quotients to approximate derivatives, to reduce bias, we should choose as small as possible. A practical method to choose is to consider a decreasing series of values until the fitted results become stable and then stop. We implemented this approach and found that the estimates for small became stable quickly.

(a)
(b)
(c)
(d)
(e)
(f)
Figure 4: Boxplots of the AISEs corresponding to the optimal bandwidths chosen by AISE and CV in each run. (a)–(c) are for , and (d)–(f) are for . Each column corresponds to , 100, and 1000, respectively.
11 1001
10 100 1000 10 100 1000
AISE 1.518 1.331 1.305 0.8760 0.1955 0.03912
CV 1.746 1.349 1.320 0.9484 0.2082 0.03965
Table 1: Medians of the AISEs corresponding to the optimal bandwidths chosen by AISE and CV.

5 Applications

In this section, we will demonstrate the proposed Wasserstein gradients for time-dependent household income, human mortality and growth data. As mentioned before, the underlying densities are practically never known and need to be estimated from data that they generate. In the household income and mortality examples, the data are reported in the form of histograms, respectively liftable, as is often the case in data collections where data are rounded.

Our methods can be applied in a straightforward way to histogram data , where is a piecewise constant, nonnegative function on . When starting with histograms, to obtain the empirical cdf, instead of using (10), we simply integrate the available histograms ,

Throughout we use the parabolic kernel function .

5.1 Income data

Many studies have been conducted on income distribution and inequality (Jones 1997; Heathcote et al. 2010), since this is a major measure of economic equality/inequality. The evolution of income distributions over time is of particular interest as it provides quantification of the directions in which income inequality is evolving. The US Census Bureau provides histogram data of US household income over calendar years from 1994 to 2016, available at https://census.gov/data/tables/time-series/demo/income-poverty/cps-hinc/hinc-06.html. To make incomes of different years comparable, adjustments for inflation have been made, using the year 2000 as baseline for constant dollars.

Figure 5: Densities of US household incomes for 1994–2016, where “2010 pop00” and “2010 pop10” represent the distribution of 2010 based on the population census of 2000 and 2010, respectively, and “2013” and “2013 r” represent the distributions on previous and redesigned questionnaires, respectively.

We focus on incomes less than $. The data requires some preprocessing, as the width of the histogram bins changed between 2000 and 2001; for 2010, due to census changes, two datasets are available based on both census 2010 and 2000 populations; for 2013, two sets of data are also available and one of them is based on a redesigned questionnaire which has been used since then. To mitigate against these changes, which potentially introduce artificial variation, we divided the whole period into four parts: 1994–2000, 2001–2010, 2010–2013 and 2013–2016. Although another change of bin width occurred between 2008 and 2009, we keep the entire period 2001–2010 in order to cover the financial crisis of 2008 well within the time interval. The densities constructed by smoothing the histogram data are shown in Figure 5, where “2010 pop00”, “2010 pop10”, “2013” and “2013 r” represents the income distribution of 2010 based on the population census of 2000 and 2010, and of 2013 based the previous and redesigned questionnaires, respectively.

Figure 5 reveals not much variation in the income distributions over time except around 2008. The estimated Wasserstein temporal gradients (with bandwidth 1.2 years, chosen by cross-validation, see Section 4) demonstrate how the income of poor, middle-class and rich households evolves in detail for the four periods in Figure 6. Since the local Fréchet regression has increased variance near endpoints, the estimated Wasserstein temporal gradients on the two ends of each period are somewhat unreliable.

Figure 6: Estimates of the Wasserstein temporal gradients (in US dollars/year) for US household income distribution flows for 1994–2016. Positive values indicate increasing trend; negative values indicate decreasing trend. Note that due to census and questionnaire changes, the gradients for the years 2010 and 2013 are plotted for the versions before and after these changes.

It can be seen in Figure 6 that for the period 1994–2000, household incomes of around $) tend to decrease up to 1995, increase from 1996 to 1998, roughly keep the same in 1999 and decrease again in 2000, which is very different from what happens to the other income brackets. Relatively poor households tend to increase their income in this period except 2000. Households with incomes above $, tend to always increase their income during this first period. For the second period 2001–2010, the economic status of the lower and middle earners tends to be stable in the first four years, rise in 2005 and 2006, and then decline starting in 2007. Higher incomes tend to decline until 2002, and beginning in 2003, a divide manifests itself in the higher income levels: The lower tier of higher incomes tend to lose income, whereas the higher tier always gains income, except for 2008 and 2009. Note that in 2008 and 2009, all household incomes tend to decrease, coinciding with the financial crisis. For the last two periods, it can be seen that household incomes gradually recover from the crisis. While the top brackets always gain, the lower brackets do not recover until 2015.

5.2 Mortality data

The analysis of mortality data across countries and species has found some interest in demography and statistics (Carey et al. 1992; Chiou and Müller 2009; Ouellette and Bourbeau 2011; Hyndman et al. 2013; Shang and Hyndman 2017). Of special interest is how the distribution of age-of-death evolves over time. The Human Mortality Database (http://www.mortality.org) provides data of yearly life tables for 37 countries, from which the distributions of age-of-death in terms of histograms can be extracted.

We focus on age-at-death in the age interval [0,109] (in years) and take Russia, Sweden and the United States as three examples. The densities obtained by smoothing the histogram data for females and males separately are shown in Figures 7, 8 and 1, respectively. It can been seen that densities and the evolution flow patterns vary across these three countries, which is partly due to the different domains in terms of calendar years during which country-specific mortality has been recorded, which goes much further into the past for Sweden than for the other countries. Estimates of the Wasserstein temporal gradients have been obtained with the bandwidth 1.2, which was chosen by cross-validation for each of the three countries.

For Russia, the densities of age-at-death from 1959 to 2014 are in Figure 7. The distribution of age-at-death is seen to be quite different for males and females, especially for age-at-death above 20. The estimates of the Wasserstein temporal gradients from 1960 to 2013 for Russia are (also) shown in Figure 7.The first and last year have been removed due to boundary effects. It can be seen that between 1960–2000 the movement of mortality to higher ages and thus longer life was interrupted, with a lot of variation during this period, and it resumes only in the 2000s, where the greatest progress is made in reducing child mortality. Especially in the 1990s, there is a reversal in the trend to increased longevity, as the estimates of the Wasserstein temporal gradients are negative for those years, especially for mid-age males. Again in 2011 and 2012, children, especially girls, under 18 years experience a negative Wasserstein gradient in the younger age range.

For Sweden, as shown in Figure 8, the densities of age-at-death of females and males are quite similar, indicating increased longevity over the years. The Wasserstein temporal gradients for Sweden in Figure 8 indicate some volatility in the distribution of age-at-death for both females and males, especially before 1950.

Compared to Russia, the evolution of the age-at-death distributions in Sweden is more balanced, years where the distribution moves to the left (right) will be followed by another year where it moves to the right (left). The Wasserstein temporal gradients for Sweden vary in a much larger range than Russia which is partly due to the inclusion of early calendar years, where the variation of mortality from year to year can be seen to have been much larger, declining in more recent calendar years. For example, the top orange curve for 1774 demonstrates a big increasing trend in the life length for both females and males while the bottom orange curve for 1772 demonstrates a big decreasing trend.

For the US, the densities of age-at-death of females and males are somewhat similar. The estimates of the Wasserstein temporal gradients from 1934 to 2013 for the US in Figure 9 indicate that age-at-death distributions tend to move to the right in almost all years, which means increasing longevity. However, for several of the years since the 1980s, reversals can be found for both females and males. A major reversal can be found for the American males from young adults to middle age during 1986–1989 and for young females under 25 in 1988 as the light blue curves shown in Figure 9. These puzzling reversals have been attributed to drug use (e.g., Case and Deaton 2015).

Figure 7: Time-varying densities of age-at-death (in years) with females in the left column and males in the right column for Russia from 1959 to 2014 (top two); Estimates of the (unit-free) Wasserstein temporal gradients of the age-at-death (in years) distributions for Russia from 1960 to 2013 (bottom two), where positive values indicate increasing trend and negative values indicate decreasing trend.
Figure 8: Time-varying densities of age-at-death (in years) with females in the left column and males in the right column for Sweden from 1751 to 2014 (top two); Estimates of the (unit-free) Wasserstein temporal gradients for the distributions of age-at-death (in years) for Sweden from 1752 to 2013 (bottom two), where positive values indicate increasing trend; negative values indicate decreasing trend.
Figure 9: Estimates of the (unit-free) Wasserstein temporal gradients of the age-at-death (in years) distributions for the US from 1934 to 2013, with females on the left and males on the right. Positive values indicate increasing trend; negative values indicate decreasing trend.

5.3 Growth data

The growth of children is a topic of perennial interest, as it characterizes the health of a population (Jones and Bayley 1941; Prader et al. 1989; Gasser et al. 1984)

. To assess growth of children properly requires a longitudinal study. The Zürich Longitudinal Growth Study is a prime example, where the standing height of children was measured at 37 ages (not equidistant) between birth (age 0) and 20 years for 112 girls and 120 boys. Measurements were obtained at ages (in years) 0 (birth), 1/12 (1-month), 1/4 (3-month), 1/2 (6-month), 3/4 (9-month), 1, 1.5, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10.5, 11, 11.5, 12, 12.5, 13, 13.5, 14, 14.5, 15, 15.5, 16, 17, 18, 19, 20. Corresponding cross-sectional density functions of the distribution of standing height at each of these ages can be easily obtained by pooling the data from all children, with the resulting density estimates in Figure

10. As age goes up, a clear trend of moving to the right can be observed among the distributions of standing height of both girls and boys, which clearly reflects the children’s growth.

The estimates of the Wasserstein temporal gradients from age 1 to 19 are shown in Figure 10 for girls and boys, computed with the optimal bandwidth 1.2 chosen by cross-validation. For both girls and boys, the Wasserstein temporal gradients are relatively large before and including age 2, decrease through around age 10, and then fall after a minor increase during puberty. The gradients show that taller children gain more height until about 6 years when this effect attenuates but is still present, for both boys and girls.

Figure 10: Time-varying densities of standing height (in centimeters) for girls and boys, stratified by age (in years) (top two). Estimates of the Wasserstein temporal gradients (in centimeters/year) of the distributions of standing height (in centimeters) for girls and boys from age 1 to 19, stratified by age (in years) (bottom two). Positive values indicate increasing trend; negative values indicate decreasing trend.

At around age 13 the gradient reverses and shorter children grow more than taller ones. This effect is much stronger expressed for boys. It is likely due to the fact that many of the taller children at that age have completed their pubertal spurt after which they grow only little while the smaller ones tend to have the pubertal spurt ahead; they are small as they did not go through the spurt yet.

A Additional Details for the Examples

Derivation for Example 2.

Derivation for Example 3.

where and thus . ∎

B Implementation of Local Fréchet regression

To smooth the empirical quantile functions over by local Fréchet regression (Petersen and Müller 2018), given inputs . we construct weighted means of the empirical quantile functions,