An optimal transport approach to data compression in distributionally robust control

05/19/2020 ∙ by Filippo Fabiani, et al. ∙ University of Oxford 0

We consider the problem of controlling a stochastic linear time-invariant system using a behavioural approach based on the direct optimization of controllers over input-output pairs drawn from a large dataset. In order to improve the computational efficiency of controllers implemented online, we propose a method for compressing this large data set to a smaller synthetic set of representative behaviours using techniques based on optimal transport. Specifically, we choose our synthetic data by minimizing the Wasserstein distance between atomic distributions supported on both the original data set and our synthetic one. We show that a distributionally robust optimal control computed using our synthetic dataset enjoys the same performance guarantees onto an arbitrarily larger ambiguity set relative to the original one. Finally, we illustrate the robustness and control performances over the original and compressed datasets through numerical simulations.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Strongly influenced by the current trend in nearby and complementary scientific areas, learning from data promises to be one of the near future quintessential problems also for the system-and-control community. The growing complexity of today’s systems, which drastically limits the traditional model-based control design, along with the increasing data storage capacity and data availability, make data-driven control approaches timely and more attractive.

Specifically, recent works were built upon a well-known result in subspace identification, the so called Willems’ lemma [1]. Essentially, under a specific technical condition, such result enables to recover a nonparametric minimal LTI realization of an unknown system from a matrix of input-output measures, namely the column space of the data matrix span all the possible trajectories of a corresponding LTI system. In this context, the interpretation of the Willems’ lemma generated two main research directions: data-dependent parametrization analysis and control [2, 3, 4], and data-enabled (robust) optimal control [5, 6, 7].

In this work, we deal with a specific issue raised within this latter approach. Specifically, since the crucial result in [1] establishes equivalence between a (possibly noisy) data matrix and a nonparametric LTI model under the persistent excitation of the system input, which directly translates on the input-output observations length, one might be induce to collect disproportionately large datasets. As a consequence, the real-time implementation of optimization-based (robust) control law may lead to quite challenging online computations. This limits the field of possible applications, e.g., ruling out systems that require high frequency rates, or obtaining too conservative performances, e.g., by choosing short control horizons to reduce the computational time.

In system-and-control, data compression and dimensionality reduction have been widely used to design synthetic sets of samples that “best” capture the informative content of original, noise-corrupted datasets. Tracing back over the past decades, statistical and manifold learning [8, 9]

, principal component analysis and related variants

[10, 11, 12], or, more generally, subspace identification approaches [13, 14, 15], represent only few examples amongst the most diffused techniques. Conceptually, these methods tacitly neglect the stochastic nature of the noise itself by comparing sets of data on topological or Hilbert spaces. Successively, they eliminate all those components considered less significant, according to some criterion, and potentially associated with noise.

Breaking away from the literature, we propose an optimal transport approach [16, 17]

for the synthesis of a compressed set of samples. Defining a metric on a probability space, the Wasserstein distance is key to setup a LP to minimize the distance between the discrete measure associated to the synthetic dataset and the empirical distribution of the noisy data. As a particular case of the family of variational Wasserstein problems

[17, §9], each atom of the synthetic dataset identifies a specific barycentre for a cluster of the original samples, then seeking to maximize its informative content. In the context of data-enabled robust control of stochastic LTI systems [6], we show that the distributionally robust optimization problem, built upon a Wasserstein ball as ambiguity set (and hence admitting a convex reformulation), enjoys the same performance guarantees of the original dataset, at a price of considering an arbitrarily bigger radius that depends on the number of synthetic atoms adopted. Remarkably, the term accounting for extra robustness vanishes as the number of atoms approaches the cardinality of the original dataset.

After briefly recalling some notions of optimal transport (§II), in §III we introduce the distributionally robust control problem addressed. The variational Wasserstein problem is then formalized and discussed in §IV, while in §V we compare the control performances obtained over the original and compressed datasets through a numerical case study.


The set of real and non-negative numbers is represented by and , respectively, while the set of natural numbers is denoted by

. For vectors

and , we denote . Given a matrix , its entry is denoted by , while its transpose as . The symbol denotes the inner product in the appropriate space, i.e., for , and for , . denotes the probability simplex, where is -dimensional vector of elements equal to . For any point , is the Dirac unit mass on . We denote the dual norm af an arbitrary norm on as . The conjugate function of is defined by .

Ii Background on Optimal Transport

We start by briefly recalling the definition of Wasserstein distance for continuous measures, which will be crucial in the next section to formalize the control problem addressed. Afterwards, by focusing on discrete probability distributions, we define the associated discrete optimal transport.

Thus, let be an arbitrary space endowed with a metric , and be the set of Borel probability measures on .

Definition 1

([16, Ch. 7]) Given any , the -Wasserstein distance between two probability measures , is defined as


where denotes the set of all probability measures on that have marginals and , respectively.

It follows from [16, Th. 7.3] that, for any , defines a metric on . Roughly speaking, the decision variable of the infinite-dimensional optimization problem in (1) can be intended as the transportation plan for moving a mass distribution described by to another one described by , while is the transportation cost. Then, given any , we define the Wasserstein ball of radius , centred around the distribution as .

Ii-a Discrete measures

Now, let us consider two families of and points in , respectively, i.e., and . The discrete probability measures with weights , , and locations in and reads as and , where , . In this setting, the Wasserstein distance happens to correspond to the optimal value of a network flow problem, since (1) translates into the following LP [17]:


Here, is the matrix of pairwise distances between points in the supports of and , raised to the power , defined as , . The decision variable in (2) represents the coupling matrix that specifies the amount of mass flowing from towards . For any and , the admissible couplings lie in the feasible set , which reads as


We note that the set of matrices in (3) is a convex polytope (also called transportation polytope), since it is bounded and defined by a set of equality constraints. Moreover, as long as is nonempty, we stress that the solution to the LP in (2), attained on the vertices of , may not be unique. Finally, since defines a metric, in the discrete setting (2) if and only if , and therefore the triangle inequality holds as , for some discrete measures , , .

Throughout the paper, we will consider the so called Kantorovich-Rubinstein distance, a special case of the Wasserstein one obtained by setting in (2) (and hence we will omit the subscript). Moreover, we assume the metric be induced by an arbitrary norm on .

Iii Data-enabled optimal control

First, we formulate the predictive control problem for stochastic LTI systems addressed. Then, after a brief digression on nonparametric models for deterministic LTI systems, we finally recall from [6] a distributionally robust reformulation of the optimal control problem, along with the main reasons that motivate us to consider a synthetic dataset.

Iii-a Constrained control of stochastic LTI systems

Let us consider the following discrete time, stochastic LTI system:


where , , , , and . Here, , , and are the state, control input, output and stochastic disturbance of the system at time , respectively. Throughout the paper, we assume the following.

Standing Assumption 1

The pair is controllable, while the pair is observable.

Then, we assume that the system matrices defining (4) are unknown, and that we have access to input-output measurements, , . Any is therefore affected by the realization of the uncertainty , drawn from an unknown probability distribution , supported on .

Along the line of stochastic model predictive control [18], we consider a finite horizon control problem over the time horizon for (4), where the main goal is to design a (possibly constrained) sequence of inputs, namely , while minimizing a given cost function . Specifically, due the stochastic nature of the problem in question, we aim at solving the following optimization problem:


where is the -fold product distribution characterizing over the whole horizon . For the remainder, we adopt the same assumptions on the cost function postulated in [6], as formalized next.

Standing Assumption 2

For all , is a separable function, namely , where , are convex and continuous. In addition, is such that is a bounded set.

Iii-B Learning models in deterministic LTI systems

Recently, the Willems’ fundamental lemma [1, Th. 1] has been exploited (or provided inspiration) to develop data-consistent, nonparametric model for unknown, deterministic LTI systems [6, 2, 4, 3]. Specifically, let us consider the system in (4) with , for all , and let us assume to conduct several experiments of length to collect input-output measurements for different time shifts. The data can be organized within a matrix , where

and is defined similarly. Here, the first subscript refers to the time index of the top-left entry of a given matrix, the second refers to the number of block-rows, while the third one to the number of columns. Note that both and have constant vector entries along the block anti-diagonals, and therefore belongs to the class of Hankel matrices.

As crucial assumption, the Willems’ fundamental lemma restricts the class of input sequences over the horizon to the persistently exciting signals, defined as follow.

Definition 2

[1] A signal , observed over samples, is persistently exciting of order if the corresponding Hankel matrix has full rank .

Therefore, it follows immediately that a signal to be persistently exciting of order shall be sufficiently long, i.e., . Then, by relying on Definition 2, the Willems’ fundamental lemma is stated next.

Lemma 1

([1, Th. 1]) If the control signal is persistently exciting of order , then for any -long input-output trajectory of the deterministic version of the system in (4), , there exists such that


In other words, any linear combination of the columns of the data matrix is an input-output trajectory of length for the system or, equivalently, the column space of span all the possible trajectories of the system. Remarkably, the solution to (6) is in general not unique, i.e., given a certain data set , there might exist several vectors that associate the data to a predefined input-output trajectory, . Thus, given the current time , we rearrange the available input-output data set to restate the deterministic version of (5). Specifically, we split into , , and . While the former ( and ) are some data matrices for the “forward” propagation (from onward) of control sequence and output prediction, respectively, the latter ( and ) define the consistency constraints associated with less recent measurements, together with the sequences and . Thus, Lemma 1 enables to define a data-consistent, nonparametric model for a deterministic LTI system, which leads to the deterministic counterpart of (5):


Iii-C A distributionally robust data-enabled control problem

Although not explicitly used, Lemma 1 provided the insight in [6] to design a robust DeePC algorithm to solve the predictive control problem in (5). The realization of the uncertain parameter , indeed, poses several challenges around the consistency constraints in (7), since it takes values over the unknown distribution , directly affecting the output trajectories. However, after some manipulations, a distributionally robust, semi-infinite reformulation of (7) was proposed in [6], which reads as


where corresponds to the vector stacked with a constant term, and therefore the set coincides with the feasible set of (7) extended to , accordingly. Then, stacks all the objects in (7) affected by the realization of , i.e., , and , whose (unknown) probability distribution is denoted by , supported on . As in [19, 6], we next postulate a light-tailed assumption for the probability distribution , which is however automatically satisfied in case identifies a compact set.

Assumption 1

There exists some such that

Moreover, with we identify the effective measurements, i.e., realizations of the uncertain objects that also affect the constraints set , and is the empirical distribution of such measurements. Finally, is obtained by massaging and substituting vectors and .

By leveraging on the results in [19], under Assumption 1, [6, Th. 4.2] shows that considering the Wasserstein ball as ambiguity set in (8) allows for a convex reformulation, whose solution upper bound the out-of-sample performance, , with high confidence. Specifically, for some , the semi-infinite problem in (8) is upper-bounded by


Thus, by denoting as an optimal solution to (9), with subvector of , the control action provides probabilistic guarantees to obtain good control performance with respect to possible realizations of the stochastic trajectory associated with the ambiguity set .

However, due to the persistent excitation assumption, along with possibly large historical dataset, the Hankel matrix gathering all the measurements might be disproportionately large, leading to a quite challenging online computation. This fairly limits the field of possible applications, ruling out all those critical systems that require high frequency samplings. Moreover, since , one may also be induced to choose short control horizons , and therefore obtaining exceedingly conservative control performances. These reasons motivate us to design a suitable procedure to compress the intrinsic information concealed within the whole dataset into a synthetic one.

Iv Data compression as a variational Wasserstein problem

In this section, we design an offline, optimal transport-based procedure which allows to compress the original dataset, i.e., to select a limited set of samples that “best” summarize the information content of . This clearly entails a lower computational burden in solving (9), and hence making such a distributionally robust control synthesis more appealing for an online implementation. Moreover, we show that the optimal solution obtained by means of the synthetic dataset enjoys the same probabilistic guarantees onto a Wasserstein ball with slightly bigger radius. This robust overestimate, which depends on the number of samples adopted, vanishes as the set of synthetic data increases.

Thus, let the dimensions of be fixed, i.e., be chosen so that the control input is persistently exciting of order as in Lemma 1, where , . The empirical distribution of such measurements is defined as where and is the -th column of . In this context, our goal is to find a set of locations , with , whose empirical probability distribution , , is close to the one of the original dataset. Hence, our problem can be formulated in an optimal transport fashion as


Such an optimization problem has a strong practical interpretation: an optimal solution to (10), , is generated by a synthetic set of samples (or atoms) defining the closest (in term of Wasserstein distance, and hence transportation cost) empirical probability distribution, , that approximate the original one, . The nested optimization in (10) is a variational Wasserstein problem, representing a particular case of the Wasserstein barycentres problem [20]. Specifically, it directly lies into the set of -means problems [21]. In addition to allowing for a semi-discrete setting, the advantages of adopting the Wasserstein distance as metric to compare probability measures have been proved in many practical and theoretical problems, spanning from dictionary and statistical learning [22, 23], to vision and image processing [24, 25].

Iv-a On the optimal transport problem (10)

Despite its appealing structure and strong practical interpretation, the variational Wasserstein problem in (10) is convex in each single variable, i.e., and , but not jointly. In practice, this might lead to compute local optimal solutions, where every atom defining each optimal identifies a specific barycentre for a subset of samples in . Specifically, (10) corresponds to

where . Thus, fixed any , every non null defines the quantity of a predefined sample of that is associated to the barycentre . Then, it follows immediately that every atom defining belongs to . In general, since each assumes continuous values, we note that every sample may be split among several barycentres , and hence generating overlapping clusters.

Since it represents the minimum of affine functions, we note that the Wasserstein distance itself is not smooth in its arguments. To circumvent this problem, the cost function in (10) can be regularized by means of a strictly convex, weighted entropic term, i.e., , for some strictly positive coefficient , whose benefits are twofold [26]: i) the inner optimization problem in (10) admits a closed form, which translates in a matrix balancing problem, and ii) the Wasserstein distance is differentiable for any .

However, by exploiting the convexity in each single variable, a possible solution algorithm may be represented by a typical ABCD method, whose main steps are summarized in Algorithm 1.

Initialization: Select
Iteration :
  • Compute

  • Compute

Algorithm 1 ABCD method

Iv-B Robust performance guarantees

In this context, the strategy is to solve the optimal transport problem in (10) to reformulate the robust optimization problem in (8) with a Wasserstein ball centred in and (possibly) different radius as ambiguity set, which reads as

Remark 1

As long as , (11) is defined on a lower dimensional space compared to (8). However, for simplicity’s sake, we will keep the same notation.

Next, we show that an optimal solution to the robust optimization problem in (11) upper bounds the out-of-sample performance with high confidence.

Proposition 1

Let and be given. Under Assumption 1, there exists some such that, for all ,

First, given any , let us define . Then, the triangle inequality for the Wasserstein metric ensures that

Furthermore, in view of Assumption 1, it follows from [19, Th. 3.4] that , and therefore, since is an empirical distribution, we obtain . This latter relation, which can be equivalently restated as , where , directly implies with probability , thus concluding the proof.

As a remark, we note that the Wasserstein distance between and can be made arbitrarily small while increasing the number of synthetic data , since as . In this case,we precisely recover the radius of the ambiguity set in [6, Th. 4.1]. Finally, although defined on a lower dimensional space, the robust optimization problem in (11) can be manipulated to obtain a tractable convex reformulation equivalent to the one in (9), which however can be solved online with a lower computational time. This therefore paves the way to possibly longer control horizons and higher sampling frequencies, as shown next.

V Numerical simulation

Offline: Given , set , compute
Initialization: Set , and
Iteration :
  • Compute

  • Set , apply

  • Retrieve current measurements, update ,

Algorithm 2 Receding horizon robust synDeePC
TABLE I: Simulation parameters

We apply here the (syn)DeePC method presented in [6] and summarized in Algorithm 2 on a linearized model of a quadcopter in a receding horizon fashion, comparing control and computational performances when considering the original dataset and the compressed one. The simulations are run in Matlab environment on a laptop with a Quad-Core Intel Core i5 2.4 GHz CPU and 8 Gb RAM, with main parameters summarized in Table I.

The linear model adopted is valid around a hover position, where the state vector is . Specifically, , and are the 3 spatial coordinates and relative velocities, while , , and are the angular ones, with relative rates. The control inputs are represented by four rotors, constrained to the set . By assuming full state measurement, we use the same state-space matrices adopted in [6], as well as same original cost function, , where denotes the parametrized, 8-figure trajectory. In this context, in Table I represents the temporal resolution with which the reference trajectory is sampled, and hence ideally corresponding to the sampling time. Moreover, the columns of the matrix

are filled by means of random inputs drawn from a uniform distribution on

to guarantee the persistency of excitation, according to Definition 2.

Fig. 1: Wasserstein distance behaviour as the number of synthetic atoms grows.

Some a-priori considerations on the choice of the parameter in Algorithm 2 can be made, e.g., in a data-driven fashion by evaluating the behaviour of the Wasserstein distance when varies. Thus, according to Fig. 1, the function seems assuming reasonable values for , leading to an offline step in Algorithm 2 taking less than six minutes. Interestingly, we note that the effect of the local minima seems preventing the Wasserstein distance from being monotonically decreasing for values of , i.e., , as evidenced in Fig. 1.

Fig. 2: Dynamical evolution of the controlled quadcopter while following a figure-8 trajectory.
Fig. 3: Spatial coordinate tracking errors.

In Fig. 2 and 3, we compare the trajectory tracking performances of the quadrotor controlled by means of the DeePC with full matrix (solid lines) and synthetic dataset, , obtained first by reducing to the 50 the total number of samples, i.e., (solid-dashed lines), which also correspond to a reduction of the 70 when considering a longer control horizon, i.e., (dotted lines). As shown in Fig. 3, where the position errors of the spatial coordinates are illustrated, the performances of the robust controller computed by means of the compressed dataset do not exceedingly deteriorate compared with the one computed by means of , also exhibiting an almost overlapping behaviour when the control horizon increases. Moreover, from our numerical experience, the step (S1) in Algorithm 2 with compressed dataset takes approximately [s] on average, in contrast with the [s] required by the original dataset, see Fig. 4. On the other hand, a longer control horizon does not lead to a much higher computational time, i.e., [s], while considering the whole dataset would take around [s] to solve the DeePC optimization problem.

Fig. 4: Computational time over the whole trajectory tracking control problem.

Vi Conclusion and Outlook

In data-driven robust control problems, optimal transport theory promises to be the key tool for the design of synthetic datasets able to provide both robust performance guarantees and reasonable computational time for real-time implementation. Specifically, we have investigated the benefits of adopting the Wasserstein metric to compress the informative content of a large dataset into a smaller one, also illustrating the performance of the robust controller obtained by means of the synthetic dataset compared to the original one. Future reasearch directions will focus on the impact that the discrete measure adopted to compare the empirical distribution associated with the original data, i.e., the vector in (2), has on the robustness. Given the input-output structure of the gathered data, we will investigate also the possibility to use different distances to define the Wasserstein metric, as well as a jointly convex reformulation of (10).


  • [1] J. C. Willems, P. Rapisarda, I. Markovsky, and B. L. De Moor, “A note on persistency of excitation,” Systems & Control Letters, vol. 54, no. 4, pp. 325–329, 2005.
  • [2] C. De Persis and P. Tesi, “Formulas for data-driven control: Stabilization, optimality and robustness,” IEEE Transactions on Automatic Control, 2019.
  • [3] H. J. Van Waarde, J. Eising, H. L. Trentelman, and M. K. Camlibel, “Data informativity: a new perspective on data-driven analysis and control,” IEEE Transactions on Automatic Control, 2020.
  • [4] J. Berberich, A. Romer, C. W. Scherer, and F. Allgöwer, “Robust data-driven state-feedback design,” arXiv preprint arXiv:1909.04314, 2019.
  • [5] J. Coulson, J. Lygeros, and F. Dörfler, “Data-enabled predictive control: In the shallows of the DeePC.”
  • [6] ——, “Regularized and distributionally robust data-enabled predictive control,” in 2019 IEEE 58th Conference on Decision and Control (CDC), 2019, pp. 2696–2701.
  • [7] J. Berberich, J. Köhler, M. A. Müller, and F. Allgöwer, “Data-driven tracking MPC for changing setpoints,” arXiv preprint arXiv:1910.09443, 2019.
  • [8] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000.
  • [9] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, “Convexity, classification, and risk bounds,” Journal of the American Statistical Association, vol. 101, no. 473, pp. 138–156, 2006.
  • [10] J. Wang and S. J. Qin, “A new subspace identification approach based on principal component analysis,” Journal of process control, vol. 12, no. 8, pp. 841–855, 2002.
  • [11] ——, “Closed-loop subspace identification using the parity space,” Automatica, vol. 42, no. 2, pp. 315–320, 2006.
  • [12] B. Schölkopf, A. Smola, and K.-R. Müller, “Kernel principal component analysis,” in

    International conference on artificial neural networks

    .   Springer, 1997, pp. 583–588.
  • [13] P. Van Overschee and B. De Moor, “Closed loop subspace system identification,” in Proceedings of the 36th IEEE Conference on Decision and Control, vol. 2.   IEEE, 1997, pp. 1848–1853.
  • [14] T. McKelvey, H. Akçay, and L. Ljung, “Subspace-based multivariable system identification from frequency response data,” IEEE Transactions on Automatic Control, vol. 41, no. 7, pp. 960–979, 1996.
  • [15] M. Jansson and B. Wahlberg, “On consistency of subspace methods for system identification,” Automatica, vol. 34, no. 12, pp. 1507–1519, 1998.
  • [16] C. Villani, Topics in optimal transportation.   American Mathematical Society, 2003, no. 58.
  • [17] G. Peyré and M. Cuturi, “Computational optimal transport,”

    Foundations and Trends® in Machine Learning

    , vol. 11, no. 5-6, 2019.
  • [18] P. Hokayem, E. Cinquemani, D. Chatterjee, F. Ramponi, and J. Lygeros, “Stochastic receding horizon control with output feedback and bounded controls,” Automatica, vol. 48, no. 1, pp. 77–88, 2012.
  • [19] P. M. Esfahani and D. Kuhn, “Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations,” Mathematical Programming, vol. 171, no. 1-2, pp. 115–166, 2018.
  • [20] M. Cuturi and A. Doucet, “Fast computation of Wasserstein barycenters,” Journal of Machine Learning Research, 2014.
  • [21]

    M. K. Ng, “A note on constrained k-means algorithms,”

    Pattern Recognition, vol. 33, no. 3, pp. 515–519, 2000.
  • [22] A. Rolet, M. Cuturi, and G. Peyré, “Fast dictionary learning with a smoothed Wasserstein loss,” in Artificial Intelligence and Statistics, 2016, pp. 630–638.
  • [23] C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio, “Learning with a Wasserstein loss,” in Advances in Neural Information Processing Systems, 2015, pp. 2053–2061.
  • [24] M. Cuturi and G. Peyré, “A smoothed dual approach for variational Wasserstein problems,” SIAM Journal on Imaging Sciences, vol. 9, no. 1, pp. 320–343, 2016.
  • [25] J. Lellmann, D. A. Lorenz, C. Schonlieb, and T. Valkonen, “Imaging with Kantorovich–Rubinstein discrepancy,” SIAM Journal on Imaging Sciences, vol. 7, no. 4, pp. 2833–2859, 2014.
  • [26] M. Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” in Advances in neural information processing systems, 2013, pp. 2292–2300.