Pursuit of a Discriminative Representation for Multiple Subspaces via Sequential Games

We consider the problem of learning discriminative representations for data in a high-dimensional space with distribution supported on or around multiple low-dimensional linear subspaces. That is, we wish to compute a linear injective map of the data such that the features lie on multiple orthogonal subspaces. Instead of treating this learning problem using multiple PCAs, we cast it as a sequential game using the closed-loop transcription (CTRL) framework recently proposed for learning discriminative and generative representations for general low-dimensional submanifolds. We prove that the equilibrium solutions to the game indeed give correct representations. Our approach unifies classical methods of learning subspaces with modern deep learning practice, by showing that subspace learning problems may be provably solved using the modern toolkit of representation learning. In addition, our work provides the first theoretical justification for the CTRL framework, in the important case of linear subspaces. We support our theoretical findings with compelling empirical evidence. We also generalize the sequential game formulation to more general representation learning problems. Our code, including methods for easy reproduction of experimental results, is publically available on GitHub.


page 8

page 24

page 26

page 27

page 28

page 29

page 30

page 31


Noisy Subspace Clustering via Thresholding

We consider the problem of clustering noisy high-dimensional data points...

Learning Transformations for Clustering and Classification

A low-rank transformation learning framework for subspace clustering and...

Robust computation of linear models by convex relaxation

Consider a dataset of vector-valued observations that consists of noisy ...

Incremental Learning of Structured Memory via Closed-Loop Transcription

This work proposes a minimal computational model for learning a structur...

Closed-Loop Data Transcription to an LDR via Minimaxing Rate Reduction

This work proposes a new computational framework for learning an explici...

The Low-Dimensional Linear Geometry of Contextualized Word Representations

Black-box probing models can reliably extract linguistic features like t...

High-Dimensional Optimization in Adaptive Random Subspaces

We propose a new randomized optimization method for high-dimensional pro...

Code Repositories


Official repository for the paper "Pursuit of a Discriminative Representation for Multiple Subspaces via Sequential Games" by Pai et al.

view repo

1 Motivation and Context

Learning representations of complex high-dimensional data with low underlying complexity is a central goal in machine learning, with applications to compression, sampling, out-of-distribution detection, classification, etc. For example, in the context of image data, one may perform clustering

[40], and generate or detect fake images [23]

. There are a number of recently popular methods for representation learning, several proposed in the context of image generation; one such example is generative adversarial networks (GANs)

[18], giving promising results [27, 37]

. Despite empirical successes, theoretical understanding of representation learning of high-dimensional data with low complexity is still in its infancy. Classical methods with theoretical guarantees, such as principal component analysis (PCA)

[25], are divorced from modern methods such as GANs whose justifications are mostly empirical and whose theoretical properties remain poorly understood [15, 13].

A challenge for our theoretical understanding is that high-dimensional data often has low-dimensional structure, such as belonging to multiple subspaces and even nonlinear manifolds [50, 34, 54, 42, 53, 52, 41, 33, 14]. This hypothesis can be difficult to account for theoretically.111

One assumption which violates this hypothesis implicitly is the existence of a probability density for the data. For instance, the analysis in several prominent works on representation learning, such as

[28, 15] critically requires this assumption to hold. Probability densities with respect to the Lebesgue measure on do not exist if the underlying probability measure has a Lebesgue measure zero support, e.g., for lower-dimensional structures such as subspaces [26]. Thus, this assumption excludes a lower dimensionality of the data. In fact, our understanding of this setting, and knowledge of principled and generalizable solutions, is still incomplete, even in the case when the data lies on multiple linear subspaces [49], and the representation map is linear.

In this work, we aim to bridge this gap. More specifically, we propose a new theoretically principled formulation, based on sequential game theory and modern representation learning, for learning discriminative representations with the multiple low-dimensional linear subspaces in high-dimensional space. We explicitly characterize the learned representations in this framework. Our results show that classical subspace learning problems can be solved using modern deep learning tools, thus unifying the classical and modern perspectives on this class of problems.

1.1 Related Works

PCA and Autoencoding.

Principal component analysis (PCA) and its probabilistic versions [22, 45] are a classical tool for learning low-dimensional representations. One finds the best -approximating subspace of a given dimension for the data. Thus, PCA can be viewed as seeking to learn the linear subspace structure of the data.

Several generalizations of PCA exist. Generalized PCA (GPCA) [48] seeks to learn multiple linear subspace structure by clustering. Unlike PCA and this work, GPCA does not learn transformed representations of the data. Here we learn discriminative representations of multiple linear subspace structure. PCA has also been adapted to recover nonlinear structures in many ways [47], e.g., via principal curves [19]

or autoencoders



Generative Adversarial Networks (GANs) are a recently popular representation learning method [18, 1]

. GANs simultaneously learn a generator function, which maps low-dimensional noise to the data distribution, and a discriminator function, which maps the data to discriminative representations from which one can classify the data as authentic or synthetic with a simple predictor. The generator and discriminator are trained adversarially; the generator is trained to generate data which is distributionally close to real data, in order to fool the discriminator, while the discriminator is simultaneously trained to identify discrepancies between the generator output and empirical data.

While GANs enjoy empirical success (e.g. [27, 37]), their theoretical properties are less well developed, especially in the context of high-dimensional data with intrinsic structure. More specifically, the most prominent works of GAN analysis use the simplifying assumption of full-rank data [15]

, require explicit computation of objective functions which are intractable to even estimate using a finite sample

[1, 55], or show how GANs have poor theoretical behavior, such as their training game not having Nash equilibria [13]. In this work, we adopt the more realistic assumption of low-dimensional data in a high-dimensional space, use explicit, closed-form objective functions which are more convenient to optimize (at least in the linear case), and demonstrate the existence of global equilibria of the training game corresponding to our method.

2 Preliminaries

2.1 Representation Learning

Suppose is a data matrix that contains data points in . To model that the dimension of the support of is lower than the ambient dimension , we consider data supported on a union of linear subspaces , each of dimension . For each , let be the matrix of the columns of contained in , and let the class information, containing the assignment of each data point to its respective subspace index be denoted by .

The goal is to learn an encoder mapping , in some function class , such that takes values in . For representation learning, we want to be such that and has better geometric properties, such as lying on orthogonal subspaces. Moreover, we want to learn an inverse or decoder mapping in some function class , such that, for , is close to .

2.2 Closed-Loop Transcription

To learn the encoder mapping and , we use the Closed-Loop Transcription (CTRL) framework, a recent method proposed for representation learning of low-dimensional submanifolds in high-dimensional space [6]. This framework generalizes both GANs and autoencoders; as has dual roles as an encoder and discriminator, and has dual roles as a decoder and a generator.

For the data matrix , we define , and use similar notations throughout. The training process follows a closed loop: starting with the data and its representations , the representations of the autoencoded data are used to train and . This approach has a crucial advantage over the GAN formulation: contrary to GANs [1, 55], since and both live in the structured representation space , interpretable measures of representation quality and of the difference between and exist and may be computed in closed form. These tractable measures are based on the paradigm of rate reduction [6, 51], which we briefly introduce here.

For and a matrix of representations , define

Suppose is partitioned into by the class information , where . Define

Finally, for any two matrices of equal size, define (overloading notation slightly)

The information-theoretic interpretations are as follows. We call the coding rate of ; if the columns of

are sampled i.i.d. according to a zero-mean Gaussian vector, then

is an estimate of the average number of bits required to encode the columns of up to quantization error [5, 36]. Furthermore, is called the rate reduction of with respect to . If the columns of each

are sampled i.i.d. from a zero-mean multivariate Gaussian distribution, it approximates the average number of bits saved by encoding each representation matrix

separately rather than all together as . One should think of as a measure of expressiveness and discriminativeness of the representations [51]. Finally, should be thought of as a measure of difference222Note that is not strictly a distance; for instance, does not imply that or that the distributions generating and are the same. between the two distributions generating the columns of and . For a more detailed discussion of rate reduction, see Appendix A.

2.3 Game Theoretic Formulation

Now that we have measures in the representation space for the properties we want to encourage in the encoder and decoder, we now discuss how to train the encoder function and decoder function .

Several methods, e.g., PCA [20], GAN [18], and the original CTRL formulation [6], can be viewed as learning the encoder (or discriminator) function and decoder (or generator) function via finding the Nash equilibria of an appropriate two-player simultaneous game between the encoder and decoder; we discuss this formulation further in Appendix B. In this work, we approach this problem from a different angle; we learn the encoder function and decoder function via an appropriate two-player sequential game between the encoder and the decoder; finding the so-called Stackelberg equilibria. We now cover the basics of sequential game theory; a more complete treatment is found in [2].

In a sequential game between the encoder — whose move corresponds to picking — and decoder — whose move corresponds to picking — both the encoder and the decoder attempt to maximize their own objectives, the so-called utility functions and respectively, by making their moves one at a time. In our formulation, the encoder moves first, and then the decoder aims to invert the encoder.

The solution concept for sequential games — that is, the encoder and decoder that correspond to rational actions – is the Stackelberg equilibrium [16, 24]. In our context, is a Stackelberg equilibrium if and only if

The sequential notion of the game is reflected in the definition of the equilibrium; the decoder, going second, may play to maximize with full knowledge of the encoder’s play , while the encoder plays to maximize with only the knowledge that the decoder plays optimally.

While gradient descent-ascent (GDA) with equal learning rates for and does not suffice to learn Stackelberg equilibria in theory or in practice [24], GDA with lopsided learning rates [24] and GDMax [23, 24] converge to local notions of Stackelberg equilibria in theory. In practice, GDMax converges to Stackelberg equilibria in our examples. For more discussion on practical considerations, see Section 4.

3 Multiple-Subspace Pursuit via the CTRL Framework

With this background in place, we now introduce the closed-loop multi-subspace pursuit (CTRL-MSP) method. Recall that in Section 2.3 we discussed the idea of learning and as equilibria for a two-player sequential game. We now introduce the game in question.

Let be the set of linear maps from to . Let be the usual Frobenius norm on matrices.

Definition 1 (CTRL-MSP Game).

The CTRL-MSP game is a two-player sequential game between:

  1. The encoder, moving first, choosing functions in the function class

    and having utility function

  2. The decoder, moving second, choosing functions in the function class , and having utility function

Thus, the encoder aims to maximize the discriminativeness of the representations , as well as the total difference between the representations and the closed-loop reconstructions . The decoder aims to minimize the latter. Since is not a function of , this game may be alternatively posed as a zero-sum game. Keeping the same encoder utility, and setting , the learned equilibria would be the same. Zero-sum games enjoy rich convergence guarantees, albeit using more regularity than is present here [2, 7, 16], but we do not invoke the zero-sum structure further in this work.

Before we characterize the learned encoder and decoder in CTRL-MSP games, it is first worthwhile to discuss the qualitative properties we want them to have, and how we can achieve them quantitatively. Recall that we wish to learn an encoder-decoder pair such that the encoder is injective and discriminates between the data subspaces, and also is self-consistent. We study quantitative ways to measure these properties. As a notation, for two sets , a map , and a subset , we denote by the image of under .

  1. To enforce the injectivity of the encoder, we aim to ensure that each is a linear subspace of dimension equal to that of , and furthermore, we aim to enforce that

    should have no small nonzero singular values.

  2. To discriminate between different subspaces, we aim to enforce the subspace representations to be orthogonal for different .

  3. To enforce internal self-consistency, we aim to have the subspace representations and the autoencoded subspace representations be equal.

We now explicitly characterize the Stackelberg equilibria in the CTRL-MSP game, and show how they connect to each of these properties. For a matrix , let be the vector space spanned by the columns of . Let , be the singular values of sorted in non-increasing order. Finally, for subspaces , denote by the sum vector space . With these notations, our key assumptions are summarized below:

Assumption 2 (Assumptions in CTRL-MSP Games).

  1. (Multiple classes) .

  2. (Informative data) For each , .

  3. (Large enough representation space.) .

  4. (Incoherent class data) .333An intuitive understanding of this condition is that if we take a linearly independent set from each , the union of all these sets is still linearly independent.

  5. (High coding precision) .

Our main result is:

Theorem 3 (Stackelberg Equilibria of CTRL-MSP Games).

If Assumption 2 holds, then the CTRL-MSP game has the following properties:

  1. A Stackelberg equilibrium exists.

  2. Any Stackelberg equilibrium enjoys the following properties:

    1. (Injective encoder) For each , we have that is a linear subspace of dimension . Further, for each , one of the following holds:

      • ; or

      • and , where if then is interpreted as .

    2. (Discriminative encoder) The subspaces are orthogonal for distinct indices .

    3. (Consistent encoding and decoding) For each , we have that .

The proof of this theorem uses Theorem 6; proofs of both are left to Appendix C. Once in the framework of Theorem 6, the main difficulty is the characterization of the maximizers of . This function is non-convex and challenging to analyze; we characterize it by carefully applying inequalities on the singular values of the representation matrices.

As the theorem indicates, the earlier check-list of desired quantitative properties can be achieved by CTRL-MSP. That is, CTRL-MSP provably learns injective and discriminative representations of multiple-subspace structure.

For the special case , where we are learning a single-subspace structure, it is possible to change the utility functions and function classes of CTRL-MSP, to learn a different set of properties that more closely mirrors PCA. In particular, a Stackelberg equilibrium encoder of this modified game does not render the covariance of nearly-isotropic; it instead is an -isometry on , which ensures well-behaved injectivity. The details are left to Appendix D.

We now discuss an implication of the CTRL-MSP method. The original problem statement of learning discriminative representations for multiple subspace structure may be solved directly via orthogonalizing the representations of multiple PCAs. However, CTRL-MSP provides an alternative approach: simultaneously learning and representing the subspaces via a modern representation learning toolkit. This gives a unifying perspective on classical and modern representation learning, by showing that classical methods can be viewed as special cases of modern methods, and that they may be formulated to learn the same types of representations. A major benefit (discussed further in Section 5) is that the new formulation can be readily generalized to much broader families of structures, beyond subspaces to submanifolds, as compelling empirical evidence in [6] demonstrates.

4 Empirical Evaluation of CTRL-MSP

We demonstrate empirical convergence of CTRL-MSP to equilibria which satisfy the conclusions of Theorem 3, in both the benign case of nearly-orthogonal subspaces, and more correlated subspaces. We then demonstrate CTRL-MSP’s robustness to noise. For more details regarding the experimental setup, CTRL-MSP’s empirical properties, and experimental results clarifying the difference between CTRL-MSP and other popular representation learning algorithms, see Appendix E.

We fix baseline values of , , , , , , , , and . Mini-batches of size are randomly sampled during optimization. To generate data, we fix

, and generate a single random matrix with orthonormal columns

using the QR decomposition. Then, for each subspace, we select

random columns from uniformly without replacement to form a matrix . We then generate random matrices , ,

whose entries are i.i.d. standard normal random variables, and set

to be the matrix whose columns are the normalized columns of . We finally obtain . If is small, the generated data is highly coherent across subspaces; if is large, the data are incoherent, since high dimensional random vectors with near-independent entries are incoherent with high probability [50]. Thus, this data generation process allows us to test CTRL-MSP and other algorithms; on data with various correlation structures.

In each experiment, we measure the success of CTRL-MSP in three qualitative ways. Below, let denote the standard norm.

  1. We check that that each column of is strongly correlated (or, coherent) with other columns in , and nearly orthogonal to columns of for . This may be checked via plotting the heatmap

    of pairwise absolute cosine similarities

    . We order our data points so that , so we aim for this heatmap to be perfectly block diagonal.

  2. We check that the spectra of each are as described by our theoretical results; there are nonzero singular values, and they are close to each other.

  3. We check that . Since and , we test this by plotting the distribution of the projection residual given by for all which are columns of ; we aim for this value to be near-zero for most or all .

4.1 Benign Subspaces

In the first experiment, we set and , which initializes the subspaces as incoherent. Figure 1 demonstrates the success of CTRL-MSP in this scenario: the cosine similarity heatmap presents a strong block diagonal, the representations corresponding to each subspace have large singular values, and the subspaces are within a low projection distance of each other.

(a) heatmap.
(b) heatmap.
(c) Spectra of .
(d) Subspace alignment.
Figure 1: Behavior of CTRL-MSP on benign subspaces. (a) Original heatmap of correlations. (b) Heatmap of correlations of the learned representations. (c) Singular value spectra of the representation matrices of the three subspaces. (d) Histogram of norms of residuals for all which are columns of .

4.2 Highly Correlated Subspaces

In the second experiment, we set and . This produces highly correlated subspaces. Yet, CTRL-MSP succeeds in this scenario: the cosine similarity heatmap presents a striking block diagonal structure, the representations corresponding to each subspace have large non-zero singular values. Further, the subspaces have low projection distance to each other (Figure 2). We remark that this situation seems significantly more difficult than the previous scenario, yet the learned encoder and decoder achieve the same outcomes.

(a) heatmap.
(b) heatmap.
(c) Spectra of .
(d) Subspace alignment.
Figure 2: Behavior of CTRL-MSP on highly correlated subspaces, with the same plots as in Fig. 1.

4.3 Highly Correlated and Noisy Subspaces

In the third experiment, we tackle the more difficult case of and . This adds off-subspace noise to the data, which then no longer satisfies the assumptions of the theoretical analysis. However, CTRL-MSP still partially succeeds. Indeed, the cosine similarity heatmap presents a clearly visible block diagonal, the representations corresponding to each subspace have large non-zero singular values, and the subspaces have low projection distance to each other (Figure 3).

(a) heatmap.
(b) heatmap.
(c) Spectra of .
(d) Subspace alignment.
Figure 3: Behavior of CTRL-MSP on highly correlated and noisy subspaces. The same quantities as in Fig. 1 are plotted.

4.4 Comparison with Other Representation Learning Methods

We now compare the performance of CTRL-MSG with other supervised representation learning methods, namely Conditional GAN (CGAN) [38], InfoGAN [4], and Conditional VAE (CVAE) [43].444The GAN implementations are adapted from PyTorch-GAN [35]; the CVAE implementation is adapted from PyTorch-VAE [44]. Since these methods do not guarantee discriminative, much less interpretable, representations, we do not compare representations. Instead, we test how well the methods are able to learn the linear structure in the data space, by looking at the correlation structure of the generated or reconstructed data, as in Figure 4. We test and as before, train CTRL-MSP for epochs, and all other methods for epochs; details on all methods used are left to Appendix E.

(a) Original .
(c) CGAN.
(d) InfoGAN.
(e) CVAE.
Figure 4: Comparison of generated or reconstructed data correlations.

We observe that InfoGAN and CTRL-MSP attempt to learn the linear structure. CTRL-MSP performs better at detecting separation between subspaces. InfoGAN generally pushes its reconstructions from different subspaces to be correlated, to the point where the first subspace is almost as correlated to the third subspace as it is to itself; which is not seen in CTRL-MSP. Thus, CTRL-MSP learns the linear structure comparably or better than InfoGAN. Meanwhile, CGAN and CVAE degenerate completely, consistent with the theoretical analysis that VAEs may be unable to learn low-dimensional subspaces under certain conditions on the encoder [31]. In summary, the CGAN, InfoGAN, and CVAE architectures use fully nonlinear deep networks and many times the number of training epochs as CTRL-MSP, and yet the latter performs better in the case of low-dimensional subspaces and noise.

5 Generalization via CTRL-SG

We now generalize CTRL-MSP to representation learning scenarios more diverse than our task of learning multiple linear subspace structure. This generalized method, which we call closed-loop sequential games (CTRL-SG), builds on sequential game theory and the CTRL framework [6].

Our formulation is inspired by the two roles of the encoder in CTRL, being injective and discriminative with respect to the decoder. Meanwhile, the decoder aims to be compatible with the encoder. We quantify the injectivity of the encoder, which is a boolean value, via the expressiveness—or expansiveness—of the representations, which themselves are quantified by larger values of a function . We quantify the compatibility of the decoder with the encoder via larger values of a function . We quantify discriminative power of the encoder with respect to the decoder by lower values of , or alternatively higher values of .

These choices are inspired by the GAN framework [18, 1], where the power of the discriminator with respect to the generator is quantified by the difference between the representations of the real and generated data. Similarly, in the CTRL framework, the encoder seeks to discriminate between the data and the autoencoded data (as well as between each data subspace), by constructing different representations for each. This would mean that the encoder and the decoder are not compatible, so would be low and would be high. Thus, the encoder aims to jointly maximize and minimize ; the decoder aims only to maximize . Formalizing this yields the CTRL-SG framework.

Definition 4 (CTRL-SG Game).

The CTRL-SG game is a two-player sequential game between:

  1. The encoder, moving first, choosing functions in the function class , and having utility function .

  2. The decoder, moving second, choosing functions in the function class , and having utility function .

Definition 4 is a generalization of Definition 1. More precisely, the CTRL-MSP game is the CTRL-SG game with , , , and . With these notations, our assumptions are summarized below:

Assumption 5 (Assumptions in CTRL-SG Games).

  1. (Expressiveness can be maximized.) is nonempty.

  2. (Compatibility can be maximized.) is nonempty for every .

  3. (The decoder can obtain equally good outcomes regardless of the encoder’s play.) The function is constant.

We may generically characterize the Stackelberg equilibria of CTRL-SG games.

Theorem 6 (Stackelberg Equilibria of CTRL-SG).

If Assumption 5 holds, then the CTRL-SG game has the following properties:

  1. A Stackelberg equilibrium exists.

  2. Any Stackelberg equilibrium enjoys:

This generalized system allows us to use the CTRL framework for representation learning, to choose principled objective functions to encourage the desired representation, and then to explicitly characterize the optimal learned encoder and decoder for that algorithm. It also suggests principled optimization strategies and algorithms, such as GDMax [24], for obtaining these optimal functions.

This system differs somewhat from the original setting of learning from finite data presented all at once. Since it is a game-theoretic formulation, in principle, one may adapt it to learning contexts different from the ones developed here, e.g., semi-supervised learning and online/incremental learning.

6 Conclusion

In this work, we introduced the closed-loop multi-subspace pursuit (CTRL-MSP) framework for learning representations of the multiple linear subspace structure. We explicitly characterized the Stackelberg equilibria of the associated CTRL-MSP game, and provided empirical support for the proved properties of the learned encoder and decoder. Finally, we introduced a generalization, CTRL-SG, for more general representation learning, and characterized the Stackelberg equilibria of the associated game.

There are several directions for future work. First, the current analysis of CTRL-MSP holds when the data lie perfectly on linear subspaces; it may be fruitful to study the conditions under which the addition of noise causes the learned encoder and decoder for CTRL-MSP to break down or degenerate. Also, it may be interesting to analyze cases where the data lies on more general non-linear manifolds. Regarding CTRL-SG, it is possible to use the framework for other kinds of representation learning problems in different contexts, and characterize the learned encoder and decoder similarly to this work.

In conclusion, CTRL-MSP exemplifies how classical subspace learning problems can be formulated as special cases of modern representation learning problems. In general, unifying classical and modern perspectives greatly contributes towards better understanding the behavior of modern algorithms on both classical and modern problems.

7 Acknowledgements

We thank Peter Tong and Xili Dai of UC Berkeley for insightful discussion regarding practical optimization strategies, as well as fair comparisons to other methods. Edgar would like acknowledge support by the NSF under grants 2046874 and 2031895. Yi would like to acknowledge the support of ONR grants N00014-20-1-2002 and N00014-22-1-2102, the joint Simons Foundation-NSF DMS grant #2031899.


  • [1] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein GAN. arXiv. External Links: Document, Link Cited by: Appendix B, Appendix B, Appendix B, §1.1, §1.1, §2.2, §5.
  • [2] T. Başar and G. J. Olsder (1998) Dynamic Noncooperative Game Theory, 2nd Edition. edition, Society for Industrial and Applied Mathematics, . External Links: Document, Link, https://epubs.siam.org/doi/pdf/10.1137/1.9781611971132 Cited by: Appendix B, §2.3, §3.
  • [3] A. Buja and N. Eyuboglu (1992) Remarks on Parallel Analysis. Multivariate behavioral research 27 (4), pp. 509–540. Cited by: §D.2.
  • [4] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. Advances in neural information processing systems 29. Cited by: §4.4.
  • [5] T. M. Cover (1999) Elements of Information Theory. John Wiley & Sons. Cited by: Appendix A, §2.2.
  • [6] X. Dai, S. Tong, M. Li, Z. Wu, M. Psenka, K. H. R. Chan, P. Zhai, Y. Yu, X. Yuan, H. Shum, and Y. Ma (2022) CTRL: Closed-Loop Transcription to an LDR via Minimaxing Rate Reduction. Entropy 24 (4). External Links: Link, ISSN 1099-4300, Document Cited by: Appendix A, Appendix B, Appendix B, Appendix B, §2.2, §2.2, §2.3, §3, §5.
  • [7] C. Daskalakis and I. Panageas (2018) Last-Iterate Convergence: Zero-Sum Games and Constrained Min-Max Optimization. arXiv. External Links: Document, Link Cited by: §3.
  • [8] S. Diamond and S. Boyd (2016) CVXPY: A Python-Embedded Modeling Language for Convex Optimization. Journal of Machine Learning Research 17 (83), pp. 1–5. Cited by: 2nd item.
  • [9] E. Dobriban and A. B. Owen (2018) Deterministic parallel analysis: an improved method for selecting factors and principal components. Journal of the Royal Statistical Society. Series B (Methodological) 81 (1), pp. 163–183. Cited by: §D.2.
  • [10] E. Dobriban (2020) Permutation methods for factor analysis and PCA. The Annals of Statistics 48 (5), pp. 2824–2847. Cited by: §D.2.
  • [11] E. Dobriban (2021) Consistency of invariance-based randomization tests. arXiv preprint arXiv:2104.12260, Annals of Statistics, to appear. Cited by: §D.2.
  • [12]

    W. Falcon and The PyTorch Lightning team

    PyTorch Lightning. External Links: Document, Link Cited by: Appendix E.
  • [13] F. Farnia and A. Ozdaglar (2020) GANs May Have No Nash Equilibria. arXiv. External Links: Document, Link Cited by: Appendix B, §1.1, §1.
  • [14] C. Fefferman, S. Mitter, and H. Narayanan (2013)

    Testing the Manifold Hypothesis

    arXiv. External Links: Document, Link Cited by: §1.
  • [15] S. Feizi, F. Farnia, T. Ginart, and D. Tse (2017) Understanding GANs: the LQG Setting. arXiv. External Links: Document, Link Cited by: Appendix B, §1.1, §1, footnote 1.
  • [16] T. Fiez, B. Chasnov, and L. J. Ratliff (2019) Convergence of Learning Dynamics in Stackelberg Games. arXiv. External Links: Document, Link Cited by: §2.3, §3.
  • [17] I. Gemp, B. McWilliams, C. Vernade, and T. Graepel (2020) EigenGame: PCA as a Nash Equilibrium. arXiv. External Links: Document, Link Cited by: Appendix B.
  • [18] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative Adversarial Networks. arXiv. External Links: Document, Link Cited by: Appendix B, Appendix B, §1.1, §1, §2.3, §5.
  • [19] T. Hastie and W. Stuetzle (1989) Principal Curves. Journal of the American Statistical Association 84 (406), pp. 502–516. Cited by: §1.1.
  • [20] Z. He, M. Kan, and S. Shan (2021) EigenGAN: Layer-Wise Eigen-Learning for GANs. In

    International Conference on Computer Vision (ICCV)

    Cited by: §2.3.
  • [21] D. Hong, Y. Sheng, and E. Dobriban (2020) Selecting the number of components in PCA via random signflips. arXiv preprint arXiv:2012.02985. Cited by: §D.2.
  • [22] H. Hotelling (1933) Analysis of a Complex of Statistical Variables into Principal Components. Journal of educational psychology 24 (6), pp. 417. Cited by: §1.1.
  • [23] H. Huang, P. S. Yu, and C. Wang (2018) An Introduction to Image Synthesis with Generative Adversarial Nets. arXiv. External Links: Document, Link Cited by: §1, §2.3.
  • [24] C. Jin, P. Netrapalli, and M. I. Jordan (2019) What is Local Optimality in Nonconvex-Nonconcave Minimax Optimization?. arXiv. External Links: Document, Link Cited by: Appendix E, §2.3, §2.3, §5.
  • [25] I. T. Jolliffe (2002) Principal Component Analysis. 2nd edition, Springer-Verlag. Cited by: §1.
  • [26] O. Kallenberg (2021) Foundations of Modern Probability. Probability Theory and Stochastic Modelling, Springer Cham. External Links: ISBN 9783030618704 Cited by: footnote 1.
  • [27] T. Karras, S. Laine, and T. Aila (2018) A Style-Based Generator Architecture for Generative Adversarial Networks. arXiv. External Links: Document, Link Cited by: §1.1, §1.
  • [28] D. P. Kingma and M. Welling (2013) Auto-Encoding Variational Bayes. arXiv. External Links: Document, Link Cited by: footnote 1.
  • [29] D. P. Kingma and J. Ba (2014) Adam: A Method for Stochastic Optimization. arXiv. External Links: Document, Link Cited by: Appendix E.
  • [30] M. Kochurov, R. Karimov, and S. Kozlukov (2020) Geoopt: Riemannian Optimization in PyTorch. arXiv. External Links: Document, Link Cited by: §D.2.
  • [31] F. Koehler, V. Mehta, A. Risteski, and C. Zhou (2021) Variational Autoencoders in the Presence of Low-dimensional Data: Landscape and Implicit Bias. arXiv preprint arXiv:2112.06868. Cited by: §4.4.
  • [32] M. A. Kramer (1991)

    Nonlinear principal component analysis using autoassociative neural networks

    AIChE journal 37 (2), pp. 233–243. Cited by: Appendix B, Appendix B, §1.1.
  • [33] Y. Lau, Q. Qu, H. Kuo, P. Zhou, Y. Zhang, and J. Wright (2020) Short and Sparse Deconvolution — A Geometric Approach. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • [34] Y. Li and Y. Bresler (2018) Global Geometry of Multichannel Sparse Blind Deconvolution on the Sphere. In Advances in Neural Information Processing Systems, pp. 1132–1143. Cited by: §1.
  • [35] E. Linder-Norén (2018) PyTorch-GAN. GitHub. Note: https://github.com/eriklindernoren/PyTorch-GAN Cited by: Appendix E, footnote 4.
  • [36] Y. Ma, H. Derksen, W. Hong, and J. Wright (2007) Segmentation of Multivariate Mixed Data via Lossy Data Coding and Compression. IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (9), pp. 1546–1562. External Links: Document Cited by: Appendix A, Appendix A, Appendix A, Figure 10, §2.2.
  • [37] A. Mino and G. Spanakis (2018) LoGAN: Generating Logos with a Generative Adversarial Neural Network Conditioned on color. arXiv. External Links: Document, Link Cited by: §1.1, §1.
  • [38] M. Mirza and S. Osindero (2014) Conditional Generative Adversarial Nets. arXiv preprint arXiv:1411.1784. Cited by: §4.4.
  • [39] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv. External Links: Document, Link Cited by: Appendix E.
  • [40] V. Prasad, D. Das, and B. Bhowmick (2020) Variational Clustering: Leveraging Variational Autoencoders for Image Clustering. arXiv. External Links: Document, Link Cited by: §1.
  • [41] Q. Qu, Y. Zhai, X. Li, Y. Zhang, and Z. Zhu (2019) Geometric Analysis of Nonconvex Optimization Landscapes for Overcomplete Learning. In International Conference on Learning Representations, Cited by: §1.
  • [42] Y. Shen, Y. Xue, J. Zhang, K. B. Letaief, and V. Lau (2020) Complete Dictionary Learning via -norm Maximization. arXiv preprint arXiv:2002.10043. Cited by: §1.
  • [43] K. Sohn, H. Lee, and X. Yan (2015) Learning Structured Output Representation using Deep Conditional Generative Models. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28, pp. . External Links: Link Cited by: §4.4.
  • [44] A. Subramanian (2020) PyTorch-VAE. GitHub. Note: https://github.com/AntixK/PyTorch-VAE Cited by: Appendix E, footnote 4.
  • [45] M. E. Tipping and C. M. Bishop (1999) Probabilistic Principal Component Analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61 (3), pp. 611–622. Cited by: §1.1.
  • [46] S. Tong, X. Dai, Z. Wu, M. Li, B. Yi, and Y. Ma (2022) Incremental Learning of Structured Memory via Closed-Loop Transcription. arXiv. External Links: Document, Link Cited by: Appendix B.
  • [47] L. Van Der Maaten, E. Postma, and J. Van den Herik (2009) Dimensionality Reduction: a Comparative. J Mach Learn Res 10 (66-71), pp. 13. Cited by: §1.1.
  • [48] R. Vidal, Y. Ma, and S. Sastry (2012) Generalized Principal Component Analysis (GPCA). arXiv. External Links: Document, Link Cited by: §1.1.
  • [49] R. Vidal, Y. Ma, and S. Sastry (2016) Generalized Principal Component Analysis. Springer Verlag. Cited by: §1.
  • [50] J. Wright and Y. Ma (2022) High-Dimensional Data Analysis with Low-Dimensional Models: Principles, Computation, and Applications. Cambridge University Press. Cited by: Appendix B, Appendix B, §1, §4.
  • [51] Y. Yu, K. H. R. Chan, C. You, C. Song, and Y. Ma (2020) Learning Diverse and Discriminative Representations via the Principle of Maximal Coding Rate Reduction. arXiv. External Links: Document, Link Cited by: Appendix A, Appendix A, Appendix A, §C.2, §C.2, §C.2, §C.2, §C.2, §2.2, §2.2.
  • [52] Y. Zhai, H. Mehta, Z. Zhou, and Y. Ma (2019) Understanding -based Dictionary Learning: Interpretation, Stability, and Robustness. In International Conference on Learning Representations, Cited by: §1.
  • [53] Y. Zhai, Z. Yang, Z. Liao, J. Wright, and Y. Ma (2020) Complete Dictionary Learning via -Norm Maximization over the Orthogonal Group.. J. Mach. Learn. Res. 21 (165), pp. 1–68. Cited by: §1.
  • [54] Y. Zhang, H. Kuo, and J. Wright (2019) Structured Local Optima in Sparse Blind Deconvolution. IEEE Transactions on Information Theory 66 (1), pp. 419–452. Cited by: §1.
  • [55] B. Zhu, J. Jiao, and D. Tse (2020) Deconstructing generative adversarial networks. IEEE Transactions on Information Theory 66 (11), pp. 7155–7179. Cited by: §1.1, §2.2.


The appendix is organized as follows. In Appendix A we discuss the mathematical and information-theoretic foundations of rate reduction theory. In Appendix B we discuss popular representation learning algorithms, such as PCA, GAN, and CTRL in terms of simultaneous game theory and the representation learning framework we developed. In Appendix C we give proofs of all theorems in the main body. In Appendix D we discuss a specialization of CTRL-MSP, which we call CTRL-SSP, to the case where the data lies on a single linear subspace; in this section we give mathematical justification and empirical support for CTRL-SSP. In Appendix E we provide more experimental details, a more thorough empirical evaluation of CTRL-MSP, and more detailed comparisons with other representation learning algorithms.

Appendix A Rate Reduction Functions

In this section, we discuss the rate reduction functions from Section 2.2, and provide more details on the rate reduction schema. Much of the exposition is motivated by [36, 51]. The perspective will be information-theoretic; basics of information theory are covered in [5].

Let be a random vector, and let be the rate distortion function of with respect to the Euclidean squared distance distortion. Information-theoretically, this is a measure of the coding rate of the data; that is, the average number of bits required to encode , such that the expected Euclidean squared distance between and its encoding is at most the argument of the function.

For a symmetric matrix , let

be the minimum eigenvalue of

. If is a multivariate Gaussian random vector with mean and covariance , then

For larger , the rate distortion function becomes more complicated and can be found by the water-filling algorithm on the eigenvalues of . However, [36] proposes the following approximation of the rate distortion. For independent of , let

If , then we may derive a closed form expression for for all . Since and

are normally distributed, so is

, and


Therefore, we have the following closed form expression for for all .

In information-theoretic terms, is a regularized rate distortion measure, and corresponds to the expansiveness of the distribution of .

From this measure we can also define a measure of difference555This measure is not a true distance function; for starters, this measure can be zero for random variables with non-identical distributions. between distributions of two possibly-correlated random vectors . This measure estimates the average number of bits saved by encoding and separately and independently compared to encoding them together, say by encoding a mixture random variable which is with probability and with probability . In this notation, we have

This measure of difference has several advantages over Wasserstein or Jensen-Shannon distances. It is a principled measure of difference which is computable in closed-form for the widely representative class of Gaussian distributions. In particular, due to the existence of the closed-form representation, it is much simpler to do analysis on the solutions of optimization problems involving these measures.

We may generalize the difference measure to several random vectors. Specifically, define probabilities such that , arranged in a vector , and let be random vectors. Define to be the mixture random vector which equals with probability . Then the coding rate reduction is given by

This is a measure of how expansive the distribution of each is, and how different the distributions of the are from each other – in some sense, the expressiveness and discriminativeness of the distributions of the . More precisely, it was shown in [51] that, subject to rank and Frobenius norm constraints on the , this expression is maximized when the are distributed on pairwise orthogonal subspaces, and also each has isotropic (or nearly isotropic) covariance on its subspace.

In practice, we do not know the distribution of the data, and the features are not perfectly a mixture of Gaussians. Still, the mixture of Gaussians is often a reasonable model for lower-dimensional feature distributions [36, 51, 6], so we use the Gaussian form for the approximate coding rate.

Also, in practice we do not have access to any full distributions, and so we need to estimate all relevant quantities via a finite sample. For Gaussians, is only a function of through its covariance ; in practice, this covariance is estimated via a finite sample as . This also allows us to estimate from a finite sample. To estimate , we also need to estimate . For this, we require finite sample information telling us which samples correspond to which random vector . Denote by the number of samples in which correspond to . Then may be estimated via plug-in as .

This set of approximations yields estimates , , and , whose expressions for Gaussians are given in Section 2.2 (dropping the subscript and working in the natural logarithm instead of the base- logarithm).

Appendix B PCA, GAN, CTRL as Games

In the main body of the paper, we use the framework of learning an encoder function and decoder function via a two-player sequential game between the encoder and decoder. This is different than conventional formulations of other representation learning algorithms, including PCA, nonlinear PCA (autoencoding) [32], GANs [18, 1], and CTRL [6] as simultaneous games between the encoder (or discriminator) and decoder (or generator). For completeness, we briefly introduce aspects of simultaneous game theory (a more detailed introduction is again found in [2]), and then discuss each of these frameworks in terms of our general representation learning formulation as well as simultaneous game theory.

Simultaneous Game Theory.

In a simultaneous game between the encoder — playing — and the decoder — playing — both players make their move at the same time with no information about the other player’s move. As in the sequential game framework, both players rationally attempt to maximize their utility functions and respectively. The solution concept for a simultaneous game is a so-called Nash equilibrium. Formally, is a Nash equilibrium if and only if

PCA and Autoencoding.

PCA finds the best approximating subspace, which we denote , to the data in the following sense. Let be the set of -dimensional linear subspaces of . Then, PCA solves the problem

The solution is well-known and exists in closed form in terms of the SVD of the data matrix [50]. To formulate this problem in terms of a game between an encoder and decoder, we learn the orthogonal projection operator as a rank- composition of so-called semi-orthogonal linear maps. Let be integers. For a linear map denote its transpose as the linear map whose matrix representation is the transpose of the matrix representation of . For a set let be the identity map on . Finally, define the semi-orthogonal linear maps by

We may then define a PCA game to learn the projection operator.

Definition 7 (PCA Game).

The PCA game is a two-player simultaneous game between:

  1. The encoder, choosing functions in the function class , and having utility function .

  2. The decoder, choosing functions in the function class , and having utility function .

There are alternate formulations of PCA as a game [17].

Autoencoder games [32] may be formulated similar to PCA games, perhaps with the same utility functions, but the function classes and are less constrained. In particular, they may include functions modelled by neural networks. This makes the Nash point analysis much more difficult.


In the GAN framework [18, 1], is interpreted as a discriminator and is interpreted as a generator. Unlike the cooperative games of PCA and autoencoding, GANs train the generator and discriminator functions adversarially, in a zero-sum fashion (meaning ). The generator attempts to fool the discriminator to treat generated data similarly to real data (by mapping to similar representations), while the discriminator seeks to discriminate between real data and generated data. Let be a random matrix whose entries are i.i.d. (probably Gaussian) noise. Also, let be any finite-sample estimate of the distance between the distributions which generate the columns of its first argument and second argument (for the purpose of concreteness, one may take to be an estimator of the Jensen-Shannon divergence or Wasserstein distance). Then the GAN training game may be posited as follows.

Definition 8 (GAN Game).

The GAN game is a two-player simultaneous game between:

  1. The discriminator, choosing functions in the function class , and having utility function .

  2. The generator, choosing functions in the function class , and having utility function .

Despite having conceptually simple foundations, GANs have technical problems. The above two-player game may not have a Nash equilibrium [13], and even and are very simple, e.g., linear and quadratic functions, and one assumes that equilibria exists, GANs may not converge to the equilibria [15]. A further, very important, complexity is that the distances most commonly used for GANs, such as Wasserstein distance and Jensen-Shannon divergence [1], are defined variationally and do not have a closed form for any non-trivial distributions (even mixtures of Gaussians). Thus, one has to approximate the distance via another distance function which is more tractable [1]; this estimate becomes worse as the data dimension increases [50].


In CTRL [6], the function is both an encoder and a discriminator, while is both a decoder and generator. Closed-loop training compares the representations and the representations of the autoencoded data . More specifically, using the rate reduction difference measure discussed in Section 2.2 and expanded on more in Appendix A, the difference measure between the distributions generating the representations of the data and of the autoencoded data is estimated by the class-wise difference measure . Similar to the GAN game, the encoder attempts to maximize this quantity, and the decoder attempts to minimize it. Unlike the GAN game, CTRL also wants to achieve structured representations; in order to do this, the encoder jointly attempts to maximize the representation expressiveness and discriminativeness of both the data and the autoencoded data, which is represented by an additive term . Formalizing this gives the CTRL game.

Definition 9 (CTRL Game).

The CTRL game is a two-player simultaneous game between:

  1. The encoder, choosing functions in the function class , and having utility function

  2. The decoder, choosing functions in the function class , and having utility function .

Theoretical analysis of this particular simultaneous game is still an open problem, though this formulation [6] and other closely related formulations [46] have achieved good empirical results.

Appendix C Proofs of Theorem 6 and Theorem 3

We prove the result from Section 5 first, then we show that it specializes to the result in Section 3.

c.1 Proof of Theorem 6


We show both consequences of the theorem at the same time, by first computing the equilibria , then computing the corresponding . By assumption 5.3, the function is constant; say equal to . Then we have

By assumption 5.1, this set is nonempty. Suppose is in this set. Then

and by assumption 5.2, this set is also nonempty. Thus, a Stackelberg equilibrium exists. If is a Stackelberg equilibrium, then by the first calculation, and