 # Random orthogonal matrices and the Cayley transform

Random orthogonal matrices play an important role in probability and statistics, arising in multivariate analysis, directional statistics, and models of physical systems, among other areas. Calculations involving random orthogonal matrices are complicated by their constrained support. Accordingly, we parametrize the Stiefel and Grassmann manifolds, represented as subsets of orthogonal matrices, in terms of Euclidean parameters using the Cayley transform. We derive the necessary Jacobian terms for change of variables formulas. Given a density defined on the Stiefel or Grassmann manifold, these allow us to specify the corresponding density for the Euclidean parameters, and vice versa. As an application, we describe and illustrate through examples a Markov chain Monte Carlo approach to simulating from distributions on the Stiefel and Grassmann manifolds. Finally, we establish an asymptotic independent normal approximation for the distribution of the Euclidean parameters which corresponds to the uniform distribution on the Stiefel manifold. This result contributes to the growing literature on normal approximations to the entries of random orthogonal matrices or transformations thereof.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Random orthogonal matrices play an important role in probability and statistics. They arise, for example, in multivariate analysis, directional statistics, and models of physical systems. The set of orthogonal matrices known as the Stiefel manifold, is a dimensional submanifold of There are two notable special cases: is equivalent to the unit hypersphere, while is equivalent to the orthogonal group Closely related to the Stiefel manifold is the Grassmann manifold the set of -dimensional linear subspaces of The Grassmann manifold has dimension Typically, points in the Grassmann manifold are thought of as equivalence classes of where two orthogonal matrices belong to the same class if they share the same column space or, equivalently, if one matrix can be obtained from the other through right multiplication by an element of In Section 3.2, we elaborate and expand upon the contributions of Shepard et al.  to provide another representation of the Grassmann manifold as the subset of orthogonal matrices having a symmetric positive definite (SPD) top block. In this article, we focus on orthogonal matrices having fewer columns than rows.

Both the Stiefel and Grassmann manifolds can be equipped with a uniform probability measure, also know as an invariant or Haar measure. The uniform distribution on is characterized by its invariance to left and right multiplication by orthogonal matrices: If then for all and Letting

be the function taking an orthogonal matrix to its column space, the uniform distribution

on is the pushforward measure of under In other words, the measure of is The uniform distributions on these manifolds have a long history in probability, as we discuss in Section 6, and in statistics, where they appear in foundational work on multivariate theory .

Non-uniform distributions on the Stiefel and Grassmann manifolds play an important role in modern statistical applications. They are used to model directions, axes, planes, rotations, and other data lying on compact Riemannian manifolds in the field of directional statistics . Also, statistical models having the Stiefel or Grassmann manifold as their parameter space are increasingly common [18, 19, 8]. In particular, Bayesian analyses of multivariate data often involve prior and posterior distributions on or Bayesian inference typically requires simulating from these posterior distributions, motivating the development of new Markov chain Monte Carlo (MCMC) methodology [19, 3, 40].

The challenge of performing calculations with random orthogonal matrices has motivated researchers to parametrize sets of square orthogonal matrices in terms of Euclidean parameters. We provide a few examples. In what Diaconis and Forrester 

identify as the earliest substantial mathematical contribution to modern random matrix theory,

Hurwitz 

parametrizes the special orthogonal and unitary groups using Euler angles and computes the volumes of their invariant measures. An implication of these computations is that the Euler angles of a uniformly distributed matrix follow independent beta distributions.

Anderson et al.  discuss the potential of various parametrizations in the simulation of uniformly distributed square orthogonal matrices. Other authors have made use of parametrizations of square orthogonal or rotation matrices in statistical applications [27, 9].

In contrast, the topic of parametrizing random orthogonal matrices having fewer columns than rows has received little attention. The recent work of Shepard et al.  extends four existing approaches to parametrizing square orthogonal matrices to the case when and to the scenario in which only the column space of the orthogonal matrix is of interest. Naturally, this latter scenario is closely related to the Grassmann manifold. The tools needed to use these parametrizations in a probabilistic setting are still largely missing.

In this article, we lay foundations for application of the Cayley parametrization of the Stiefel and Grassmann manifolds in a probabilistic setting. There are three main contributions. First, we elaborate and expand upon the work of Shepard et al.  to show that the Grassmann manifold can be represented by the subset of orthogonal matrices having an SPD top block and that this subset can be parametrized in terms of Euclidean elements using the Cayley transform. Next, we derive the necessary Jacobian terms for change of variables formulas. Given a density defined on or , these allow us to specify the corresponding density for the Euclidean parameters, and vice versa. As an application, we describe and illustrate through examples an approach to MCMC simulation from distributions on these sets. Finally, we establish an asymptotic independent normal approximation for the distribution of the Euclidean parameters which corresponds to the uniform distribution on the Stiefel manifold. This result contributes to the growing literature on normal approximations to the entries of random orthogonal matrices or transformations thereof.

Code to replicate the figures in this article and to simulate from distributions on and is available at https://github.com/michaeljauch/cayley.

## 2 Probability distributions on submanifolds of Rn

In this section, we introduce tools for defining and manipulating probability distributions on a

-dimensional submanifold of . In particular, we discuss the reference measure with respect to which we define densities on we make precise what it means for us to parametrize in terms of Euclidean parameters, and we state a change of variables formula applicable in this setting. Our formulation of these ideas follows that of Diaconis et al.  somewhat closely. This general discussion will form the basis for our handling of the specific cases in which is or

In order to specify probability distributions on in terms of density functions, we need a reference measure on that space, analogous to Lebesgue measure on As in Diaconis et al.  and Byrne and Girolami 

, we take the Hausdorff measure as our reference measure. Heuristically, the

-dimensional Hausdorff measure of is the -dimensional area of More formally, the -dimensional Hausdorff measure of is defined

 Hm(A) =limδ→0infA⊂∪iSidiam(Si)<δ∑iαm(% diam(Si)2)m

where the infimum is taken over countable coverings of with

 diam(Si)=sup{|x−y|:x,y∈Si}

and the volume of the unit ball in The -dimensional Hausdorff measure on coincides with up to a multiplicative constant.

Let be proportional to a density with respect to the -dimensional Hausdorff measure on Furthermore, suppose can be parametrized by a function from an open domain to satisfying the following conditions:

1. Almost all of is contained in the image of under so that

2. The function is injective on

3. The function is continuously differentiable on with the derivative matrix at

In this setting, we obtain a simple change of variables formula. Define the -dimensional Jacobian of at as Like the familiar Jacobian determinant, this term acts as a scaling factor in a change of variables formula. However, it is defined even when the derivative matrix is not square. For more details, see the discussion in Diaconis et al. . The change of variables formula is given in the following theorem, which is essentially a restatement of the main result of Traynor :

###### Theorem 2.1.

For all Borel subsets

 ∫AJmf(ϕ)Lm(dϕ)=Hm[f(A)]

and hence

 ∫Ag[f(ϕ)]Jmf(ϕ)Lm(dϕ)=∫f(A)g(y)Hm(dy).

Naturally, the change of variables formula has an interpretation in terms of random variables. Let

be a random element of whose distribution has a density proportional to Then when the distribution of has a density proportional to

## 3 The Cayley parametrizations

The Cayley transform, as introduced in Cayley 

, is a map from skew-symmetric matrices to special orthogonal matrices. Given

in the (original) Cayley transform of is the special orthogonal matrix

 Corig.(X)=(Ip+X)(Ip−X)−1.

We work instead with a modified version of the Cayley transform described in Shepard et al. . In this version, the Cayley transform of is the orthogonal matrix

 C(X)=(Ip+X)(Ip−X)−1Ip×k

where denotes the

matrix having the identity matrix as its top block and the remaining entries zero. The matrix

is invertible for any , so the Cayley transform is defined everywhere.

In this section, we parametrize (in the sense of Section 2) the sets and using the Cayley transform We are able to parametrize these distinct sets by restricting the domain of to distinct subsets of The third condition of the previous section requires that be continuously differentiable on its domain. We verify this condition by computing the derivative matrices of clearing a path for the statement of change of variables formulas in Section 4. We also state and discuss an important proposition which justifies our claim that the Grassmann manifold can be represented by the set

### 3.1 Cayley parametrization of the Stiefel manifold

To parametrize the Stiefel manifold the domain of is restricted to the subset

 DV={X=[B−ATA0p−k]∣∣∣B∈Rk×k∈Skew(k)A∈Rp−k×k}⊂Skew(p).

Let and set Partition so that is square. We can write the blocks of in terms of the blocks of as

 Q1 =(Ik−ATA+B)(Ik+ATA−B)−1 Q2 =2A(Ik+ATA−B)−1.

The matrix is the sum of a symmetric positive definite matrix and a skew-symmetric matrix and therefore nonsingular (see the appendix for a proof). This observation guarantees (again) that the Cayley transform is defined for all We can recover the matrices and from

 F =(Ik−Q1)(Ik+Q1)−1 (1) B =12(FT−F) (2) A =12Q2(Ik+F). (3)

We are now in a position to verify the first two conditions of Section 2. The image of under is the set

 IV={Q=[QT1QT2]T∈V(k,p)∣Ik+Q1 is nonsingular}

which has measure one with respect to The injectivity of on follows from the existence of the inverse mapping described in equations 1 - 3. All that remains to be verified is the third condition, that is continuously differentiable on its domain.

As the first step in computing the derivative matrix of we define a

-dimensional vector

containing each of the independent entries of Let be the -dimensional vector of independent entries of obtained by eliminating diagonal and supradiagonal elements from the vectorization The vector then contains each of the independent entries of Let be the matrix having as its corresponding vector of independent entries.

The Cayley transform can now be thought of as a function of

 C(φ) =(Ip+Xφ)(Ip−Xφ)−1Ip×k.

As a function of the Cayley transform is a bijection between the set and The inverse Cayley transform is defined in the obvious way as the map where and are computed from according to equations 1-3 and contains the independent entries of as before.

The next lemma provides an explicit linear map from to greatly simplifying our calculation of the derivative matrix The entries of belong to the set The construction of involves the commutation matrix and the matrix satisfying both of which are discussed in Magnus  and defined explicitly in Appendix B . Set and

###### Lemma 3.0.1.

The equation is satisfied by the matrix

 ΓV=[(ΘT1⊗ΘT1)˜Dk(Ip2−Kp,p)(ΘT1⊗ΘT2)].

With these pieces in place, we can now identify the derivative matrix

###### Proposition 3.1.

The Cayley transform is continuously differentiable on with derivative matrix

 DC(φ)=2[ITp×k(Ip−Xφ)−T⊗(Ip−Xφ)−1]ΓV.

The form of the derivative matrix reflects the composite structure of the Cayley transform as a function of The Kronecker product term arises from differentiating with respect to while the matrix arises from differentiating with respect to

### 3.2 The Cayley parametrization of the Grassmann manifold

While the definition of the Grassmann manifold as a collection of subspaces is fundamental, we often need a more concrete representation. One simple idea is to represent a subspace by having as its column space. However, the choice of is far from unique. Alternatively, the subspace can be represented uniquely by the orthogonal projection matrix onto as in Chikuse . We propose instead to represent by the subset of orthogonal matrices

 V+(k,p)={Q=[QT1QT2]T∈V(k,p)∣∣Q1≻0}.

As the next proposition makes precise, almost every element of can be represented uniquely by an element of Recall that we defined as the map which sends each element of to its column space.

###### Proposition 3.2.

The map is injective on and the image of under has measure one with respect to the uniform probability measure on

In turn, the set is amenable to parametrization by the Cayley transform. In this case, the domain of the Cayley transform is restricted to the subset

 DG={X=[0−ATA0p−k]∣∣∣A∈Rp−k×k,evali(ATA)∈[0,1) for 1≤i≤k}⊂Skew(p)

where the notation indicates the

th eigenvalue of the matrix

Let and set Again, partition so that is square. We can write the blocks of in terms of as

 Q1 =(Ik−ATA)(Ik+ATA)−1 Q2 =2A(Ik+ATA)−1.

We can recover the matrix from

 F =(Ik−Q1)(Ik+Q1)−1 (4) A =12Q2(Ik+F). (5)

We turn to verifying the conditions of Section 2. The first two conditions are satisfied as a consequence of the following proposition:

###### Proposition 3.3.

The Cayley transform is one-to-one.

Only the third condition, that is continuously differentiable on remains.

The process of computing the derivative matrix is the same as before. We define a -dimensional vector containing each of the independent entries of Let be the matrix having as its corresponding vector of independent entries. The Cayley transform can now be thought of as a function of

 C(ψ) =(Ip+Xψ)(Ip−Xψ)−1Ip×k.

As a function of the Cayley transform is a bijection between the set and The next lemma provides an explicit linear map from to in the form of a matrix

###### Lemma 3.0.2.

The equation is satisfied by the matrix

 ΓG=(Ip2−Kp,p)(ΘT1⊗ΘT2).

In the next proposition, we identify the derivative matrix

###### Proposition 3.4.

The Cayley transform is continuously differentiable on with derivative matrix

 DC(ψ)=2[ITp×k(Ip−Xψ)−T⊗(Ip−Xψ)−1]ΓG.

## 4 Change of variables formulas

We now state change of variables formulas for the Cayley parametrizations of and Given the results of the previous section, the formulas follow directly from Theorem 2.1. The -dimensional Jacobian of the Cayley transform at is equal to

 JdVC(φ) =∣∣DC(φ)TDC(φ)∣∣1/2 =∣∣22ΓTV(GV⊗HV)ΓV∣∣1/2

where

 GV =(Ip−Xφ)−1Ip×kITp×k(Ip−Xφ)−T HV =(Ip−Xφ)−T(Ip−Xφ)−1

and is defined as in Section 3.1. The equations above hold for the -dimensional Jacobian if we replace the symbols and with the symbols and respectively. In the supplementary material, we describe how to compute the Jacobian terms, taking advantage of their block structure. Let be proportional to a density with respect to the -dimensional Hausdorff measure on

###### Theorem 4.1.

(Change of Variables Formulas) For all Borel sets

 ∫AJdVC(φ)LdV(dφ)=HdV[C(A)] (6)

and hence

 (7)

If instead we have proportional to a density with respect to the -dimensional Hausdorff measure on the statement is true when we replace the symbols and with the symbols and respectively.

Similarly to Theorem 2.1, Theorem 4.1 has an interpretation in terms of random variables. Let the distribution of have a density proportional to Then when the distribution of has a density proportional to In particular, let so that Then when the distribution of has a density proportional to Analogous statements hold when is a random element of

## 5 Simulating from V(k,p) and V+(k,p)

Practical applications often require simulating a random orthogonal matrix whose distribution has a prescribed density For instance, Bayesian analyses of statistical models with an orthogonal matrix parameter yield posterior densities on the Stiefel manifold, and inference typically requires simulating from these densities. In many cases, generating independent samples is too challenging and MCMC methods are the most sensible option.

In this section, we present an MCMC approach to simulating from a distribution having a density on the set or which takes advantage of the Cayley parametrizations described in Section 3. The recent work of Pourzanjani et al.  explores a similar idea based on a Givens rotation parametrization of the Stiefel manifold. When it is not too computationally expensive, our approach may have certain advantages over existing methods. Unlike Hoff , it can be applied regardless of whether conditional distributions belong to a particular parametric family, and it is arguably simpler to implement than the approach of Byrne and Girolami . In statistical applications where interest lies in a subspace rather than a particular orthogonal basis, our representation of the Grassmann manifold by the set may suggest an appealing parametrization, with the MCMC approach of this section offering the machinery for Bayesian inference.

We illustrate the basic idea with the Stiefel manifold. (Simulating from involves analogous steps.) In order to simulate whose distribution has density on the Stiefel manifold we construct a Markov chain whose stationary distribution has density on the set Then we simply transform the realized Markov chain back to using the Cayley transform. By doing so, we avoid the difficulty of choosing and simulating from an efficient proposal distribution defined on the Stiefel manifold.

To make things more concrete, we describe the approach with the Metropolis-Hastings algorithm as our MCMC method. We start with an initial value for our chain and a density for the proposal given the previous value The Metropolis-Hastings algorithm for simulating from the distribution having density on proceeds as follows. For

1. Generate from

2. Compute the acceptance ratio

 r =g[C(φ′)]JdVC(φ′)g[C(φt)]JdVC(φt)q(φt|φ′)q(φ′|φt).
3. Sample If set Otherwise, set

For a broad class of and the orthogonal matrices approximate the distribution having density when is large enough.

In place of this simple Metropolis-Hastings algorithm, we can substitute other MCMC methods. Hamiltonian Monte Carlo (HMC) , with implementations in software such as Carpenter et al.  and Salvatier et al. , is a natural choice. For settings in which evaluation and automatic differentiation of the Jacobian term are not prohibitively expensive, the Cayley transform approach offers a relatively straightforward path to MCMC simulation on the sets and

### 5.1 Example: The uniform distribution on V(k,p) Figure 1: The top left panel is a histogram of the top left entry of simulated values of Q∼PV(3,5) plotted against the exact density. The bottom left panel is the analogous plot for Q∼PV(3,50). The top right panel is a histogram of the first entry, rescaled by √p/2, of φ=C−1(Q) when p=5 and k=3. The histogram is plotted against a standard normal density. The bottom right panel is the analogous plot for the case when p=50 and k=3.

Using the MCMC approach described above, we simulate from the uniform distribution on Specifically, we use HMC as implemented in Stan  to simulate Of course, there exist many algorithms for the simulation of independent, uniformly distributed orthogonal matrices. The uniform distribution only serves as a starting point for illustrating the proposed approach.

Figure 1 provides plots based on simulated values of The top row of the figure deals with the case and while the bottom row deals with the case and The histograms on the left show the top left entry of the simulated values of plotted against the exact density, given in Proposition 7.3 of Eaton 

. As we expect, there is close agreement between the histogram density estimate and the exact density. The histograms on the right show the first entry, rescaled by

of the simulated values of the vector These are plotted against a standard normal density. Theorem 6.1 tells us that the histogram density estimate and the standard normal density should agree when is large (both in an absolute sense and relative to k), which is what we observe in the plot on the bottom right. When is similar in magnitude to the standard normal density is a poor approximation, as we see in the plot on the top right.

### 5.2 Example: Bayesian inference for the spiked covariance model Figure 2: The plot in the top left overlays trace plots of the first principal angle calculated from a portion of each of the chains, while the plot in the top right shows the correlation between lagged values of the first principal angle. The black lines and dots correspond to our MCMC approach, while the gray lines and dots correspond to that of Hoff . The plots in the bottom half compare histogram approximations of the marginal posterior distributions of the principal angles.

Suppose the rows of an data matrix are independent samples from a mean zero multivariate normal population with covariance matrix The spiked covariance model, considered by Johnstone  and others, assumes the covariance matrix can be decomposed as with and where Under this model, the covariance matrix is partitioned into the low rank “signal” component and the isotropic “noise” component

Given priors for the unknown parameters, a conventional Bayesian analysis will approximate the posterior distribution by a Markov chain having the posterior as its stationary distribution. Inference for the trivially constrained parameters and is easily handled by standard MCMC approaches, so we treat these parameters as fixed and focus on inference for the orthogonal matrix parameter With a uniform prior for the posterior distribution is matrix Bingham , having density

 p(Q∣Y,σ2,Λ) ∝etr{[(Λ−1+Ik)−1/(2σ2)]QT[YTY]Q}.

We compare two MCMC approaches to simulating from the matrix Bingham distribution: the Gibbs sampling method of Hoff  and the Cayley transform approach (again, with HMC as implemented in Stan). The dimensions are chosen as and We set and choose a true value of uniformly from We then generate a data matrix according to the model We run each Markov chain for 12,000 steps and discard the first 2000 steps as burn-in. In order to summarize the high-dimensional posterior simulations in terms of lower dimensional quantities, we compute the principal angles between the columns of the simulated matrices and the corresponding columns of the posterior mode computed from the eigendecomposition For the principal angles are

 θj=cos−1(|qTjvj|∥qj∥∥vj∥)

where and are the th columns of and respectively.

Figure 2 displays the principal angles calculated from the two Markov chains. The plots in the bottom half of the figure compare histogram approximations of the marginal posterior distributions of the principal angles. There is considerable overlap, suggesting that the two chains have found their way to equivalent modes of their matrix Bingham stationary distribution. The plot in the top left of the figure overlays trace plots of the first principal angle calculated from a portion of each of the chains. The black line corresponds to our MCMC approach, while the gray line corresponds to that of Hoff . The plot in the top right shows the correlation between lagged values of the first principal angle. Again, the black dots correspond to our MCMC approach, while the gray dots correspond to that of Hoff . Together, the plots in the top half of the figure indicate that, compared to the approach of Hoff , the Cayley transform approach produces a Markov chain with less autocorrelation, reducing Monte Carlo error in the resulting posterior inferences.

## 6 An asymptotic independent normal approximation

In Section 4, we derived the density for the distribution of when is distributed uniformly on the Stiefel manifold. However, the expression involves the rather opaque Jacobian term Instead of analyzing the density function, we can gain insight into the distribution of by other means. The critical observation, evident in simulations, is the following: If with large and then, in some sense, the elements of

are approximately independent and normally distributed. Theorem

6.1 of this section provides a mathematical explanation for this empirical phenomenon.

In order to understand Theorem 6.1 and its broader context, it is helpful to review the literature relating to normal approximations to the entries of random orthogonal matrices or transformations thereof. For the sake of consistency and clarity, the notation and formulation of the relevant results have been modified slightly. Let be a sequence of random orthogonal matrices with each element uniform on The notation indicates that the number of columns may grow with the number of rows. For each let be the top left entry of (any other entry would also work). It has long been observed that is approximately normal when is large. The earliest work in this direction relates to the equivalence of ensembles in statistical mechanics and is due to Mehler , Maxwell [32, 33], and Borel . A theorem of Borel shows that as grows, where

is the cumulative distribution function of a standard normal random variable. Since then, a large and growing literature on this sort of normal approximation has emerged. A detailed history is given in

Diaconis and Freedman , while an overview is given in D’Aristotile et al.  and Jiang .

Much of this literature is devoted to normal approximations to the joint distribution of entries of random orthogonal matrices. Letting

be the first column of for each (again, any other column would also work), Stam  proved that the total variation distance between the distribution of the first coordinates of and the distribution of standard normal random variables converges to zero as gets large so long as Diaconis and Freedman  strengthened this result, showing that it holds for Diaconis et al.  prove that the total variation distance between the distribution of the top left block of and the distribution of independent standard normals goes to zero as if and for (Clearly, we must have for this result to make sense.) Their work drew attention to the problem of determining the largest orders of and such that the total variation distance goes to zero. Jiang  solved this problem, finding the largest orders to be Recent work has further explored this topic [45, 24].

Many authors have also considered transformations of random orthogonal matrices or notions of approximation not based on total variation distance. In the former category, D’Aristotile et al.  and Meckes  study the convergence of linear combinations of the entries of the matrices in the sequence to normality as Diaconis and Shahshahani , Stein , Johansson , and Rains  addresses the normality of traces of powers of random orthogonal and unitary matrices. In the latter category, Chatterjee and Meckes  and Jiang and Ma  consider probability metrics other than total variation distance. Jiang  also considers a notion of approximation other than total variation distance, and Theorem 3 in that work is particularly important in understanding our Theorem 6.1.

Theorem 3 of Jiang  tells us that the distribution can be approximated by the distribution of independent normals provided that is sufficiently large (both in an absolute sense and relative to ). The form of approximation in the theorem is weaker and likely less familiar than one based on total variation distance. Define the max norm of a matrix as the maximum of the absolute values of its entries. Jiang  shows that one can construct a sequence of pairs of random matrices with each pair defined on the same probability space such that

1. The entries of the matrix are independent standard normals;

2. The matrix is uniform on

3. The quantity in probability as grows provided that and this is the largest order of such that the result holds.

The coupling is constructed by letting be the result of the Gram-Schmidt orthogonalization procedure applied to

To better understand this type of approximation, which we refer to as ‘approximation in probability,’ consider the problem of simulating from the distribution of the random matrix on a computer with finite precision. One could simulate a matrix of independent standard normals, obtain using Gram-Schmidt, then multiply by to arrive at However, for a fixed machine precision and sufficiently large (again, both in an absolute sense and relative to ), the matrix would be indistinguishable from with high probability.

Our Theorem 6.1 establishes that the distribution of which we know to have a density proportional to can be approximated in probability by independent normals. (Since we now have a sequence of matrices of different dimensions, we denote the Cayley transform parametrizing by ) For each define the diagonal scale matrix

 Πp =[√p/2Ikp(kp−1)/200√pI(p−kp)kp],

and recall that the infinity norm of a vector is equal to the maximum of the absolute values of its entries.

###### Theorem 6.1.

One can construct a sequence of pairs of random vectors such that

1. The entries of the vector are independent standard normals;

2. The vector where

3. The quantity in probability as provided

The construction of the coupling is more elaborate than in Theorem 3 of Jiang . We first introduce a function which approximates the inverse Cayley transform. Given a matrix having a square top block and a bottom block , the vector contains the independent entries of obtained by eliminating diagonal and supradiagonal entries from while The approximate inverse Cayley transform is then

 ˜C−1p(M) =[˜bp(M)vec˜Ap(M)].

Now let be a matrix of independent standard normals and let be the result of applying the Gram-Schmidt orthogonalization procedure to It follows that Finally, set

 zp =Πp˜C−1p(p−1/2Zp) φp =C−1p(Qp).

The details of the proof of Theorem 6.1 appear in the appendix, but we provide a sketch here. Part (i) is straightforward to verify and (ii) is immediate. Part (iii) requires more work. The proof of (iii) involves verifying the following proposition, which involves a third random vector

###### Proposition 6.1.

(i) The quantity in probability as provided and (ii) the quantity in probability as provided

Part (iii) then follows from the proposition by the triangle inequality.

## Acknowledgements

This work was partially supported by grants from the National Science Foundation (DMS-1505136, IIS-1546130) and the United States Office of Naval Research (N00014-14-1-0245/N00014-16-1-2147).

## References

• Anderson et al.  T. W. Anderson, I. Olkin, and L. G. Underhill. Generation of Random Orthogonal Matrices. SIAM Journal on Scientific and Statistical Computing, 8(4):625–629, 1987.
• Borel  Émile Borel. Sur les principes de la théorie cinétique des gaz. Annales scientifiques de l’École Normale Supérieure, 23:9–32, 1906.
• Byrne and Girolami  Simon Byrne and Mark Girolami. Geodesic Monte Carlo on embedded manifolds. Scandinavian Journal of Statistics, 40(4):825–845, 2013.
• Carpenter et al.  Bob Carpenter, Andrew Gelman, Matthew D. Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan : A Probabilistic Programming Language. Journal of Statistical Software, 76(1):1–32, 2017.
• Cayley  Arthur Cayley. Sur quelques propriétés des déterminants gauches. Journal für die reine und angewandte Mathematik, 32:119–123, 1846.
• Chatterjee and Meckes  Sourav Chatterjee and Elizabeth Meckes. Multivariate normal approximation using exchangeable pairs. Latin American Journal of Probability and Mathematical Statistics, 4:257–283, 2008.
• Chikuse  Yasuko Chikuse. Statistics on Special Manifolds. Springer New York, 2003.
• Cook et al.  R. Dennis Cook, Bing Li, and Francesca Chiaromonte.

Envelope models for parsimonious and efficient multivariate linear regression.

Statistica Sinica, 20(3):927–960, 2010.
• Cron and West  Andrew Cron and Mike West. Models of Random Sparse Eigenmatrices and Bayesian Analysis of Multivariate Structure. In

Statistical Analysis for High-Dimensional Data

, pages 125–153. Springer International Publishing, 2016.
• D’Aristotile et al.  Anthony D’Aristotile, Persi Diaconis, and Charles M. Newman. Brownian motion and the classical groups. Lecture Notes-Monograph Series, 41:97–116, 2003.
• Diaconis and Forrester  Persi Diaconis and Peter J. Forrester. Hurwitz and the origins of random matrix theory in mathematics. Random Matrices: Theory and Applications, 6(1), 2017.
• Diaconis and Freedman  Persi Diaconis and David Freedman. A dozen de Finetti-style results in search of a theory. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 23(2):397–423, 1987.
• Diaconis and Shahshahani  Persi Diaconis and Mehrdad Shahshahani. On the Eigenvalues of Random Matrices. Journal of Applied Probability, 31:49–62, 1994.
• Diaconis et al.  Persi Diaconis, Morris Eaton, and Steffen Lauritzen. Finite De Finetti Theorems in Linear Models and Multivariate Analysis. Scandinavian Journal of Statistics, 19(4):289–315, 1992.
• Diaconis et al.  Persi Diaconis, Susan Holmes, and Mehrdad Shahshahani. Sampling from a Manifold. In Advances in Modern Statistical Theory and Applications: A Festschrift in honor of Morris L. Eaton, pages 102–125. Institute of Mathematical Statistics, 2013.
• Eaton  Morris L. Eaton. Multivariate Statistics: a Vector Space Approach. John Wiley & Sons, 1983.
• Eaton  Morris L. Eaton. Group Invariance Applications in Statistics. In Regional Conference Series in Probability and Statistics, volume 1, pages i–133. Institute of Mathematical Statistics, 1989.
• Hoff  Peter D. Hoff.

Model Averaging and Dimension Selection for the Singular Value Decomposition.

Journal of the American Statistical Association, 102(478):674–685, 2007.
• Hoff  Peter D. Hoff. Simulation of the matrix Bingham-von Mises-Fisher Distribution, With Applications to Multivariate and Relational Data. Journal of Computational and Graphical Statistics, 18(2):438–456, 2009.
• Hubbard and Hubbard  J. H. Hubbard and B. B. Hubbard. Vector Calculus, Linear Algebra, and Differential Forms: A Unified Approach. Matrix Editions, 2009.
• Hurwitz  A. Hurwitz. Über die Erzeugung der invarianten durch integration. Nachrichten von der Gesellschaft der Wissenschaften zu Göttingen, Mathematisch-Physikalische Klasse, 1897:71–90, 1897.
• James  A. T. James. Normal Multivariate Analysis and the Orthogonal Group. Annals of Mathematical Statistics, 25(1):40–75, 1954.
• Jiang  Tiefeng Jiang. How many entries of a typical orthogonal matrix can be approximated by independent normals? Annals of Probability, 34(4):1497–1529, 2006.
• Jiang and Ma  Tiefeng Jiang and Yutao Ma. Distances between random orthogonal matrices and independent normals. arXiv:1704.05205v1, 2017.
• Johansson  Kurt Johansson. On Random Matrices from the Compact Classical Groups. Annals of Mathematics, 145(3):519–545, 1997.
• Johnstone  Iain M. Johnstone.

On the distribution of the largest eigenvalue in principal components analysis.

The Annals of Statistics, 29(2):295–327, 2001.
• León et al.  Carlos A. León, Jean-Claude Massé, and Louis-Paul Rivest. A Statistical Model for Random Rotations. Journal of Multivariate Analysis, 97(2):412–430, 2006.
• Magnus  Jan R. Magnus. Linear Structures. Oxford University Press, 1988.
• Magnus and Neudecker  Jan R. Magnus and H. Neudecker. The Commutation Matrix: Some Properties and Applications. The Annals of Statistics, 7(2):381–394, 1979.
• Magnus and Neudecker  Jan R. Magnus and Heinz Neudecker. Matrix differential calculus with applications in statistics and econometrics. Wiley Series in Probability and Mathematical Statistics, 1988.
• Mardia and Jupp  K. V. Mardia and P. E. Jupp. Directional Statistics. Wiley Series in Probability and Statistics, 2009.
• Maxwell  J. C. Maxwell. Theory of Heat. Longmans, London, 4th edition, 1875.
• Maxwell  J. C. Maxwell. On Boltzmann’s theorem on the average distribution of energy in a system of material points. Transactions of the Cambridge Philosophical Society, 12:547–575, 1878.
• Meckes  Elizabeth Meckes. Linear functions on the classical matrix groups. Transactions of the American Mathematical Society, 360(10):5355–5366, 2008.
• Mehler  F. G. Mehler. Ueber die Entwicklung einer Function von beliebig vielen. Variablen nach Laplaschen Functionen höherer Ordnung. Crelle’s Journal, 66:161–176, 1866.
• Neal  Radford M. Neal. MCMC Using Hamiltonian Dynamics. Handbook of Markov Chain Monte Carlo, 54:113–162, 2010.
• Neudecker  H. Neudecker. On Jacobians of transformations with skew-symmetric, strictly (lower) triangular or diagonal matrix arguments. Linear and Multilinear Algebra, 14(3):271–295, 1983.
• Pourzanjani et al.  Arya A. Pourzanjani, Richard M. Jiang, Brian Mitchell, Paul J. Atzberger, and Linda R. Petzold. General Bayesian Inference over the Stiefel Manifold via the Givens Transform. arXiv:1710.09443v2, 2017.
• Rains  E. M. Rains. High powers of random elements of compact Lie groups. Probability Theory and Related Fields, 107(2):219–241, 1997.
• Rao et al.  Vinayak Rao, Lizhen Lin, and David B. Dunson. Data augmentation for models based on rejection sampling. Biometrika, 103(2):319–335, 2016.
• Salvatier et al.  John Salvatier, Thomas V. Wiecki, and Christopher Fonnesbeck. Probabilistic programming in Python using PyMC3. PeerJ Computer Science, 2:e55, 2016.
• Shepard et al.  Ron Shepard, Scott R. Brozell, and Gergely Gidofalvi. The Representation and Parametrization of Orthogonal Matrices. The Journal of Physical Chemistry A, 119(28):7924–7939, 2015.
• Stam  A. J. Stam. Limit theorems for uniform distributions on spheres in high-dimensional Euclidean spaces. Journal of Applied Probability, 19:221–228, 1982.
• Stein  C. Stein. The accuracy of the normal approximation to the distribution of the traces of powers of random orthogonal matrices. Technical Report No. 470, Stanford University Department of Statistics, 1995.
• Stewart  Kathryn Stewart. Total variation approximation of random orthogonal matrices by Gaussian matrices. arXiv:1704.06641v2, 2018.
• Traynor  Tim Traynor. Change of variable for Hausdorff measure. Rendiconti dell’Istituto di Matematica dell’Università di Trieste, 1:327–347, 1993.
• van Handel  Ramon van Handel. Probability in high dimension. Technical report, Princeton University, 2016.

## Appendix A Proofs

### a.1 The sum of a symmetric positive definite matrix and skew-symmetric matrix is nonsingular

Let and be symmetric positive definite and skew-symmetric, respectively, of equal dimension. Their sum can be written

 Σ+S =Σ1/2(I+Σ−1/2SΣ−1/2)Σ1/2.

The matrix is skew-symmetric, which implies that is nonsingular. Because it can be written as the product of nonsingular matrices, the sum is nonsingular.

### a.2 Proof of Proposition 3.1

One can compute, following Magnus and Neudecker , that

 dQ=2(Ip−Xφ)−1dXφ(Ip−Xφ)−1Ip×k,

so that

 dvecQ=2[ITp×k(Ip−Xφ)−T⊗(Ip−X−1φ)]dvecXφ.

By Lemma 3.0.1, we have that Thus,

 dvecQ=2[ITp×k(Ip−Xφ)−T⊗(Ip−Xφ)−1]ΓVdφ.

Using the first identification table of Magnus and Neudecker , we identify the derivative matrix

 DCayV(φ)=2[ITp×k(Ip−Xφ)−T⊗(Ip−Xφ)−1]ΓV.

### a.3 Proof of Proposition 3.2

We first show that is injective on Let and suppose that i.e. the columns of and span the same subspace. There must exist such that so that Because the matrix is nonsingular, its left polar decomposition into the product of a symmetric positive definite matrix and an orthogonal matrix is unique (see, for example, Proposition 5.5 of Eaton ). We conclude that and Thus, is injective on

Next, we prove that the image of under has measure one with respect to the uniform probability measure on Define as the set

 VN(k,p) ={Q=[QT1QT2]T∈V(k,p):Q1 is nonsingular}.

The set