The QR decomposition for radial neural networks

07/06/2021 ∙ by Iordan Ganev, et al. ∙ Northeastern University Weizmann Institute of Science 0

We provide a theoretical framework for neural networks in terms of the representation theory of quivers, thus revealing symmetries of the parameter space of neural networks. An exploitation of these symmetries leads to a model compression algorithm for radial neural networks based on an analogue of the QR decomposition. A projected version of backpropogation on the original model matches usual backpropogation on the compressed model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

QR-decomposition-radial-NNs

Code accompanying the paper "The QR decomposition for radial neural networks", arxiv:2107.02550.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recent work has shown that representation theory, the formal study of symmetry, provides the foundation for various innovative techniques in deep learning

[cohen2016group, kondor_generalization_2018, ravanbakhsh2017equivariance, cohen2016steerable]. Much of this previous work considers symmetries inherent to the input and output spaces, as well as distributions and functions that respect these symmetries. By contrast, in this paper, we expose a broad class of symmetries intrinsic to the parameter space of the neural networks themselves. We use these symmetries to devise a model compression algorithm that reduces the widths of the hidden layers, and hence the number of trainable parameters. Unlike representation-theoretic techniques in the setting of equivariant neural networks, our methods are applicable to deep learning models with non-symmetric domains and non-equivariant functions, and hence pertain to some degree to all neural networks.

Specifically, we formulate a theoretical framework for neural networks in terms of the theory of representations of quivers, a mathematical field with connections to symplectic geometry and Lie theory [kirillov2016quiver, nakajima1998quiver]. This approach builds on that of Armenta and Jodoin [armenta_representation_2020] and of Wood and Shawe-Taylor [wood_representation_1996], but is arguably simpler and encapsulates larger symmetry groups. Formally, a quiver is another name for a directed graph, and a representation

of a quiver is the assignment of a vector space to each vertex and a linear map to each edge, where the source and target of the linear map are the vector spaces assigned to the source and target vertices of the edge. Representations of quivers carry rich symmetry groups via change-of-basis transformations of the vector spaces assigned to the vertices.

Our starting point is to regard the vector space of parameters for a neural network111

Unless specified otherwise, the term ‘neural network’ exclusively refers to a multilayer perceptron (MLP).

with layers as a representation of a specific quiver, namely, the neural quiver:

with

vertices in the top row and a ‘bias vertex’ at the bottom. As such, the parameter space carries a change-of-basis symmetry group. Factoring out this symmetry leads to a reduced parameter space without affecting the feedforward function. The size of the symmetry group is determined by properties of the activation functions. We focus on the case of radial activation functions, as in

[weiler_general_2019, sabour2017dynamic, weiler20183d, weiler2018learning]; these interact favorably with certain QR decompositions, and, consequently, the model compression is significant compared to the more common pointwise (also known as ‘local’) activations. We refer to neural networks with radial activation functions as radial neural networks.

Given a radial neural network, our results produce a neural network with fewer neurons in each hidden layer and the same feedforward function, and hence the same loss for any batch of training data. Moreover, we prove that the value of the loss function after a step of gradient descent applied to the compressed model is the same as the value of the loss function after a step of

projected gradient descent

applied to the original model. As we explain, projected gradient descent is a version of gradient descent where one subtracts a truncation of the gradient, rather than the full gradient. Admittedly, training the original model often leads to better performance after fewer epochs; however, when the compression is significant enough, the compressed model takes less time per epoch to train and reaches local minima faster.

To state these results slightly more precisely, recall that the parameters of a neural network with layer widths consist of an matrix of weights for each layer , where we include the bias as an extra column. These are grouped into a tuple . We define the reduced widths recursively as for , with and . Note that for all .

Theorem 1.1 (Informal version of Theorems LABEL:thm:ascending and LABEL:thm:projGD).

Suppose a neural network has layers with widths , parameters , and radial activation functions. Let be the feedforward function of the network.

  1. There exists a reduced radial neural network with layer widths , parameters , and the same feedforward function .

  2. Training with gradient descent is an equivalent optimization problem to training with projected gradient descent.

This theorem can be interpreted as a model compression result: the reduced (or compressed) neural network has the same accuracy as the original neural network, and there is an explicit relationship between the gradient descent optimization problems for the two neural networks. We now state a simplified version of the algorithm used to compute the reduced neural network, which amounts to layer-by-layer computations of QR decompositions:

input : 
for  to  do   // Iterate through layers
       QR-decomp()   // QR decomposition of the matrix
         // Update
      
end for
  // Define
return:
Algorithm 1 QR Dimensional Reduction (simplified version of Algorithm LABEL:alg:QRU )

We view this work as a step in the direction of improving learning algorithms by exploiting symmetry inherent to neural network parameter spaces. As such, we expect our framework and results to generalize in several ways, including: (1) further reductions of the hidden widths, (2) incorporating certain non-radial activation functions, (3) encapsulating neural networks beyond MLPs, such as convolutional, recurrent, and graph neural networks, (4) integration of regularization techniques. We postpone a detailed discussion of future directions to the end of the paper (Section 4.3).

Our contributions are as follows:

  1. We provide a theoretical framework for neural networks based on the representation theory of quivers.

  2. We derive a QR decomposition for radial neural networks and prove its compatibility with (projected) gradient descent.

  3. We implement a lossless model compression algorithm for radial neural networks.

1.1. Related work

Quiver representation theory and neural networks.

Armenta and Jodoin [armenta_representation_2020] establish an approach to understanding neural networks in terms of quiver representations. Our work generalizes their approach as it (1) encapsulates both pointwise and non-pointwise activation functions, (2) taps into larger symmetry groups, and (3) connects more naturally to gradient descent. Jeffreys and Lau [jeffreys_kahler_2021]

also place quiver varieties in the context of machine learning, and define a Kähler metric in order to perform gradient flow. Manin and Marcolli

[manin_homotopy_2020] advance the study of neural networks using homotopy theory, and the “partly oriented graphs” appearing in their work are generalizations of quivers. In comparison to the two aforementioned works, our approach is inspired by similar algebro-geometric and categorical perspectives, but our emphasis is on practical consequences for optimization techniques at the core of machine learning.

Although the study of neural networks in the context of quiver representations emerged recently, there are a number of precursors. One is the study and implementation of the “non-negative homogeneity” (also known as “positive scaling invariance”) property of ReLU activation functions

[dinh_sharp_2017, neyshabur_path-sgd_2015, meng_g-sgd_2019], which is a special case of the symmetry studied in this paper. Wood and Shawe-Taylor [wood_representation_1996] regard layers in a neural network as representations of finite groups and restrict their attention to the case of pointwise activation functions; by contrast, our framework captures Lie groups as well as non-pointwise activation functions. Our quiver approach to neural networks shares similarities with the “algebraic neural networks” of Parada-Mayorga and Ribeiro [parada-mayorga_algebraic_2020], and special cases of their formalism amount to representations of quivers over base rings beyond , such as the ring of polynomials . In a somewhat different data-analytic context, Chindris and Kline [chindris_simultaneous_2021] use quiver representation theory in order to untangle point clouds, though they do not use neural networks or machine learning.

Equivariant neural networks.

Previously, representation theory has been used to design neural networks which incorporate symmetry as an inductive bias. A variety of architectures such as -convolutions, steerable CNNs, and Clebsch-Gordon networks are constrained by various weight-sharing schemes to be equivariant or invariant to various symmetry groups [cohen2019gauge, weiler_general_2019, cohen2016group, chidester2018rotation, kondor_generalization_2018, bao2019equivariant, worrall2017harmonic, cohen2016steerable, weiler2018learning, dieleman2016cyclic, lang2020wigner, ravanbakhsh2017equivariance]. Our approach, in contrast, does not rely on symmetry of the input domain, output space, or mapping. Rather, our method exploits symmetry of the parameter space and thus applies more generally to domains with no obvious symmetry.

Model compression and weight pruning.

In general, model compression aims to exploit redundancy in neural networks; enormous models can be reduced and run on smaller systems with faster inference [bucilua2006model, cheng2017survey, frankle2018lottery, zhang2018systematic]. Whereas previous approaches to model compression are based on weight pruning, quantization, matrix factorization, or knowledge distillation, we take a fundamentally different approach by exploiting symmetries of neural networks parameter spaces.

1.2. Organization of the paper

This paper is a contribution to the mathematical foundations of machine learning, and hence uses precise mathematical formalism throughout. At the same time, our results are motivated by expanding the applicability and performance of neural networks. We hope our work is accessible to both machine learning researchers and mathematicians – allowing the former to recognize the practical potential of representation theory, while giving the latter a glimpse into the rich structure underlying neural networks.

Section 2 consists of preliminary material. We review the necessary background on linear algebra and group theory in Section 2.1. We recall basic facts related to the QR decomposition in Section 2.2. Section LABEL:subsec:GD serves to establish notation used in the context of gradient descent, and to prove an interaction between gradient descent and orthogonal transformations (Proposition LABEL:prop:GD-orthtrans).

In Section LABEL:sec:nn-quiver, we delve into quiver representation theory. We first provide the basic definitions (Section LABEL:subsec:quiver-rep) before focusing on a specific quiver, called the neural quiver (Section LABEL:subsec:neuralquiver). Finally, we give a formulation of neural networks in terms of quiver representations (Definition LABEL:def:nn in Section LABEL:subsec:nn-defs), and explain the sense in which the backpropogation training algorithm can be regarded as taking place on the vector space of representations of the neural quiver.

Section LABEL:sec:QR-rnns contains the first set of main results of this paper. We recall the definitions of radial functions and radial neural networks (Section LABEL:subsec:radial-functions). In Section LABEL:subsec:QR-result, we introduce the notion of a reduced dimension vector for any dimension vector of the neural quiver. This reduced dimension vector features in a QR decomposition (Theorem LABEL:thm:ascending) for radial neural networks; we also provide an algorithm to compute the QR decomposition (Algorithm LABEL:alg:QRU).

Section LABEL:sec:proj-GD

contains the second set of main results of this paper, which are related to projected gradient descent. We first introduce an ‘interpolating space’ that relates the space of representations with a given dimension vector to the space of representations with the reduced dimension vector (Section

LABEL:subsec:interpolating). Using the interpolating space, we define a projected version of gradient descent (Definition LABEL:def:projGD in Section LABEL:subsec:projGD-QR) and state a result on the relationship between projected gradient descent and the QR decomposition for radial neural networks (Theorem LABEL:thm:projGD).

In Section 3, we summarize implementations that empirically verify our main results (Sections 3.1 and 3.2) and demonstrate that reduced models train faster (Section 3.3).

Section 4 is a discussion section that includes a version of our results for neural networks with no biases (Section 4.1), a generalization involving shifts in radial functions (Section 4.2), and an overview of future directions (Section 4.3).

Finally, in Appendix A, we include a formulation and generalization of our results using the language of category theory.

1.3. Acknowledgments

We would like to thank Avraham Aizenbud, Marco Antonio Armenta, Niklas Smedemark-Margulies, Jan-Willem van de Meent, and Rose Yu for insightful discussions, comments, and questions. Robin Walters is supported by a Postdoctoral Fellowship from the Roux Institute.

2. Preliminaries

In this section, we first review basic notation and concepts from linear algebra, group theory, and representation theory. We then state a version of the QR decomposition. Finally, we formulate a definition of the gradient descent map with respect to a loss function. References include [Petersen12thematrix], [quarteroni_numerical_2007], and [dummit_abstract_2003].

2.1. Linear algebra

For positive integers and , let or denote the vector space of by matrices with entries in , that is, the vector space of linear functions (also known as homomorphisms) from to . The general linear group consists of the set of invertible by matrices, with the operation of matrix multiplication. Equivalently, consists of all linear automorphisms of , with the operation of composition. Such automorphisms are given precisely by change-of-basis transformations. The unit is the identity by matrix, denoted . The orthogonal group is the subgroup of consisting of all matrices such that . Such a matrix is called an orthogonal transformation.

Let be a group. An action (or representation) of on the vector space is the data of an invertible by matrix for every group element , such that (1) for all , the matrix is the product of the matrices and , and (2) for the identity element , we have . In other words, an action amounts to a group homomorphism We often abbreviate as simply , for and . A function (non-linear in general) is invariant for the action of if for all and .

Suppose . Then denotes the standard inclusion into the first coordinates:

The standard projection onto the first coordinates is defined as:

An affine map from to is one of the form for a matrix and a -dimensional vector . The set of affine maps from to can be identified with the vector space , as we now explain. First, given and as above, we form the matrix . Conversely, the affine map corresponding to is given by

where is defined as:

(2.1)

2.2. The QR decomposition

In this section, we recall the QR decomposition and note several relevant facts. For integers and , let denote the vector space of upper triangular by matrices.

Theorem 2.1 (QR Decomposition).

The following map is surjective:

In other words, any matrix can be written as the product of an orthogonal matrix and an upper-triangular matrix. When

, the last rows of any matrix in are zero, and the top rows form an upper-triangular by matrix. These observations lead to the following special case of the QR decomposition:

Corollary 2.2.

Suppose . The following map is surjective:

We make some remarks:

  1. There are several algorithms for computing the QR decomposition of a given matrix. One is Gram–Schmidt orthogonalization, and another is the method of Householder reflections. The latter has computational complexity in the case of a matrix with . The package numpy includes a function numpy.linalg.qr that computes the QR decomposition of a matrix using Householder reflections.

  2. The QR decomposition is not unique in general, or, in other words, the map is not injective in general. For example, if , each fiber of contains a copy of .

  3. The QR decomposition is unique (in a certain sense) for invertible square matrices. To be precise, let be the subset of of consisting of upper triangular matrices with positive entries along the diagonal. If , then both and are subgroups of , and the multiplication map is bijective. However, the QR decomposition is not unique for non-invertible square matrices.

    Proof of Theorem LABEL:thm:projGD.

    The first claim follows easily from the definitions of , , and . The action of on is an orthogonal transformation, so the second claim follows from Proposition LABEL:prop:GD-orthtrans. For the last claim, we proceed by induction. The base case follows from Theorem LABEL:thm:ascending. For the induction step, we set

    Each belongs to , so . Moreover, . We compute:

    where the second equality uses the induction hypothesis, the third invokes the definition of , the fourth uses the relation between the gradient and orthogonal transformations, the fifth and sixth use Lemma LABEL:lem:repint above, and the last uses the definition of . ∎

    3. Experiments

    In addition to the theoretical results in this work, we provide an implementation of QRDimRed as described in Algorithm LABEL:alg:QRU. We perform experiments in order to (1) empirically validate that our implementation satisfies the claims of Theorems LABEL:thm:ascending and Theorem LABEL:thm:projGD and (2) quantify real-world performance. Our implementation is written in Python and uses the QR decomposition routine in NumPy [harris2020array]. We also implement a general class RadNet

    for radial neural networks using PyTorch

    [NEURIPS2019_9015]. Our code is available at https://github.com/ivganev/QR-decomposition-radial-NNs/.

    3.1. Empirical verification of Theorem LABEL:thm:ascending

    We verify the claim using a small model and synthetic data. We learn the function using samples for . We test on a radial neural network with layer widths and activation functions the radial shifted sigmoid . Applying QRDimRed gives a radial neural network with widths . Theorem LABEL:thm:ascending implies that the neural functions of and are equal. Over 10 random initializations of , the mean absolute error . Thus and agree up to machine precision.

    3.2. Empirical verification of Theorem LABEL:thm:projGD

    Similarly, we verify the conclusions of Theorem LABEL:thm:projGD using synthetic data. The claim is that training with objective by projected gradient descent coincides with training with objective by usual gradient descent. We verified this for 3000 epochs at learning rate 0.01. Over 10 random initializations of , the loss functions match up to machine precision with .

    3.3. Reduced model trains faster.

    The goal of model compression algorithms is usually to provide smaller trained models with faster inference but similar accuracy to larger trained models. Due to the relation between projected gradient descent of the full network and gradient descent of the reduced network , our method may produce a smaller model class which also trains faster without sacrificing accuracy.

    We test this hypothesis using a different set of synthetic data. We learn the function sending to using samples for . We test on a radial neural network with layer widths and activation functions the radial sigmoid . Applying QRDimRed gives a radial neural network with widths . We trained both models until the training loss was . Running on test system with an Intel i5-8257U CPU at 1.40GHz and 8GB of RAM and averaged over 10 random initializations, the reduced network trained in seconds and the original network trained in seconds, approximately times as long to reach the same loss.

    4. Discussion and conclusions

    In this section, we first consider how our framework specializes to the case of no bias, and then how it generalizes to shifts within the radial functions. Finally, we gather some conclusions of this work and discuss future directions.

    4.1. Special case: No-bias version

    We now consider neural networks with only linear maps between successive layers, rather than the more general setting of affine maps. In other words, there are no bias vectors.

    Let be a positive integer. The no-bias neural quiver is the following quiver with vertices:

    A representation of this quiver with dimension vector consists of a linear map from to for ; hence The corresponding no-bias reduced dimension vector is defined recursively by setting , then for , and finally . Since for all , we have an obvious inclusion and identify with its image in . Given functions , one adapts Procedure LABEL:procedure to define a radial neural network for every representation of , where the trainable parameters define linear maps rather than affine maps.

    Proposition 4.1.

    Theorem LABEL:thm:ascending holds with the neural quiver replaced by the no-bias neural quiver and the reduced dimension vector replaced by the no-bias reduced dimension vector .

    We illustrate an example of the reduction for no-bias radial neural networks in Figure 1. Versions of Algorithm LABEL:alg:QRU and Theorem LABEL:thm:projGD also hold in the no-bias case, where one uses projected gradient descent with respect to the subspace of representations having the lower left block of each equal to zero.

    Figure 1. Parameter reduction in 3 steps. Since the activation function is radial, it commutes with orthogonal transformations. This example has , , and no bias. The reduced dimension vector is . The number of trainable parameters reduces from 24 to 3.

    4.2. Generalization: Radial neural networks with shifts

    In this section we consider radial neural networks with an extra trainable parameter in each layer which shifts the radial function. Adding such parameters allows for more flexibility in the model, and (as shown in Theorem 4.3) the QR decomposition of Theorem LABEL:thm:ascending holds for such radial neural networks.

    Let be a function222We also assume is piece-wise differentiable and exclude those for which the limit does not exist.. For any and any , the corresponding shifted radial function on is given by:

    The following definition is a modification of Definition LABEL:def:nn.

    Definition 4.2.

    A radial neural network with shifts consists of the following data:

    1. Hyperparameters. A positive integer and a dimension vector for the neural quiver .

    2. Trainable parameters.

      1. A representation of the quiver with dimension vector . So, for , we have a matrix .

      2. A vector of shifts .

    3. Radial activation functions. A tuple , where . The activation function in the -th layer is given by for .

    The neural function of a radial neural network with shifts is defined as:

    where is the affine map corresponding to . The trainable parameters form the vector space , and the loss function of a batch of training data is defined as

    We have the gradient descent map:

    which updates the entries of both and . The group acts on as usual (see Section LABEL:subsec:radial-functions), and on trivially. The neural function is unchanged by this action. We conclude that the action on commutes with gradient descent .

    We now state a generalization of Theorem LABEL:thm:ascending for the case of radial neural networks with shifts. We omit a proof, as it uses the same techniques as the proof of Theorem LABEL:thm:ascending.

    Theorem 4.3.

    Let be a dimension vector for and fix functions as above. For any there exist:

    such that:

    1. The matrices are upper triangular.

    2. The following equality holds:

    3. The neural functions defined by and coincide:

    One can use the output of Algorithm LABEL:alg:QRU to obtain the , , and appearing in Theorem 4.3. Theorem LABEL:thm:projGD also generalizes to the setting of radial neural networks with shifts, using projected gradient descent with respect to the subspace of .

    4.3. Conclusions and future work

    In this paper, we have adopted the formalism of quiver representation theory in order to establish a theoretical framework for neural networks. While drawing inspiration from previous work, our approach is novel in that it (1) reveals a large group of symmetries of neural network parameter spaces, (2) leads to a version of the QR decomposition in the context of radial neural networks, and (3) precipitates an algorithm to reduce the widths of the hidden layers in such networks. Our main results, namely Theorems LABEL:thm:ascending and LABEL:thm:projGD, may potentially generalize in several ways, which we now describe.

    First, these theorems are only meaningful if for some (otherwise

    ). In particular, this leads to compression for networks which have any consecutive layers of increasing width. Many networks such as decoders, super-resolution mappings, and GANs fulfill this criteria, but others do not. We expect our techniques to prove useful in weakening the assumptions necessary for a meaningful reduction, and hence widening the applicability of our results.

    One shortcoming of our results is the necessity of projected gradient descent in Part 2 of Theorem LABEL:thm:projGD, rather than usual gradient descent . This shortcoming may be alleviated by the following speculations. First, preliminary experimental evidence suggests that our version of projected gradient descent is a first-order approximation of usual gradient descent. Second, it may be possible that, for a different choice of QR decomposition in Theorem LABEL:thm:ascending, the equation stated in Part 2 of Theorem LABEL:thm:projGD holds with usual gradient descent , rather than projected gradient descent . This flexibility stems from the non-uniqueness of the QR decomposition of an matrix with (see Section 2.2).

    The hypothesis that the activation functions are radial is crucial for our results, as they commute with orthogonal matrices (Lemma LABEL:lem:radial-basics), and hence interact favorably with QR decompositions. However, similar dimensional-reduction procedures may well be possible for other activation functions which relate to other matrix decompositions.

    Our techniques may also lead to the incorporation of parameter space symmetries to improve generalization. In particular, there may be enhancements of our main results that incorporate regularization. For example, the loss function of a radial neural network with dimension vector remains invariant for the group action after adding an regularization term.

    The conceptual flexibility of quiver representation theory can encapsulate neural networks beyond MLPs, including convolutional neural networks (CNNs), equivariant neural networks, recurrent neural networks, graph neural networks, and others. For example, networks with skip connections, such as ResNet, require altering the neural quiver

    by adding extra edges encoding skip connections. On the other hand, the case of CNNs would require working with representations of the neural quiver over the ring of Laurent polynomials.

    Finally, from a more algebro-geometric perspective, our work naturally leads to certain quiver varieties (i.e. moduli spaces of quiver representations); these may reveal a novel approach to the manifold hypothesis, building on that of Armenta and Jodoin

    [armenta_representation_2020].

    References

     

    Weizmann Institute of Science, Rehovot, Israel. E-mail: iordan.ganev@weizmann.ac.il

    Northeastern University, Boston, Massachusetts, USA. E-mail: r.walters@northeastern.edu

    A. Appendix: Categorical formulation

    In this appendix, we summarize a category-theoretic approach toward the main results of the paper. While there is no substantial difference in the proofs, the language of category theory provides conceptual clarity that leads to generalizations of these results. References for category theory include [pierce1991basic, dummit_abstract_2003].

    a.1. The category of quiver representations

    In this section, we recall the category of representations of a quiver. Background references include [kirillov2016quiver, nakajima1998quiver].

    As in Section LABEL:subsec:quiver-rep, let be a quiver with source and target maps , and let be a dimension vector for . Recall that a representation of with dimension vector consists of a tuple of linear maps, where

    Let and be representations of the quiver , with dimension vectors and . A morphism of representations from to consists of the data of a linear map

    for every , subject to the condition that the following diagram commutes for every :

    The resulting category is known as the category of representations of .

    a.2. The category

    In this section, we define a certain subcategory of the category . Its objects are the same as the objects of , that is, representations of the neural quiver, while morphisms in are given by isometries.

    Let be a positive integer, and recall the neural quiver from Section LABEL:subsec:neuralquiver:

    As a reminder, the vertices in the top row are indexed from to , and we only consider dimension vectors whose value at the bias vertex is equal to . So a dimension vector for will refer to a tuple . Recall the isomorphism

    from Lemma LABEL:lem:neural-quiver. We denote a representation of as a tuple , where each belongs to . Let be a representation of with dimension vector . Tracing through the proof of Lemma LABEL:lem:neural-quiver and the definitions in Section A.1, we see that a morphism in consists of the following data:

    • a linear map , for , and

    • a scalar at the bias vertex,

    making the following diagram commute, for :

    Definition A.1.

    We define a subcategory of as follows. The objects of are the same as the objects of , that is, representations of . Let and be such representations, with dimension vectors and , respectively. A morphism in belongs to if the following hold:

    • and is the identity on ,

    • and is the identity on ,

    • for , the linear map is norm-preserving, i.e. for all , and

    • .

    Remark A.2.

    A norm-preserving map is called an isometry, which explains the notation .

    Lemma A.3.

    We have:

    1. The category is a well-defined subcategory of .

    2. Let be a morphism in . For , the linear map is injective.

    Proof.

    The claims follow from the facts that (1) the composition of two norm-preserving maps is norm-preserving, and (2) any linear norm-preserving map is injective. ∎

    Definition A.4.

    Let be a morphism in , and let and be the dimension vectors. An orthogonal factorization of is an element of such that

    for . The correction term corresponding to an orthogonal factorization is:

    The correction term belongs to .

    Remark A.5.

    Orthogonal factorizations always exist, since any norm-preserving linear map can be written as the composition for some orthogonal .

    Remark A.6.

    Suppose is a morphism in . For , the map is an isomorphism if and only if . In this case, the choice of is unique and . Conversely, is not an isomorphism if and only if . In this case, there are choices for .

    Fix functions for . Hence we obtain radial functions for any . We group the into a tuple . Given , we attach a neural network (and hence a neural function) to every object in as in Procedure LABEL:procedure.

    Proposition A.7.

    Let and be a representations of . Suppose there is a morphism in from to . Then the neural functions of the radial neural networks and coincide:

    Sketch of proof..

    The key is to show that, for , we have:

    where and are the affine map corresponding to and . The first step in the verification of this identity is to choose an orthogonal factorization of . The rest of the proof proceeds along the same lines as the proof of Equation LABEL:eqn:hWQi=QihR in Section LABEL:subsec:proof-QR. ∎

    a.3. Projected gradient descent: set-up

    In this section, we collect notation necessary to state results about projected gradient descent.

    Let and be dimension vectors for . We write if:

    • and ,

    • for .

    Consequently, if is a morphism in , then the dimension vectors satisfy . For , we make the following abbreviations, for :

    Using these maps, one defines an inclusion taking to the representation with

    Recall the functions . As in Procedure LABEL:procedure, these define activation functions (resp. ) for a representation with dimension (resp. ).

    Finally, we fix a batch of training data . Using the activation functions defined by , we have loss functions on and (see Section LABEL:subsec:nn-defs):

    There are resulting gradient descent maps on and given by:

    The verification of the following lemma is analogous to the proof of Part 1 of Lemma LABEL:lem:repint.

    Lemma A.8.

    We have that .

    a.4. The interpolating space

    We first define a space that interpolates between and . The discussion of this section is completely analogous to that in Sections LABEL:subsec:interpolating and LABEL:subsec:projGD-QR.

    Definition A.9.

    Let denote the subspace of consisting of those such that, for , we have:

    Just as in Section LABEL:subsec:interpolating, the space consists of representations for which the bottom left