 # Efficient Representation of Low-Dimensional Manifolds using Deep Networks

We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.

Comments

There are no comments yet.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Deep neural networks have achieved state-of-the-art results in a variety of tasks. This remarkable success is not fully explained, but one possibility is that their hierarchical, layered structure may allow them to capture the geometric regularities of commonplace data. We support this hypothesis by exploring ways that networks can handle input data that lie on or near a low-dimenisonal manifold. In many problems, for example face recognition, data lie on or near manifolds that are of much lower dimension than the input space

Turk & Pentland (1991); Basri & Jacobs (2003); Lee et al. (2003), and that represent the intrinsic degrees of variation in the data. Figure 1: We illustrate the embedding of a manifold by a deep network using the famous Swiss Roll example (top). Dots represent color coded input data. In the center, the data is divided into three parts using hidden units represented by the yellow and cyan planes. Each part is then approximated by a monotonic chain of linear segments. Additional hidden units, also depicted as planes, control the orientation of the next segments in the chain. A second layer of the network then flattens each chain into a 2D Euclidean plane, and assembles these into a common 2D representation (bottom).

We study the representational power of deep networks when applied to manifold data. We demonstrate that the initial layers of networks can take inputs that lie in a manifold in a high-dimensional space, approximate this manifold with piecewise linear functions, and economically output the coordinates of these points embedded in a low-dimensional Euclidean space. In fact, each new linear segment approximating the manifold can be represented by a single additional hidden unit, leading to a representation of manifold data that in some cases is nearly optimal in the number of parameters of the system. This means that subsequent layers of a deep network could build upon these early layers, operating in lower dimensional spaces that more naturally represent the input data. It is beyond the scope of this paper to study the problem of training networks to build these representations. However, our results describe novel representations that might be sought in existing networks, or that might suggest new architectures for networks. Moreover, we feel that these results provide intuitions about the role that individual units of a network can play in shaping the function that it is computing.

We first show how this embedding can be done efficiently for manifolds consisting of monotonic chains of linear segments. We then show how these primitives can be combined to form linear approximations for more complex manifolds. This process is illustrated in Figure 1

. We further show that when the data lies sufficiently close to their linear approximation, the error in the embedding will be small. Our constructions will use a feed-forward network with rectified linear unit (RELU) activation. We consider fully connected layers, although the treatment of complex manifolds that are divided into pieces (e.g., of monotonic chains) will be modular, resulting in many zero weights.

## 2 Prior Work

Realistic learning problems, e.g., in vision applications and speech processing, involve high dimensional data. Such data is often governed by many fewer variables, producing manifold-like sub-structures in a high dimensional ambient space. A large number of dimensionality reduction techniques, such as principle component analysis (PCA)

Pearson (1901), multi-dimensional scaling Young & Hamer (1987), Isomap Tenenbaum et al. (2000), and local linear embedding (LLE) Roweis & Saul (2000), have been introduced. An underlying manifold assumption

, which states that different classes lie in separate manifolds, has also guided the design of clustering and semi-supervised learning algorithms

Nadler et al. (2005); Belkin & Niyogi (2003); Weston et al. (2008); Mobahi et al. (2009).

A number of recent papers examine properties of neural nets in light of this manifold assumption. Specifically, Rifai et al. (2011) trained a contractive auto-encoder to represent an atlas of manifold charts. Shaham et al. (2015) demonstrate that a 4-layer network can efficiently represent any function on a manifold through a trapezoidal wavelet decomposition. In both, each chart is represented independently, requiring the representation of an independent projection to map the input space onto each chart. We show that for monotonic chains we can reduce the size of the representation to near optimal by exploiting geometric relations between neighboring projection matrices, so that an additional chart requires only a single hidden unit.

Another family of networks attempt to learn a “semantic” distance metric for training pairs, often by using a siamese network Salakhutdinov & Hinton (2007); Chopra et al. (2005); R. Hadsell & LeCun (2006); Yi et al. (2014); Huang et al. (2015). These assume that the input space can be mapped non-linearly by a network to produce the desired distances in a lower dimensional feature space. Giryes et al. (2016)

shows that even a feed-forward neural network with random Gaussian weights embeds the input data in an output space while preserving distances between input items. They further suggest that training may improve the embedding quality.

Another outstanding question is to what extent deep networks can represent data or handle classification problems more efficiently than shallow networks with a single hidden layer. Earlier work showed that shallow networks are universal approximators Cybenko (1989). However, recent work demonstrates that deep networks can be exponentially more efficient in representing certain functions Bianchini & Scarselli (2014); Telgarsky (2015); Eldan & Shamir (2015); Delalleau & Bengio (2011); Montufar et al. (2014); Cohen et al. (2015). On the other hand, Ba & Caruana (2014) show empirically that in many practical cases a shallow network can be trained to mimic the behavior of a deep network. Our construction does not produce exponential gains, but does show that the early layers of a network can efficiently reduce the dimensionality of data that feeds into later layers.

## 3 Monotonic Chains of Affine Subspaces

Our aim in this paper is to construct networks that can perform dimensionality reduction for data that lies on or near a manifold. We focus on feed-forward networks with RELU activation, i.e., . Clearly the output of such networks are continuous, non-negative piecewise linear functions of their input. It is therefore natural to ask whether they can embed piecewise-linear manifolds in a low-dimensional Euclidean space both accurately and efficiently. In this section we construct such efficient networks for a class of manifolds that we call monotonic chains of affine subspaces, which are defined shortly. These will serve as building blocks for handling more general chains, as well as other sets of data, which can be decomposed into monotonic chains. Handling these more complex cases will require deeper networks. In subsequent sections we discuss these more complex manifolds and show in addition that our networks can be used to approximate data that is on or near non-linear manifolds.

We will consider the case of data lying in a chain of linear segments, denoted . Each segment () in the chain is a portion of some -dimensional affine subspace of , and the segments are connected to form a chain (Figure 2). We suppose that every two subsequent segments and intersect, and that the intersection lies in an -dimensional affine subspace. We further assume that these chains can be flattened so that they may be represented in . Note that any curve on will be mapped to a curve of the same length in on the flattened chain. Figure 2: A continuous chain of linear segments (above) that can be flattened to lie in a single low-dimensional linear subspace (bottom).

We will next consider a special case of these chains which we call monotonic, and show that these can be handled using networks with two hidden layers.

Definition: We say that a chain of affine subspaces is monotonic (see Figure 3) when there exist a set of half-spaces, such that

is bounded by a hyperplane that contains the intersection of

and , and while , where is the complement of . Intuitively, each of the half-spaces divides the chain into two connected pieces at the boundary of each linear segment. We can consider each half-space to represent a hidden unit that is active (i.e., non-zero) over a subset of the regions. With a monotonic chain, the set of active units grows monotonically, so that, . Additionally, we can always define some units that are active over all the regions. Figure 3: A monotonic chain. Sk denotes the k’th segment in the chain. Hk is a hyperplane that separates S1,...,Sk from Sk+1,...,SK.

Below we show that monotonic chains can be embedded efficiently by networks with two layers of weights. These networks have units in the input layer, a hidden layer with units that encodes the structure of the manifold (with is a function of the manifold complexity), and an output layer with units. Denote the weights in the first layer by a matrix

and further use a bias vector

. The second layer of weights is captured by a matrix . The total number of weights in these two layers is . This two layer network maps a point to the embedding space through

 u=B[Ax+a0]+

where denotes the RELU operation. For now we do not use a bias or RELU in the second level, but those will be used later when we discuss more complex manifolds.

A simple example of a manifold that can be represented efficiently with a neural network occurs when the data lies in a single -dimensional affine subspace of . Embedding can be done in this case with just one layer, with the matrix of size containing in its rows a basis parallel to the affine space. RELU is not needed, but if required we can set the bias accordingly to map all the feasible data points to non-negative coordinates.

A simple way to extend this example to handle chains is by encoding each linear segment separately. Such encoding will require units in addition to units that use RELU to separate each segment from the rest of the segments. A related representation was used, e.g., in Shaham et al. (2015). Below we show that monotonic chains can be encoded much more efficiently.

We next show how to construct the network (i.e., set the weights in , , and ) to encode monotonic chains. Below we use the notation to denote the matrix formed by the first rows of , is the vector containing the first entries of , and the matrix including the first columns of . Therefore will express the output of the network when only the first hidden units are used. These will be set to recover the intrinsic coordinates of points in the first segments in ; RELU ensures that subsequent hidden units do not affect the output for points in these segments.

For the construction we consider the pull-back of the standard basis of on the chain, producing a geodesic basis to the manifold that is expressed by a collection of column-orthogonal matrices . Each matrix provides an orthogonal basis for one of the segments.

We will construct the network inductively. Suppose . We set , , and set so that for all all the components of are non-negative. Clearly, is an orthogonal projection matrix and . This shows that the network projects the orthonormal basis for the first segment into , an orthonormal basis in . Next we will show that for all . This implies that , so there is no distortion in the projection. This will show that the network extends this basis throughout the monotonic chain in a consistent way.

Next, suppose we used units to construct , , and for the first segments. (For notational convenience we will next omit the superscript for these matrices and vectors, so , etc.) We will now use those to construct , , and . We do so by adding a node to the first hidden layer. The weights on the incoming edges to this node will be encoded by appending a row vector to and a scalar to , and the weights on the outgoing edges will be encoded by appending a column vector to . Our aim is to assign values to these vectors and scalar to extend the embedding to .

By induction we assume that any is embedded with no distortion to by

 ~u=B[A~x+a0]+,

and that . By monotonicity we further assume that is dimensional and there exists a hyperplane with normal that contains this intersection with lying completely on the side of in the direction of , while lies on the opposite side of . We then set and set so that for any point . (This is well defined since is orthogonal to .)

To determine , we first rotate the bases (referred to as below) and by a common, matrix , i.e., and so that and with providing an orthogonal basis parallel to . (This is equivalent to rotating the coordinate system in the embedded space and then pulling-back to the manifold.) Note that by the induction assumption . We next aim to set so that . We note that

 B(k)A(k)X(k)=B(k)A(k)Y(k)RT=(BA+baT)Y(k)RT.

We aim to set so that . Consider this equality first for the common columns of and . These columns are parallel to , so that for , implying equality for any choice of . Consider next the left-most column of and , denoted respectively and , we get

 (BA+baT)v=BAw.

This is satisfied if we set

 b=1aTvBA(w−v).

We have constructed so that the segments are embedded with consistent orientations. We now show that they are also translated properly by , to create a continuous embedding. Consider a point . Denote by its projection onto , so that for a scalar . Denoting the embedded coordinates of by ,

 u=B(k)(A(k)x+a0(k)).

We want to verify that as tends to 0 will coincide with the embedding of due to , i.e.,

 ¯u=B(A¯x+a0).

Due to the construction of , , and

 u=(BA+baT)x+Ba0+a0b.

Replacing we obtain

 u=(BA+baT)¯x+β(BA+baT)v+Ba0+a0b.

Since , and we get

 u=B(A¯x+a0)+β(BA+baT)v,

which coincides with when , implying that the embedding is extended continuously to . Note that by construction for all so RELU ensures that the embedding of the these segments will not be affected by the additional unit.

Finally, we note that the proposed representation of monotonic chains with a neural network is very efficient and uses only few parameters beyond the degrees of freedom needed to define such chains. In particular, the definition of a chain requires specifying

basis vectors in for one linear segment (exploiting orthonormality these require parameters), with each additional segment specified by a 1D direction for the new segment (a unit vector in specified by parameters) and a direction in the previous segment to be replaced (specified by a unit vector in , i.e. parameters). The total number of degrees of freedom of a chain is therefore . This is the minimum possible number of parameters required to specify a monotonic chain. Our construction requires parameters. Specifically, note that for any choice of parameters

 N≥(K+m−1)(d−m−2).

We therefore obtain that

 N′N≤(1+2K+m−1)(1+2m+3d−m−2).

Assuming we get

 N′N⪅1+2md−m.

Since we normally expect that the dimension of the input space will be much greater than the dimension of the manifold, this ratio will be close to 1, which would be optimal.

## 4 Error Analysis

We now consider points that do not lie exactly on the monotonic chain. We expect this to happen due to noise, or because we are approximating a non-linear manifold with piece-wise linear segments. Let be a point that is on the segment , but that is then perturbed by some small noise vector, , that is perpendicular to , to produce the point . Ideally, the network would represent using the coordinates of . In effect, the network would project all points onto the monotonic chain. We now analyze the error that can occur in this projection. Our analysis assumes that is small enough that and lie in the same region; that is, that they are both on the same side of all hyperplanes defined by the hidden units.

We first show in Section 4.1 that for an arbitrary monotonic chain, this error can be unbounded. While this sounds bad, we then show in Section 4.2 that this can only happen when the hyperplanes that separate the monotonic chain into segments must be poorly chosen, in some sense. We show that in many reasonable cases the error is bounded by times a small constant.

### 4.1 Worst-case error

To show that the error can be unbounded, we consider a simple case in which the piecewise linear manifold consists of three connected 1D line segments, and , with 2D vertices respectively of and , and , and and . is very large, and is very small (see Figure 4). Since three segments compose a 1D manifold, three hidden units defining three hyperplanes, and (lines) will be needed to represent the manifold. In addition, a single output unit will sum the results of these units to produce the geodesic distance from the origin to any point on the three segments. Figure 4: In black, we show a 1D monotonic chain with three segments. In red, we show three hidden units that flatten this chain into a line. Note that each hidden unit corresponds to a hyperplane (in this case, a line) that separates the segments into two connected components. The third hyperplane must be almost parallel to the third segment. This leads to large errors for noisy points near S3.

Using our construction in Section 3 we get the embedding with

 B = (1,1q2,−1r1(2+q1q2)), A = ⎛⎜⎝10q1q2r1r2⎞⎟⎠,   a0=⎛⎜⎝0q3r3⎞⎟⎠.

Note that the first row of uses the standard orthogonal projection ; the two other rows of and separate the three segments with (1) and and set so that the separator goes through , and (2) , and , and set so that the separator goes through . It can be easily verified that in this setup points on the first segment , are mapped to , points , on the second segment are mapped to , and points , on the third segment are mapped to .

Ideally, we would want to be embedded to the same point as . Let . Clearly . It can be readily verified that, under these conditions, when then ; when then , and when then . Therefore, there is no error in embedding for . The error in embedding with is small and bounded (since , assuming is small and is large), while the error in embedding when can be huge since . In the next section we show that this can only happen when there is a large angle between a segment and the normal to the previous separating hyperplane.

### 4.2 Bounds on Error

To show that this noise can often be quite limited, we will consider a class of monotonic chains in which the total curvature between all segments is less than or equal to some angle . We denote the angle between and as . (This angle is well defined since and intersect in an -dimensional affine space.) As before, we will drop the subscript when it is , and just write . Specifically, we define so that (where and are defined as in Sec. 3, as vectors perpendicular to , and parallel to and , respectively), defining similarly for any . We then express our constraint on the curvature as .

Now let be a constant such that we can bound for any . To understand this, recall that is a unit vector normal to the hyperplane separating and . By saying this bound holds for all , we mean that we are able to choose the hyperplanes that divide the chain into segments so that the angle between the normal to each hyperplane and the following segment is not too big. We next bound the error in terms of and .

Let be as in the last section. We define the embedding error of by

 E(p)=(B(k)A(k)−X(k)T)p,

where denotes the orthogonal projection to , as is used in Sec. 3. Noting that, by the construction of our network, (since is on ) and that (due to the orthonormality of ), we obtain

 E(p)=B(k)A(k)δ.

The magnitude of the error therefore is scaled at most by the maximal singular value of

, denoted .

To bound we note that for (where, as before, we drop superscripts so that denotes . Therefore,

 σk≤σk−1+|aTb|,

where denotes the largest singular value of . Recall that and

 b=1aTvBA(w−v).

Note that . Therefore,

 |aTb|≤cσk−1θk−1,

from which we conclude that

 σk≤σk−1(1+cθk−1).

Finally, note that , implying that . We therefore obtain

 σk≤k−1∏j=1(1+cθj).

Note that and so . Therefore,

 σk≤(1+cTk−1)k−1≤ecT.

We conclude that

 ∥E(p0+δ)∥≤ecT∥δ∥.

Many monotonic chains can be divided into segments using hyperplanes in which is not too big, and may be as low as 1. For such manifolds, when a point is perturbed away from the manifold, its coordinates will not be changed by more than the magnitude of the perturbation times a small constant factor. For example, if and then . Note that rather than beginning at the start of the monotonic chain, we could ”begin” in the middle, and work our way out. That is, provide an orthonormal basis for the middle segment and add hidden units to represent the chain from the central segment toward either ends of the chain. This can reduce the total curvature from the starting point to either end by up to half. We further emphasize that this bound is not tight. For example, the bound for a single affine segment is 1, but since in this case the network encodes an orthogonal projection matrix the actual error is zero.

## 5 Combinations of Monotonic Chains

To handle non-monotonic chains and more general piecewise linear manifolds that can be flattened we show that we can use a network to divide the manifold into monotonic chains, embed each of these separately, and then stitch these embeddings together. Suppose we wish to flatten a non-monotonic chain that can be divided into monotonic chains, . Let , and denote the matrices and bias used to represent the hidden units that flatten , which has segments. We suppose that a set of hyperplanes (that is, a convex polytope) can be found that separate from the other chains. Let denote a matrix in which the rows represent the normals to these hyperplanes, oriented to point away from . We can concatenate these vertically, letting We next let where denotes an matrix containing all ones and is a very large constant. Note that has rows. So we can define , where the matrices are concatenated horizontally.

We now note that if:

 u=B′l[A′lx+a0l]+

when lies on , will contain the coordinates of embedded in , as before. When lies on a different monotonic chain, will be a vector with very small negative numbers. Applying RELU will therefore eliminate these numbers.

and therefore represent a module consisting of a two layer network that embeds one monotonic chain in while producing zero for other chains. We can then stitch these values together. First, we must rotate and translate each embedded chain so that each chain picks up where the previous one left off. Let denote the rotation of each chain, and let denote its appropriate translation. Then, for each chain, the appropriate coordinates are produced by

 [RlB′l[A′lx+a0l]++b0l]+.

We can now concatenate these for all chains to produce the final network. We let , and be the vertical concatenation of all and and respectively, and let be the block-diagonal concatenation of all . The application of to will produce a vector with entries in which the entries give the embedded coordinates of and the rest of the entries are zero. We can now construct a third layer of the network to then stitch these monotonic chains together. Let denote a matrix of size obtained by concatenating horizontally identity matrices of size . We then describe the output of the network with the equation:

 u=C[B[Ax+a0]++b0]+].

Note, for example, that the first element of is the sum of the first coordinates produced by each module in the first two layers. Each of these modules produces the appropriate coordinates for points in one monotonic chain, while producing 0 for points in all other monotoinic chains.

We note that this summation may result in wrong values if there is overlap between the regions (which will generally be of zero measure). This can be rectified by replacing the summation due to

by max pooling, which allows overlap of any size

111Note that max pooling can easily be implemented with RELU by adding a layer; namely, , where .. Together, all three layers will require units. If the network is fully connected, this requires weights.

Note that the size of this network depends on how many regions are required () and how many hyperplanes each region needs to separate it from the rest of the manifold (). In the worst case, this can be quite large. Consider, for example, a 1D manifold that is a polyline that passes through every point with integer coordinates in . To separate any portion of this polyline from the rest will require regions that are not unbounded, and so for all . However, such manifolds are somewhat pathological. We expect that many manifolds can be divided appropriately using many fewer hyperplanes. We will show this for the example of a Swiss roll and a real world manifold of faces.

## 6 Deeper networks and Hierarchical Representations of Manifolds

We also note that the previously developed constructions can be applied recursively, producing a deeper network that progressively approximates data using linear subspaces of decreasing dimension. That is, we may first divide the data into a set of segments that each lie in a low dimensional subspace whose dimension is higher than the intrinsic dimension of the data. Then we may subdivide each segment into a set of subsegments of lower dimension, using a similar construction, and deeper layers of the network. These subsegments may represent the original data, or they be further subdivided by additional layers, until we ultimately produce subsegments that represent the data.

We first illustrate this hierarchical approach with a simple example that requires only one extra layer in the hierarchy. Consider a monotonic chain of , -dimensional linear segments that collectively lie in a -dimensional linear subspace, , of a -dimensional space, with . We can construct the first hidden layer with units that are active over the entire monotonic chain, so that their gradient directions form an orthonormal basis for . The output of this layer will contain the coordinates in of points on the monotonic chain. These can form the input to two layers that then flatten the chain, as described in Section 3.

In Section 3 we had already shown how to flatten the manifold with two layers that take their input directly from the input space. Here we accomplish the same end with an extra layer. However, this construction, while using more layers, may also use fewer parameters. The construction in Section 3 required parameters. Our new construction will require parameters. Note that as increases, the number of parameters used in the first construction increases in proportion to , while in the second construction the parameters increase only in proportion to . Consequently, the second construction can be much more economical when is large and is small.

In much the same way, we could represent a manifold using a hierarchy of chains. The first layers can map a -dimensional chain to a linear -dimensional output space. The next layers can select an -dimensional chain that lies in this -dimensional space, and map it to an -dimensional space. This process can repeat indefinitely, but whether it is economical will depend on the structure of the manifold.

## 7 Experiments

In this section we provide examples of deep networks that illustrate the potential performance of the type of networks that we have described in this paper. We use two examples. First, we synthetically generate points on a ”Swiss Roll”. We know analytically that this 2D manifold can be flattened to lie in a 2D Euclidean space. Second, we make use of images rendered from a 3D face model under changing viewpoint. Though high dimensional, these images have only two true degrees of freedom as we alter the elevation and azimuth of the camera. Consequently, these images can be expected to lie near a 2D manifold.

As the focus of this paper is on the representational capacity of networks, we do not attempt to learn these networks, but rather construct them ”by hand.” We make use of prior knowledge of the intrinsic coordinates of each image to divide each set of images into segments. We then use PCA to fit linear subspaces to each segment, and use the constructions in this paper to build a corresponding neural network that will map these input points to a 2D Euclidean space. Figure 5: These plots show the error in flattening the Swiss Roll. Relative error is constant in every segment, starting from zero for each monotonic chain and increasing with each segment. The absolute error (for display purposes it is normalized by the maximal distance from the Swiss Roll to its linear approximation) behaves similarly, but vanishes at the end points of each segment where the Swiss Roll and its linear approximation coincide.

For the Swiss Roll, as shown in Figure 1, we use hidden units and their corresponding hyperplanes to divide the Roll into three monotonic chains. We divide each chain into segments, obtaining a total of 14 segments. Figure 1 shows the points that are input into the network, and the 2D representation that the network outputs. The points are color coded to allow the reader to identify corresponding points. In Figure 5 we further plot the absolute and relative error in embedding every point of the Swiss Roll due to the linear approximation used by the network. One can see that the Swiss Roll is unrolled almost perfectly. In fact, despite the relatively large angular extent of each monotonic chain (the three chains range between 126 to 166.5 degrees each in total curvature), the relative error does not exceed 2.5. (In fact, our bound for this case is very loose, amounting to 18.3 for .) The mean relative error is 0.98. Figure 6: The output of a network that approximates images of a face using a monotonic chain. Each dot represents an image. They are coded by size to indicate elevation, and color to indicate azimuth. At four dots, we display the corresponding face images.

Next we construct a network to flatten a set of images of faces. We render faces with azimuth ranging from 0 to 35 degrees, and with elevation ranging from 0 to 6 degrees. We use the known viewing parameters to divide these into seven segments, and and then construct a network. As described at the end of Section 4.2, we begin with an orthonormal basis for the middle segment of the chain and attach additional segments to both ends of this segment. The results are shown in Figure 6. The output does not form a perfect grid, in part because elevation and azimuth need not provide an orthonormal basis for this 2D manifold. However, we can see that the structure of these variables that describe the input is well-preserved in the output.

## 8 Discussion

The direct technical contribution of this work is to show that deep networks can represent data that lies on a low-dimensional manifold with great efficiency. In particular, when using a monotonic chain to approximate some component of the data, the addition of only a single neural unit can produce a new linear segment to approximate a region of the data. This suggests that deep networks may be very effective devices for such dimensionality reduction. It also may suggest new architectures for deep networks that encourage this type of dimensionality reduction.

We also feel that our work makes a larger point about the nature of deep networks. It has been shown by Montufar et al. (2014) that a deep network can divide the input space into a large number of regions in which the network computes piecewise linear functions. Indeed, the number of regions can be exponential in the number of parameters of the network. While this suggests a source of great power, it also suggests that there are very strong constraints on the set of regions that can be constructed, and the set of functions that can be computed. Not every pair of neighboring regions can compute arbitrarily different functions. Our work shows one way that a single unit can change the linear function that a network computes in two neighboring regions. We demonstrate that one unit can shape this function to follow a manifold that contains the data. We feel that this suggests interesting new directions for the study of deep networks.

## 9 Acknowledgements

This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. 2014-14071600012. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.

This research is also based upon work supported by the Israel Binational Science Foundation Grant No. 2010331 and Israel Science Foundation Grants No. 1265/14.

The authors thank Angjoo Kanazawa and Shahar Kovalsky for their helpful comments.

## References

• Ba & Caruana (2014) Ba, Jimmy and Caruana, Rich. Do deep nets really need to be deep? In Advances in Neural Information Processing Systems, pp. 2654–2662, 2014.
• Basri & Jacobs (2003) Basri, Ronen and Jacobs, David W. Lambertian reflectance and linear subspaces. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 25(2):218–233, 2003.
• Belkin & Niyogi (2003) Belkin, M. and Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396, 2003.
• Bianchini & Scarselli (2014) Bianchini, M. and Scarselli, F.

On the complexity of neural network classifiers: A comparison between shallow and deep architectures.

IEEE Transactions on Neural Networks and Learning Systems, 25(8), 2014.
• Chopra et al. (2005) Chopra, S., Hadsell, R., and LeCun, Y. Learning a similarity metric discriminatively, with application to face verification. In CVPR, 2005.
• Cohen et al. (2015) Cohen, N., Sharir, O., and Shashua, A.

On the expressive power of deep learning: A tensor analysis, 2015.

• Cybenko (1989) Cybenko, George.

Approximation by superpositions of a sigmoidal function.

Mathematics of control, signals and systems, 2(4):303–314, 1989.
• Delalleau & Bengio (2011) Delalleau, O. and Bengio, Y. Shallow vs. deep sum-product networks. In NIPS, pp. 666–674, 2011.
• Eldan & Shamir (2015) Eldan, R. and Shamir, O. The power of depth for feedforward neural networks. ArXiv preprint, arXiv:1512.03965, 2015.
• Giryes et al. (2016) Giryes, R., Sapiro, G., and Bronstein, A. M. Deep neural networks with random gaussian weights: A universal classification strategy? forthcoming, 2016.
• Huang et al. (2015) Huang, R., Lang, F., and Shu, C.

Nonlinear metric learning with deep convolutional neural network for face verification.

In Yang, J. et al. (ed.), Biometric Recognition, volume 9428 of Lecture Notes in Computer Science, pp. 78–87. Springer, 2015.
• Lee et al. (2003) Lee, Kuang-Chih, Ho, Jeffrey, Yang, Ming-Hsuan, and Kriegman, David. Video-based face recognition using probabilistic appearance manifolds. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 1, pp. I–313. IEEE, 2003.
• Mobahi et al. (2009) Mobahi, H., Weston, J., and Collobert, R. Deep learning from temporal coherence in video. In ICML, 2009.
• Montufar et al. (2014) Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. On the number of linear regions of deep neural networks. In NIPS, pp. 2924–2932, 2014.
• Nadler et al. (2005) Nadler, B., Lafon, S., Coifman, R. R., and Kevrekidis, I. G.

Diffusion maps, spectral clustering and eigenfunctions of fokker-planck operators.

In NIPS, volume 18, 2005.
• Pearson (1901) Pearson, K. On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(11):559–572, 1901.
• R. Hadsell & LeCun (2006) R. Hadsell, S. Chopra and LeCun, Y. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006.
• Rifai et al. (2011) Rifai, S., Dauphin, Y. N., Vincent, P., Bengio, Y., and Muller, X. The manifold tangent classifier. In NIPS, pp. 2294–2302, 2011.
• Roweis & Saul (2000) Roweis, S. and Saul, L. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000.
• Salakhutdinov & Hinton (2007) Salakhutdinov, R. and Hinton, G. Learning a nonlinear embedding by preserving class neighbourhood structure. In

International Conference on Artificial Intelligence and Statistics (AISTATS)

, 2007.
• Shaham et al. (2015) Shaham, U., Cloninger, A., and Coifman, R. R. Provable approximation properties for deep neural networks. ArXiv preprint, arXiv:1509.07385, 2015.
• Telgarsky (2015) Telgarsky, M. Representation benefits of deep feedforward networks. ArXiv preprint, arXiv:1509.08101, 2015.
• Tenenbaum et al. (2000) Tenenbaum, J. B., de Silva, V., and Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, 2000.
• Turk & Pentland (1991) Turk, Matthew and Pentland, Alex. Eigenfaces for recognition. Journal of cognitive neuroscience, 3(1):71–86, 1991.
• Weston et al. (2008) Weston, Jason, Ratle, Frederic, and Collobert, Ronan. Deep learning via semi-supervised embedding. In ICML, 2008.
• Yi et al. (2014) Yi, D., Lei, Z., Liao, S., and Li, S. Z. Deep metric learning for person re-identification. In ICPR, 2014.
• Young & Hamer (1987) Young, F. W. and Hamer, R. M. Multidimensional Scaling: History, Theory and Applications. Erlbaum, New York, 1987.