The results achieved by deep neural networks for prediction tasks have been impressive in domains where data is structured and available in large amounts. In particular, convolutional neural networks (CNNs, LeCun et al., 1989) have shown to effectively leverage the local stationarity of natural images at multiple scales thanks to convolutional operations, while also providing some translation invariance through pooling operations. Yet, the exact nature of this invariance and the characteristics of functional spaces where convolutional neural networks live are poorly understood; overall, these models are sometimes seen as clever engineering black boxes that have been designed with a lot of insight collected since they were introduced.
Understanding the inductive bias of these models is nevertheless a fundamental question. For instance, a better grasp of the geometry induced by convolutional representations may bring new intuition about their success, and lead to improved measures of model complexity. In turn, the issue of regularization may be solved by providing ways to control the variations of prediction functions in a principled manner. One meaningful way to study such variations is to consider the stability of model predictions to naturally occuring changes of input signals, such as translations and deformations.
Small deformations of natural signals often preserve their main characteristics, such as class labels (e.g., the same digit with different handwritings may correspond to the same images up to small deformations), and provide a much richer class of transformations than translations. The scattering transform (Mallat, 2012; Bruna and Mallat, 2013) is a recent attempt to characterize convolutional multilayer architectures based on wavelets. The theory provides an elegant characterization of invariance and stability properties of signals represented via the scattering operator, through a notion of Lipschitz stability to the action of diffeomorphisms. Nevertheless, these networks do not involve “learning” in the classical sense since the filters of the networks are pre-defined, and the resulting architecture differs significantly from the most used ones, which adapt filters to training data.
In this work, we study these theoretical properties for more standard convolutional architectures, from the point of view of positive definite kernels (Schölkopf and Smola, 2001). Specifically, we consider a functional space derived from a kernel for multi-dimensional signals that admits a multi-layer and convolutional structure based on the construction of convolutional kernel networks (CKNs) introduced by Mairal (2016); Mairal et al. (2014). The kernel representation follows standard convolutional architectures, with patch extraction, non-linear (kernel) mappings, and pooling operations. We show that our functional space contains a large class of CNNs with smooth homogeneous activation functions.
The main motivation for introducing a kernel framework is to study separately data representation and predictive models. On the one hand, we study the translation-invariance properties of the kernel representation and its stability to the action of diffeomorphisms, obtaining similar guarantees as the scattering transform (Mallat, 2012), while preserving signal information. When the kernel is appropriately designed, we also show how to obtain signal representations that are invariant to the action of any locally compact group of transformations, by modifying the construction of the kernel representation to become equivariant to the group action. On the other hand, we show that these stability results can be translated to predictive models by controlling their norm in the functional space, or simply the norm of the last layer in the case of CKNs (Mairal, 2016). With our kernel framework, the RKHS norm also acts as a measure of model complexity, thus controlling both stability and generalization, so that stability may lead to improved sample complexity. Finally, our work suggests that explicitly regularizing CNNs with the RKHS norm (or approximations thereof) can help obtain more stable models, a more practical question which we study in follow-up work (Bietti et al., 2018).
A short version of this paper was published at the Neural Information Processing Systems 2017 conference (Bietti and Mairal, 2017).
1.1 Summary of Main Results
Our work characterizes properties of deep convolutional models along two main directions.
The first goal is to study representation properties of such models, independently of training data. Given a deep convolutional architecture, we study signal preservation as well as invariance and stability properties.
The second goal focuses on learning aspects, by studying the complexity of learned models based on our representation. In particular, our construction relies on kernel methods, allowing us to define a corresponding functional space (the RKHS). We show that this functional space contains a class of CNNs with smooth homogeneous activations, and study the complexity of such models by considering their RKHS norm. This directly leads to statements on the generalization of such models, as well as on the invariance and stability properties of their predictions.
Finally, we show how some of our arguments extend to more traditional CNNs with generic and possibly non-smooth activations (such as ReLU or tanh).
Signal preservation, invariance and stability.
We tackle this first goal by defining a deep convolutional representation based on hierarchical kernels. We show that the representation preserves signal information and guarantees near-invariance to translations and stability to deformations in the following sense, defined by Mallat (2012): for signals defined on the continuous domain , we say that a representation is stable to the action of diffeomorphisms if
where is a -diffeomorphism, its action operator, and the norms and characterize how large the translation and deformation components are, respectively (see Section 3 for formal definitions). The Jacobian quantifies the size of local deformations, so that the first term controls the stability of the representation. In the case of translations, the first term vanishes (), hence a small value of is desirable for translation invariance. We show that such signal preservation and stability properties are valid for the multilayer kernel representation defined in Section 2 by repeated application of patch extraction, kernel mapping, and pooling operators:
The representation can be discretized with no loss of information, by subsampling at each layer with a factor smaller than the patch size;
The translation invariance is controlled by a factor , where represents the “resolution” of the last layer, and typically increases exponentially with depth;
The deformation stability is controlled by a factor which increases as , where corresponds to the patch size at a given layer, that is, the size of the “receptive field” of a patch relative to the resolution of the previous layer.
These results suggest that a good way to obtain a stable representation that preserves signal information is to use the smallest possible patches at each layer (e.g., 3x3 for images) and perform pooling and downsampling at a factor smaller than the patch size, with as many layers as needed in order to reach a desired level of translation invariance . We show in Section 3.3 that the same invariance and stability guarantees hold when using kernel approximations as in CKNs, at the cost of losing signal information.
In Section 3.5, we show how to go beyond the translation group, by constructing similar representations that are invariant to the action of locally compact groups. This is achieved by modifying patch extraction and pooling operators so that they commute with the group action operator (this is known as equivariance).
Our second goal is to analyze the complexity of deep convolutional models by studying the functional space defined by our kernel representation, showing that certain classes of CNNs are contained in this space, and characterizing their norm.
The multi-layer kernel representation defined in Section 2 is constructed by using kernel mappings defined on local signal patches at each scale, which replace the linear mapping followed by a non-linearity in standard convolutional networks. Inspired by Zhang et al. (2017b), we show in Section 4.1 that when these kernel mappings come from a class of dot-product kernels, the corresponding RKHS contains functions of the form
for certain types of smooth activation functions , where and live in a particular Hilbert space. These behave like simple neural network functions on patches, up to homogeneization. Note that if was allowed to be homogeneous, such as for rectified linear units , homogeneization would disappear. By considering multiple such functions at each layer, we construct a CNN in the RKHS of the full multi-layer kernel in Section 4.2. Denoting such a CNN by , we show that its RKHS norm can be bounded as
where are convolutional filter parameters at layer , carries the parameters of a final linear fully connected layer, is a function quantifying the complexity of the simple functions defined above depending on the choice of activation , and , denote spectral and Frobenius norms, respectively, (see Section 4.2 for details). This norm can then control generalization aspects through classical margin bounds, as well as the invariance and stability of model predictions. Indeed, by using the reproducing property , this “linearization” lets us control stability properties of model predictions through :
meaning that the prediction function will inherit the stability of when is small.
The case of standard CNNs with generic activations.
When considering CNNs with generic, possibly non-smooth activations such as rectified linear units (ReLUs), the separation between a data-independent representation and a learned model is not always achievable in contrast to our kernel approach. In particular, the “representation” given by the last layer of a learned CNN is often considered by practitioners, but such a representation is data-dependent in that it is typically trained on a specific task and dataset, and does not preserve signal information.
Nevertheless, we obtain similar invariance and stability properties for the predictions of such models in Section 4.3, by considering a complexity measure given by the product of spectral norms of each linear convolutional mapping in a CNN. Unlike our study based on kernel methods, such results do not say anything about generalization; however, relevant generalization bounds based on similar quantities have been derived (though other quantities in addition to the product of spectral norms appear in the bounds, and these bounds do not directly apply to CNNs), e.g., by Bartlett et al. (2017); Neyshabur et al. (2018), making the relationship between generalization and stability clear in this context as well.
1.2 Related Work
Our work relies on image representations introduced in the context of convolutional kernel networks (Mairal, 2016; Mairal et al., 2014), which yield a sequence of spatial maps similar to traditional CNNs, but where each point on the maps is possibly infinite-dimensional and lives in a reproducing kernel Hilbert space (RKHS). The extension to signals with
spatial dimensions is straightforward. Since computing the corresponding Gram matrix as in classical kernel machines is computationally impractical, CKNs provide an approximation scheme consisting of learning finite-dimensional subspaces of each RKHS’s layer, where the data is projected. The resulting architecture of CKNs resembles traditional CNNs with a subspace learning interpretation and different unsupervised learning principles.
Another major source of inspiration is the study of group-invariance and stability to the action of diffeomorphisms of scattering networks (Mallat, 2012), which introduced the main formalism and several proof techniques that were keys to our results. Our main effort was to extend them to more general CNN architectures and to the kernel framework, allowing us to provide a clear relationship between stability properties of the representation and generalization of learned CNN models. We note that an extension of scattering networks results to more general convolutional networks was previously given by Wiatowski and Bölcskei (2018); however, their guarantees on deformations do not improve on the inherent stability properties of the considered signal, and their study does not consider learning or generalization, by treating a convolutional architecture with fixed weights as a feature extractor. In contrast, our stability analysis shows the benefits of deep representations with a clear dependence on the choice of network architecture through the size of convolutional patches and pooling layers, and we study the implications for learned CNNs through notions of model complexity.
Invariance to groups of transformations was also studied for more classical convolutional neural networks from methodological and empirical points of view (Bruna et al., 2013; Cohen and Welling, 2016), and for shallow learned representations (Anselmi et al., 2016) or kernel methods (Haasdonk and Burkhardt, 2007; Mroueh et al., 2015; Raj et al., 2017). Our work provides a similar group-equivariant construction to (Cohen and Welling, 2016), while additionally relating it to stability. In particular, we show that in order to achieve group invariance, pooling on the group is only needed at the final layer, while deep architectures with pooling at multiple scales are mainly beneficial for stability. For the specific example of the roto-translation group (Sifre and Mallat, 2013), we show that our construction achieves invariance to rotations while maintaining stability to deformations on the translation group.
Note also that other techniques combining deep neural networks and kernels have been introduced earlier. Multilayer kernel machines were for instance introduced by Cho and Saul (2009); Schölkopf et al. (1998). Shallow kernels for images modeling local regions were also proposed by Schölkopf (1997), and a multilayer construction was proposed by Bo et al. (2011). More recently, different models based on kernels have been introduced by Anselmi et al. (2015); Daniely et al. (2016); Montavon et al. (2011) to gain some theoretical insight about classical multilayer neural networks, while kernels are used by Zhang et al. (2017b) to define convex models for two-layer convolutional networks. Theoretical and practical concerns for learning with multilayer kernels have been studied in Daniely et al. (2017, 2016); Steinwart et al. (2016); Zhang et al. (2016) in addition to CKNs. In particular, Daniely et al. (2017, 2016) study certain classes of dot-product kernels with random feature approximations, Steinwart et al. (2016) consider hierarchical Gaussian kernels with learned weights, and Zhang et al. (2016) study a convex formulation for learning a certain class of fully connected neural networks using a hierarchical kernel. In contrast to these works, our focus is on the kernel representation induced by the specific hierarchical kernel defined in CKNs and the geometry of the RKHS. Our characterization of CNNs and activation functions contained in the RKHS is similar to the work of Zhang et al. (2016, 2017b), but differs in several ways: we consider general homogeneous dot-product kernels, which yield desirable properties of kernel mappings for stability; we construct generic multi-layer CNNs with pooling in the RKHS, while Zhang et al. (2016) only considers fully-connected networks and Zhang et al. (2017b) is limited to two-layer convolutional networks with no pooling; we quantify the RKHS norm of a CNN depending on its parameters, in particular matrix norms, as a way to control stability and generalization, while Zhang et al. (2016, 2017b) consider models with constrained parameters, and focus on convex learning procedures.
1.3 Notation and Basic Mathematical Tools
A positive definite kernel that operates on a set implicitly defines a reproducing kernel Hilbert space of functions from to , along with a mapping . A predictive model associates to every point in a label in ; it consists of a linear function in such that , where is the data representation. Given now two points in , Cauchy-Schwarz’s inequality allows us to control the variation of the predictive model according to the geometry induced by the Hilbert norm :
This property implies that two points and that are close to each other according to the RKHS norm should lead to similar predictions, when the model has small norm in .
Then, we consider notation from signal processing similar to Mallat (2012). We call a signal a function in , where the domain represents spatial coordinates, and is a Hilbert space, when , where is the Lebesgue measure on . Given a linear operator , the operator norm is defined as . For the sake of clarity, we drop norm subscripts, from now on, using the notation for Hilbert space norms, norms, and operator norms, while denotes the Euclidean norm on . We use cursive capital letters (e.g., ) to denote Hilbert spaces, and non-cursive ones for operators (e.g., ). Some useful mathematical tools are also presented in Appendix A.
1.4 Organization of the Paper
The rest of the paper is structured as follows:
In Section 2, we introduce a multilayer convolutional kernel representation for continuous signals, based on a hierarchy of patch extraction, kernel mapping, and pooling operators. We present useful properties of this representation such as signal preservation, as well as ways to make it practical through discretization and kernel approximations in the context of CKNs.
In Section 3, we present our main results regarding stability and invariance, namely that the kernel representation introduced in Section 2 is near translation-invariant and stable to the action of diffeomorphisms. We then show in Section 3.3 that the same stability results apply in the presence of kernel approximations such as those of CKNs (Mairal, 2016), and describe a generic way to modify the multilayer construction in order to guarantee invariance to the action of any locally compact group of transformations in Section 3.5.
In Section 4, we study the functional spaces induced by our representation, showing that simple neural-network like functions with certain smooth activations are contained in the RKHS at intermediate layers, and that the RKHS of the full kernel induced by our representation contains a class of generic CNNs with smooth and homogeneous activations. We then present upper bounds on the RKHS norm of such CNNs, which serves as a measure of complexity, controlling both generalization and stability. Section 4.3 studies the stability for CNNs with generic activations such as rectified linear units, and discusses the link with generalization.
Finally, we discuss in Section 5 how the obtained stability results apply to the practical setting of learning prediction functions. In particular, we explain why the regularization used in CKNs provides a natural way to control stability, while a similar control is harder to achieve with generic CNNs.
2 Construction of the Multilayer Convolutional Kernel
We now present the multilayer convolutional kernel, which operates on signals with spatial dimensions. The construction follows closely that of convolutional kernel networks but is generalized to input signals defined on the continuous domain . Dealing with continuous signals is indeed useful to characterize the stability properties of signal representations to small deformations, as done by Mallat (2012) in the context of the scattering transform. The issue of discretization on a discrete grid is addressed in Section 2.1.
In what follows, we consider signals that live in , where typically (e.g., with and
, the vectorin may represent the RGB pixel value at location in ). Then, we build a sequence of reproducing kernel Hilbert spaces and transform into a sequence of “feature maps”, respectively denoted by in , in , etc… As depicted in Figure 1, a new map is built from the previous one by applying successively three operators that perform patch extraction (), kernel mapping to a new RKHS , and linear pooling , respectively. When going up in the hierarchy, the points carry information from larger signal neighborhoods centered at in with more invariance, as we formally show in Section 3.
Patch extraction operator.
Given the layer , we consider a patch shape , defined as a compact centered subset of , e.g., a box, and we define the Hilbert space equipped with the norm , where is the normalized uniform measure on for every in . Specifically, we define the (linear) patch extraction operator such that for all in ,
Note that by equipping with a normalized measure, it is easy to show that the operator preserves the norm—that is, and hence is in .
Kernel mapping operator.
Then, we map each patch of to a RKHS thanks to the kernel mapping associated to a positive definite kernel that operates on patches. It allows us to define the pointwise operator such that for all in ,
In this paper, we consider homogeneous dot-product kernels operating on , defined in terms of a function that satisfies the following constraints:
assuming convergence of the series and . Then, we define the kernel by
if , and if or . The kernel is positive definite since it admits a Maclaurin expansion with only non-negative coefficients (Schoenberg, 1942; Schölkopf and Smola, 2001). The condition ensures that the RKHS mapping preserves the norm—that is, , and thus for all in ; as a consequence, is always in . The technical condition , where is the first derivative of , ensures that the kernel mapping is non-expansive, according to Lemma 2 below. [Non-expansiveness of the kernel mappings] Consider a positive-definite kernel of the form (2) satisfying (A1) with RKHS mapping . Then, is non-expansive—that is, for all in ,
Moreover, we remark that the kernel is lower-bounded by the linear one
From the proof of the lemma, given in Appendix B, one may notice that the assumption is not critical and may be safely replaced by . Then, the non-expansiveness property would be preserved. Yet, we have chosen a stronger constraint since it yields a few simplifications in the stability analysis, where we use the relation (3) that requires . More generally, the kernel mapping is Lipschitz continuous with constant . Our stability results hold in a setting with , but with constants that may grow exponentially with the number of layers.
Examples of functions that satisfy the properties (A1) are now given below:
|arc-cosine, degree 1|
|Vovk’s, degree 3|
We note that the inverse polynomial kernel was used by Zhang et al. (2016, 2017b) to build convex models of fully connected networks and two-layer convolutional neural networks, while the arc-cosine kernel appears in early deep kernel machines (Cho and Saul, 2009). Note that the homogeneous exponential kernel reduces to the Gaussian kernel for unit-norm vectors. Indeed, for all such that , we have
and thus, we may refer to kernel (2) with the function as the homogeneous Gaussian kernel. The kernel with may also be used here, but we choose for simplicity since (see discussion above).
The last step to build the layer consists of pooling neighboring values to achieve local shift-invariance. We apply a linear convolution operator with a Gaussian filter of scale , , where . Then, for all in ,
where the integral is a Bochner integral (see, Diestel and Uhl, 1977; Muandet et al., 2017). By applying Schur’s test to the integral operator (see Appendix A), we obtain that the operator norm is less than . Thus, is in , with . Note that a similar pooling operator is used in the scattering transform (Mallat, 2012).
Multilayer construction and prediction layer.
Finally, we obtain a multilayer representation by composing multiple times the previous operators. In order to increase invariance with each layer and to increase the size of the receptive fields (that is, the neighborhood of the original signal considered in a given patch), the size of the patch and pooling scale typically grow exponentially with , with and the patch size of the same order. With layers, the maps may then be written
It remains to define a kernel from this representation, that will play the same role as the “fully connected” layer of classical convolutional neural networks. For that purpose, we simply consider the following linear kernel defined for all in by using the corresponding feature maps in given by our multilayer construction (5):
Then, the RKHS of contains all functions of the form with in (see Appendix A).
We note that one may also consider nonlinear kernels, such as a Gaussian kernel:
Such kernels are then associated to a RKHS denoted by , along with a kernel mapping which we call prediction layer, so that the final representation is given by in . We note that is non-expansive for the Gaussian kernel when (see Section B.1), and is simply an isometric linear mapping for the linear kernel. Then, we have the relation and in particular, the RKHS of contains all functions of the form with in , see Appendix A.
2.1 Signal Preservation and Discretization
In this section, we show that the multilayer kernel representation preserves all information about the signal at each layer, and besides, each feature map can be sampled on a discrete set with no loss of information. This suggests a natural approach for discretization which will be discussed after the following lemma, whose proof is given in Appendix C. [Signal recovery from sampling] Assume that contains all linear functions with in (this is true for all kernels described in the previous section, according to Corollary 15 in Section 4.1 later); then, the signal can be recovered from a sampling of at discrete locations in a set as soon as (i.e., the union of patches centered at these points covers ). It follows that can be reconstructed from such a sampling.
The previous construction defines a kernel representation for general signals in , which is an abstract object defined for theoretical purposes. In practice, signals are discrete, and it is thus important to discuss the problem of discretization. For clarity, we limit the presentation to 1-dimensional signals (), but the arguments can easily be extended to higher dimensions when using box-shaped patches. Notation from the previous section is preserved, but we add a bar on top of all discrete analogues of their continuous counterparts. e.g., is a discrete feature map in for some RKHS .
Input signals and .
Discrete signals acquired by a physical device may be seen as local integrators of signals defined on a continuous domain (e.g., sensors from digital cameras integrate the pointwise distribution of photons in a spatial and temporal window). Then, consider a signal in and a sampling interval. By defining in such that for all in , it is thus natural to assume that , where is a pooling operator (local integrator) applied to an original continuous signal . The role of is to prevent aliasing and reduce high frequencies; typically, the scale of should be of the same magnitude as , which we choose to be without loss of generality. This natural assumption is kept later for the stability analysis.
We now want to build discrete feature maps in at each layer involving subsampling with a factor with respect to . We now define the discrete analogues of the operators (patch extraction), (kernel mapping), and (pooling) as follows: for ,
where (i) extracts a patch of size starting at position in , which lives in the Hilbert space defined as the direct sum of times ; (ii) is a kernel mapping identical to the continuous case, which preserves the norm, like ; (iii) performs a convolution with a Gaussian filter and a subsampling operation with factor . The next lemma shows that under mild assumptions, this construction preserves signal information.
[Signal recovery with subsampling] Assume that contains the linear functions for all in and that . Then, can be recovered from . The proof is given in Appendix C. The result relies on recovering patches using linear “measurement” functions and deconvolution of the pooling operation. While such a deconvolution operation can be unstable, it may be possible to obtain more stable recovery mechanisms by also considering non-linear measurements, a question which we leave open.
Links between the parameters of the discrete and continuous models.
Due to subsampling, the patch size in the continuous and discrete models are related by a multiplicative factor. Specifically, a patch of size with discretization corresponds to a patch of diameter in the continuous case. The same holds true for the scale parameter of the Gaussian pooling.
2.2 Practical Implementation via Convolutional Kernel Networks
Besides discretization, convolutional kernel networks add two modifications to implement in practice the image representation we have described. First, it uses feature maps with finite spatial support, which introduces border effects that we do not study (like Mallat, 2012), but which are negligible when dealing with large realistic images. Second, CKNs use finite-dimensional approximations of the kernel feature map. Typically, each RKHS’s mapping is approximated by performing a projection onto a subspace of finite dimension, which is a classical approach to make kernel methods work at large scale (Fine and Scheinberg, 2001; Smola and Schölkopf, 2000; Williams and Seeger, 2001). If we consider the kernel mapping at layer , the orthogonal projection onto the finite-dimensional subspace , where the ’s are anchor points in , is given by the linear operator defined for in by
where is the inverse (or pseudo-inverse) of the kernel matrix . As an orthogonal projection operator, is non-expansive, i.e., . We can then define the new approximate version of the kernel mapping operator by
Note that all points in the feature map lie in the -dimensional space , which allows us to represent each point by the finite dimensional vector
with ; this finite-dimensional representation preserves the Hilbertian inner product and norm111We have . See Mairal (2016) for details. in so that .
Such a finite-dimensional mapping is compatible with the multilayer construction, which builds by manipulating points from . Here, the approximation provides points in , which remain in after pooling since is a linear subspace. Eventually, the sequence of RKHSs is not affected by the finite-dimensional approximation. Besides, the stability results we will present next are preserved thanks to the non-expansiveness of the projection. In contrast, other kernel approximations such as random Fourier features (Rahimi and Recht, 2007) do not provide points in the RKHS (see Bach, 2017), and their effect on the functional space derived from the multilayer construction is unclear.
It is then possible to derive theoretical results for the CKN model, which appears as a natural implementation of the kernel constructed previously; yet, we will also show in Section 4 that the results apply more broadly to CNNs that are contained in the functional space associated to the kernel. However, the stability of these CNNs depends on their RKHS norm, which is hard to control. In contrast, for CKNs, stability is typically controlled by the norm of the final prediction layer.
3 Stability to Deformations and Group Invariance
In this section, we study the translation invariance and the stability under the action of diffeomorphisms of the kernel representation described in Section 2 for continuous signals. In addition to translation invariance, it is desirable to have a representation that is stable to small local deformations. We describe such deformations using a -diffeomorphism , and let denote the linear operator defined by . We use a similar characterization of stability to the one introduced by Mallat (2012): the representation is stable under the action of diffeomorphisms if there exist two non-negative constants and such that
where is the Jacobian of , , and . The quantity measures the size of the deformation at a location , and like Mallat (2012), we assume the regularity condition , which implies that the deformation is invertible (Allassonnière et al., 2007; Trouvé and Younes, 2005) and helps us avoid degenerate situations. In order to have a near-translation-invariant representation, we want to be small (a translation is a diffeomorphism with ), and indeed we will show that is proportional to , where is the scale of the last pooling layer, which typically increases exponentially with the number of layers . When is non-zero, the diffeomorphism deviates from a translation, producing local deformations controlled by .
In order to study the stability of the representation (5), we assume that the input signal may be written as , where is an initial pooling operator at scale , which allows us to control the high frequencies of the signal in the first layer. As discussed previously in Section 2.1, this assumption is natural and compatible with any physical acquisition device. Note that can be taken arbitrarily small, so that this assumption does not limit the generality of our results. Then, we are interested in understanding the stability of the representation
We do not consider a prediction layer here for simplicity, but note that if we add one on top of , based on a linear of Gaussian kernel, then the stability of the full representation immediately follows from that of thanks to the non-expansiveness of (see Section 2). Then, we make an assumption that relates the scale of the pooling operator at layer with the diameter of the patch : we assume indeed that there exists such that for all ,
The scales are typically exponentially increasing with the layers , and characterize the “resolution” of each feature map. This assumption corresponds to considering patch sizes that are adapted to these intermediate resolutions. Moreover, the stability bounds we obtain hereafter increase with , which leads us to believe that small patch sizes lead to more stable representations, something which matches well the trend of using small, 3x3 convolution filters at each scale in modern deep architectures (e.g., Simonyan and Zisserman, 2014).
Finally, before presenting our stability results, we recall a few properties of the operators involved in the representation , which are heavily used in the analysis.
Patch extraction operator: is linear and preserves the norm; Kernel mapping operator: preserves the norm and is non-expansive; Pooling operator: is linear and non-expansive ;
The rest of this section is organized into three parts. We present the main stability results in Section 3.1, explain their compatibility with kernel approximations in Section 3.3, and provide numerical experiment for demonstrating the stability of the kernel representation in Section 3.4. Finally, we introduce mechanisms to achieve invariance to any group of transformations in Section 3.5.
3.1 Stability Results and Translation Invariance
Here, we show that our kernel representation satisfies the stability property (11), with a constant inversely proportional to , thereby achieving near-invariance to translations. The results are then extended to more general transformation groups in Section 3.5.
General bound for stability.
The following result gives an upper bound on the quantity of interest, , in terms of the norm of various linear operators which control how affects each layer. An important object of study is the commutator of linear operators and , which is denoted by . [Bound with operator norms] For any in , we have
For translations , it is easy to see that patch extraction and pooling operators commute with (this is also known as covariance or equivariance to translations), so that we are left with the term , which should control translation invariance. For general diffeomorphisms , we no longer have exact covariance, but we show below that commutators are stable to , in the sense that is controlled by , while is controlled by and decays with the pooling size .
Bound on .
We note that can be identified with isometrically for all in , since by Fubini’s theorem. Then,
so that . The following result lets us bound the commutator when , which is satisfied under assumption (A2).
[Stability of shifted pooling] Consider the pooling operator with kernel . If , there exists a constant such that for any and , we have
where depends only on and . A similar result can be found in Lemma E.1 of Mallat (2012) for commutators of the form , but we extend it to handle integral operators with a shifted kernel. The proof (given in Appendix C.4) follows closely Mallat (2012) and relies on the fact that is an integral operator in order to bound its norm via Schur’s test. Note that can be made larger, at the cost of an increase of the constant of the order .
Bound on .
We bound the operator norm in terms of using the following result due to Mallat (2012, Lemma 2.11), with : [Translation invariance] If , we have
This result matches the desired notion of stability in Eq. (11), with a translation-invariance factor that decays with . We discuss implications of our bound, and compare it with related work on stability in Section 3.2. We also note that our bound yields a worst-case guarantee on stability, in the sense that it holds for any signal . In particular, making additional assumptions on the signal (e.g., smoothness) may lead to improved stability. The predictions for a specific model may also be more stable than applying (1) to our stability bound, for instance if the filters are smooth enough.
[Stability for Lipschitz non-linear mappings] While the previous results require non-expansive non-linear mappings , it is easy to extend the result to the following more general condition
3.2 Discussion of the Stability Bound (Theorem 3.1)
In this section, we discuss the implications of our stability bound (13), and compare it to related work on the stability of the scattering transform (Mallat, 2012) as well as the work of (Wiatowski and Bölcskei, 2018) on more general convolutional models.
Role of depth.
Our bound displays a linear dependence on the number of layers in the stability constant . We note that a dependence on a notion of depth (the number of layers here) also appears in Mallat (2012), with a factor equal to the maximal length of “scattering paths”, and with the same condition . Nevertheless, the number of layers is tightly linked to the patch sizes, and we now show how a deeper architecture can be beneficial for stability. Given a desired level of translation-invariance and a given initial resolution , the above bound together with the discretization results of Section 2.1 suggest that one can obtain a stable representation that preserves signal information by taking small patches at each layer and subsampling with a factor equal to the patch size (assuming a patch size greater than one) until the desired level of invariance is reached: in this case we have , where is of the order of the patch size, so that , and hence the stability constant grows with as , explaining the benefit of small patches, and thus of deeper models.
While the scattering representation preserves the norm of the input signals when the length of scattering paths goes to infinity, in our setting the norm may decrease with depth due to pooling layers. However, we show in Appendix C.5 that a part of the signal norm is still preserved, particularly for signals with high energy in the low frequencies, as is the case for natural images (e.g., Torralba and Oliva, 2003). This justifies that the bounded quantity in (13) is relevant and non-trivial. Nevertheless, we recall that despite a possible loss in norm, our (infinite-dimensional) representation preserves signal information, as discussed in Section 2.1.
Dependence on signal bandwidth.
We note that our stability result crucially relies on the assumption , which effectively limits its applicability to signals with frequencies bounded by . While this assumption is realistic in practice for digital signals, our bound degrades as approaches 0, since the number of layers grows as