I Introduction
Even though today’s signal processing systems have achieved an unprecedented complexity, a multitude of them have a very basic commonality: The application of a translationinvariant function to a large signal in a sliding fashion facilitates the dense computation of interesting output values for each possible spatial location. Consider filterbased signal denoising as an example: Here, each entry of the denoised output signal always depends on a fixed computation rule applied to only a limited number of samples within the input signal, or in other words, on a subsignal of the input signal. The computation rule is completely agnostic with regard to the actual position, it is merely important that the input samples are drawn accordingly from the input signal.
Of course, modern systems apply more sophisticated techniques than mere filtering. However, recently an architecture essentially made up of simple filtering building blocks has displayed advantages over any other approach in a wide variety of practical applications. Due to significant advances in the design of massively parallel processors and the availability of huge annotated data sets, deep artificial neural networks which learn desired behavior by adapting their degrees of freedom to concrete sample data rather than being programmed explicitly have become the de facto stateoftheart in the domains of signal restoration and signal classification.
The most important architecture for analyzing signals that possess a spatial structure, such as images where pixels are arranged on a twodimensional grid, was inspired by findings on the dynamics of mammalian visual cortex [1]: Convolutional Neural Networks (CNNs) [2, 3, 4] respect the weightsharing principle, hence convolution with trainable filters becomes the actual workhorse for data processing. This principle greatly reduces the network’s degrees of freedom, making it less susceptible to overfitting, and it incorporates a strong prior with respect to the spatial layout of the input data. In fact, this particular architecture has proven highly successful both for image restoration tasks [5, 6, 7]
and pattern recognition problems
[8, 9, 10].If a CNN trained for object categorization is evaluated at each feasible image position, it is possible to assign class membership estimations to all the pixels in an image, yielding a semantic segmentation of a scene
[11, 12, 13]. This representation is much more powerful than what can be gained from a strict conventional object detection approach which solely outputs bounding boxes of found object instances. Instead, it facilitates applications such as automated biological or medical image analysis [11, 14, 15] and dense vehicle environment perception [16]. While the computational complexity of a sophisticated classification system used in conjunction with a sliding window approach may seem excessive at first glance, the weightsharing principle of a CNN can be exploited so that intermediate computation results can be shared among adjacent image patches, resulting in a speedup of several orders of magnitude. Although this was already realized for CNNs without pooling layers more than two decades ago [17], approaches that also account for pooling layers emerged only recently [14, 18, 19].The approach of Giusti et al. [14]
achieves fast scanning of entire images through the introduction of a fragmentation data structure. Here, the internal representations of a CNN are decomposed using a spatial reordering operation after each pooling layer, allowing the evaluation of convolutions on contiguous signals at all times. The intermediate signals are however inhomogeneous with respect to their dimensionality, leaving the possibility for the use of efficient tensor convolution routines unclear. Li
et al. [18], on the other hand, propose enlarging the filter banks of convolutional layers by inserting vanishing entries at regular locations. These sparse filter banks require a cumbersome reengineering of efficient convolution implementations, which may not be able to achieve maximum throughput on modern massively parallel processors. Sermanet et al. [19] use the same processing pipeline for patches and entire images, which incurs relaxations with accuracy loss effects where the actual impact is hard to predict.All these approaches have in common that it is not inherently clear what they actually compute or if the result is even the desired one. Instead of a rigorous mathematical proof of correctness, only toy examples are available, illustrating the implementation of these approaches. This situation is especially unsatisfactory if, instead of pure convenience functions, systems subject to safety considerations should be realized where precise statements rather than only an empirical evaluation are required.
The key contributions of this paper are (i) the development of an original theory on subsignal compatible transformations as exact characterization of functions that fulfill the invariants required for a sound sliding window approach, (ii) the proposition of a method for dense signal scanning provably without any accuracy loss that yields significant speedups due to homogeneous data structures and elimination of redundant computations and special case treatment, and (iii) the demonstration how CNNs interconnect with the theory and how they can be exactly transformed from subsignalbased application to signalbased application without any necessary adjustments to the computationally most demanding tensor convolution. To the authors’ best knowledge, they are the first to actually have mathematically rigorous statements to support their claims on the correctness of dense signal scanning with CNNs. Due to the generality of the results, the herein developed theoretical framework can also serve as a basis for analyzing related and emerging methods for signal processing based on translationinvariant functions applied in a sliding fashion.
The remainder of this paper is structured as follows. Section II presents an introduction to the CNN structure, fixes the notation and introduces what is meant by subsignals. Section III establishes the basics of the theory on subsignal compatible transformations and shows how the building blocks of CNNs fit into the theory. In the following Sect. IV
, the theory is extended to functions applied in a strided fashion, which is particularly important for pooling operators evaluated on nonoverlapping blocks. Section
V provides a theoretical computational complexity analysis. Practical considerations for image processing and the results of experiments on real parallel processors are discussed in Sect. VI. The paper is concluded with a discussion of the results in Sect. VII.Ii Prerequisites
This section begins with an introduction to the building blocks of a CNN. Then the notation used throughout the paper is established. The section concludes with the definition of the subsignal extraction operator and statements on its properties.
Iia Convolutional Neural Networks
CNNs are organized in a number of specialized layers [4]. Each layer receives input data from its predecessor, processes it, and sends the result to the next layer. The network’s output is then the output of the final layer. The training process consists of tuning the network’s degrees of freedom until the network produces the desired output given concrete input sample data [20]. After a network has been trained, it can be used as a predictor on previously unseen data in regression or classification tasks.
The different specialized layer types are given as follows. Convolutional layers respect the weightsharing principle: They convolve their input with a trainable filter bank and add a trainable scalar bias to form the layer output. These layers fall into the class of subsignal compatible transformations detailed in Sect. III, a mathematical analysis of the involved computations is given in Sect. IIIC.
Fullyconnected layers are a special case of convolutional layers in that they carry out a convolution with unit spatial filter size. Mathematical treatment of these layers is hence superseded by the analysis of convolutional layers.
Nonlinearity layers independently pass each sample of a signal through a scalar transfer function. This prevents the entire network from forming a purely linear system and hence enhances the network’s representational capacity. Since these operations are agnostic with respect to any spatial structure, an analysis is straightforward and handled in Sect. IIIC.
Eventually, pooling layers strengthen a network’s invariance to small translations of the input data by evaluation of a fixed pooling kernel followed by a downsampling operation. For brevity of the presentation, only functions applied to nonoverlapping blocks are considered here. Pooling requires an extension of the plain theory of subsignal compatible transformations, provided in Sect. IV.
This paper proves that CNNs can be transformed from subsignalbased application to signalbased application by transforming strided function evaluation into sliding function evaluation and inserting special helper layers, namely fragmentation, defragmentation, stuffing and trimming. This transformation is completely lossless, both subsignalbased application and signalbased application lead to the same results. Even after the transformation, CNNs can be further finetuned with standard optimization methods. An example for this process is given in Sect. VI.
IiB Notation
For the sake of simplicity, the mathematical analysis is restricted to vectorshaped signals. The generalization to more complex signals such as images is straightforward through application of the theory to the two independent spatial dimensions of images. This is briefly discussed in Sect.
VI.represents the positive natural numbers. If is a set and , then denotes the set of all tuples with entries from . The elements of are called signals, their entries are called samples. If is a signal and is an index list with entries, the formal sum is used for the element with for all . For example, when equals the set of real numbers and hence is the dimensional Euclidean space, then the formal sum corresponds to the linear combination of canonical basis vectors weighted with selected coordinates of the signal .
For , represents the dimensionality of . This does not need to correspond exactly with the concept of dimensionality in the sense of linear algebra. If for example for categorical data with features, then is not a vector space over . The theory presented in this paper requires algebraic structures such as vector spaces or analytic structures such as the real numbers only for certain examples. The bulk of the results hold for signals with samples from arbitrary sets.
If is a set and is a positive natural number, then is written for the set that contains all the signals of dimensionality greater than or equal to with samples from . For example, if then there is a natural number so that with for all . Note that contains all nonempty signals with samples from .
IiC Division of a Signal into Subsignals
A subsignal is a contiguous list of samples contained in a larger signal. First, the concept of extracting subsignals with a fixed number of samples from a given signal is formalized:
Definition 1
Let be an arbitrary set and let denote a fixed subsignal dimensionality. Then the function ,
is called the subsignal extraction operator. Here, is the input signal and denotes the subsignal index.
It is straightforward to verify that is welldefined and actually returns all possible contiguous subsignals of length from a given signal with samples (see Fig. 1). Note that for application of this operator, it must always be ensured that the requested subsignal index is within bounds, that is must hold to address a valid subsignal.
Iterated extraction of subsignals of different length can be collapsed into one operator evaluation:
Lemma 1
Let be a set and further let , , be two subsignal dimensionalities. Then for all , and it is .
Proof
The subsignal indices of the lefthand side are well within bounds. Since this also holds for the righthand side. Now
where in the () step has been substituted.
Iii Subsignal Compatible Transformations
This section introduces the concept of subsignal compatible transformations. These are functions that can be applied to an entire signal at once and then yield the same result as if they had been applied to each subsignal independently. It is shown that functions applied in a sliding fashion can be characterized as subsignal compatible transformations, and that the composition of subsignal compatible transformations is again a subsignal compatible transformation.
At the end of this section, CNNs without pooling layers are considered and it is demonstrated that these satisfy the requirements of subsignal compatible transformations. As a consequence, such networks can be applied to the whole input signal at once without having to handle individual subsignals. CNNs that do contain pooling layers require more theoretical preparations and are discussed verbosely in Sect. IV.
Now the primary definition of this section:
Definition 2
Let and be sets, let be a positive natural number, and let be a function. is then called a subsignal compatible transformation with dimensionality reduction constant if and only if these two properties hold:

[label=()]

Dimensionality reduction property (DRP):
for all . 
Exchange property (XP):
For all subsignal dimensionalities , , it holds that for all and all .
The first property guarantees that reduces the dimensionality of its argument always by the same amount regardless of the concrete input. The second property states that if is applied to an individual subsignal, then this is the same as applying to the entire signal and afterwards extracting the appropriate samples from the resulting signal. Therefore, if with subsignalbased application of the outcome for all feasible subsignals should be determined, it suffices to carry out signalbased application of on the entire input signal once, preventing redundant computations. These concepts are illustrated in Fig. 2.
Note that the exchange property is welldefined: The dimensionality reduction property guarantees that the dimensionalities on both sides of the equation match. Further, the subsignal index is within bounds on both sides. This is trivial for the lefthand side, and can be seen for the righthand side since .
An identity theorem for subsignal compatible transformations immediately follows:
Theorem 1
Let be sets and two subsignal compatible transformations with dimensionality reduction constant . If holds for all , then already .
Proof
Let . For , applying the precondition (PC) and the exchange property where the subsignal dimensionality is set to yields: . Hence all samples of the transformed signals match, thus for all in the domain of and .
Iiia Relationship between Functions Applied in a Sliding Fashion and Subsignal Compatible Transformations
Turning now to functions applied to a signal in a sliding fashion, first a definition what is meant hereby:
Definition 3
Let and be sets, let be a positive natural number and let be a function. Then ,
is the operator that applies in a sliding fashion to all the subsignals of length of the input signal and stores the result in a contiguous signal. The sliding window is always advanced by exactly one entry after each evaluation of .
The next result states that functions applied in a sliding fashion are essentially the same as subsignal compatible transformations, and that the exchange property could be weakened to hold only for the case where the dimensionality reduction constant equals the subsignal dimensionality:
Theorem 2
Let and be sets, let and let be a function. Then the following are equivalent:

is a subsignal compatible transformation with dimensionality reduction constant .

fulfills the dimensionality reduction property, and for all and all it holds that .

There is a unique function with .
Proof
1 2: Trivial, since the dimensionality reduction property is fulfilled by definition, and the claimed condition is only the special case of the exchange property where .
2 3: For showing existence, define , . For it is due to the dimensionality reduction property, therefore is welldefined. Now let and define . It is clear that . Now let , then the precondition (PC) implies , hence .
Considering uniqueness, suppose that there exist functions with . Let be arbitrary, then Definition 3 gives , therefore on .
3 1: Suppose there is a function with . inherently fulfills the dimensionality reduction property. Let , , be an arbitrary subsignal dimensionality and let be a signal. Further, let be an arbitrary subsignal index. Remembering that and using Lemma 1 gives
thus the exchange property is satisfied as well.
Therefore, for each subsignal compatible transformation there is a unique function that generates the transformation. This yields a succinct characterization which helps in deciding whether a given transformation fulfills the dimensionality reduction property and the exchange property. It is further clear that subsignal compatible transformation evaluations themselves can be parallelized since there is no data dependency between individual samples of the outcome.
Reconsidering Fig. 2 it is now obvious that the operator introduced there is no more than the quotient of two samples evaluated in a sliding fashion. It seems plausible from this example that convolution is also a subsignal compatible transformation. This is proven rigorously in Sect. IIIC.
Before discussing more theoretical properties, first an example of a transformation that is not subsignal compatible:
Example 1
Let denote the integers and consider the function , , which fulfills the dimensionality reduction property with dimensionality reduction constant . The exchange property is, however, not satisfied: Let and , then yields , but it is . Since unless vanishes, cannot be a subsignal compatible transformation.
IiiB Composition of Subsignal Compatible Transformations
The composition of subsignal compatible transformations is again a subsignal compatible transformation, where the dimensionality reduction constant has to be adjusted:
Theorem 3
Let , and be sets and let . Suppose is a subsignal compatible transformation with dimensionality reduction constant , and is a subsignal compatible transformation with dimensionality reduction constant .
Define . Then , , is a subsignal compatible transformation with dimensionality reduction constant .
Proof
Note first that since and , hence indeed . Let be arbitrary for demonstrating that is welldefined. As because of , this yields and hence is welldefined. Further, using the dimensionality reduction property of , therefore . Thus is welldefined, and so is .
For all , the dimensionality reduction property of and now implies , therefore fulfills the dimensionality reduction property.
Let , , be arbitrary, and let and . Since both and satisfy the exchange property, it follows that , where and hold during the two respective applications of the exchange property. Therefore, also fulfills the exchange property.
This result can be generalized immediately to compositions of more than two subsignal compatible transformations:
Corollary 1
Let , , and let be sets. For each let be a subsignal compatible transformation with dimensionality reduction constant . Then the composed function , , is a subsignal compatible transformation with dimensionality reduction constant .
Proof
Define , and for each let , , be a function. Since , the claim follows when it is shown with induction for that is a subsignal compatible transformation with dimensionality reduction constant . While the situation is trivial, the induction step follows with Theorem 3.
IiiC CNNs without Pooling Layers
To conclude this section, a demonstration is provided of how CNNs without any pooling layers fit in the theory developed so far. Since pooling layers require a nontrivial extension of the theory, they are detailed in Sect. IV.
Convolutional layers are the most substantial ingredient of CNNs, the trainable degrees of freedom which facilitate adaptation of the network to a specific task are located here. In these layers, multichannel input feature maps are convolved channelwise with adjustable filter banks, the result is accumulated and an adjustable bias is added to yield the output feature map.
First, the introduction of the indexing rules for iterated structures to account for the multichannel nature of the occurring signals. Let be a set, positive natural numbers and a multichannel signal. It is then for indices , and moreover for indices and . This rule is extended naturally to sets written explicitly as products with more than two factors. Therefore, if for another number , then for example for indices and .
These rules become clearer if the multichannel convolution operation is considered. Suppose the samples are members of a ring , denotes the number of input channels, is the number of output channels, and equals the number of samples considered at any one time during convolution with the filter bank, or in other words the receptive field size of the convolutional layer. Then input signals or feature maps with samples have form , and filter banks can be represented by a tensor . Here must hold, that is the filter kernel should be smaller than the input signal.
The output feature map is then
for indices . Note that and , so that the result of their product is understood here as scalar product. The operation is welldefined since , which follows immediately through substitution of the extreme values of and .
This multichannel convolution operation is indeed a subsignal compatible transformation as shown explicitly here:
Example 2
Define and and consider ,
Since it is , hence is welldefined. For all and any follows
where was substituted in the () step. The multichannel convolution operation as defined above is hence in fact the application of in a sliding fashion. Therefore, Theorem 2 guarantees that is a subsignal compatible transformation with dimensionality reduction constant .
Since fullyconnected layers are merely a special case of convolutional layers, these do not need any special treatment here. Addition of biases does not require any knowledge on the spatial structure of the convolution’s result and is therefore a trivial subsignal compatible transformation with dimensionality reduction constant . Nonlinearity layers are nothing but the application of a scalarvalued function to all the samples of an input signal. Hence these layers also form subsignal compatible transformations with dimensionality reduction constant due to Theorem 2.
Furthermore, compositions of these operations can also be understood as subsignal compatible transformations with Corollary 1. As a consequence, the exchange property facilitates application of CNNs without pooling layers to an entire signal at once instead of each subsignal independently, all without incurring any accuracy loss. The next section will extend this result to CNNs that may also feature pooling layers.
Iv Pooling Layers and Functions Applied in a Strided Fashion
So far it has been shown how convolutional layers and nonlinearity layers of a CNN fit in the theoretical framework of subsignal compatible transformations. This section analyzes pooling layers which apply a pooling kernel to nonoverlapping blocks of the input signal. This is equivalent to a function applied in a sliding fashion followed by a downsampling operation, which will here be referred to as the application of a function in a strided fashion.
The theory developed herein can of course also be applied to other functions than the pooling kernels encountered in ordinary CNNs. For example, multichannel convolution in which the filter bank is advanced by the receptive field size is essentially from Sect. IIIC applied in a strided fashion. Application of convolution where the filter banks are advanced by more than one sample has however no benefit in terms of execution speed for signalbased application. This is discussed at the end of Sect. IVB after having developed sufficient theory to analyze this notion.
This section demonstrates how these functions can be turned into efficiently computable subsignal compatible transformations using a data structure recently introduced as fragmentation by Giusti et al. [14]. Here, that proposed method is generalized and rigorously proven correct. As an added benefit of these results, the dynamics of the entire signal processing chain can also be accurately described, including the possibility of tracking down the position of each processed subsignal in the fragmentation data structure.
Moreover, the circumstances under which the fragment dimensionalities are guaranteed to always be homogeneous are analyzed. This is a desirable property as it facilitates the application of subsequent operations to signals which all have the same number of samples, rendering cumbersome handling of special cases obsolete and thus resulting in accelerated execution on massively parallel processors. For CNNs this means that conventional tensor convolutions can be used without any modifications whatsoever, which is especially beneficial if a highlyoptimized implementation is readily available.
First, a more precise statement on what the application of a function in a strided fashion means (see Fig. 3 for orientation):
Definition 4
Let and be sets, let be a positive natural number and let be a function. Then ,
is the operator that applies in a strided fashion to signals where the number of samples is a multiple of . The subsignal indices are chosen here so that all nonoverlapping subsignals are fed through , starting with the first valid subsignal.
Since it is for all , is welldefined. Further, for all in the domain of . Since the input dimensionality is reduced here through division with a natural number rather than a subtraction, the dimensionality reduction property cannot be fulfilled unless . The situation in which is, however, not particularly interesting since then which was already handled in Sect. III.
Before continuing with fragmentation, first consider multichannel pooling kernels commonly encountered in CNNs:
Example 3
Assume the goal is to process realvalued signals with channels, that is , where each channel should be processed independently of the others, and adjacent samples should be compressed into one output sample. Average pooling is then realized by the pooling kernel , which determines the channelwise empirical mean value of the samples. Another example is maxpooling, where the maximum entry in each channel should be determined. This can be achieved with the pooling kernel .
Iva Fragmentation
The fragmentation operator [14] performs a spatial reordering operation. Its precise analysis requires a recap of some elementary number theory. For all numbers and , Euclidean division guarantees that there are unique numbers and so that . Here is a small collection of results on these operators for further reference:
Proposition 1
It is and for all . Moreover, and for all and .
If the fragmentation operator is applied to a signal, it puts certain samples into individual fragments which can be grasped as signals themselves. If a collection of fragments is fragmented further, a larger collection of fragments results. The total number of samples is, however, left unchanged after these operations. For the sake of convenience, matrices are used here as concrete data structure for fragmented signals, where columns correspond to fragments and rows correspond to signal samples.
First, some notation needs to be defined. If is a set and , then denotes the set of all matrices with rows and columns with entries from . In the present context, this represents a collection of fragments where each signal has samples. For , and denote the number of rows and columns, respectively. Furthermore, is the entry in the th row and th column of where and . The transpose of is written as .
The vectorization operator [21] stacks all the columns of a matrix on top of another:
Definition 5
Let be a set and . The vectorization operator is characterized by for all indices and all matrices . The inverse vectorization operator is given by for all indices , and all vectors .
It can be verified directly that these two operators are welldefined permutations and inversely related to one another. With their help the fragmentation operator may now be defined:
Definition 6
Let be a set and . For arbitrary vector dimensionalities and numbers of input fragments the function ,
is called the fragmentation operator.
Here, equals the corresponding parameter from the application of a function in a strided fashion. is clearly welldefined, and the number of output fragments is . Next consider this operator that undoes the ordering of the fragmentation operator:
Definition 7
Let be a set, let , and let denote a vector dimensionality and a number of output fragments. Then ,
is called the defragmentation operator.
Note that is welldefined and the number of input fragments must equal . Fragmentation and defragmentation are inversely related, that is and . An illustration of the operations performed during fragmentation and defragmentation is depicted in Fig. 4.
Fragmentation is merely a certain reordering operation:
Lemma 2
Suppose that is a set, and . Then and . Further, for all indices and .
Proof
The dimensionality statements are obvious by the definition of . To prove the identity, let and . One yields
and the claim follows.
Similar properties are fulfilled by defragmentation:
Lemma 3
Let be a set. Let be positive natural numbers and let be an arbitrary fragmented signal. Then , , and for all indices , .
Proof
Completely analogous to Lemma 2.
As already outlined in Fig. 4, compositions of the fragmentation operator are equivalent to a single fragmentation with an adjusted parameterization:
Remark 1
Let be a set and . Then for all .
Proof
The claim follows through entrywise comparison between and using Lemma 2.
It follows immediately that fragmentation is a commutative operation:
Remark 2
If denotes a set, are natural numbers and is a fragmented signal, then .
Proof
Obvious with Remark 1 as multiplication in is commutative.
IvB Relationship between Fragmentation, Functions Applied in a Strided Fashion and Subsignal Compatible Transformations
A bit more background is necessary before analyzing how functions applied in a strided fashion fit into the theory of subsignal compatible transformations. The outcome of a subsignal compatible transformation applied to a fragmented signal is defined naturally:
Definition 8
Let be sets and a subsignal compatible transformation with dimensionality reduction constant . Let be a fragmented signal with samples in each of the fragments, where holds. For let denote the individual fragments. The output of applied to is then defined as , that is is applied to all the fragments independently.
Since there is no data dependency between fragments, parallelization of subsignal compatible transformation evaluation over all output samples is straightforward. What follows is the formal introduction of the processing chain concept which captures and generalizes all the dynamics of a CNN, and two notions of its application to signal processing:
Definition 9
The collection of the following objects is called a processing chain: A fixed subsignal dimensionality , a number of layers , a sequence of sets , and for each subsignal compatible transformations with dimensionality reduction constant and functions where . The numbers for are called the stride products of the processing chain. This implies that . For , the operator ,
applies the processing chain in a strided fashion, and further ,
is the operator that applies the processing chain in a sliding fashion. Note that these two functions are not welldefined unless additional conditions are fulfilled, detailed below.