semantic-segmentation
A Fully Convolutional Network (FCN) script to label the pixels of a road in images
view repo
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
READ FULL TEXT VIEW PDF
We characterize the singular values of the linear transformation associa...
read it
Hyperparameters tuning is a time-consuming approach, particularly when t...
read it
In order to collectively forecast the demand of ride-sourcing services i...
read it
Convolutional neural network is a very important model of deep learning....
read it
Convolutional neural network is a very important model of deep learning....
read it
In i-theory a typical layer of a hierarchical architecture consists of H...
read it
This paper shows a Min-Max property existing in the connection weights o...
read it
A Fully Convolutional Network (FCN) script to label the pixels of a road in images
Visualising VGG using the Deconvnet technique
None
None
Supplement pages
The bread and butter of neural networks is affine transformations
: a vector is received as input and is multiplied with a matrix to produce an output (to which a bias vector is usually added before passing the result through a nonlinearity). This is applicable to any type of input, be it an image, a sound clip or an unordered collection of features: whatever their dimensionality, their representation can always be flattened into a vector before the transformation.
Images, sound clips and many other similar kinds of data have an intrinsic structure. More formally, they share these important properties:
They are stored as multi-dimensional arrays.
They feature one or more axes for which ordering matters (e.g., width and height axes for an image, time axis for a sound clip).
One axis, called the channel axis, is used to access different views of the data (e.g., the red, green and blue channels of a color image, or the left and right channels of a stereo audio track).
These properties are not exploited when an affine transformation is applied; in fact, all the axes are treated in the same way and the topological information is not taken into account. Still, taking advantage of the implicit structure of the data may prove very handy in solving some tasks, like computer vision and speech recognition, and in these cases it would be best to preserve it. This is where discrete convolutions come into play.
A discrete convolution is a linear transformation that preserves this notion of ordering. It is sparse (only a few input units contribute to a given output unit) and reuses parameters (the same weights are applied to multiple locations in the input).
Figure 1.1 provides an example of a discrete convolution. The light blue grid is called the input feature map. To keep the drawing simple, a single input feature map is represented, but it is not uncommon to have multiple feature maps stacked one onto another.^{1}^{1}1An example of this is what was referred to earlier as channels for images and sound clips. A kernel (shaded area) of value
slides across the input feature map. At each location, the product between each element of the kernel and the input element it overlaps is computed and the results are summed up to obtain the output in the current location. The procedure can be repeated using different kernels to form as many output feature maps as desired (Figure 1.3). The final outputs of this procedure are called output feature maps.^{2}^{2}2While there is a distinction between convolution and cross-correlation from a signal processing perspective, the two become interchangeable when the kernel is learned. For the sake of simplicity and to stay consistent with most of the machine learning literature, the term convolution will be used in this guide. If there are multiple input feature maps, the kernel will have to be 3-dimensional – or, equivalently each one of the feature maps will be convolved with a distinct kernel – and the resulting feature maps will be summed up elementwise to produce the output feature map.
The convolution depicted in Figure 1.1 is an instance of a 2-D convolution, but it can be generalized to N-D convolutions. For instance, in a 3-D convolution, the kernel would be a cuboid and would slide across the height, width and depth of the input feature map.
The collection of kernels defining a discrete convolution has a shape corresponding to some permutation of , where
The following properties affect the output size of a convolutional layer along axis :
: input size along axis ,
: kernel size along axis ,
: stride (distance between two consecutive positions of the kernel) along axis ,
: zero padding (number of zeros concatenated at the beginning and at the end of an axis) along axis .
For instance, Figure 1.2 shows a kernel applied to a input padded with a border of zeros using strides.
Note that strides constitute a form of subsampling
. As an alternative to being interpreted as a measure of how much the kernel is translated, strides can also be viewed as how much of the output is retained. For instance, moving the kernel by hops of two is equivalent to moving the kernel by hops of one but retaining only odd output elements (
Figure 1.4).In addition to discrete convolutions themselves, pooling operations make up another important building block in CNNs. Pooling operations reduce the size of feature maps by using some function to summarize subregions, such as taking the average or the maximum value.
Pooling works by sliding a window across the input and feeding the content of the window to a pooling function. In some sense, pooling works very much like a discrete convolution, but replaces the linear combination described by the kernel with some other function. Figure 1.5 provides an example for average pooling, and Figure 1.6
does the same for max pooling.
The following properties affect the output size of a pooling layer along axis :
: input size along axis ,
: pooling window size along axis ,
: stride (distance between two consecutive positions of the pooling window) along axis .
The simplest case to analyze is when the kernel just slides across every position of the input (i.e., and ). Figure 2.1 provides an example for and .
One way of defining the output size in this case is by the number of possible placements of the kernel on the input. Let’s consider the width axis: the kernel starts on the leftmost part of the input feature map and slides by steps of one until it touches the right side of the input. The size of the output will be equal to the number of steps made, plus one, accounting for the initial position of the kernel ((a)). The same logic applies for the height axis.
More formally, the following relationship can be inferred:
For any and , and for and ,
To factor in zero padding (i.e., only restricting to ), let’s consider its effect on the effective input size: padding with zeros changes the effective input size from to . In the general case, Relationship 1 can then be used to infer the following relationship:
For any , and , and for ,
Figure 2.2 provides an example for , and .
In practice, two specific instances of zero padding are used quite extensively because of their respective properties. Let’s discuss them in more detail.
Having the output size be the same as the input size (i.e., ) can be a desirable property:
For any and for odd (), and ,
This is sometimes referred to as half (or same) padding. Figure 2.3 provides an example for , and (therefore) .
While convolving a kernel generally decreases the output size with respect to the input size, sometimes the opposite is required. This can be achieved with proper zero padding:
For any and , and for and ,
This is sometimes referred to as full padding, because in this setting every possible partial or complete superimposition of the kernel on the input feature map is taken into account. Figure 2.4 provides an example for , and (therefore) .
All relationships derived so far only apply for unit-strided convolutions. Incorporating non unitary strides requires another inference leap. To facilitate the analysis, let’s momentarily ignore zero padding (i.e., and ). Figure 2.5 provides an example for , and .
Once again, the output size can be defined in terms of the number of possible placements of the kernel on the input. Let’s consider the width axis: the kernel starts as usual on the leftmost part of the input, but this time it slides by steps of size until it touches the right side of the input. The size of the output is again equal to the number of steps made, plus one, accounting for the initial position of the kernel ((b)). The same logic applies for the height axis.
From this, the following relationship can be inferred:
For any , and , and for ,
The floor function accounts for the fact that sometimes the last possible step does not coincide with the kernel reaching the end of the input, i.e., some input units are left out (see Figure 2.7 for an example of such a case).
The most general case (convolving over a zero padded input using non-unit strides) can be derived by applying Relationship 5 on an effective input of size , in analogy to what was done for Relationship 2:
For any , , and ,
As before, the floor function means that in some cases a convolution will produce the same output size for multiple input sizes. More specifically, if is a multiple of , then any input size will produce the same output size. Note that this ambiguity applies only for .
Figure 2.6 shows an example with , , and , while Figure 2.7 provides an example for , , and . Interestingly, despite having different input sizes these convolutions share the same output size. While this doesn’t affect the analysis for convolutions, this will complicate the analysis in the case of transposed convolutions.
Take for example the convolution represented in Figure 2.1. If the input and output were to be unrolled into vectors from left to right, top to bottom, the convolution could be represented as a sparse matrix where the non-zero elements are the elements of the kernel (with and being the row and column of the kernel respectively):
This linear operation takes the input matrix flattened as a 16-dimensional vector and produces a 4-dimensional vector that is later reshaped as the output matrix.
Using this representation, the backward pass is easily obtained by transposing
; in other words, the error is backpropagated by multiplying the loss with
. This operation takes a 4-dimensional vector as input and produces a 16-dimensional vector as output, and its connectivity pattern is compatible with by construction.Notably, the kernel defines both the matrices and used for the forward and backward passes.
Let’s now consider what would be required to go the other way around, i.e., map from a 4-dimensional space to a 16-dimensional space, while keeping the connectivity pattern of the convolution depicted in Figure 2.1. This operation is known as a transposed convolution.
Transposed convolutions – also called fractionally strided convolutions or deconvolutions^{1}^{1}1The term “deconvolution” is sometimes used in the literature, but we advocate against it on the grounds that a deconvolution is mathematically defined as the inverse of a convolution, which is different from a transposed convolution. – work by swapping the forward and backward passes of a convolution. One way to put it is to note that the kernel defines a convolution, but whether it’s a direct convolution or a transposed convolution is determined by how the forward and backward passes are computed.
For instance, although the kernel defines a convolution whose forward and backward passes are computed by multiplying with and respectively, it also defines a transposed convolution whose forward and backward passes are computed by multiplying with and respectively.^{2}^{2}2The transposed convolution operation can be thought of as the gradient of some convolution with respect to its input, which is usually how transposed convolutions are implemented in practice.
Finally note that it is always possible to emulate a transposed convolution with a direct convolution. The disadvantage is that it usually involves adding many columns and rows of zeros to the input, resulting in a much less efficient implementation.
Building on what has been introduced so far, this chapter will proceed somewhat backwards with respect to the convolution arithmetic chapter, deriving the properties of each transposed convolution by referring to the direct convolution with which it shares the kernel, and defining the equivalent direct convolution.
The simplest way to think about a transposed convolution on a given input is to imagine such an input as being the result of a direct convolution applied on some initial feature map. The trasposed convolution can be then considered as the operation that allows to recover the shape ^{3}^{3}3Note that the transposed convolution does not guarantee to recover the input itself, as it is not defined as the inverse of the convolution, but rather just returns a feature map that has the same width and height. of this initial feature map.
Let’s consider the convolution of a kernel on a input with unitary stride and no padding (i.e., , , and ). As depicted in Figure 2.1, this produces a output. The transpose of this convolution will then have an output of shape when applied on a input.
Another way to obtain the result of a transposed convolution is to apply an equivalent – but much less efficient – direct convolution. The example described so far could be tackled by convolving a kernel over a input padded with a border of zeros using unit strides (i.e., , , and ), as shown in Figure 4.1. Notably, the kernel’s and stride’s sizes remain the same, but the input of the transposed convolution is now zero padded.^{4}^{4}4Note that although equivalent to applying the transposed matrix, this visualization adds a lot of zero multiplications in the form of zero padding. This is done here for illustration purposes, but it is inefficient, and software implementations will normally not perform the useless zero multiplications.
One way to understand the logic behind zero padding is to consider the connectivity pattern of the transposed convolution and use it to guide the design of the equivalent convolution. For example, the top left pixel of the input of the direct convolution only contribute to the top left pixel of the output, the top right pixel is only connected to the top right output pixel, and so on.
To maintain the same connectivity pattern in the equivalent convolution it is necessary to zero pad the input in such a way that the first (top-left) application of the kernel only touches the top-left pixel, i.e., the padding has to be equal to the size of the kernel minus one.
Proceeding in the same fashion it is possible to determine similar observations for the other elements of the image, giving rise to the following relationship:
A convolution described by , and has an associated transposed convolution described by , and and its output size is
Interestingly, this corresponds to a fully padded convolution with unit strides.
Knowing that the transpose of a non-padded convolution is equivalent to convolving a zero padded input, it would be reasonable to suppose that the transpose of a zero padded convolution is equivalent to convolving an input padded with less zeros.
It is indeed the case, as shown in Figure 4.2 for , and .
Formally, the following relationship applies for zero padded convolutions:
A convolution described by , and has an associated transposed convolution described by , and and its output size is
By applying the same inductive reasoning as before, it is reasonable to expect that the equivalent convolution of the transpose of a half padded convolution is itself a half padded convolution, given that the output size of a half padded convolution is the same as its input size. Thus the following relation applies:
A convolution described by , and has an associated transposed convolution described by , and and its output size is
Figure 4.3 provides an example for , and (therefore) .
Knowing that the equivalent convolution of the transpose of a non-padded convolution involves full padding, it is unsurprising that the equivalent of the transpose of a fully padded convolution is a non-padded convolution:
A convolution described by , and has an associated transposed convolution described by , and and its output size is
Figure 4.4 provides an example for , and (therefore) .
Using the same kind of inductive logic as for zero padded convolutions, one might expect that the transpose of a convolution with involves an equivalent convolution with . As will be explained, this is a valid intuition, which is why transposed convolutions are sometimes called fractionally strided convolutions.
Figure 4.5 provides an example for , and which helps understand what fractional strides involve: zeros are inserted between input units, which makes the kernel move around at a slower pace than with unit strides.^{5}^{5}5Doing so is inefficient and real-world implementations avoid useless multiplications by zero, but conceptually it is how the transpose of a strided convolution can be thought of.
For the moment, it will be assumed that the convolution is non-padded (
) and that its input size is such that is a multiple of . In that case, the following relationship holds:A convolution described by , and and whose input size is such that is a multiple of , has an associated transposed convolution described by , , and , where is the size of the stretched input obtained by adding zeros between each input unit, and its output size is
When the convolution’s input size is such that is a multiple of , the analysis can extended to the zero padded case by combining Relationship 9 and Relationship 12:
A convolution described by , and and whose input size is such that is a multiple of has an associated transposed convolution described by , , and , where is the size of the stretched input obtained by adding zeros between each input unit, and its output size is
Figure 4.6 provides an example for , , and .
The constraint on the size of the input can be relaxed by introducing another parameter that allows to distinguish between the different cases that all lead to the same :
A convolution described by , and has an associated transposed convolution described by , , , and , where is the size of the stretched input obtained by adding zeros between each input unit, and represents the number of zeros added to the bottom and right edges of the input, and its output size is
Figure 4.7 provides an example for , , and .
Readers familiar with the deep learning literature may have noticed the term “dilated convolutions” (or “atrous convolutions”, from the French expression convolutions à trous) appear in recent papers. Here we attempt to provide an intuitive understanding of dilated convolutions. For a more in-depth description and to understand in what contexts they are applied, see Chen et al. (2014); Yu and Koltun (2015).
Dilated convolutions “inflate” the kernel by inserting spaces between the kernel elements. The dilation “rate” is controlled by an additional hyperparameter
. Implementations may vary, but there are usually spaces inserted between kernel elements such that corresponds to a regular convolution.Dilated convolutions are used to cheaply increase the receptive field of output units without increasing the kernel size, which is especially effective when multiple dilated convolutions are stacked one after another. For a concrete example, see Oord et al. (2016), in which the proposed WaveNet model implements an autoregressive generative model for raw audio which uses dilated convolutions to condition new audio frames on a large context of past audio frames.
To understand the relationship tying the dilation rate and the output size , it is useful to think of the impact of on the effective kernel size. A kernel of size dilated by a factor has an effective size
This can be combined with Relationship 6 to form the following relationship for dilated convolutions:
For any , , and , and for a dilation rate ,
Figure 5.1 provides an example for , and .
Proc. International Conference on Computer Vision and Pattern Recognition (CVPR’10)
. IEEE.Reading checks with multilayer graph transformer networks.
In Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on, volume 1, pages 151–154. IEEE.Reseg: A recurrent neural network for object segmentation.
Comments
There are no comments yet.