KerCNNs: biologically inspired lateral connections for classification of corrupted images

10/18/2019 ∙ by Noemi Montobbio, et al. ∙ 82

The state of the art in many computer vision tasks is represented by Convolutional Neural Networks (CNNs). Although their hierarchical organization and local feature extraction are inspired by the structure of primate visual systems, the lack of lateral connections in such architectures critically distinguishes their analysis from biological object processing. The idea of enriching CNNs with recurrent lateral connections of convolutional type has been put into practice in recent years, in the form of learned recurrent kernels with no geometrical constraints. In the present work, we introduce biologically plausible lateral kernels encoding a notion of correlation between the feedforward filters of a CNN: at each layer, the associated kernel acts as a transition kernel on the space of activations. The lateral kernels are defined in terms of the filters, thus providing a parameter-free approach to assess the geometry of horizontal connections based on the feedforward structure. We then test this new architecture, which we call KerCNN, on a generalization task related to global shape analysis and pattern completion: once trained for performing basic image classification, the network is evaluated on corrupted testing images. The image perturbations examined are designed to undermine the recognition of the images via local features, thus requiring an integration of context information - which in biological vision is critically linked to lateral connectivity. Our KerCNNs turn out to be far more stable than CNNs and recurrent CNNs to such degradations, thus validating this biologically inspired approach to reinforce object recognition under challenging conditions.

READ FULL TEXT VIEW PDF

Authors

page 17

page 29

page 31

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Convolutional Neural Networks (CNNs) are a powerful tool that provides outstanding performances on image classification tasks. Major advances have been made since their introduction in the 1980s (Fukushima, 1980), thanks to the availability of large-scale datasets, as well as efficient GPU implementations and new regularization schemes. A notable example in this respect is the huge improvement of the state of the art performance reached by Krizhevsky et al. (2012)

on the ImageNet 2012 classification benchmark. However, there is still little insight into how the learning process of these algorithms develops and how the features of the data are encoded in the network structure. Some visualization techniques have been developed, such as the “deconvolution”-based projection of activations onto the pixel space proposed by

Zeiler and Fergus (2014), to identify the input stimuli that excite each feature map at a given layer of the network. Although this may provide some intuition on internal operations and simplify the diagnosis of the limitations of the models, there is still much to be understood about how exactly image information is coded in CNNs, and notably about how their functioning is related to human object processing. Indeed, although CNN models were initially inspired (Fukushima, 1980; LeCun et al., 1989) by the hierarchical model of the visual system of Hubel and Wiesel (1962), they display critical discrepancies w.r.t. biological vision in both structure and feature analysis.
In Baker et al. (2018) the authors show that, unlike in human vision, global shapes have surprisingly little impact on the classification output of the net: that is, CNNs turn out to learn mostly from local features. As such, CNN architectures are very unstable to small local perturbations, even when the global structure of the image is preserved and its content is still easily recognizable by a human observer. Along the same lines, it has been recently shown by Brendel and Bethge (2019) that a very good classification accuracy on the ImageNet dataset can be reached through a model that only relies on the occurrences of local features, with no information on their spatial location in the image.
Another key strand in unraveling the shortcomings in the internal processing of CNNs is the one related to “adversarial attacks”: it has been shown (Szegedy et al., 2014) that a CNN can be caused to completely misclassify an image when it is perturbed in a way imperceptible to humans, but specifically designed to maximize the prediction error. A similar study is presented in Nguyen et al. (2015), where the authors produce images that are unrecognizable to the human eye, but get labeled as one specific object class with high confidence by state-of-the-art CNNs.
Besides, although the overall convolutional architecture has much in common with the process of feature extraction carried out in the visual pathways, its structure implements a purely feedforward mechanism. On the contrary, the human visual system is well known to rely on both lateral (intra-layer) and feedback (top-down) recurrent connections for processes that are critical for object recognition, such as contour integration or figure-ground segregation (Gilbert et al., 1996; Grossberg and Mingolla, 1985; Neumann and Mingolla, 2001; Layton et al., 2014). In recent years, several models have been proposed in which CNN architectures were enriched with some recurrent mechanism inspired by biological visual systems. In Tang et al. (2018), pre-trained feedforward models were augmented with a Hopfield-like recurrent mechanism acting on the activation of the last layer, to improve their performance in pattern completion: partially visible objects converge to fixed attractor points dictated by the original whole objects. Liang and Hu (2015) introduced a “Recurrent CNN” architecture, where lateral connections of convolutional type are inserted in a regular feedforward CNN. A systematic analysis of the effect of adding lateral and/or feedback connections has been carried out by Spoerer et al. (2017), where the resulting architectures are trained and tested on a task of classification of cluttered digits. In Recurrent CNNs, lateral connections are learned, and no geometrical prior (apart from the ones given by the convolutional structure) is inserted. As such, these connections are determined by additional parameters that are completely independent of the feedforward architecture.

In this work, we propose to modify the classical CNN architecture by inserting lateral connections defined by structured kernels, containing precise geometric information specific of each layer. The new architecture will be referred to as KerCNN. The kernel associated to a convolutional layer implements a measure of correlation

between neurons of that layer, inspired by the connectivity model of

Montobbio et al. (2019a, b), and acts as a transition kernel on the corresponding activation. As will be discussed in Section 3.1, the lateral contribution is defined by an iterative update rule similar to the recurrent mechanism of Liang and Hu (2015), although carefully modified to implement a biologically plausible propagation of neural activity. Most importantly, the lateral kernels themselves are not learned, but rather they are constructed to allow diffusion in the metric defined by the learned filters. In particular, they establish a link between the geometrical properties of feedforward connections and horizontal connectivity, being defined as a function of the convolutional filters. This also implies that such kernels do not depend on any additional trainable parameters: therefore, their insertion does not increase the original network’s complexity in terms of number of parameters, which allows a fair comparison in performance.
The main point that we wish to make is that the insertion of these connections allows the networks to spontaneously implement perceptual mechanisms of global shape analysis and completion. Therefore, we shall examine the ability of the models to generalize an image classification task to data corrupted by a variety of different perturbations: these include occlusions (as in Tang et al., 2018), local contour disruption (as in Baker et al., 2018) and adversarial attacks via the Fast Gradient Sign Method (FGSM) of Goodfellow et al. (2015). We stress that the data perturbations are only inserted in the testing

phase – that is, the models are not optimized to classify corrupted images.


We will fix a base 2-layer CNN model and modify it by inserting our structured lateral connections in one or both layers. We will then compare the performance of the base CNN with the one of the different KerCNN models, obtained by varying the number of iterations of the update rule for each layer. We first present an extensive analysis of our results for the classical MNIST dataset (LeCun et al., 1998). As will be shown in Section 4.1, KerCNNs turn out to improve the base CNN’s classification accuracy on degraded images by up to points, while preserving the same performance on the original (not corrupted) testing images. We also compare the KerCNN models with the “RecCNN” obtained by adding recurrent connections to the base model as in Spoerer et al. (2017) – where the number of parameters of the networks is matched by decreasing the size of feedforward filters, to compensate for the additional recurrent parameters. In particular, for each task we inspect the performance of the best KerCNN and RecCNN architectures (i.e. the ones with the optimal number of iterations), and our results show that our biologically inspired model outperforms the recurrent one in practically all experiments, see Section 4.1.6. We will conclude the paper by giving a synthetic account on the same study carried out on different datasets, namely Kuzushiji-MNIST (Clanuwat et al., 2018), Fashion-MNIST (Xiao et al., 2017) and CIFAR-10 (Krizhevsky, 2009). The choice of the two MNIST-like datasets was driven by their being very homogeneous, allowing for a meaningful interpretation of our results in terms of the characterizing features of the images. On the other hand, our tests on CIFAR-10 show that this technique can be extended to richer datasets, and notably to natural images.
A noteworthy feature of our model is that it somewhat links two approaches to image treatment that are classically seen as opposites, namely “geometrical” methods and “data-driven” methods. The former rely on a priori assumptions based on mathematical modeling either of the structure of the data or of the task, e.g. variational techniques for inpainting (Ambrosio and Masnou, 2003; Bertalmio et al., 2000); the latter instead are designed to learn

patterns and convenient representations from the statistics of a dataset through optimization of a loss function related to the task. In KerCNNs these two aspects coexist, since the metric structure that we define on each layer of the network is directly induced by the learned filters.

2. Preliminaries

In this section, we shall give an overview on some biological notions and computational methods that will be of interest throughout the paper. CNNs are a particular kind of deep neural network architecture, designed in analogy with information processing in biological visual systems (Fukushima, 1980; LeCun et al., 1989). In addition to the hierarchical organization typical of deep architectures, translation invariance is enforced in CNNs by local convolutional windows shifting over the spatial domain. This structure was inspired by the localized receptive profiles of neurons in the early visual areas, and by the approximate translation invariance in their tuning. Although the analogy with biological vision is strong, the feedforward mechanism implemented in CNNs is a simplified one, and it does not take into account all of the processes contributing towards the interpretation of a visual scene. In the following, we first describe some structures of the visual pathways with a focus on the primary visual cortex (V1), and review some mathematical models of vision; we then recall the main features of feedforward convolutional architectures, and finally outline the Recurrent CNN models of Liang and Hu (2015) and Spoerer et al. (2017).

2.1. Feedforward and lateral connectivity of V1

The primary visual cortex (V1) implements the first stage of cortical processing of a visual stimulus. It receives the retinal signal after a first subcortical processing stage, and it sends information to “higher” cortical areas performing further processing. These junctions form a connectivity of feedforward type, since they link the zones of the visual pathways in a sequential way, generating a hierarchy starting from the retina.
Through the above-mentioned connections, each visual neuron is linked to a specific domain of the retina which is referred to as its receptive field (RF). The reaction of a cell to a punctual luminous stimulation applied at a point can be of excitatory or inhibitory type, with different modulation: this can be described by a function , called the receptive profile (RP) of the cell, whose values are positive when the cell is excited and negative when it is inhibited. The RPs of certain types of visual neurons are shown to act, at least to a first approximation, as linear filters on the optic signal. This means that the response of the cell to a visual stimulus , defined as a function on the retina, is given by the integral of against the profile of the neuron, computed over its receptive field :

(1)

In these cases, the shape of the RP contains information about the features that it extracts from a visual signal. For example, the local support111The support of a function is defined as the closure of the set . of makes it sensitive to position, i.e. the neuron only responds to stimuli in a localized region of the image. Or again, a receptive profile with an elongated shape will be sensitive to a certain orientation, i.e. it will respond strongly to stimuli consisting of bars aligned with this shape. This is the case for simple cells, a class of neurons of V1 showing orientation selectivity due to their strongly anisotropic RPs, first discovered by Hubel and Wiesel (1962).

The processing performed by the human visual system allows to efficiently group local items into extended contours, and to segregate a path of elements from its background. This implies that the perception of a local edge element in a visual stimulus is influenced by the perception of the surrounding oriented components: this perceptual phenomenon has been described through the concept of association field (Field et al., 1993), characterizing the geometry of the mutual influences between local oriented elements in the perceived image, depending on their orientation and reciprocal position. These psychophysical experiments suggest that a local analysis is not sufficient to correctly interpret a visual scene: from the physiological point of view, this means that the activity of V1 neurons is not only influenced by the feedforward signal received from the preceding visual areas, but also by intracortical connections with surrounding V1 cells. In fact, the reciprocal influences described by association fields are thought to be neurally implemented in V1 through a kind of long-range connections referred to as lateral (or horizontal), whose orientation specificity and spatial extent is compatible with association fields. Indeed, V1 horizontal connections show facilitatory influences for cells that are similarly oriented; moreover, the connections departing from each neuron spread anisotropically, concentrating along the axis of its preferred orientation (see e.g. Bosking et al., 1997).

The functional architecture of V1 and the related perceptual phenomena have been described in a variety of mathematical models. The set of RPs of simple cells is typically represented by a bank of linear filters , where is a set of indices parameterizing the family. Each can be thought of as representing the features extracted by the filter : in these terms, we can refer to as the feature space associated to the bank of filters . This set often has the product form , where the parameters determine the retinal location of the RF, while represents the other local image features extracted by each filter. This is the case for the work of Bressloff and Cowan (2003), where each V1 cell is labeled by a spatial index and a “feature index”, and the evolution in time of the activity of the neural population at is assumed to satisfy a Wilson-Cowan equation (Wilson and Cowan, 1972):

(2)

Here,

is a nonlinear activation function;

is a decay rate; is the feedforward input corresponding to the response of the simple cells in presence of a visual stimulus, as in (1); and the kernel weights the strength of horizontal connections between and . A possible way to obtain a measure of this connectivity is by means of differential geometry tools. A breakthrough idea in this direction has been that of viewing the feature space as a fiber bundle with basis and fiber . This approach first appeared in the works of Koenderink and van Doom (1987) and Hoffman (1989). It was then further developed by Petitot and Tondut (1999) and Citti and Sarti (2006). In the latter work, the model is written in the Lie group by requiring the invariance under roto-translations: here, the feature index explicitly represents a local orientation . More generally, it can also contain information about other variables such as scale, curvature or even velocity (see e.g. Sarti et al., 2008; Abbasi-Sureshjani et al., 2018; Barbieri et al., 2014). Another strand of research is linked to statistics of natural images (see e.g. August and Zucker, 2000; Kruger, 1998; Sigman et al., 2001; Sanguinetti et al., 2010). In Sanguinetti et al. (2010), the statistics of edge co-occurrence in natural images are fitted to a Fokker-Planck kernel in ; such kernel has been proposed as a connectivity weight to insert in (2) by Sarti and Citti (2015).

2.1.1. A kernel model for lateral connectivity

A different connectivity kernel, induced by a structure of metric space associated to the RPs of simple cells, has been introduced in Montobbio et al. (2019a). The core of the model is the definition of a kernel describing the interactions between local elements, which determines a metric structure directly induced by the shape of the RPs of V1 simple cells. This local correlation kernel is then propagated through an iterative procedure, to yield a wider kernel modeling long-range connections.

The starting point is a family of filters modeling the set of V1 simple cells. The local connectivity of V1 is represented by the following generating kernel on :

(3)

The kernel is constructed to provide a measure of correlation between RPs. In fact, if the filters are normalized to have squared -norm equal to some number , then

This means that expresses the correlation w.r.t. the distance between the filters. Note that this also defines a metric onto the feature space.

The generating kernel has a local sense, since it only describes the reciprocal influences between simple cells with overlapping RFs. The action of is then iterated to model the long-range connectivity. Given a starting point , the local kernel around it is first passed through a nonlinear activation function and a normalization operator (see also Coifman and Lafon, 2006), thus defining:

(4)

The iterative procedure yielding the propagation is then given by

(5)

Here, is the spherical Hausdorff measure (Hausdorff, 1918) associated to the distance on .

The geometrical structure encoded in this kernel is shown in Montobbio et al. (2019a)

to be compatible with the properties of V1 horizontal connections, and with the perceptual principles synthesized by association fields. Results are also shown for a bank of filters arising from an unsupervised learning algorithm: this shows that meaningful information of the geometry of horizontal connections can be recovered from numerically known filters, thus motivating the present work.


We conclude this section with an important remark on the action of the correlation kernel of Montobbio et al. (2019a) as an operator acting on functions defined on . Given a function

the action of the propagated kernel onto can then be expressed by

(6)

Note that, by substituting Eq. (5) into Eq. (6), we get:

This means that applying the -th step kernel to is equivalent to applying times the local kernel to . In the following, we will take as functions the activations obtained by mapping a signal to a feature space. In the case of V1, this signal is a retinal image and the activation in presence of is a function of the cortical coordinates :

where is a nonlinear activation function. Updating this activation through the connectivity kernel means taking into account the contextual influences in modeling the response of V1 to the image .

2.2. CNNs for image classification

In order to fix the notations that will be used in later sections, we recall here the typical structure of a CNN, with a focus on image classification tasks. We refer to Rawat and Wang (2017) for an exhaustive review on this topic. As mentioned in Section 2.1, the processing of early visual areas is classically modeled as the mapping of an image to a feature space through a bank of filters with a localized support. The first convolutional layer of a CNN implements an analogous mechanism, typically defined as follows:

(7)

where

is a nonlinear activation function. A popular choice for it in recent literature is the Rectified Linear Unit (ReLU)

(Nair and Hinton, 2010; Krizhevsky et al., 2012). Here, the input image is an tensor, where and denote the height and width of the image in pixels, while is the number of channels: if is a grayscale image, if it is RGB. The bank of filters of the first layer is a tensor: is the spatial size of the filters, is the number of filters and is the number of channels of each filter, matching the number of channels of the input. The convolution between and the filters gives an tensor, to which a biasvector is added along the third component, to obtain the output of the layer. Note that the number of filters defines the number of channels of the output of the layer. Written in a more compact notation, Eq. (7) reads:

The subsequent convolutional layers are defined similarly: for each we have a bank of filters defined by a

tensor, and a bias vector

of length . The number of channels of the filters varies across layers according to the number of channels of the inputs they receive. In particular, since the output of the -th layer has channels, each of the filters applied to it must have channels as well. The activation of the -th layer in terms of the output of the preceding layer is given by

Another layer that can optionally be interposed between convolutional layers consists of the application of a pooling operator : this performs a downsampling of its input over the spatial variables (i.e. the “depth” dimension remains unchanged), typically by taking the maximum or by averaging over small neighborhoods. For instance, if a pooling layer is applied to an activation of size over 22 squares, then the output will be an tensor. This downsampling operation reduces the dimensionality and introduces invariance to small shifts and distortions. The insertion of pooling layers has a neural motivation as well: the receptive fields of visual neurons tend to get wider and wider moving towards higher cortical layers, and subsampling the spatial dimension of a feature space is equivalent to taking filters with a wider support in the next layer.
The final layer of the network is typically fully connected: the output of the last convolutional layer, which is a tensor of size , is “flattened” to a vector of length and transformed as follows:

where is an matrix of trainable weights and is a bias vector of length . This yields a vector of length as output: in the case of multiclass classification, must be the number of classes. It is also not uncommon to have multiple fully connected layers, with nonlinear activation functions interposed between them – in this case, only the length of the last output vector needs to match the number of classes. A softmax function is then typically applied to the final output vector:

The softmax function gives a vector whose entries are real numbers between 0 and 1 that sum to 1: this can be interpreted as a vector of probabilities, where each entry represents the “score” of the corresponding class.


The most common loss function for multiclass classification is the cross entropy between the output and the target vector containing the “true” probabilities associated to the input :

(8)

where is the correct class for . The last equality holds since and for each .

2.3. Recurrent CNNs (RecCNNs)

The human visual system, as outlined in Section 2.1, relies not only on a hierarchical transmission of signals, but also on a horizontal spreading of information. On the other hand, the sequential structure of a CNN implements a purely feedforward mechanism: the output of each layer only depends on the activation of the preceding layer. Recurrent CNNs (Liang and Hu, 2015; Spoerer et al., 2017), referred to as RecCNNs in the following, are a modification of this kind of architecture, where lateral connections of convolutional type are added to a regular CNN, yielding an equation analogous to (2). This means that the network includes not only connections from one layer to the next one, but also connections from a layer to itself, ruled by “horizontal” connectivity weights. As in (2), this is described through an evolution in time. The activation of the -th hidden layer at time , which we denote by , is a function of:

  • (the output of the preceding layer at the same time step );

  • (the output of the same layer at time ).

Specifically, following the notations introduced for CNNs, we have:

(9)

for all , where for all , and for all . Here, denotes the bank of convolutional filters defining the lateral connections at the -th layer. Note that their introduction results in an additional set of parameters in the architecture w.r.t. a standard CNN.
Recurrent neural networks (RNNs) are often employed to process sequential inputs, e.g. audio recordings, video or text. In such cases, a new input is fed into the network at each time step. On the contrary, in RecCNNs the input image is static, i.e. it is kept fixed at each time step: the time variable only affects the processing.

3. Kernel CNNs (KerCNNs)

The lateral connections in RecCNNs are completely learned and independent of the feedforward ones. As such, the inclusion of these connections in a CNN increases its complexity in terms of trainable parameters. We propose a different modification of a CNN, obtained by introducing convolutional lateral connections with kernels constructed according to the connectivity model of Montobbio et al. (2019a, b) (see Section 2.1.1). We shall refer to this architecture as KerCNN. In the following, we first outline the proposed network architecture; we then introduce a testing framework to analyze the performance of the networks in three tasks of classification of corrupted images, focusing on types of image degradation where mechanisms of perceptual completion and global object analysis are required for correct classification.

3.1. Network architecture

The idea of the KerCNN architecture is to transpose the notion of connectivity of Montobbio et al. (2019a, b), and notably the propagation of Eq. (6), into the structure of a CNN. The lateral kernel associated to the -th convolutional layer is defined as follows. First, a correlation kernel is computed by taking the scalar product between the filters :

(10)

where is the sigmoidal activation function

The indices in the sum are let vary as long as the product does not vanish. Therefore, if the size of the bank of filters is , then the size of the kernel is obtained as . The final kernel is then obtained as

where is the same normalization operator introduced by Coifman and Lafon (2006) and appearing in Montobbio et al. (2019a), see Eq. (4). Specifically, in the current case of a discrete, translation-invariant kernel , the operator reads:

where

The update rule of a KerCNN layer is inspired by the iterative procedure outlined in the preceding section, designed to model the propagation of neural activity in V1:

(11)

The output of the -th layer is first mapped to the -th feature space through a feedforward step, yielding an activation , which is then updated through convolution with the kernel , as in (6). The new output is defined by averaging between this updated activation and the original activation . Note again an analogy with Eq. (2). The same procedure is repeated, yielding a sequence of activations , until a fixed stopping time is reached. Note that each layer has its own stopping time: this yields a different KerCNN architecture for each combination of the stopping times of the layers. If all stopping times are 1, the model coincides with the base CNN. We remark that convolutions with the kernel

are taken with appropriate zero padding, so that the size of

is preserved at every iteration.
The intuitive idea here is that behaves like a “transition kernel” on the feature space of the -th layer, slightly modifying its output according to the correlation between its filters: the activation of a filter encourages the activation of other filters highly correlated with it.

3.2. Task: stability to corrupted images

We will show that the insertion of such structured lateral connections improves the performance of a CNN in tasks related to perceptual mechanisms of global shape analysis and integration. In particular, we focus on classification of corrupted images. Given a labeled image dataset, each model is trained in a supervised way to perform classification. No corruption is applied to the images during the training phase. The actual experiment consists in analyzing the ability of the models to generalize the classification to the degraded images, by comparing their classification accuracy on corrupted testing images. We examine the following different kinds of image corruption.

  1. Gaussian patches occluding the image, similar to the ones in Tang et al. (2018).

  2. Disruption of local contours, in analogy with the study presented by Baker et al. (2018), obtained by subdividing the image into horizontal or vertical strips and by shifting each of these strips by a random number of pixels .

  3. Adversarial attacks through the Fast Gradient Sign Method (FGSM) of Goodfellow et al. (2015). FGSM, one of the most popular attack methods, simply adjusts the input image by taking a gradient ascent step to maximize the loss function. Precisely, the perturbed image is obtained as

    (12)

    where is the loss function, as in Eq. (8).

In all three cases, the amount of degradation can be quantified by one parameter: the standard deviation of the Gaussian patches, the maximum displacement

of the strips, and the step of the FGSM. The more stable a model is to these perturbations, the slower the drop in performance w.r.t. the degradation parameter.

4. Results

In this section, we first provide a complete analysis of the results obtained on the MNIST dataset (LeCun et al., 1998): we compare the performance of a 2-layer CNN model with the ones of the corresponding KerCNN and RecCNN models, for varying stopping times and and for different types and amounts of image degradation (as outlined in Section 3.2). We then give a more synthetic report on the same study carried out on the Kuzushiji-MNIST (Clanuwat et al., 2018), Fashion-MNIST (Xiao et al., 2017) and CIFAR-10 (Krizhevsky, 2009)

datasets. All the experiments were implemented using PyTorch

(Paszke et al., 2017).

4.1. Mnist

We start by considering the MNIST dataset (LeCun et al., 1998), consisting of 70000 labeled grayscale images of handwritten digits from 0 to 9: see the sample in Figure 1a. The default train-test split is 60000/10000. We retained a part of the images from the training set for validation-based early stopping, so that the final dataset used consisted of 50000 training samples, 10000 validation samples and 10000 testing samples. We trained the networks on the original training images, and we tested them on corrupted testing images, according to the three types of degradation mentioned above. Some examples are displayed in Figure 1b-d.

Figure 1. (a) A sample from the MNIST dataset. (b) A testing image corrupted by a Gaussian patch of increasing standard deviation. (c) A testing image corrupted by an increasing amount of local contour disruption . (d) Testing images perturbed by applying FGSM to the base CNN, with increasing values of . Below each image, we display the classified label, as well as the correct label (in brackets). Apart from the unperturbed one (), all the images are misclassified by the CNN.
Figure 2. Our KerCNN model with structured lateral connections defined by kernels and .

4.1.1. Base model

Our base model is a CNN with 2 hidden layers. We take 16 filters of size in the first convolutional layer and 16 filters of size

in the second convolutional layer, each followed by ReLU activation and max pooling, and a fully connected last layer followed by softmax activation. The total number of trainable parameters is 7482. We then compare this model with the one obtained from it by inserting the structured lateral connections. See Figure

2 for a description of the model. The lateral kernels in this case have size . We also analyze the performance of the model obtained from the CNN by inserting recurrent connections according to the RecCNN model, i.e. through the update rule (9). As said before, lateral connections given by the kernels do not introduce new parameters in the starting CNN. On the other hand, the insertion of learned lateral connections results in a model with more parameters than the base CNN: for example, the introduction of learned kernels of size in the first layer of the base model would add 4096 new parameters to the original 7482. In the following, we consider a 7482-parameter version of the RecCNN, obtained by decreasing the size of feedforward filters in order to compensate for the extra recurrent parameters, as in Spoerer et al. (2017).

4.1.2. Training details

All the models were trained with validation-based early stopping, for a maximum of 150 epochs. Adam optimizer was employed with the standard parameters indicated in

Kingma and Ba (2015), a batch size of 50 and the Xavier initialization scheme (Glorot and Bengio, 2010); regularization with was used. In the models including lateral connections (of any kind), recurrent dropout (Semeniuta et al., 2016) with .2 probability was applied to the “horizontal” contributions. In RecCNNs, local response normalization (LRN) was applied after recurrent convolutional layers as in Liang and Hu (2015) and Spoerer et al. (2017)

. The training and testing images were z-score normalized according to the mean and standard deviation computed across the whole training set. For each architecture (i.e. each combination of stopping times

and

), 10 nets initialized with different random seeds were trained. The results displayed in the following are obtained by testing all 10 nets and averaging the classification accuracy over trials. Note that the testing itself introduces a further element of randomness over trials, since the perturbations are applied to the images at each evaluation, yielding possibly different results. Error bars (95% confidence intervals) are shown in the plots to keep track of the variability across initialization seeds and image perturbations.

4.1.3. Gaussian patches

We first consider testing images corrupted by occlusions in the form of Gaussian “bubbles” at random locations over the image, similar to the ones considered by Tang et al. (2018). Specifically, the image obtained by modifying the original input through a patch centered at was implemented as:

where and is the “background color”, chosen to be the value at the upper left angle of each image. See Figure 1b. The number of patches per image was kept fixed to 4. In the following, we show the results of comparing the classification accuracy of the CNN and KerCNN models for varying amounts of image degradation (i.e. standard deviation of the Gaussian bubbles, expressed in pixels) and for different stopping times of KerCNN.

(a)
(b)
Figure 3. Results for MNIST testing images corrupted through Gaussian patches, for KerCNN with lateral connections in the first (A), resp. second layer (B). Left plots: accuracy (-axis) at increasing values of (-axis), for stopping time (A), resp. (B). Right plots: accuracy (-axis) for increasing values (-axis) of (A), resp. (B), for different values of degradation. Each curve refers to a value of , specified in red in correspondence of the curve.

We first examine the KerCNN defined by inserting lateral connections in the first layer of the base CNN. Figure (a)a(left) shows its classification accuracy for varying values of standard deviation of the Gaussian patches. The three graphs displayed refer to different stopping times . The chance level accuracy (10%) is displayed as well (dashed blue line). For , the model is the standard CNN with no lateral connections. The mean performance of these three nets on the original testing set () is almost identical (). On the other hand, for increasingly degraded images the performance drops dramatically for the CNN (, blue curve), while decaying much more slowly for increasing values of . Note that the difference in classification accuracy between the CNN and the best KerCNN reaches points. After reaching its optimal value ( for and for greater values), the performance drops again by taking further steps. For the sake of legibility, we displayed in the left plot only the curves up to the optimal value of . The behavior of classification accuracy w.r.t. can be best appreciated in the right plot of Figure (a)a, displaying a curve for each value of the standard deviation : for every , the accuracy increases w.r.t. until a maximum is reached, and then decreases again.

We now analyze the performance of the KerCNN models with lateral connections:

  • only in the second layer;

  • in both layers.

Analogous to the preceding case, the optimal stopping time for the net with lateral connections in the second layer is for the original images, for a small degradation () and for greater values of standard deviation. Figure (b)b(left) plots the accuracy against the level of degradation: we display the curves for ; the accuracy w.r.t. stopping times is plotted in Figure (b)b(right), where each curve corresponds to a level of image degradation. The results show the same pattern as before, although with a smaller improvement (up to points between the base CNN and the model with optimal ).
It is interesting to note that the optimal number of iterations shifts towards higher values (for both layers) as the size of the occlusions increases. As mentioned before, the kernel can be thought of as an anisotropic transition kernel on the space of activations of the -th layer. As such, the repeated application of the lateral contribution given by these kernels may be interpreted as a spreading of activation, around each spatial location, along those orientations that are most activated at that point.

Figure 4. Classification accuracy (color-coded) for KerCNN for all combinations of , displayed for . The maximum value of accuracy is marked by a red star onto the corresponding cell.

Intuitively, this “compensates” for the gaps in the activation caused by the occlusions: the wider the gap, the higher the number of iterations of the kernel needed for the image to be consistently completed.
We finally study the combinatorics of stopping times in the two layers: Figure 4 displays the results for different levels of image degradation. For each combination of (-axis) and (-axis), the mean accuracy over all trials (color-coded) is displayed. Note that the highest values of accuracy lie on a diagonal that shifts towards higher values of both and as the level of degradation increases. It is interesting to observe that, for , the optimal couple , highlighted by a red star, is one involving lateral connections in both layers.

4.1.4. Local contour disruption

In Baker et al. (2018), evidence is provided that the feature extraction performed by deep CNNs mostly relies on local edge relations, rather than on global object shapes. Their experiments showed that, conversely to human vision, the networks’ performance was much more robust to global shape changes preserving local features, than to a disruption of local contours preserving the global information. We hypothesized that the insertion of structured lateral connections in CNNs could make the models more robust to these local perturbations.

(a)
(b)
Figure 5. Results for MNIST testing images corrupted through local contour disruption, for KerCNN with lateral connections in the first (A), resp. second layer (B). Left plots: accuracy at increasing values of displacement , for stopping time (A), resp. (B). Right plots: accuracy for increasing values of (A), resp. (B), for different values of degradation. Each curve refers to a value of , displayed in red in correspondence of the curve.

To automatically create a “local scrambling” of pixel information, we subdivided the images into horizontal strips and shifted each of these strips by a number of pixels , randomly picked in ; we then repeated the procedure by subdividing the modified image into vertical strips and by shifting them as well. For a small displacement (), this produces a local degradation analogous to the one considered by Baker et al. (2018), where the local contours are corrupted but the connected components are preserved. For increasing values of , the image is more and more disrupted, yet still roughly preserving its global structure. See Figure 1c. As before, we compare the classification accuracy of the models for an increasing amount of degradation, given in this case by the maximum displacement , which was kept the same for both horizontal and vertical strips.

Figure 6. Classification accuracy (color-coded) for KerCNN for all combinations of , displayed for . The maximum value of accuracy is marked by a red star onto the corresponding cell.

In the present experiments, varies from 0 to 4 pixels. In this case, the performance of the models for turns out to rise for increasing stopping times up to for the models with lateral connections in the second layer, while there is a peak in performance at for the ones with lateral connections in the first layer: see Figure 5. A similar situation can be observed when analyzing the combinatorics of stopping times for the first and second layers, as shown in Figure 6: the optimal couple of values shifts towards the maximum as the displacement increases, and the best accuracy is reached at above a certain amount of degradation.

4.1.5. Adversarial attacks

Finally, we tested the robustness of our model to adversarial attacks via FGSM. Figure 1d shows some examples of images obtained through (12) applied to the base CNN for MNIST, for increasing values of .

(a)
(b)
Figure 7. Results for MNIST testing images perturbed via FGSM, for KerCNN with lateral connections in the first (A), resp. second layer (B). Left plots: accuracy at increasing values of the FGSM parameter , for stopping time (A), resp. (B). Right plots: accuracy for increasing values of (A), resp. (B), for different values of degradation. Each curve refers to a value of , displayed in red in correspondence of the curve.

For sufficiently small , this perturbation results in an image that is almost identical to the original one to the human eye; however, these images are misclassified by the network.
Again, we first examine the performance of the models with lateral connections in one layer at a time, for varying and respectively. Figure 7 displays the classification accuracies of these models for and (A) and for and (B). As before, the left figure plots the accuracy against the amount of degradation, with a curve for each stopping time , while the right figure plots the accuracy against the stopping time , with a curve for each value of .

Figure 8. Classification accuracy (color-coded) for KerCNN for all combinations of , displayed for . The maximum value of accuracy is marked by a red star onto the corresponding cell.

Finally, Figure 8 displays the analysis of the combinatorics of and . Similarly to the case of Gaussian patches, the highest accuracy values lie on a diagonal. However, while in that case the optimal combination was clearly located around a single spot, two peaks develop in the current case, corresponding to either high values of and low values of , or vice versa.

We will summarize the main results obtained for all datasets in Table 1, showing the difference in mean percent accuracy between the base CNN and the optimal KerCNN model, along with the corresponding combination of stopping times . A possible concern about our approach is the fact that we do not identify a combination that is optimal for all tasks, thus raising the issue of how to choose the stopping times when the amount of degradation is not known a priori. We nonetheless remark that, although the optimal combination of varies, the KerCNNs with outperform the base CNN in practically all the tasks.

4.1.6. Comparison with learned kernels

We now compare our model with the RecCNN architectures described above. Here, recurrent convolutional connections as described in Section 2.3, with weights of size , have been added in the first (resp. second) layer; the size of the feedforward weights of the second layer has been decreased to to make the number of parameters match with the base CNN (as in Spoerer et al., 2017). The performance of these RecCNN models on the tasks examined before has been compared to the one of the base CNN, as well as with the corresponding KerCNNs. In most experiments, the RecCNN model did not reach better accuracies than the base CNN on corrupted images, although in some cases a pattern similar to the one seen for KerCNNs could be observed: in such cases, the performance increased until an optimal stopping time. However, the improvement in accuracy w.r.t. the CNN turned out to be much smaller than the one obtained by KerCNN models. Moreover, the geometric content of these learned lateral kernels is not evident and the iterative steps taken according to (9) do not seem to implement a kind of propagation – a hint of this lies in the fact that the optimal stopping time for RecCNNs never depends on the amount of degradation of the testing images.
In Figure 9, we compare the accuracies of the KerCNN and RecCNN architectures for the corresponding optimal stopping times for each task. In all plots, the filled curves refer to KerCNN models, while the accuracy of RecCNNs is displayed by dashed curves. The color of each curve matches the one used for the corresponding stopping time in all the plots throughout the paper.
Note that, in Figure (a)a(top), curves for KerCNN with both and are displayed. Although the KerCNN model with stopping time (orange curve) widely outperforms the optimal RecCNN for all values of standard deviation above 10, the RecCNN displays a higher accuracy with small occlusions. However, for these smaller patches the optimal stopping time for KerCNN is (green curve), and this model outperforms the best RecCNN for all values of degradation. A similar situation can be observed in Figure (a)a(middle) for local edge disruption, where both and curves are displayed for the KerCNN model.
To sum up, the KerCNN model clearly outperforms the corresponding RecCNN architecture, when comparing the two for their respective best stopping times, for almost all tasks examined. It is interesting to note that the only case in which RecCNNs show a higher accuracy than KerCNNs for some values of degradation (only for lateral connections in the first layer) is when the images are perturbed via FGSM for . This suggests that, although the recurrent structure of RecCNNs may help improve the stability to “noise-like” perturbations, the absence of a geometric prior prevents them from implementing any mechanism of completion or contour integration.

(a)
(b)
Figure 9. Comparison between optimal KerCNN and optimal RecCNN. Top: Gaussian patches; middle: local edge disruption; bottom: adversarial attacks.

It is worth noting that, in the study carried out by Spoerer et al. (2017), the networks were trained and tested to recognize cluttered digits: in their experiments, RecCNNs significantly outperform the purely convolutional architectures, thus showing the benefits of recurrence in learning challenging tasks. On the other hand, our study shows that this does not extend to the case where the networks are facing nuisances for which they were not specifically optimized. For such generalization task, our structured lateral connections inducing a geometric prior turn out to be much more effective.

4.2. Other datasets

In this last section, we provide a synthetic report of our results on some different datasets, namely Kuzushiji-MNIST (Clanuwat et al., 2018), Fashion-MNIST (Xiao et al., 2017) and CIFAR-10 (Krizhevsky, 2009). We then illustrate our results through a summary table, which exhibits the improvement in accuracy obtained with the optimal w.r.t. the base CNN as an index of effectiveness of KerCNNs.

4.2.1. Kuzushiji-MNIST and Fashion-MNIST

In order to analyze the effect of our lateral connections on different images while keeping most of our settings unchanged, we examined two MNIST-like datasets: the Kuzushiji-MNIST dataset, containing 10 phonetic letters of hiragana, one of the components of the Japanese writing system; and the Fashion-MNIST dataset,

Figure 10. Examples from the Kuzushiji-MNIST (left) and Fashion-MNIST (right) datasets. For each database, the images display: (a) A sample from the dataset. Each row corresponds to a class. (b) A testing image corrupted by a Gaussian patch of increasing standard deviation. (c) A testing image corrupted by an increasing amount of local contour disruption . (d) Testing images from different classes, perturbed by applying the FGSM to the base CNN with increasing values of . Below each image, we display the classified label, as well as the correct label (in brackets). Apart from the unperturbed image (), all the images are misclassified by the CNN.

consisting of Zalando’s article images subdivided into 10 item categories (T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot). Both datasets are made up of 70000 images of size 2828, with the same training-testing split as in MNIST. Figure 10 displays, for each of these two datasets, some representatives of their 10 classes, as well as a few testing images corrupted by the three types of degradation examined. These have been implemented exactly as for MNIST, except for some changes in the range of degradation values considered (see Section 4.2.3).
Again, we considered a CNN with 2 hidden layers as a base model; the architecture is the same, except for the number of filters of the second layer which was set to 32 instead of 16, so that the total number of parameters becomes 14538. The training options were kept the same as before, except for the regularization parameter for Kuzushiji-MNIST which was set to . With these choices, the mean accuracy of the base CNN is 93.13% on Kuzushiji-MNIST and 89.86% on Fashion-MNIST.

4.2.2. Cifar-10

The CIFAR-10 dataset consists of 60000 3232 color images in 10 classes (0:Airplane, 1:Automobile, 2:Bird, 3:Cat, 4:Deer, 5:Dog, 6:Frog, 7:Horse, 8:Ship, 9:Truck). In contrast with MNIST-like datasets, CIFAR-10 poses the significantly harder problem of recognizing objects in natural scene images. The dataset includes 50000 training images and 10000 test images. We extracted 10000 images from the training set to use for validation-based early stopping – so that in our experiments the models were trained on 40000 samples, validated on 10000 samples and tested on 10000 samples. Figure 11 shows some examples of (original as well as perturbed) testing images from CIFAR-10. The perturbations have been applied to the images by simply extending the former methods to three channels. Our base model is a 2-layer CNN with the same architecture as before, but with 64 and 128 filters respectively in the first and second convolutional layers. Moreover, since the images are RGB, the filters of the first layer have three channels in this case. The models were trained with early stopping for a maximum of 300 epochs.

Figure 11. (a) A sample from the CIFAR-10 dataset. Each row corresponds to a class. (b) A testing image corrupted by a Gaussian patch of increasing standard deviation. (c) A testing image corrupted by an increasing amount of local contour disruption . (d) Testing images from different classes, perturbed by applying the FGSM to the base CNN with increasing values of . Below each image, we display the classified label, as well as the correct label (in brackets). Apart from the unperturbed image (), all the images are misclassified by the CNN.

Stochastic gradient descent was employed with an initial learning rate of .01, which was automatically decreased by 1/10 when validation accuracy stopped increasing for 10 epochs. We used a batch size of 64 samples and an regularization parameter . Also, dropout with .5 probability was employed in the last layer. The rest of the settings were kept the same as for the other datasets. Due to the longer training times, the results displayed for each architecture are obtained by averaging over 3 networks, instead of 10, trained with different random seeds. Moreover, we let vary the stopping times only in . We remark that we are employing a rather small CNN (the total number of parameters is 214922), and no data augmentation is used. With these settings, the mean accuracy of the base CNN on CIFAR-10 is 75.64%. We stress that our aim is to determine the improvement brought by our lateral kernel: in order to better assess its effect, we thought it best to consider a simple network as a base model.

4.2.3. Results overview

Our results on all considered datasets are summarized in Table 1. For each dataset, the three row blocks correspond to the three types of perturbation examined. For each type of image degradation and each level of corruption, the table displays the mean percent accuracies of the base CNN and the best KerCNN, as well as their difference. The combination leading to the best KerCNN performance is shown next to the corresponding accuracy value. In the event when the optimal performance is reached by the CNN, the best is displayed.

MNIST
std 0 5 10 15 20 25 30
base CNN 99.05% 80.13% 46.84% 25.73% 18.28% 15.33% 14.10%
best KerCNN (1,2) 99.08% (2,1) 82.96% (3,1) 63.77% (3,2) 52.47% (3,2) 41.18% (3,2) 30.34% (3,1) 22.65%
difference +0.04% +2.83% +16.94% +26.74% +22.90% +15.01% +8.55%
Shift 0 1 2 3 4
base CNN 99.05% 94.39% 61.14% 28.83% 17.93%
best KerCNN (1,2) 99.08% (1,5) 96.89% (5,5) 85.45% (6,6) 62.11% (6,6) 41.28%
difference +0.03% +2.51% +24.32% +33.29% +23.35%
FGSM 0 .05 .1 .15 .2 .25
base CNN 99.05% 94.23% 74.54% 38.12% 13.82% 14.83%
best KerCNN (1,2) 99.08% (1,5) 95.71% (1,6) 86.89% (2,6) 69.20% (2,6) 44.37% (2,6) 20.16%
difference +0.04% +1.47% +12.35% +31.08% +30.55% +5.33%
Kuzushiji-MNIST
std 0 5 10 15 20 25 30
base CNN 93.13% 74.20% 39.67% 21.22% 14.81% 36.79% 30.38%
best KerCNN (1,2) 93.13% (2,1) 75.72% (3,3) 59.96% (3,3) 51.44% (3,3) 43.69% (3,3) 12.39% (3,3) 11.41%
difference +0.00% +1.53% +20.29% +30.22% +28.89% +24.40% +18.97%
Shift 0 1 2 3 4
base CNN 93.13% 85.06% 61.62% 42.15% 31.49%
best KerCNN (1,2) 93.13% (1,4) 87.91% (3,4) 73.60% (5,2) 59.15% (5,3) 47.48%
difference +0.00% +2.85% +11.99% +17.00% +16.00%
FGSM 0 .05 .1 .15 .2 .25
base CNN 93.13% 65.03% 28.15% 11.28% 6.36% 3.95%
best KerCNN (1,2) 93.13% (1,5) 74.08% (1,6) 48.91% (1,6) 25.63% (5,5) 13.74% (5,6) 7.76%
difference +0.00% +9.05% +20.76% +14.35% +7.38% +3.81%
Fashion-MNIST
std 0 5 10 15 20 25 30
base CNN 89.86% 72.03% 49.07% 32.47% 22.43% 17.13% 14.18%
best KerCNN (1, 3) 90.02% (3, 1) 73.37% (3, 2) 55.44% (3, 2) 43.03% (3, 2) 31.71% (4, 4) 25.55% (4, 4) 22.57%
difference +0.16% +1.33% +6.37% +10.55% +9.28% +8.42% +8.39%
Shift 0 1 2 3 4
base CNN 89.86% 77.27% 58.58% 44.33% 34.81%
best KerCNN (1, 3) 90.02% (4, 3) 83.69% (5, 4) 72.18% (6, 6) 66.43% (6, 6) 60.87%
difference +0.16% +6.42% +13.61% +22.10% +26.06%
FGSM 0 .02 .04 .06 .08 .1
base CNN 89.86% 53.81% 31.49% 18.53% 13.01% 10.13%
best KerCNN (1, 3) 90.02% (2, 6) 70.48% (2, 6) 54.83% (2, 6) 42.78% (2, 6) 32.84% (2, 6) 25.57%
difference +0.16% +16.67% +23.34% +24.25% +19.82% +15.45%
CIFAR-10
std 0 5 10 15 20 25 30
base CNN 75.64% 58.22% 32.84% 22.89% 19.27% 17.97% 17.40%
best KerCNN (2, 1) 75.57% (2, 1) 58.08% (2, 1) 32.90% (2, 2) 23.57% (3, 2) 20.53% (3, 2) 19.33% (4, 1) 18.89%
difference - 0.07% - 0.14% +0.06% +0.67% +1.26% +1.36% +1.49%
Shift 0 1 2 3 4
base CNN 75.64% 41.7% 27.70% 23.71% 21.91%
best KerCNN (2, 1) 75.57% (4, 4)52.97% (4, 4) 43.33% (4, 4) 36.72% (4, 4) 31.99%
difference - 0.07% +11.27% +15.63% +13.02% +10.08%
FGSM 0 .005 .01 .015 .02 .025
base CNN 75.64% 42.9% 21.80% 10.83% 5.32% 2.86%
best KerCNN (2, 1) 75.57% (2, 3) 51.25% (3, 4) 35.55% (4, 4) 25.58% (4, 4) 18.66% (4, 4) 13.53%
difference - 0.07% +8.35% +13.75% +14.75% +13.33% +10.67%
Table 1. Overview of the results for MNIST, Kuzushiji-MNIST, Fashion-MNIST and CIFAR-10. For each degradation value, the accuracy of the base CNN is compared to the one of the best KerCNN. The optimal is also shown for each case.

For what concerns Kuzushiji-MNIST, the best performance improvement for images occluded by Gaussian patches is comparable to the one obtained for MNIST. However, a greater contribution of the second layer’s kernel can be observed: that is, the optimal combinations of stopping times display larger values of for this type of degradation. This may be due to the more frequent occurrence of complex patterns requiring a “higher order” analysis (such as crossings and loops) w.r.t. MNIST. On the other hand, on images subject to local displacement, the values of and bringing to the best accuracy are overall smaller, also leading to significantly smaller differences in performance relative to MNIST. In fact, the abundance of small details in such characters makes this kind of perturbation far more disruptive than it is for images like MNIST’s digits: even a small displacement may completely destroy some tiny yet characterizing features. Finally, the results for adversarial attacks with small values of are analogous to the ones obtained for digits, although with a faster decay in accuracy. On the other hand, although a configuration different from MNIST is observed for , the accuracy values are around (or even below) chance level in these cases, which makes somewhat pointless to speculate about them.

Let us now examine the results obtained for the Fashion-MNIST dataset. As for the images occluded by Gaussian patches, the slightly increased contribution of the second layer w.r.t. MNIST is again probably due to the heterogeneity of features characterizing these images, including both extended contours and tiny, intricate line patterns. For this type of perturbation, the improvement provided by our lateral connections is more moderate than it is for the preceding datasets, reaching a maximum accuracy difference of 10%. This may depend upon such images being largely composed by “solid color” areas rather than lines. Intuitively, when an occlusion falls in the middle of one such area, it does not interrupt a curve or a contour: therefore, the activation values of filters sensitive to local orientation is very low at these locations and consequently the action of the kernel on them is less relevant. On the other hand, the perturbation obtained by shifting horizontal and vertical strips does not affect constant areas, while it consistently disrupts the image edges. Moreover, differently from Kuzushiji-MNIST’s characters, global shapes rather than local details are markedly characterizing for discriminating between Fashion-MNIST classes. This makes our lateral connections particularly suited to manage this kind of perturbation. Indeed, a far greater improvement in the CNN performance can be observed w.r.t. Kuzushiji-MNIST in this case, especially for large values of the displacement : as an example, for , the 35% accuracy obtained by the base CNN rises to 60% with the optimal KerCNN model. Finally, for what concerns adversarial attacks, we considered values of varying in a smaller range, since the decay in performance for this dataset turned out to be much faster; namely, we took . Again, up to this rescaling, the results are analogous to the other datasets.

As for CIFAR-10, the performance of CNNs and KerCNNs on images corrupted by Gaussian patches is comparable for all values of , with a slight advantage for KerCNNs for occlusions large enough (). In our view, such “insensitivity” of lateral kernels to this type of perturbation may be linked to the increased difficulty of dealing with color images – indeed, this aspect certainly requires further investigation. On the other hand, the improvement obtained by KerCNNs w.r.t. CNNs for images subject to edge disruption and adversarial attacks is still consistent (up to 15%). Note that the value of for adversarial attacks in this case was let vary in (again due to the faster decay in accuracy w.r.t. ).

Overall, we believe that the global results are very promising, both for what concerns the effectiveness of the model for image recognition under challenging conditions, and from the point of view of its interpretation linked to biological vision.

5. Conclusion

In this article we introduced KerCNN, a modification of a CNN architecture given by the addition of biologically inspired lateral connections. Such connections are determined by convolutional kernels iteratively applied to the output of each convolutional layer, and defined by a notion of correlation between the filters of that layer, as in the cortical connectivity model of Montobbio et al. (2019a, b). This allows to establish a link between the geometry of feedforward and lateral connections, as the latter are defined in terms of the former. Moreover, since the lateral kernels are a deterministic function of the convolutional filters, the number of parameters of the original CNN is left unchanged – thus allowing a fair comparison between a base CNN architecture and the KerCNNs obtained from it.

The models were compared on their ability to generalize a learned image classification task to unseen corrupted inputs. The types of perturbation applied to the images were chosen to disrupt discriminative local information, so as to “force” the nets to perform an integration of context data to correctly recognize the corrupted input. The biological reason for choosing this testing framework was the close bond between anatomical lateral connections and perceptual phenomena linked to global shape analysis. In fact, our study revealed that the insertion of the proposed lateral connections in a 2-layer CNN critically enhanced its stability to all types of perturbation examined. Moreover, such improvement was not observed when introducing learned lateral kernels as in Liang and Hu (2015) and Spoerer et al. (2017). This suggests that the geometric information encoded in our lateral kernel has a meaningful role in implementing mechanisms of pattern completion and contour integration, to compensate for the missing information in the corrupted testing images. We remark that such mechanisms are “spontaneous” to the effect that that they are not enforced during the training stage: indeed, the networks were only trained to classify uncorrupted images.

The main analysis was carried out on the MNIST dataset, and then extended to a few more image datasets. Notably, promising results were obtained on natural images from the CIFAR-10 dataset. As a future development, we intend to test our model on bigger images and on richer datasets. It would also be interesting to examine the connectivity kernels obtained for non-image data and for different tasks: as an example, the regularity enforced by our lateral kernels may be helpful for problems of sound source separation.

Another natural advancement would be to consider deeper architectures. Indeed, although the proposed architecture was motivated by a model for early visual areas, its flexibility could make it suitable for recovering patterns in higher level processing as well. An analysis of the different feature information encoded in the kernels associated to each layer may help gain better insight into the analysis carried out by the networks at each stage of their processing.

Acknowledgments

The authors have been supported by Horizon 2020 Project ref. 777822: GHAIA.

References

  • Abbasi-Sureshjani et al. (2018) Abbasi-Sureshjani, S., Favali, M., Citti, G., Sarti, A., and ter Haar Romeny, B. M. (2018). Curvature integration in a 5d kernel for extracting vessel connections in retinal images. IEEE Trans Image Process, 27:606–621.
  • Ambrosio and Masnou (2003) Ambrosio, L. and Masnou, S. (2003). A direct variational approach to a problem arising in image reconstruction. Interfaces and Free Boundaries, 5(1):63–81.
  • August and Zucker (2000) August, J. and Zucker, S. W. (2000). The curve indicator random field: Curve organization via edge correlation. In Boyer, K. and Sarkar, S., editors, Perceptual Organization for Artificial Vision Systems, volume 546 of The Kluwer International Series in Engineering and Computer Science. Springer, Boston, MA.
  • Baker et al. (2018) Baker, N., Lu, H., Erlikhman, G., and Kellman, P. J. (2018). Deep convolutional networks do not classify based on global object shape. PLOS Computational Biology, 14:1–43.
  • Barbieri et al. (2014) Barbieri, D., Cocci, G., Citti, G., and Sarti, A. (2014). A cortical-inspired geometry for contour perception and motion integration. J Math Imaging Vis, 49(3):511–529.
  • Bertalmio et al. (2000) Bertalmio, M., Sapiro, G., Caselles, V., and Ballester, C. (2000). Image inpainting. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’00, pages 417–424, New York, NY, USA. ACM Press/Addison-Wesley Publishing Co.
  • Bosking et al. (1997) Bosking, W., Zhang, Y., Schoenfield, B., and Fitzpatrick, D. (1997). Orientation selectivity and the arrangement of horizontal connections in tree shrew striate cortex. J Neurosci, 17:2112–2127.
  • Brendel and Bethge (2019) Brendel, W. and Bethge, M. (2019). Approximating CNNs with bag-of-local-features models works surprisingly well on ImageNet. In International Conference on Learning Representations.
  • Bressloff and Cowan (2003) Bressloff, P. C. and Cowan, J. D. (2003). The functional geometry of local and long-range connections in a model of V1. J Physiol Paris, 97(2-3):221–236.
  • Citti and Sarti (2006) Citti, G. and Sarti, A. (2006). A cortical based model of perceptual completion in the roto-translation space. Journal of Mathematical Imaging and Vision archive, 24:307–326.
  • Clanuwat et al. (2018) Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A., Yamamoto, K., and Ha, D. (2018). Deep learning for classical Japanese literature. In

    Workshop on Machine Learning for Creativity and Design

    . NIPS.
  • Coifman and Lafon (2006) Coifman, R. R. and Lafon, S. (2006). Diffusion maps. Appl Comput Harmon Anal, 21.
  • Field et al. (1993) Field, D. J., Hayes, A., and Hess, R. F. (1993). Contour integration by the human visual system: evidence for a local association field. Vision Res, 33:173–193.
  • Fukushima (1980) Fukushima, K. (1980).

    Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.

    Biological Cybernetics, 36:193–202.
  • Gilbert et al. (1996) Gilbert, C. D., Das, A., Ito, M., Kapadia, M., and Westheimer, G. (1996). Spatial integration and cortical dynamics. In Proceedings of the National Academy of Sciences USA, volume 93, pages 615–622.
  • Glorot and Bengio (2010) Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In AISTATS, volume 9 of JMLR Proceedings, pages 249–256. JMLR.org.
  • Goodfellow et al. (2015) Goodfellow, I. J., Shlens, J., and Szegedy, C. (2015). Explaining and harnessing adversarial examples. In Proceedings of the ICLR.
  • Grossberg and Mingolla (1985) Grossberg, S. and Mingolla, E. (1985). Neural dynamics of perceptual grouping: Textures, boundaries, and emergent segmentations. Perception & Psychophysics, 38:141–171.
  • Hausdorff (1918) Hausdorff, F. (1918). Dimension und äusseres Mass. Mathematische Annalen, 79:157–179.
  • Hoffman (1989) Hoffman, W. (1989). The visual cortex is a contact bundle. Appl Math Comput, 32:137–167.
  • Hubel and Wiesel (1962) Hubel, D. H. and Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat visual cortex. J Physiol (London), 160:106–154.
  • Kingma and Ba (2015) Kingma, D. P. and Ba, J. L. (2015). Adam: a method for stochastic optimization. In International Conference on Learning Representations, pages 1–13.
  • Koenderink and van Doom (1987) Koenderink, J. J. and van Doom, A. J. (1987). Representation of local geometry in the visual system. Biol. Cybern., 55(6):367–375.
  • Krizhevsky (2009) Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report.
  • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, pages 1097–1105, USA. Curran Associates Inc.
  • Kruger (1998) Kruger, N. (1998). Collinearity and parallelism are statistically significant second order relations of complex cell responses. Neural Processing Letters, 8:117–129.
  • Layton et al. (2014) Layton, O. W., Mingolla, E., and Yazdanbakhsh, A. (2014). Neural dynamics of feedforward and feedback processing in figure-ground segregation. Front Psychol, 5(972).
  • LeCun et al. (1989) LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Comput, 1(4):541–551.
  • LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. In Proceedings of the IEEE, volume 86, pages 2278–2324.
  • Liang and Hu (2015) Liang, M. and Hu, X. (2015). Recurrent convolutional neural network for object recognition. CVPR.
  • Montobbio et al. (2019a) Montobbio, N., Citti, G., and Sarti, A. (2019a). From receptive profiles to a metric model of V1. J Comput Neurosci, 46(3):257–277.
  • Montobbio et al. (2019b) Montobbio, N., Sarti, A., and Citti, G. (2019b). A metric model for the functional architecture of the visual cortex. under revision.
  • Nair and Hinton (2010) Nair, V. and Hinton, G. E. (2010).

    Rectified linear units improve restricted Boltzmann machines.

    In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, pages 807–814, USA. Omnipress.
  • Neumann and Mingolla (2001) Neumann, H. and Mingolla, E. (2001). Computational neural models of spatial integration in perceptual grouping. In Shipley, T. F. and Kellman, P. J., editors, Advances in psychology, volume 130, chapter From fragments to objects: Segmentation and grouping in vision, pages 353–400.
  • Nguyen et al. (2015) Nguyen, A. M., Yosinski, J., and Clune, J. (2015). Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 427–436.
  • Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2017). Automatic differentiation in pytorch. In NIPS-W.
  • Petitot and Tondut (1999) Petitot, J. and Tondut, Y. (1999). Vers une neuro-géométrie. Fibrations corticales, structures de contact et contours subjectifs modaux. In Mathématiques , Informatique et Sciences Humaines, volume 145, pages 5–101. CAMS, EHESS.
  • Rawat and Wang (2017) Rawat, W. and Wang, Z. (2017). Deep convolutional neural networks for image classification: A comprehensive review. Neural Computation, 29:2352–2449.
  • Sanguinetti et al. (2010) Sanguinetti, G., Citti, G., and Sarti, A. (2010). A model of natural image edge co-occurrence in the rototranslation group. J Vis, 10.
  • Sarti and Citti (2015) Sarti, A. and Citti, G. (2015). The constitution of visual perceptual units in the functional architecture of V1. Journal of Computational Neuroscience, 38:285–300.
  • Sarti et al. (2008) Sarti, A., Citti, G., and Petitot, J. (2008). The symplectic structure of the primary visual cortex. Biol. Cybern., 98(1):33–48.
  • Semeniuta et al. (2016) Semeniuta, S., Severyn, A., and Barth, E. (2016). Recurrent dropout without memory loss. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1757–1766, Osaka, Japan. The COLING 2016 Organizing Committee.
  • Sigman et al. (2001) Sigman, M., Cecchi, G., Gilbert, C. D., and Magnasco, M. O. (2001). On a common circle: Natural scenes and Gestalt rules. In Proceedings of the National Academy of Sciences, volume 98, pages 1935–1940.
  • Spoerer et al. (2017) Spoerer, C., McClure, P., and Kriegeskorte, N. (2017). Recurrent convolutional neural networks: a better model of biological object recognition. Frontiers in psychology, 8:1551.
  • Szegedy et al. (2014) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. (2014). Intriguing properties of neural networks. CoRR, abs/1312.6199.
  • Tang et al. (2018) Tang, H., Schrimpf, M., Lotter, W., Moerman, C., Paredes, A., Caro, J. O., Hardesty, W., Cox, D., and Kreiman, G. (2018). Recurrent computations for visual pattern completion. PNAS, 115:8835–8840.
  • Wilson and Cowan (1972) Wilson, H. R. and Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophys J, 12(1):1–24.
  • Xiao et al. (2017) Xiao, H., Rasul, K., and Vollgraf, R. (2017). Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv:cs.LG/1708.07747.
  • Zeiler and Fergus (2014) Zeiler, M. and Fergus, R. (2014). Visualizing and understanding convolutional networks. In Fleet, D., Pajdla, T., Schiele, B., and Tuytelaars, T., editors, Lecture Notes in Computer Science, Vol 8689, ECCV 2014, pages 1097–1105. Springer, Cham.