Generalisation in Neural Networks Does not Require Feature Overlap

That shared features between train and test data are required for generalisation in artificial neural networks has been a common assumption of both proponents and critics of these models. Here, we show that convolutional architectures avoid this limitation by applying them to two well known challenges, based on learning the identity function and learning rules governing sequences of words. In each case, successful performance on the test set requires generalising to features that were not present in the training data, which is typically not feasible for standard connectionist models. However, our experiments demonstrate that neural networks can succeed on such problems when they incorporate the weight sharing employed by convolutional architectures. In the image processing domain, such architectures are intended to reflect the symmetry under spatial translations of the natural world that such images depict. We discuss the role of symmetry in the two tasks and its connection to generalisation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

06/10/2019

SymNet: Symmetrical Filters in Convolutional Neural Networks

Symmetry is present in nature and science. In image processing, kernels ...
10/22/2020

Learning Invariances in Neural Networks

Invariances to translations have imbued convolutional neural networks wi...
08/25/2017

Identifying Mirror Symmetry Density with Delay in Spiking Neural Networks

The ability to rapidly identify symmetry and anti-symmetry is an essenti...
12/08/2017

Artificial Neural Networks that Learn to Satisfy Logic Constraints

Logic-based problems such as planning, theorem proving, or puzzles, typi...
09/23/2019

Learning in the Machine: To Share or Not to Share?

Weight-sharing is one of the pillars behind Convolutional Neural Network...
04/14/2022

Relaxing Equivariance Constraints with Non-stationary Continuous Filters

Equivariances provide useful inductive biases in neural network modeling...
03/03/2022

Symmetry Structured Convolutional Neural Networks

We consider Convolutional Neural Networks (CNNs) with 2D structured feat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this article, we examine two problems (marcus1998; marcusetal1999) designed to highlight the shortcomings of connectionist models that rely on generalisation based on featural similarity or overlap and describe solutions to these long-standing111In the eyes of its creator, the identity learning task has remained an unresolved problem for neural network models for over two decades (marcus1998; marcus2020). challenges. The identity and rule learning tasks themselves are described in more detail in Sections 2 and 3, but the basic idea in each case is fairly simple. Both tasks require generalising to test data that contains elements - digits or words - that were not seen during training.

Prior research on such tasks has often relied on modifications to the representations of these elements, via careful input coding or appropriate pre-training, that introduce shared features as a means to achieve generalisation to the task test sets (e.g., negishi; seidenberg; sirois). In fact, it has commonly been asserted that generalisation in neural networks requires such featural overlap (alhamazuidema2019; marcus2001; mcclellandplaut1999)

and this characteristic is frequently invoked as a benefit of distributed representations

(hinton86; farrelllewandowsky2000). Here, we show that these models are capable of generalising to test data that lacks any features in common with their training data by exploiting symmetries relevant to the task. Architecturally, this is achieved using convolutional (neocognitron; Lecun98gradient-basedlearning)

layers, a common component in deep learning approaches to image categorisation, and our solutions repurpose their spatial symmetries to serve the needs of each problem.

While the computational processes involved in convolution are probably not biologically or cognitively plausible in their particular detail, we nonetheless wish to argue that symmetry, considered more abstractly, is a useful concept in thinking about how generalisation beyond similarity is possible. By symmetry, we mean a transformation, or set of transformations, that preserve some relevant characteristics of a task. For example, in a task such as object recognition, a particular dog or face is equivalent wherever it occurs within an image, and this can be thought of as a symmetry under spatial translations. Thus, ideally, a neural network model of object recognition should be able to generalise from a dog seen during training in the bottom right of an image to the same dog at test time in the top left, even though there is no overlap in the set of pixels involved. This implies that spatial translation, e.g. from bottom right to top left, should be a symmetry of the function computed by the network, and convolutional networks implement this by learning feature weights that are shared across the image and then pooling these shared features into larger and larger receptive fields (described in more detail in Appendix

A).

Our experiments show that symmetry has a wider application, beyond just spatial translation within images. In particular, we apply this approach to learning the identity function on binary digits (marcus1998) and to rule learning on sequences of words (marcusetal1999). Successful performance on these tasks requires generalisation to unseen features, i.e. novel digits or words. In each case, we analyse the structure of the domain to identify a symmetry that is relevant to the task and impose this on the network in terms of a convolutional architecture. Concretely, this results in weight sharing between the seen and unseen features, i.e. across digit positions or syllables, allowing effective generalisation to the test instances.

marcus2001 argues that such generalisation to novel features is not possible for connectionist models due to what he calls training independence. Roughly, he assumes that the weights within these networks are learned independently, with the consequence that no updates are made to the connections attached to any unit which is inactive during training. Although they are critical of the conclusions drawn by marcusetal1999 from the rule learning experiments, mcclellandplaut1999 nonetheless affirm that generalisation in neural networks depends on overlapping patterns of activity somewhere in the network. Furthermore, in a review of the literature surrounding those experiments, alhamazuidema2019

also assume that generalisation will depend on the overlap between vectors in training and test.

Thus, the requirement of featural overlap to achieve generalisation has been a basic assumption made by both proponents and critics of connectionist models of cognition. However, convolutional neural nets are designed precisely to overcome this limitation in relation to generalisation to unseen positions, and this ability has been a key factor in their widespread adoption in deep image recognition models. The mechanism by which they achieve this is simply the sharing of the same feature detectors across all positions in the image, which avoids the independence of weights criticised by marcus2001. More abstractly, this weight sharing can be understood as an implementation of the requirement for the network’s behaviour to be symmetric under spatial translations. Thus, in addition to generalisation being possible between inputs related by featural similarity, we can also obtain generalisation between inputs related by a symmetry.

In the following sections, we explore how this idea can be applied to the identity and rule learning challenges, demonstrating convolutional architectures that achieve effective generalisation to the unseen digits and syllables. In each experiment, we compare a network with an appropriate symmetry constraint to the same network without the constraint.

2 Identity Learning

Input Output
Training Set 00010 00010
00100 00100
00110 00110
01000 01000
01010 01010
Test Set 00011 00011
00101 00101
00111 00111
01001 01001
01011 01011
Table 1:

Examples from the training and test sets used in the identity learning task. Each instance consists of the same 5-digit binary numeral in both input and output, representing the identity function. However, the training set consists of only the even numbers, and the test set consists of the odd numbers. In other words, successful generalisation requires handling the digit

1 in a an unseen position.

marcus1998 introduced his identity learning task as an example of a situation in which humans are able to generalise outside their training space. The task simply requires learning the identity function, from inputs to outputs, for binary digits. In other words, the desired outcome is for the learner to copy whatever arrives in the input to the output.

However, the training data only contains even numbers, in which the final digit is always zero, but generalising effectively to the test set requires handling odd numbers, in which the final digit is one. Examples of the training and test data are shown in Table 1. marcus1998 reports that humans typically learn the identity function successfully, copying the odd numbers without problem, even though they contain a feature that was not present during training.

In contrast, feedforward networks ordinarily fail to generalise in this way, precisely because the odd numbers in the test set contain an unseen feature. This points to an important difference in the biases that guide learning in people and artificial neural networks. Whereas humans find it natural to apply the same copying operation to all digit positions, standard feedforward networks have no such preference.

The lack of this bias in these models is not surprising, as they are designed to learn general input-output associations, and in many of the tasks they are applied to there is no identity function. For example, the inputs in the MNIST dataset

(mnist) are pixel arrays and the outputs are ten digit categories. In this case, it is not possible for the input and output units to carry identical patterns of activity as the numbers of units in each is not even the same. There is therefore no unique, well-defined identity function between the pixel space and the digit category space. Even in the case where input and output have the same dimensionality, they remain distinct spaces and there need not be a single, unambiguous identity function. For example, if our task is translation and the inputs are constructed from a vocabulary of 100k French words, while the outputs use a vocabulary of 100k English words, then there is again no well defined identity function between these spaces, despite their being the same size.

For people presented with the challenge described by marcusetal1999, a relation of identity between inputs and outputs will be immediately apparent. Moreover, they will also immediately grasp that both inputs and outputs are made up of strings of digits and that there is a correspondence between digit positions in the input and digit positions in the output. It is this intrinsic structure in the task that makes it tractable for humans. In contrast, the standard fully connected architecture is indifferent to this structure, and for such a network the task consists of identifying one unexceptional one-to-one mapping from among many.

While humans find Marcus’s task easy because the identity function is obvious to them and stands out in some way, we can also imagine other tasks which lack the relevant correspondences between symbols and positions in input and output. In particular, we could efface this structure by replacing the digits 0 and 1 with arbitrary symbols at each position and permuting the order of those symbols in the output. So, for example, an input of !h*7@ might map onto the output 8j?:y. In this case, learning such an arbitrary connection between input and output and then generalising to an unseen symbol is going to be much more difficult, and more similar to the problem that a standard connectionist network is trying to solve.

Just as we can make the inherent structure in the problem more obscure for the biological neural systems of human beings, we can also try to modify the architecture of an artificial neural network to exploit that structure more effectively. In particular, if we want the network to be able to apply the same operation to the final digit that it applied to all the others, then its design ought to reflect the fact that the same digit can occur in multiple positions and that there is a correspondence between positions in the input and positions in the output.

1

0

1

1

0

1

0

1

1

0
(a) The unconstrained network has an individual weight for each pair of input and output nodes, which must be learned separately.

1

0

1

1

0

1

0

1

1

0
(b) A single filter in the convolutional network. The same weights are replicated across the other positions in the network.
Figure 1: The connection weights of the two architectures applied to the identity learning task.

In fact, this is exactly the problem that convolutional layers are designed to solve. Effective image processing requires that the same objects or features can be recognised wherever they occur in an image without losing the knowledge of where they occur. A convolutional layer addresses this by applying the same set of filters across the whole of an image, producing an equivalent bank of channels at each position. This introduces both the notion of the same structure being able to occur in multiple positions, and also a correspondence between input positions and output positions.

In mechanistic terms, this is achieved by sharing the same connection weights from inputs to outputs at each position within the network. However, at a more abstract level we can see this as an implementation of translational symmetry in the function computed by the network. In particular, if that function, , maps an input image, , to a feature map, , then we can consider the result when the function is applied to a translation, , of the original image. Convolutional layers are equivariant under these translations: . That is, the output from a translated image is itself a translated version of the feature map for the original image. Or, more concretely, if a set of feature detectors are active in response to a structure in the top right of an image, then translating that structure diagonally across the image into the opposite corner will result in an equivalent set of feature detectors firing in the bottom left.

This translational symmetry enables the convolutional layer to automatically generalise from features seen during training in one position to novel, unseen positions at test time. Moreover, this happens without any featural overlap between the training and test inputs, as the features are present at different positions. What connects the seen and unseen inputs is not their pixel similarity but instead the fact that they can be mapped onto each other by a symmetry transformation.

In terms of the problem proposed by marcus1998, the inputs and outputs consist of sequences of digits, in which the same digit, e.g. 0 or 1, can occur in multiple positions, and those positions have an intrinsic order in both the inputs and outputs. This implies that we can meaningfully talk about moving a digit from one position to another, and that there is a correspondence between such motions in the input and the output. Convolution imposes this structure on the architecture and makes the behaviour of the net symmetric under these translations.

2.1 Experiment 1

We consider two architectures: a standard, fully connected network and a convolutional network, both with 5 inputs and outputs and both consisting of a single layer without hidden units. Thus, the only difference between these models is the constraint of translational symmetry imposed on the latter. The set of functions computable by the convolutional network is therefore a strict subset of those computable by the unconstrained network. This experiment investigates how this constraint affects learning and generalisation in the identity learning task.

The data consist of all five-digit binary numbers, ranging from zero to thirty one, and all inputs and outputs are equal. The training data consists exclusively of even numbers, in which the least significant digit is zero, and the test data exclusively of odd numbers, in which the least significant digit is one.

The experiments are implemented in Tensorflow

(abadi)

, using batch gradient descent over the full training set of 16 instances. The squared error loss function is minimised for 1,000 epochs, and then performance is evaluated over the entire training and test sets. An output is treated as correct if all its digits are predicted correctly, based on a cutoff of 0.5 to discretise units activities into binary values. We report average accuracy over 100 training runs with random initialisations.


Architecture
Training Accuracy Test Accuracy
Unconstrained 100% 12%
Convolutional 100% 100%
Table 2: Accuracies on the training and test sets of the identity learning task, with and without the constraint of translational symmetry.

The results are reported in Table 2. The pattern of performance seen for the standard, unconstrained network in the first row indicate that, while learning succeeded over the training data, generalisation to the test set was not as successful. This was precisely because no updates were made to the weights attached to the inactive digit. In contrast, the accuracies in the second row of the table indicate perfect generalisation for the convolutional architecture. Here, weight sharing allowed learning to occur across all positions, even the inactive one.

2.2 Discussion

Our experiments showed that a convolutional network was able to generalise successfully to unseen digits without featural overlap between training and test instances. Thus, one potential solution to the identity learning challenge (marcus1998) turns out to be a standard architectural design that predates it by almost two decades (neocognitron). In concrete terms, weight sharing across digit positions avoids the problem of training independence identified by marcus1998, allowing what is learned at seen positions to be transferred to unseen positions. More abstractly, symmetry provides a notion of sameness that goes beyond featural overlap, allowing us to apply the same operation to features that were unseen during training. Here, we identified translational symmetry as reflecting the intrinsic structure of the problem, in terms of inputs and outputs being sequences of binary digits.

In the task, this symmetry relates to the fact that the same digit can occur in multiple positions, and the equivariance of the convolutional layer provides the required correspondence between input and output spaces. More generally, a symmetry under movement of elements between positions may be relevant whenever we are dealing with structures in which the same components can be rearranged in multiple configuration, such as list or stack memory structures.

However, unless these ideas can be shown to be applicable in a broader range of tasks, then it might be reasonable to treat the approach described here as merely a convenient trick or hack. In particular, it would be desirable to show that considerations of symmetry are relevant beyond simple spatial translations and without the requirement for the input and output spaces to share a common structure.

3 Rule Learning

Input Output
Training Set ga ti ga ABA
li gi li ABA
ga ti ti ABB
li na na ABB
Test Set wo fe wo ABA
de ko de ABA
wo fe fe ABB
de ko ko ABB
Table 3: Examples from the training set and the full test set used in the rule learning task. Each input is a sequence of three words which are categorised at the output into the structures ABA and ABB. The words used to construct the sequences in the test set are entirely disjoint from those in the training set. In other words, successful generalisation requires handling the unseen words appropriately.

In the previous task, effective generalisation required applying the same rule to a known digit when it occurred in a new position. The connection to image processing and the need to recognise known features in new positions was reasonably straightforward. In this section we will consider a task based on word sequences rather than digits, and which requires generalisation to new words rather than new positions. While translation between positions is no longer the relevant symmetry, we will nonetheless show that convolution can also be used to solve this problem.

This rule learning task involves learning to differentiate between structures of the form ABA and ABB, and is based on the the experiments with infants described by marcusetal1999. Examples of the task data can be seen in Table 3. Each input to the network is a sequence of three words, such as ga ti ga or ga ti ti, and the outputs are binary labels indicating the ABA and ABB categories. Crucially, however, the sequences in the test data are constructed from words that were not seen during training, e.g. wo fe wo or wo fe fe. In other words, the network must learn in a sufficiently abstract manner to be able to apply the same rule to both seen and unseen words.

In their experiments, marcusetal1999 familiarised 7-month-old infants with sequences from one of the categories and then subsequently observed their responses to sequences containing novel words from both categories. Overall, the infants showed more interest in stimuli from the unfamiliar category, indicating they were able to generalise their experience to words they did not hear during the familiarisation phase. marcusetal1999

were unable to obtain this form of generalisation from a neural network trained on the same data, because the backpropagation learning rule did not update the weights associated with unseen words during training. Numerous studies have investigated a variety of aspects of these experiments, including how phonetic forms are represented

(negishi), the nature of the familiarisation process (sirois) and the knowledge already acquired by the infants before the experiments (seidenberg).

Setting aside other psychological aspects, we focus here on the core computational question of whether a neural network can be endowed with the ability to learn about structure in a way that allows it to generalise to instantiations of that structure constructed from novel elements. To this end, we frame the problem as a supervised learning task in which each of the three words in the input sequence are represented by a one-hot vector, thus ensuring there is no featural overlap between train and test, and the output consists of a pair of units representing the categories ABA and ABB. The data used in our experiments are taken from the

marcusetal1999 paper, and we evaluate the models in terms of their ability to generalise to test inputs that are constructed from words not seen during training. As a result of the localist coding of words we have employed the training and test inputs share no active units in common, and so successful generalisation requires that the network is able to generalise outside its training space.

However, the architecture used in the previous experiment is not relevant to the current task. In that task, inputs and outputs lay in the same space, which permitted us to impose a constraint based on symmetry under translations acting on both spaces. Here, the output space is entirely distinct from the input space and, in particular, there is no meaningful identity function between them. Instead, the pair of output units are intended to categorise the structure of the input word sequence. Specifically, we want to learn to categorise these sequences based purely on their structure, independently of the particular words instantiating that structure.

In other words, the outputs of the network should be invariant to word substitutions that leave the structure of the sequence unchanged. For example, replacing ti with li in ga ti ti to produce ga li li does not change the structure of the sequence and so both sequences should be categorised in the same way. Thus, such substitutions should be symmetries of the network in terms of having no impact on the outputs. In this way, we can ensure that the categorisation of sequences is driven by their structure and not by the particular words seen during training.

This is comparable to the invariance under spatial translation that visual object recognition systems seek to achieve. In that case, categorisation should be driven purely by the shape of an object and not by the position it occurred in during training. Convolutional architectures address this by combining convolutional layers, which are equivariant under translations, with pooling layers, which are invariant under translations.222In fact, as described in Appendix A

, the architectures applied in practice to image recognition tasks often fall short of full translation in variance

(blything; azulayweiss2019; biscionebowers2020). Specifically, the output of a pooling function, remains unchanged when the input, , is translated within its receptive field: . Commonly, a fixed pooling function, such as maximum or average, is used to combine the signals across a range of positions, creating larger receptive fields, which then feed into the next convolutional layer, until a final global pooling layer is applied across the whole image. That is, in simple terms, a convolutional layer produces a feature for every position in the input and a global pooling layer aggregates these features to give a single position independent output.

Whereas translations are the relevant symmetry for object recognition, here we wish to make our network invariant to the substitution of one word with another. We will, nonetheless, be able to employ the same convolution and pooling layers in constructing a solution. In particular, by treating word types as if they were positions in an image, word substitutions become analogous to movement between these positions. A convolutional layer can then extract the same structural features for each word type, and a pooling layer can reduce this to a single pair of output categories which are invariant to the word types involved.

3.1 Experiment 2

wo

fe

wo

ga

ti

wo

na

gi

la

li

fe

ko

ni

ta

de

Filters

ABA

ABB

Softmax

Max-Pooling

Hidden Units

Convolution

Input Array
Figure 2: The architecture applied to the rule learning task of marcusetal1999, consisting of convolution followed by max-pooling and softmax. In the figure, the grid at the bottom represents the 12 words within the 3 time-steps, with the first word at the top and the last at the bottom. This input is convolved with the two filters, resulting in the

output above it, and max-pooling reduces this to a single pair of values, which are the logits of the output softmax. In the example,

wo fe wo is encoded at the input as ones in the first and third channels for the syllable wo and in the second channel for the syllable fe. The same two filters are applied to the three channels of every word, with the filter matching the pattern of activations in the wo position, giving a high activation value in the bottom channel at the same position in the hidden units. Max-pooling picks this value out and uses it to predict that this is an ABA sequence.

In this experiment we again compare two architectures, with and without the symmetry constraint. The former is a two layer network with 24 hidden units, consisting of a convolutional layer, followed by max-pooling and softmax activations in the output layer. The latter is the same network with a fully connected layer replacing the initial convolutional layer. Again, the set of functions computable by the convolutional architecture is a proper subset of those computable by the unconstrained network.

Each data point consist of a three word sequence having the form ABB or ABA. These inputs can be thought of in terms of a array, representing the twelve words in the vocabulary and the three time steps in the sequence, and the outputs are a binary pair of values, representing the two structures. In other words, for each time step, there is a one hot vector representing the word that occurs there. The training data contains 32 sequences, evenly balanced across structures, while the test data contains 4 sequences, consisting of 2 of each form. As discussed, the vocabulary of the test sequences is entirely disjoint from the training vocabulary, and as a result, training and test data sets lack any common active input units.

The convolutional architecture, shown in Figure 2, treats the three time steps as channels associated with each word, to which it applies a filter of width one, sharing the same weights between all words. These filters project the three input channels down to two hidden layer channels for each word, corresponding to the two sequence structures. Max-pooling then passes the maximum value in each channel across the width of the hidden layer to the output units.

The unconstrained network takes the same 12 word 3 time step matrix as its input, which it projects down onto 24 hidden units using a standard fully connected layer. As before, max-pooling is applied to these values to produce the two output values. In this way, only the word substitution symmetry constraint on the first layer changes between the two architectures.

The experiments are implemented in Tensorflow (abadi), using batch gradient descent over the full training set of 32 sequences. The cross-entropy loss function is minimised for 1,000 epochs, and then performance is evaluated over the entire training and test sets. Accuracy is measured by discretising the outputs using a cutoff of 0.5, and an average is taken over 100 random initialisations. A training run is restarted if the network fails to achieve 100% accuracy on the training set within 1,000 epochs.


Architecture
Training Accuracy Test Accuracy
Unconstrained 100% 45%
Convolutional 100% 100%
Table 4: Accuracies on the training and test sets of the rule learning task, with and without the constraint of word substitution symmetry.

The results in Table 4

demonstrate that, whereas the unconstrained network performs at the level of random chance on the novel words in the test set, the convolutional architecture handles these sequences accurately. The first column shows that both models successfully learn to identify the structure of the sequences in the training set. However, performance of the two models diverges on the test set, as can be seen in the second column. The standard, unconstrained, multi-layer perceptron is unable to generalise its knowledge to the words in the test set because they involve independent sets of weights. In contrast, the convolutional architecture generalises effectively because the weights associated with the unseen words are shared with the seen words.

3.2 Discussion

Our convolutional network handled the test sequences effectively, despite their being composed of unseen word types. This, again, illustrates the ability of these architectures to generalise to inputs that lack featural overlap with the training data. In this case, the task required a more subtle analysis than for the copying of digits. Specifically, we reasoned that if the outputs were to represent the structure of the inputs, disentangled from the particular words used, then they should be invariant to word substitutions that preserve the structure.

Imposing that symmetry constraint on the network enabled it to recognise the same structure when it was composed of new elements. Thus, our experiments have shown the relevance of symmetry in obtaining generalisation without featural overlap in two distinct situations: handling the same digit wherever it occurs and handling the same structure whichever words it contains. In addition to demonstrating the capacity of artificial neural networks to generalise in the absence of featural similarities, these results also raise the question of whether an analysis in terms of symmetry under transformations is also relevant to human cognitive abilities.

4 General Discussion

The role of similarity in the process of applying the lessons of past experience to new situations has been a longstanding topic of discussion. For example, hume asserts that In reality, all arguments from experience are founded on the similarity which we discover among natural objects, and by which we are induced to expect effects similar to those which we have found to follow from such objects. In psychology, this connection between similarity and inductive inferences has been explored in relation to verbal reasoning by rips, osherson, heit and others. Taking a broader view, shepard found that across a range of animals and with a variety of stimuli, generalisation could be modelled in terms of similarity within an abstract space.

In practical terms, this concept of generalisation as being grounded in relations of similarity underlies a variety of machine learning techniques such as k-nearest-neighbour classification

(fixhodges1951) and kernel machines (aizerman). As previously discussed, the view that the abilities of artificial neural networks are also limited to generalisation between similar items, in the sense of sharing overlapping features, has been a common assumption among both supporters and critics of connectionist modelling (mcclellandplaut1999; marcus2001; alhamazuidema2019).

However, our experiments showed that convolutional architectures are not limited in this way, and are capable of generalising to test data that shares no features with the training data. Moreover, the underlying principles of this approach arose from models of visual cortex (neocognitron), and have become standard components in image processing networks (alexnet; vgg; effnet). In concrete terms, weight sharing allows what is learned about features present in the training data to be transferred to unseen features in the test data.

More abstractly, what connects the unseen test items to those seen in the training data is not simple featural similarity but instead a transformation that is a symmetry of the system. In the case of vision, the relevant transformation is spatial translation, allowing generalisation across positions within images. This transferred fairly directly to the identity learning task, where the test data contained familiar digits in new positions. In the case of the rule learning task, however, the relevant transformation was substitution of one word with another. Making the network symmetric under these substitutions allowed it to generalise from seen to unseen words.

In fact, this idea that input stimuli can be connected not only by sharing common features, but also in terms of transformations that map representations onto each other arises in at least a couple of other domains. In analogical reasoning, for example, knowledge from a familiar base domain is leveraged in a novel target domain by identifying a mapping from one to the other (e.g. gentner). Thus, analogy overcomes the limits of simple featural similarity in a comparable manner to that being discussed here in regards to artificial neural networks. In addition, transformations are frequently invoked to connect distinct grammatical structures in natural languages. So, passivisation - e.g. from John loves Mary to Mary is loved by John - is a transformation that preserves the basic semantics of sentences. domingos have proposed an approach to semantic parsing based on treating such transformations as symmetries.

Thus, the ideas that sameness extends beyond simple featural overlap, and that it can be useful to identify transformations which connect distinct inputs are not novel within the cognitive sciences. Here, we have shown that this perspective can be effective in obtaining generalisation to novel features with artificial neural networks. In particular, we employed two types of symmetry, one relating to the translation of digits from one position to another, and another relating to the substitution of one word with another. More generally, we speculate that similar symmetries may be relevant in handling any structures in which multiple elements can be combined in multiple configurations, with the former allowing the same element to be treated consistently wherever it occurs, and the latter supporting recognition of the same structure whichever elements instantiate it.

In the identity learning task, translational symmetry ensured that the digits in all positions were handled consistently, resulting in the same copying operation being applied to the unseen digit position. Without this constraint, the behaviour of an item in a given position would be independent of its behaviour in any other, and therefore lack a notion of being the same item. A similar problem arises whenever an element can play multiple roles within a larger structure, for example John in John loves Mary and Alice loves John.

In the rule learning task, symmetry under word substitutions allowed the ABB and ABA structures to be recognised irrespective of which words they contained. In fact, such a symmetry will typically be required whenever a process needs to be driven by structure, in a way that is consistent whatever content instantiates that structure. For example, a valid logical inference - e.g. All men are mortal and Socrates is a man, therefore Socrates is mortal - remains valid under substitutions - e.g. All men are mortal and Aristotle is a man, therefore Aristotle is mortal - that preserve the form of the inference. Thus, the fact that validity is driven by formal structure implies that these substitutions are symmetries of the processes of logical derivation.

The fact that the ABA and ABB structures containing both seen and unseen words are handled effectively implies that the knowledge embodied in the connection weights has been, to some extent, abstracted away from the particular examples seen during training. marcusetal1999 interpret the behaviour of the infants in their experiments as similarly implicating some form of abstract knowledge. In particular, they suggest that the infants in their experiment had learned algebra-like rules that represent relationships between placeholders (variables), such as ‘the first item X is the same as the third item Y’. While this mechanism for representing the structure of the sequences, in terms of using variables, is clearly distinct from the convolutional approach employed here, there are, nonetheless, underlying similarities. When, for instance, the network processes the sequence wo fe wo, the wo channels contain the values , representing the fact that this particular word occurs at the beginning and end of the sequence. Moreover, from the point of view of filters that do not discriminate between words, these values simply signal that the same item occurs in both first and third position, which is what marcusetal1999 suggest the infants are sensitive to.

In fact, we can think of the algebra-like rules and the weight sharing as two ways to implement the same requirement of symmetry under word substitution. In the algebraic case, the behaviour of the rules is invariant when the words seen during training are replaced with the unseen test words because the use of variables allows the rule to be expressed in a single form that makes no reference to any specific word, and is instead valid for all values. In contrast, the convolutional architecture replicates a version of the rule for every word, and in this way achieves invariance across values.

Whatever the advantages and benefits of algebraic models of cognition, our concern here is with the abilities of artificial neural networks. In particular, our results showed that a convolutional architecture is capable of generalising to the unseen words and digits in the rule and identity learning tasks without the need for shared features in training and test inputs.

5 Conclusions

marcus2001 correctly identifies training independence as a factor limiting the ability of standard connectionist architectures to generalise outside their training space. That is, without additional constraints, the backpropagation algorithm adjusts the weights associated with each input feature independently. This flexibility makes them one of the most effective methods of machine learning when training data is plentiful and the test data is drawn from an identical distribution. However, in the identity (marcus1998) and rule (marcusetal1999) learning tasks, this results in these models failing to generalise to the unseen features.

Nonetheless, our experiments demonstrated that convolutional architectures can overcome this shortcoming, simply by sharing weights between seen and unseen features. This mechanism directly addresses the training independence critique by tying the weights associated with multiple features. More abstractly, the particular form of weight sharing was driven by considerations of relevant symmetries identified in each task. This approach allowed us to extend a notion of sameness to inputs that shared no common features, and as a consequence the trained networks were able to apply the same rule to an unseen digit position and recognise the same structure instantiated with novel elements.

Acknowledgements

This research was supported by the European Research Council Grant Generalization in Mind and Machine, ID number 741134.

Appendix A Convolutional Neural Networks

(a) A convolutional layer. The same learned function is applied to each small patch of an image, by sharing the same weights at all positions. Two such patches and their shared weights are highlighted.
(b) A global pooling layer. A fixed translationally invariant function is applied to all the units in each channel of the input layer to produce a single value in each channel of the output.
Figure 3: The structure of convolutional and global pooling layers

Many of the mechanisms found in modern Convolutional Neural Networks have their origins in the Neocognitron

(neocognitron). This model consisted of multiple layers of units inspired by the simple and complex cells described by hubelwiesel. Local features were learned by S-cells which, crucially, share their parameters across the input array. This creates a field of replicated feature detectors, which are then aggregated locally by C-cells, creating wider detectors with limited shift invariance. In this way, the receptive fields of these units grow as processing proceeds through the layers of the network, building more complex features and expanding the tolerance to spatial translation.

The sharing of feature weights across the visual field within this model is critical in its achieving invariance under spatial translation. Without it, different features would be learned in different positions, preventing a consistent response to the same object wherever it occurs. It is unclear whether some form of this mechanism can be made biologically plausible, but there is nonetheless evidence that human object recognition is tolerant to substantial displacements across the retina (blything). Moreover, marcus2018 includes translation invariance in a list of 10 computational primitives that he thinks would need to be part of the innate machinery supporting any human-like cognition.

Beyond the Neocognitron, weight sharing has been a common tool for achieving the desired form of generalisation in other visual applications (e.g., rumelhartpdp). rumelhartnature also explained generalisation within a relational learning task in terms of shared weights. Similarly, jain imposed weight sharing on their network in order to obtain appropriate generalisation in a model of natural language parsing.

The same ideas also found application in practical tasks, such as handwritten digit recognition (lenet)

, where robust generalisation to small deformations was vital. Despite these strengths and the reduced number of free parameters, effective training of such models still required large amounts of data, which motivated the development of datasets such as ImageNet

(imagenet). Given these extensive collections of natural images, it has become possible to train neural models of object recognition with impressive test set performance (e.g., alexnet; vgg; effnet).

These models are based, mainly, around the ideas introduced in the Neocognitron (neocognitron), consisting of layers of identical feature detectors combined with layers of units that pool these features into larger structures. The general term convolutional neural network is used to refer to the various variations on this design, in reference to the mathematical name for an operation equivalent to applying the same filter to all positions in an image. The convolutional layers themselves contain the trainable feature detectors, and these are typically followed by pooling layers which apply a fixed aggregation to those outputs, such as taking an average or maximum. Particular architectures in this family may combine these convolution and pooling layers with other components such as fully connected layers or skip-connections.

A convolutional layer is depicted in more detail in Figure 2(a). An input image is represented by the grid at the bottom, with spatial position depicted horizontally and the red, green and blue colour channels vertically. For simplicity, only one spatial dimension is shown, but the extension to two dimensions is straightforward. Each unit in the output layer is driven by a small patch in the input, and two such patches are highlighted. In this case, there are two output channels, so, given the three input channels and spatial width of three, the number of weights used in this transformation is . In a real image processing model, there would, of course, be an additional spatial dimension and typically a larger number of output channels is used, but the fundamental principle remains that the same weights are applied to all positions in the input. This constraint ensures that, under spatial translations , the function, , computed by the layer is equivariant: .

Figure 2(b) depicts a global pooling layer, applied to a two channel input. In this case, a fixed translationally invariant function, such as taking the maximum or average, is applied to the features across all the input positions within each channel. This preserves the number of input channels, but reduces the information across the input positions down into a single spatially invariant stream. That is, the function is insensitive to input translations: . Typically, a convolutional architecture will also contain local pooling layers, which aggregate across receptive fields of limited extent, comparable to the complex cells of the mammalian visual system. These are then interleaved with convolutional layers to produce larger and more complex feature detectors as processing proceeds through the network. Finally, a global pooling layer can be used to aggregate across the entire image in the final layers of the network. Alternatively, a fully-connected layer is sometimes used instead, but this tends to reduce the translational invariance of the network (blything).

In practice, a number of other factors, related to training data and network design, may also obstruct the theoretical ideal of perfect translation invariance (azulayweiss2019). While humans can typically recognise an object anywhere in their visual field after seeing it in a single position, the networks applied to image recognition tasks frequently lack such flexible generalisation abilities (biscionebowers2020). Nonetheless, a number of computational neuroscientists see appropriately trained convolutional architectures as having a close resemblance to the primate visual system (e.g. yaminsetal2014; kriegeskorte2015; kubiliusetal2016).

References