The notion of interpolation is built into the assumptions underlying most approaches to generalization in machine learning, in which it is typically assumed that training and test samples are drawn from the same distribution. There is a widely shared view that human reasoning involves something more than this, that human reasoning involves the ability toextrapolate (Lake et al., 2017; Marcus, 2001). In particular, advocates of this view sometimes point to analogical reasoning as a clear example of this capacity, in which a reasoner extrapolates from knowledge in one domain to make inferences in a different, often less familiar, domain, based on common structure between the two (Gentner, 1983; Holyoak, 2012).
What are the prospects for capturing the capacity for extrapolation in neural network algorithms? Recent results have begun to address this question. In general, these results point to the conclusion that generalization in neural networks, even in relatively sophisticated domains such as relational or mathematical reasoning, is primarily limited to interpolation between data points within the convex hull (i.e. boundaries defined by the extremes) of the training set (Lake and Baroni, 2017; Santoro et al., 2018; Hill et al., 2019; Saxton et al., 2019).
It is worth considering how extrapolation is possible at all. Consider, for example, theoretical physics, a spectacularly successful paradigm of extrapolation. Physical laws discovered on the basis of terrestrial observations make precise quantitative predictions about phenomena in distant galaxies. This is possible because physical laws are characterized by certain symmetries – that is, they are invariant with respect to a group of transformations such as translation and rotation in space, translation in time, etc. (Feynman, 1966). As Feynman puts it, “nature uses only the longest threads to weave her patterns, so each small piece of her fabric reveals the organization of the entire tapestry” (Feynman, 1967).
In this work, we exploit the fact that many domains can be characterized by such symmetries, and test the idea that extrapolation can be enabled by encouraging the learning of representations that reflect these symmetries. To do so, we introduce context normalization, a simple inductive bias in which normalization is applied over a task-relevant temporal window. This technique preserves local relational information (e.g. the size of one object relative to another), while introducing both scale and translation invariance over the broader scope of the learned representational space. We hypothesized that the application of context normalization would improve the ability of neural networks to extrapolate. Critically, when trained end-to-end, we expect the presence of context normalization to impose constraints on both upstream (computed prior to a layer with normalization) and downstream (computed post-normalization) representations, promoting the acquisition of abstract representations that reflect task-relevant symmetries.
In an effort to aid the systematic evaluation of extrapolation in neural networks, we also introduce a novel benchmark, the Visual Analogy Extrapolation Challenge (VAEC). This dataset has two major advantages relative to other benchmarks designed to evaluate extrapolation (Santoro et al., 2018; Hill et al., 2019; Saxton et al., 2019). First, VAEC contains generalization regimes that assess both translation and scale invariance with respect to the underlying task space. Second, in each regime, VAEC includes a progressive series of evaluation modes, in which test data lie increasingly far away from the convex hull defined by the training data, allowing the graded evaluation of extrapolation as a function of distance from the training domain. We evaluate context normalization, in addition to a number of competitive alternative techniques, on the VAEC dataset, the visual analogy dataset from Hill et al. (2019), and a dynamic object prediction task. We find that context normalization yields a considerable improvement in the ability to extrapolate in each of these task domains.
2 Task Setup
2.1 VAEC Dataset
The VAEC dataset consists of four-term visual analogy problems, constructed from objects that vary in brightness, size, and location along the X and Y axes. Each object is rendered as an RGB image of size , with each color channel scaled to produce a minimum possible value of and maximum possible value of . Each image consists of a green square on a gray background (, , ). Each of the four dimensions of variation consists of a linear range tiled by discrete levels, with brightness spanning the range , size spanning the range , and location spanning the range along both X and Y axes.
The dataset contains proportional analogy problems of the form , in which all four terms of a given problem vary along one stimulus dimension, and in which both the distance and direction along that dimension are the same for and as they are for and (Figure 1). Each analogy problem also contains a set of foil objects through , each of which take the same values as the terms of the analogy (, , , and ) along the irrelevant dimensions of a problem, but take a different value than along the relevant dimension. The task is to select from a set of multiple choices consisting of and through .
2.1.1 Translation Extrapolation Regime
The VAEC dataset contains two generalization regimes, each requiring a distinct form of extrapolation. The Translation Extrapolation regime tests for invariance to translation along the underlying dimensions of the visual object space (size, brightness, and location), by dividing the space into six regions along the diagonal, each with size . These are referred to as Regions - (Figure 2), where Region consists of small, dim objects located in the upper left portion of space, and Region consists of large, bright objects located in the lower right portion of space. This allows the graded evaluation of extrapolation to a series of increasingly remote test domains.
We note that, although the sources of variation (size, brightness, and location) are correlated in terms of how training and test regions are defined, they are not correlated within an individual analogy problem. Each individual analogy problem only includes a single source of variation. The source of variation in each problem is included as an annotation in the dataset, and can be used to analyze whether and how performance differs across these dimensions. We include such an analysis in the Supplementary Material.
For each region, analogy problems are subsampled from the set of all valid analogies, resulting in a dataset with analogy problems per region, approximately of all valid analogies within a region. In our work, we train networks on analogies from Region , and test on analogies from Regions through , though we note that other configurations are possible with the dataset. Critically, test samples involve not only novel objects, but objects that fall completely outside the range of values observed during training.
2.1.2 Scale Extrapolation Regime
The VAEC dataset also contains a second generalization regime, the Scale Extrapolation regime, that tests for invariance to the scale of differences between visual objects. This regime includes six evaluation modes, referred to as Scale through Scale . Scale contains objects sampled from levels - along each visual object dimension (identical to values used in Region of the Translation Extrapolation regime). In scales through , these values are multiplied by a scalar ranging from to . Just as with the Translation Extrapolation regime, analogy problems in the Scale Extrapolation regime are: a) subsampled from the set of all valid analogies at each scale, resulting in analogy problems per scale; and b) use test samples outside the range of values observed during training.
2.2 Visual Analogy Dataset
We also evaluated context normalization on the extrapolation regime from the visual analogy dataset in Hill et al. (2019). Inspired by Raven’s Progressive Matrices (Raven, 1941), this dataset consists of matrices, in which a rule must be inferred from the images in the first row (the source) and then applied to the images in the second row (the target). Although the extrapolation regime in this dataset is in some ways easier than the VAEC dataset, because it does not require extrapolation as far away from the training domain, it is also more challenging in some ways, because it involves distracting, task-irrelevant variation, it involves cross-domain analogies (e.g. mapping a change in brightness to a change in size), and because each image typically involves multiple objects.
2.3 Dynamic Object Prediction Task
In order to evaluate the generality of our proposed context normalization method, we also evaluated context normalization on a dynamic object prediction task. Specifically, we created a task containing a sequence of images, …, depicting a smoothly changing object, requiring the prediction of the image given images …, for each time step in the sequence. We used grayscale images of size , each containing a white square on a black background. Over the course of a sequence, the location and size of the square changed smoothly. Location varied within the range along the X and Y axes, and size varied within the range .
For each sequence, we uniformly sampled start and end values for object size and location and generated a sequence by linearly interpolating between these values. We used sequences with length . To evaluate extrapolation, we stipulated that training would be performed only on objects with sizes from the range , and evaluation would be performed on objects with sizes from the range . This task thus required extrapolation from training on one set of objects to testing on objects that were an average of nearly three times as large (Figure 3).
Our central proposal for improving extrapolation is to normalize representations with respect to a task-relevant temporal window, preserving the relations between these representations, but discarding information about their absolute magnitude. We first formalize the proposed normalization approach, then describe how we apply this approach to different task domains.
3.1 Context Normalization
Given a batch with sequences, in which each sequence contains
time points, and in which a vector withdimensions is presented at each time point, we refer to the activity in the th sequence, at the th time point, in the th dimension as . We define the corresponding context-normalized activity as:
where is a small constant to avoid division by zero, and and
are learned gain and shift parameters (all initialized to one and zero respectively). This approach is similar to batch normalization(Ioffe and Szegedy, 2015) except that it normalizes over the temporal dimension instead of the batch dimension. In our experiments, we also evaluate a range of other normalization techniques, including batch normalization.
In the proposed approach, the context over which normalization is applied can be tailored based on knowledge of the structure of the problem. In the visual analogy dataset from Hill et al. (2019), analogy problems in the test set require extrapolation from familiar values in the source domain to novel values in the target domain. For this dataset, we therefore implement context normalization by treating the source and target as separate contexts over which to normalize (Context norm. (source/target) in Table 3), and compare to a version that treats the entire analogy as a single context (Context norm. (entire analogy) in Table 3). For all other datasets, context normalization is applied over an entire analogy problem or sequence.
3.2 Analogy Scoring Model
To solve analogy problems in both the VAEC dataset and the visual analogy dataset from Hill et al. (2019), we employ an approach also proposed by Hill et al. (2019), treating analogy as a scoring problem. For each analogy problem, our network is presented with multiple candidate analogies, consisting of the objects , , , and a candidate answer, drawn from the set containing and foil objects through
. The network produces a score for each candidate analogy, these scores are passed through a softmax function, and the network is trained to maximize the probability thatis the correct answer.
The network consists of a feedforward encoder that generates a vector embedding z for each image, a recurrent network that sequentially processes the vector embedding of each image in a candidate analogy, and a linear output layer (detailed in 4.1). In our experiments, we apply context normalization, along with a range of other techniques, to the vector embeddings (z) before passing them into the recurrent network.
3.3 Dynamic Object Prediction Model
To address the dynamic object prediction task, we employ an approach that combines an autoencoder and a recurrent network (detailed in4.3). First, we train an autoencoder to generate a low-dimensional embedding z given an image x. Then, for each sequence of images …, we obtain the corresponding low-dimensional embeddings …. Finally, we train a recurrent network to predict given …. The combined system is capable of making predictions in image space given an input sequence of images, and can be fine-tuned end-to-end.
In our experiments, we apply context normalization to the embeddings … before passing them to the recurrent network to make predictions. Then, before passing the predictions … to the decoder to generate predictions in image space, we invert the transformation imposed by normalization. We do this because normalization removes information about the absolute magnitude of the object that is necessary to accurately render an image. Specifically, given the prediction (the activity in the th sequence, at the th time point, in the th dimension), we define the de-normalized version as:
When testing other normalization procedures on the dynamic object prediction task, we similarly invert the normalization procedure (scaling by and adding ) before passing to the decoder.
4.1 Analogy Architecture and Training Procedure
For the analogy scoring model used on the VAEC dataset, the encoder architecture consisted of convolutional layers, each with kernels of size
, and a stride of
(no max-pooling), resulting in a feature map of size. This was followed by fully-connected layers, with
units per layer. ReLU nonlinearities were used in all layers of the encoder. The image embeddingz was then generated with a linear layer consisting of 256 units.
We applied either context normalization, or one out of a number of other normalization techniques (detailed in 4.2), to these embeddings, amd then passed a sequence consisting of the embeddings for , , , and the candidate answer to an LSTM with a single hidden layer of
units. The final hidden state of the LSTM was then passed through a linear layer to generate a score for the candidate answer. This process was repeated for each candidate answer (using the same encoder parameters for each image, and the same recurrent and output layer parameters for each sequence, reinitializing the recurrent state at the beginning of each sequence), and the resulting scores were passed through a softmax function to generate a probability distribution over the candidate answers.
We trained networks to maximize the probability that was the correct answer using a cross entropy loss. Each network was trained for iterations, with a batch size of , using the ADAM optimizer (Kingma and Ba, 2014) with a learning rate of (except as otherwise noted in 4.2). All weights were initialized using Xavier uniform initialization (Glorot and Bengio, 2010)
, and all biases were initialized to zero. All simulations on the VAEC dataset were performed using TensorFlow(Abadi et al., 2016).
For the analogy scoring model used on the visual analogy dataset from Hill et al. (2019)
, we used an architecture and training procedure modeled as closely as possible on the original paper. We describe these in detail in the Supplementary Material. All simulations for this dataset were performed using PyTorch(Paszke et al., 2017).
4.2 Comparison with Other Normalization Techniques
We considered a wide range of techniques as alternatives to context normalization. First, we compared to batch normalization (Ioffe and Szegedy, 2015), in which normalization statistics are computed over the batch dimension. Just as with context normalization, we applied batch normalization to the embedding vector z for each image in a sequence.
In our default implementation, we evaluate performance on the test set in batches with the same size () as during training, and compute batch normalization statistics online directly from these test batches. We did this to give batch normalization the best possible chance of extrapolating to domains with statistics that are different than the training set, but we note that, in the standard approach to batch normalization, normalization statistics during test are computed from the training set, to prevent the need to have batches during evaluation. We therefore also compared to a model that included batch normalization with statistics during evaluation computed from the entire training set.
We also compared to layer normalization (Ba et al., 2016), in which normalization statistics are computed over the units in a hidden layer. Given that layer normalization has been proposed specifically in the context of recurrent networks, we evaluated two versions, one in which normalization was applied to the hidden layer of the LSTM, and one in which it was applied to outputs of the feedforward encoder (z). We found that we had to train the networks with layer normalization significantly longer ( training iterations) to achieve a comparable degree of convergence on the training set.
In our experiments, batch normalization statistics were computed over a larger sample () than context normalization statistics (). To determine whether this factor affected performance, we compared to sub-batch normalization, in which normalization statistics were computed over sub-batches of size (though batch size was still ). Thus, sub-batch normalization was performed over the same dimension as batch normalization, but with sample sizes comparable to context normalization.
We also compared to a combined form of context- and batch-normalization, in which normalization statistics were computed over both the temporal and batch dimensions (similar to the ‘sequence-wise normalization’ proposed by Laurent et al. (2016)).
Our proposed approach to context normalization is aligned with the temporal structure of our task, in that normalization statistics are computed over the terms of a candidate analogy. To determine the importance of this alignment, we compared to two control conditions, each of which involved first concatenating all of the analogy problems from a given batch into a long sequence. In one condition, misaligned context normalization, we divided this sequence into segments of length (as opposed to segments of length required for context normalization that is aligned with the structure of the task), and computed normalization statistics over these segments. Thus, normalization parameters were computed over segments that intermixed terms (in varying proportion) from two separate analogy problems. In a second control condition, sliding-window context normalization, we used a sliding window to compute normalization statistics for each object based on itself and the preceding objects. Thus, for every object except the last object in each analogy, normalization statistics were computed over a window that incorporated objects from two analogy problems.
We also compared to a model that employed dropout, a technique proposed to improve generalization in neural networks, in which a random subset of units are dropped from each batch during training (Srivastava et al., 2014). Specifically, we implemented a model that combined both batch normalization and 50% dropout (after normalization), both applied to the output of the feedforward encoder (z).
Finally, we compared to a network that did not have any form of normalization applied to it. We found that we had to use a lower learning rate () and train for significantly longer ( iterations) to achieve convergence with this approach.
4.3 Dynamic Object Prediction
For the dynamic object prediction model, we first trained an autoencoder to generate a low-dimensional embedding z given an image x. The encoder architecture consisted of convolutional layers, each with kernels of size , and a stride of , resulting in a feature map of size . This was followed by fully-connected layers, with units per layer. ReLU nonlinearities were used in all layers of the encoder. The image embedding z was then generated with a linear layer consisting of 10 units.
|Region 1 (training)||Region 2||Region 3||Region 4||Region 5||Region 6|
|Context norm.||99.1 0.6||77.0 5.8||73.2 6.4||72.5 5.1||71.7 5.5||61.6 4.9|
|Sub-batch norm.||83.4 3.3||56.1 1.5||51.7 1.7||50.5 2.2||47.0 2.7||46.3 1.5|
|Sliding context norm.||68.2 8.2||44.8 1.6||35.4 2.6||34.8 2.7||36.7 4.0||36.6 4.1|
|Batch + context norm.||98.0 0.5||42.5 1.4||33.6 5.0||37.0 6.3||36.7 6.5||36.5 6.2|
|Layer norm. (recurrent)||100.0 0.0||52.1 8.0||44.5 7.5||35.7 6.1||27.4 3.9||23.3 3.7|
|Misaligned context norm.||55.3 4.9||41.5 2.0||35.4 1.0||33.3 1.6||31.6 1.6||31.6 1.8|
|Batch norm.||99.5 0.1||29.0 2.2||26.8 2.4||28.0 2.6||29.0 2.7||30.8 1.7|
|Batch norm. + dropout||99.0 0.1||27.5 2.4||22.6 1.8||24.4 2.3||26.0 1.6||26.8 1.7|
|Layer norm.||99.0 0.2||25.2 2.6||22.9 2.3||22.4 2.3||21.6 2.2||17.2 1.9|
|No norm.||94.8 3.3||23.9 2.6||23.4 2.4||19.1 1.9||17.8 1.7||16.7 1.4|
|Batch norm. (train stats)||99.9 0.0||33.5 4.0||19.1 6.4||10.2 4.9||13.5 7.7||6.3 4.3|
the standard error of the mean).
The decoder architecture consisted of fully-connected layers with units, followed by a fully-connected layer with units (reshaped for input to convolutional layers). This was followed by layers of transposed convolutions, each with kernels of size , and a fractional stride of , and a final transposed convolutional layer with output channel (also with kernel size and a fractional stride of
). ReLU nonlinearities were used in all layers of the decoder, except for the output layer, which used a sigmoid nonlinearity to generate grayscale images with values ranging betweenand .
We trained the autoencoder using a reconstruction loss (mean-squared error), with a batch size of , for iterations (though we found that convergence was achieved after approximately iterations), using the ADAM optimizer and a learning rate of .
After training the autoencoder, we used the encoder to obtain a sequence of low-dimensional embeddings … for each sequence of images. We trained a recurrent network to predict given …. The recurrent network consisted of an LSTM with units (we found that using larger LSTMs did not make a difference in this task), and a linear output layer with units, corresponding to the size of its input (i.e. the embedding layer of the autoencoder).
We performed context normalization before passing the embeddings to the recurrent network, and then de-normalized the predictions made by the recurrent network. We also compare to versions with batch normalization (computed either online using batches of size on the test set, or by calculating statistics over a sample of size from the training set), and to a version without any normalization.
We trained the recurrent network using the mean-squared error between the predicted embedding and the true embedding z with a batch size of , for iterations, using the ADAM optimizer and a learning rate of . All simulations for the dynamic object prediction task were performed using PyTorch (Paszke et al., 2017)
. All weights and biases were initialized using a uniform distribution bounded by, where is the number of input dimensions for a given layer (default method in PyTorch). We evaluate the combined model (including the encoder, LSTM, and decoder) using the mean-squared error between the predicted image and the true image x.
|Scale 1 (training)||Scale 2||Scale 3||Scale 4||Scale 5||Scale 6|
|Context norm.||98.8 0.7||77.8 1.8||61.2 3.8||54.4 3.3||51.2 2.5||48.7 2.2|
|Layer norm. (recurrent)||100.0 0.0||44.1 5.0||28.1 2.4||23.4 1.6||20.2 1.2||18.3 0.8|
|Batch norm. (train stats)||99.9 0.0||40.2 1.2||25.7 1.7||21.2 1.2||21.3 0.8||19.9 0.8|
|Batch norm.||99.2 0.2||18.3 0.3||17.7 0.5||18.4 0.4||20.1 0.8||21.2 1.0|
|No norm.||94.4 3.5||20.9 1.0||17.6 0.7||17.6 0.6||16.7 0.7||16.7 0.6|
5.1 VAEC Dataset
5.1.1 Translation Extrapolation Regime
Table 1 shows the results on the Translation Extrapolation regime of the VAEC dataset. In general, we find that performance decreases monotonically as a function of distance from the training domain, although we note that most models struggle even with extrapolation from Region 1 to Region 2. This suggests that the VAEC dataset is indeed a challenging benchmark, and an effective method of evaluating extrapolation in neural networks.
Promisingly, we find that networks trained with context normalization extrapolate considerably better than any of the other techniques that we tested. This is particularly true when compared to networks trained without any normalization at all, but there is also a substantial difference in test accuracy when comparing to established techniques, such as layer and batch normalization, with an overall decrease in test error of relative to the next best method (sub-batch normalization).
Some of the techniques that we tested were designed to better understand context normalization. From the performance of these techniques, we learn a few things. First, from the comparison with sub-batch normalization, we learn that the improvement from context normalization is not due merely to normalizing over a smaller sample. Second, from the comparison with both sliding-window and misaligned context normalization, we learn that it is important for context normalization to be aligned with the temporal structure of the task. Third, we learn that combining context normalization with batch normalization actually results in worse generalization than with context normalization alone.
As expected, we also find that networks trained with batch normalization extrapolate quite poorly when statistics are computed over the training set. This result emphasizes an additional strength of context normalization: that it can be computed online from single test samples, rather than requiring batches during evaluation.
One unexpected result was that sub-batch normalization was the second best performing technique. This was surprising because previous work has found batch normalization works better with larger batch sizes (Wu and He, 2018). We speculate that normalizing over small sub-batches might implicitly regularize the learned representations by introducing a source of noise during training, enabling stronger extrapolation. However, we note that batch normalization actually outperforms sub-batch normalization within the training region, suggesting that normalizing over larger batches is indeed better in the traditional IID generalization regime.
Here, we have focused on the benefits of normalization for generalization, but a common reason for applying normalization techniques to neural networks is to decrease training time. We found in our simulations that context normalization provided a comparable acceleration in training speed to batch normalization (training time courses are presented in the Supplementary Material), demonstrating that it is also useful for this purpose.
We also performed an analysis to better understand how context normalization shaped the representations learned by our networks. We found that context normalization encouraged the learning of representations that mirrored the linear structure of the stimulus space, and that this structure was preserved across the test regions in a manner that supported extrapolation (Supplementary Material).
Finally, we note that, although context normalization does indeed enable a substantial increase in the ability to extrapolate, extrapolation is still far from perfect. Thus, we see the VAEC dataset as a tool to aid in the development of methods with even stronger abilities to extrapolate.
5.1.2 Scale Extrapolation Regime
Table 2 shows the results on the Scale Extrapolation regime. As with the Translation Extrapolation regime, we find that extrapolating between scales is quite challenging, with performance monotonically decreasing as distance from the training domain increases. We find, however, that models trained with context normalization again display a considerable improvement in the ability to extrapolate relative to the other techniques we tested.
5.2 Visual Analogy Dataset
Table 3 shows the results on the visual analogy dataset from Hill et al. (2019). We find that applying context normalization over the source and target separately enables a decrease in test error relative to the results of Hill et al. (2019). We also find that batch normalization, and context normalization applied over the entire analogy problem, both enable more limited improvements in extrapolation. These results show that context normalization can also improve extrapolation in a more complex visual setting.
|Context norm. (source/target)||74.2 0.81|
|Context norm. (entire analogy)||67.6 0.49|
|Batch norm.||66.1 0.53|
|Baseline (Hill et al., 2019)||62 0.02|
5.3 Dynamic Object Prediction
We find that the generalization benefits of context normalization are not specific to visual analogy problems, but also enable a significant improvement in extrapolation on the dynamic object prediction task. Table 4 shows the average prediction error on the test set (in which objects are, on average, nearly three times the size of objects in the training set), for models trained with context normalization, batch normalization, or no normalization.
Note that when we implement batch normalization in the conventional manner (computing normalization statistics from the training set), test error is nearly ten times as high as with context normalization. Even when we allow normalization statistics to be computed over the test set, we find that the test error for batch normalization is higher than context normalization. These results demonstrate that the extrapolation benefits afforded by context normalization are not limited to visual analogies, but extend to sequential tasks more generally.
|Context norm.||0.0056 0.00010|
|Batch norm.||0.0095 0.00008|
|Batch norm. (train stats)||0.0507 0.00162|
|No norm.||0.0675 0.00275|
6 Related Work
Recent studies have investigated the question of extrapolation in neural networks (Santoro et al., 2018; Hill et al., 2019; Saxton et al., 2019). Despite the fact that some of these studies found surprisingly strong performance in complex reasoning tasks, they nevertheless found that current approaches do not perform well when neural networks are required to extrapolate. These results are broadly consistent with the work presented here; however, we note two unique contributions of our work. First, whereas in this previous work neural networks were required to extrapolate to a domain immediately adjacent to the training domain (equivalent to extrapolating from Region to Region in our task), the VAEC dataset that we present allows the graded evaluation of extrapolation at distances much farther from the training domain. This is important because, as we find in this work, the ability of neural networks to extrapolate tends to degrade as a function of the distance from the training domain, so the ability to measure extrapolation in terms of this distance is important for evaluating novel approaches to extrapolation. Second, we present a technique that considerably outperforms other approaches at extrapolation, performing reasonably well even in the more challenging evaluation modes of our dataset.
It is important to note that the ability to extrapolate is not the only challenging aspect of analogical reasoning. A related, but separate, challenge arises from the control demands imposed by analogical and relational reasoning tasks more broadly. When many entities are present in a scene or sequence, as is often the case in natural settings, processing the relations between these entities in a systematic manner becomes challenging. A number of architectures have recently been proposed to meet this challenge, with impressive results ranging from relational reasoning (Santoro et al., 2017, 2018)et al., 2017), to mathematical reasoning (Saxton et al., 2019). In the present work, we pursued the hypothesis that the failure of neural networks to extrapolate may be due to the nature of the representations over which they operate, rather than the control demands inherent to relational reasoning tasks. To that end, we focused on a simple problem from a control perspective, allowing us to use a relatively simple recurrent architecture (LSTM). To extend the present approach to more complex settings involving many entities and hierarchical relations, it may be useful to combine our approach with some of these recent architectural developments.
Some studies have employed a ‘parallelogram’ computation (, based on the approach of Rumelhart et al. (1973)) to perform both linguistic (Mikolov et al., 2013) and visual (Reed et al., 2015) analogies in vector space. Here, we use LSTMs instead of a prespecified computation, with the aim of developing a more flexible framework that is also amenable to other relational and analogical reasoning tasks. The parallelogram approach would not be capable of handling, for instance, either the analogy problems from Hill et al. (2019) or the dynamic object prediction task.
A key aspect of our approach involves normalizing representations with respect to their context. Other forms of normalization have played an important role in recent deep learning research, including batch normalization(Ioffe and Szegedy, 2015) and layer normalization (Ba et al., 2016). These methods have been shown to both speed convergence and improve generalization (Bjorck et al., 2018), at least in the traditional IID setting. However, to our knowledge, it has not been tested whether any variants of these methods also improve extrapolation. In our work, we found that normalization can indeed enable a substantial improvement in extrapolation, but the details of the normalization procedure make quite a difference. We found, for instance, that normalizing only over the temporal dimension (‘context norm.’) results in significantly better extrapolation than normalizing over both the batch and temporal dimensions (‘batch + context norm.’ in our work, referred to as ‘sequence-wise normalization’ by Laurent et al. (2016)).
We note that the idea of normalizing activations with respect to the recent context is similar to the ‘adaptive detrending’ (subtraction of the mean activity over a temporal window) applied by Jung et al. (2018), who found that this procedure improved image recognition from video, providing a benefit both in terms of convergence and (IID) generalization. The details of the normalization procedure in this study were different than ours – in particular, we implement both a detrending and a scaling operation, as well as the learned gain and shift operations that are commonly found in other normalization procedures. But we are encouraged by the fact that a similar approach proved useful in a more applied setting. Given these results, as well as our results in the dynamic object prediction task, we predict that context normalization may also enable the ability to extrapolate in richer settings such as video prediction.
We have considered the question of how to enable neural networks to extrapolate beyond the convex domain of their training data, making two key contributions. First, we proposed a novel benchmark, the Visual Analogy Extrapolation Challenge (VAEC) dataset, that allows the graded evaluation of extrapolation as a function of distance from the training data, testing for invariance to both scale and translation. Second, we have proposed a simple technique, context normalization, that enables a considerable improvement in the ability to extrapolate, as revealed by experiments using the VAEC dataset, the visual analogy dataset from Hill et al. (2019), and a dynamic object prediction task.
One possible concern with the benchmark that we propose in this work is that it lacks much of the visual complexity characteristic of real-world data (3D objects, multiple sources of illumination, clutter, etc.). Adding complexity to the VAEC dataset would certainly pose an interesting challenge, but we opted to avoid this in the present work for a simple reason. Adding extraneous complexity to the dataset, unrelated to the issue of extrapolation, would make it difficult to determine whether model failures resulted from this added complexity or from the central challenge of extrapolation. The poor performance on our dataset exhibited by a range of competitive techniques demonstrates that the extrapolation required by the task is more than challenging enough without additional visual complexity. We argue that the VAEC dataset is thus appropriately focused on the highly challenging issue of extrapolation.
Another potential concern with the present work is that context normalization needed to be temporally aligned with the structure of the task in order to enable a significant degree of extrapolation. This was easy to do in the context of our task, but is this too strong a constraint for the approach to be generally applicable? The question of how to align a normalization procedure with the underlying temporal structure of a task is an important one to address in future work, but we highlight two aspects of this problem that are causes for optimism. First, from an engineering perspective, many problems present natural, heuristic methods for segmenting sequential data according to their underlying structure, such as segmenting natural language data at the ends of sentences or paragraphs. Second, we point to a body of work in neuroscience focused precisely on the question of event segmentation(Zacks et al., 2007). In particular, this work suggests particular signatures that might be used as cues to the presence of event boundaries, such as transient changes in prediction error (Zacks et al., 2011), or clusters of temporal associations (Schapiro et al., 2013).
In this work, we took inspiration from the notion of symmetries in theoretical physics, hypothesizing that data in many domains can be characterized by such symmetries, and that extrapolation can be enabled by designing learning algorithms that exploit these symmetries. Our proposed approach, context normalization, was designed to exploit such symmetries – in particular, the translation and scale invariance of underlying linear dimensions – and our results demonstrate that doing so does substantially improve the ability to extrapolate. We emphasize that this result was far from obvious a priori. Though our proposed normalization procedure introduces scale and translational invariance with respect to the representational space (in whichever layer it is applied), this will not necessarily be the same thing as introducing scale and translation invariance with respect to the underlying object space (the size, location, and brightness of objects).
These results are particularly surprising in the case of our experiments with the VAEC dataset and the dataset from Hill et al. (2019), in which no aspect of the trained system, including the feedforward encoder, experienced any objects outside of the narrowly defined training domain. In this case, the downstream presence of context normalization apparently shaped the learning of representations in the encoder that supported a significant degree of extrapolation. In other words, the learning of representations that support extrapolation was encouraged by the presence of a subtle inductive bias that reflected the underlying symmetries in the task space. Understanding this interaction better is an important task for future work, and would likely lead to even further improvements in the ability to extrapolate.
Finally, we note that there is likely much more to be gained from the design of inductive biases that reflect the underlying symmetries of a given task space. In addition to designing techniques to more effectively capitalize on translation and scale invariance, there are also a host of other symmetries to be exploited, including invariance with respect to rotation in space, translation in time, and so on. There are also a number of opportunities to capitalize on symmetry between domains that are characterized by similar underlying structure. This is indeed the basis of more advanced forms of analogical reasoning (e.g., the analogy between the solar system and an atom). It is no coincidence that complex, relational representations are the hallmark of analogical reasoning because the most abstract and far-reaching invariances are expressed as systems of relations. We look forward to exploring these ideas in greater depth in future work.
We would like to thank Timothy Buschman, Simon Segert, Mariano Tepper, Jacob Russin, and the reviewers for helpful feedback and discussions. We would also like to thank David Turner for assistance in performing simulations.
- Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §4.1.
- Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §4.2, §6.
- Understanding batch normalization. In Advances in Neural Information Processing Systems, pp. 7694–7705. Cited by: §6.
- Symmetry in physical laws. The Physics Teacher 4 (4), pp. 161–174. Cited by: §1.
- The character of physical law (1965). Cox and Wyman Ltd., London. Cited by: §1.
- Structure-mapping: a theoretical framework for analogy. Cognitive science 7 (2), pp. 155–170. Cited by: §1.
Understanding the difficulty of training deep feedforward neural networks.
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. Cited by: §4.1.
- Learning to make analogies by contrasting abstract relational structure. arXiv preprint arXiv:1902.00120. Cited by: §1, §1, §2.2, §3.1, §3.2, §4.1, §5.2, Table 3, §6, §6, §7, §7.
- Analogy and relational reasoning. The Oxford handbook of thinking and reasoning, pp. 234–259. Cited by: §1.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §3.1, §4.2, §6.
Adaptive detrending to accelerate convolutional gated recurrent unit training for contextual video recognition. Neural Networks 105, pp. 356–370. Cited by: §6.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
- Generalization without systematicity: on the compositional skills of sequence-to-sequence recurrent networks. arXiv preprint arXiv:1711.00350. Cited by: §1.
- Building machines that learn and think like people. Behavioral and brain sciences 40. Cited by: §1.
Batch normalized recurrent neural networks. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2657–2661. Cited by: §4.2, §6.
- The algebraic mind. Cambridge, MA: MIT Press. Cited by: §1.
- Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §6.
- Automatic differentiation in pytorch. Cited by: §4.1, §4.3.
- Standardization of progressive matrices, 1938. British Journal of Medical Psychology 19 (1), pp. 137–150. Cited by: §2.2.
- Deep visual analogy-making. In Advances in neural information processing systems, pp. 1252–1260. Cited by: §6.
- A model for analogical reasoning. Cognitive Psychology 5 (1), pp. 1–28. Cited by: §6.
- Measuring abstract reasoning in neural networks. In International Conference on Machine Learning, pp. 4477–4486. Cited by: §1, §1, §6, §6.
- A simple neural network module for relational reasoning. In Advances in neural information processing systems, pp. 4967–4976. Cited by: §6.
- Analysing mathematical reasoning abilities of neural models. arXiv preprint arXiv:1904.01557. Cited by: §1, §1, §6, §6.
- Neural representations of events arise from temporal community structure. Nature neuroscience 16 (4), pp. 486. Cited by: §7.
- Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §4.2.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §6.
Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §5.1.1.
- Prediction error associated with the perceptual segmentation of naturalistic events. Journal of Cognitive Neuroscience 23 (12), pp. 4057–4066. Cited by: §7.
- Event perception: a mind-brain perspective.. Psychological bulletin 133 (2), pp. 273. Cited by: §7.