The successes of deep learning critically rely on the ability of neural networks to output meaningful predictions on unseen data – generalization. Yet despite its criticality, there remain fundamental open questions on how neural networks generalize. How much do neural networks rely on memorization – seeing highly similar training examples – and how much are they capable of human-intelligence styled reasoning – identifying abstract rules underlying the data? In this paper we introduce a novel benchmark, Pointer Value Retrieval (PVR) tasks, that explore the limits of neural network generalization. While PVR tasks can consist of visual as well as symbolic inputs, each with varying levels of difficulty, they all have a simple underlying rule. One part of the PVR task input acts as a pointer, giving the location of a different part of the input, which forms the value (and output). We demonstrate that this task structure provides a rich testbed for understanding generalization, with our empirical study showing large variations in neural network performance based on dataset size, task complexity and model architecture. The interaction of position, values and the pointer rule also allow the development of nuanced tests of generalization, by introducing distribution shift and increasing functional complexity. These reveal both subtle failures and surprising successes, suggesting many promising directions of exploration on this benchmark.READ FULL TEXT VIEW PDF
Human intelligence is characterized by a remarkable ability to infer abs...
Recent years have witnessed the success of deep neural networks in many
In practice it is often found that large over-parameterized neural netwo...
Neural networks can be powerful function approximators, which are able t...
Despite the groundbreaking successes of neural networks, contemporary mo...
Understanding the underlying mechanisms that enable the empirical succes...
Combinatorial generalization - the ability to understand and produce nov...
The remarkable capabilities of neural networks across many different domains is critically reliant on their ability to generalize — output accurate predictions on unseen data. But despite its central importance, the mechanisms of neural network generalization are still poorly understood, with many unanswered questions. For example, in , it is shown that neural networks can surprisingly memorize training data with random labels. Is this a barrier to true, intelligent, generalization? Followup work paints a more nuanced picture, showing that like in humans, memorization can also aid generalization [15, 9]. But humans are also perform generalization through alternate approaches, such as abstract reasoning. Are general neural networks also capable of this?
Answering these questions on neural network generalization has gained pressing urgency as we increase the complexity of tasks being considered (e.g. object detection for self-driving, multimodal learning) [11, 8, 24], run into limits on the amount of diverse data available , and also require reasoning capabilities in high-stakes settings, e.g. medical data with distribution shifts [33, 22]. But progress in understanding generalization has been hindered by a lack of benchmarks that can distinguish between simple methods of generalization such as memorization or nearest neighbors from more complex approaches such as sophisticated reasoning.
In this work, we address this shortcoming, introducing a simple but versatile benchmark to test and understand the limits of neural network generalization. Specifically, our contributions are:
We introduce Pointer Value Retrieval (PVR), a family of tasks to study neural network generalization. PVR tasks can consist of visual or vectorized inputs, with natural ways to systematically vary task difficulty by introducing distribution shifts and/or increasing functional complexity.
All PVR tasks are unified by a simple rule, where part of the input acts as pointer, indicating a different location in the input which forms the value (output). This property means that PVR tasks require understanding of position, values and the rule connecting them, which naturally enables many rich task variations. It also allows targeted testing of which of these abstract concepts a model has learned, and which it has failed at.
We perform an empirical exploration of neural network generalization, starting with visual PVR tasks and introducing distribution shift. We analyze ways in which this causes models to fail, identifying important questions for progress on these tasks.
We then look at neural network generalization as the functional complexity of PVR tasks are varied. We introduce a family of functions to aggregate values into final output, and highlight connections between this and the notion of noise sensitivity from Boolean functions, which measure task difficulty.
We test multiple model architectures on vectorized PVR tasks with varying functional complexity, discovering performance differences between MLPs, Transformers and the recent MLP-Mixer.
We analyze the models for abstract concepts learned, discovering surprising successes on tasks combining functional complexity and distribution shift. We discuss the ramifications and important open questions for future exploration.
Neural network generalization and reasoning has been studied with several approaches/tasks distinct from our PVR framework. One active direction has looked at the capability of models to learn visual tasks in a few-shot setting [29, 31], on benchmarks such as Omniglot , and Bongard-LOGO . Our focus is not on few-shot, but faithfully learning the reasoning rule. Previous work has looked at learning rules in visual question answering [1, 13], drawing on classical work studying generalization . These rules are often much more complex than our simple pointer-value formulation. Other work with varied rules has looked at mathematical reasoning, both in the form of non-verbal, visual puzzles [2, 36] as well as more symbolic mathematical inputs [28, 27], with related efforts studied in RL [25, 35, 23]. There are also tasks exploring verbal reasoning abilities [17, 20], which are different from our setting with systematic difficulty variation. Sequence-to-sequence learning has been used to study compositional generalization [17, 14, 26]
, due to rapid advances in natural language processing in the past few years. However, the model architectures and learning dynamics are much more complicated to analyze than the classification setting. A line of literature design neural networks with dedicated symbolic components to improve systematic generalization[4, 5]. We instead focus on studying general architectures and analyze whether generalization behavior is learned from the data.
Here we introduce a new benchmark, Pointer Value Retrieval (PVR) for testing and understanding neural network generalization. The PVR benchmark provides a rich family of tasks with different inputs types and varying difficulty. But all of these tasks are unified by a simple underlying rule: part of the input acts as a pointer, indicating a position in the input, which contains the value, which forms the output. This structure requires models to understand notions of position, values and the rule linking them, enabling a diverse set of task types. We overview different subfamilies of tasks below.
the bottom right position. The image at that position is then the value, which determines the label (the digit value for top row of MNIST examples, and class label for bottom row of CIFAR-10 examples.) Task variants may have larger block sizes (e.g.x blocks) or a different mapping from pointer digits to positions. Train, validation and test splits are easily created by drawing images from the underlying dataset train/test splits.
One subfamily of PVR tasks, visual PVR, consist of visual (image) inputs, which provide a way to test and understand generalization in high performing computer vision models. We consider two structures for our PVR visual inputs (i)block and (ii) sequential, illustrated in Figures 1, 2.
In the block styled input, we have a x grid, where the upper left entry acts as the pointer. The number of the pointer indicates which position of the input we should look at for the value term, the label of which provides the overall label. For example, in the top row of Figure 1, we define that if the pointer digit is , the value term is the upper right digit, if the pointer digit is , the value is the lower left digit, and if the pointer digit is the value is the lower right digit. This task structure also extends across datasets (shown with MNIST digits as pointers and CIFAR10 images as possible values in Figure 1 second row), and to a larger block size.
In the sequential input, we have the visual data arranged in a row. In our example, we take the left most entry acting as the pointer, and the numerical digit specifying which of the remaining sequence will be the value. In Figure 2, we show this for a sequence of length (one pointer and ten possible values), but like the block styled input, this can also be extended to different sequence lengths, pointer positions and different datasets.
To disentangle learning visual representations from learning reasoning, we can also look at vectorized versions of these tasks, where the input is no longer an image, but a vector consisting of the digits themselves. This has natural parallels to language tasks — the digits form a vocabulary. Indeed, when testing models on this benchmark, we also use an embedding layer, which maps each digit to an embedding vector, as show in Figure 3. As with the visual input, in our examples we let the left most entry act as the pointer, indicating which of the remaining digits is the value (and output).
Having seen the different input types and the basic tasks, we turn to defining how to systematically vary the difficulty of the PVR tasks. It is this feature that makes the PVR tasks especially useful in testing and understanding generalization. Specifically, as the PVR tasks rely on reasoning about position and values, there are several ways to naturally increase task complexity.
One such method is to introduce distribution shift: at train time, we (i) ensure some of the values do not appear at some positions (i.e. value holdout at some positions) (ii) we also ensure all values appear somewhere (so the model has the ability to learn all the values). We can then test to see how well the model learns notions of position, value and the pointer rule. In Figure 4, we provide an example of this (studied further in Section 4): we take the block style visual PVR task, and in the training set, make sure digits never appear in the top right, digits never appear in the bottom left and digits never appear in the bottom right.
Another natural method to vary task difficulty is by increasing functional complexity. So far, the pointer in the input has indicated the position of the value, which has directly been taken as the output. Instead, we now use the pointer to indicate a window of values, which are aggregated to form the final output. The size of the window is determined by a fixed, complexity parameter, with window size corresponding to the original task of directly outputting the value. Figure 5 shows a schematic of this.
The connection between window size and functional complexity can be made formal through Boolean function analysis (discussed in Section 5.1). But it can also be seen through observing that having a window of values increases the number of positions used to determine the output. This prevents reliance on naive nearest neighbors, as a single digit changing could affect the output.
There are many possibilities for aggregation functions when increasing functional complexity. In our experiments, we study the following, discovering interesting variations in difficulty amongst them: (i) mod_sum (sum all the values in the window and compute the remainder mod ) (ii) median (median of window values) (iii) maj-vote (mode of window values) (iv) min/max of window values.
Having introduced the PVR benchmark, different subfamilies of tasks it contains, and natural ways to vary difficulty, we turn to a simple first empirical study of the block style visual PVR task, and the effect of introducing distribution shift.
In Figure 6, we look at training two convolutional networks, VGG and ResNet, on the standard block style visual PVR MNIST task. (Full details of all PVR datasets in Appendix Section A.) We observe that both model architectures achieve strong generalization (Figure 6 right pane.) But how is this performance achieved? Are the models learning the notion of the four different positions, the ten different digit values and the pointer rule?
To understand this further, we introduce distribution shift into the task, specifically the holdout rule described in Figure 4, where some digits constitute the holdout in some positions at training, but all digits (and all pointer values) appear. This provides the models with enough information to still learn the abstract concepts (position, digit values, pointer rule) underlying the labelling process, but prevents simple memorization or nearest neighbor approaches.
We then evaluate the models on data which draws uniformly at random from at all positions (labelled dshift) and also all the datapoints that were heldout during training (labelled holdout.) The results, shown in Figure 7, are striking. We see that both models perform poorly when tested on the dshift data, and entirely fail to learn (converging from random accuracy to ) when tested only on the holdout data. This reveals that strong generalization performance seen in Figure 6 is not due to the neural network correctly learning to reason, but a consequence of a more simplistic generalization method, like nearest neighbors.
Left: we plot the raw logit values for a few different test examples for pointer digit. We observe that the model has learned to assign very low logits to labels , exactly the values left out from the top right position during training (which pointer points to). Although all test examples have only values in this position, this correlation is ingrained in the network, leading to systematic errors. For comparison, we include logits from models trained in the IID setting, where we observe no correlations between pointer digit and label values. Additional examples are in the Appendix.
The striking convergence to accuracy in the Holdout case also suggests a systematic error. To understand this further, we analyzed the model representations and outputs. In particular, looking at the raw logit values across models trained in an IID setting and in the Holdout setting revealed a surprising insight, shown in Figure 8. We observe that under Holdout shift, models learn to negatively correlate the pointer digit with values that were held out during training. In Figure 8 left pane, as digits never appear in the top right position during training, and pointer indicates the top right position, the neural network automatically assigns labels a very low logit value, even though at test time, all of the top right values are in . By contrast, in the IID setting (right column) no such adversarial correlation is observed. Additional examples are in the Appendix.
These results raise interesting open questions that can be explored on this benchmark. Are there architectural changes that can enforce better priors and withstand distribution shift? Can novel learning objectives prevent these adversarial correlations? Progress on these questions holds promise for greater robustness.
The results also highlight a question on natural mechanisms of neural network generalization. On the one hand, the neural network did not learn to generalize on this task with the abstractions we might have expected, and new methods that better align its generalization to these concepts could be very useful. However, its systematic failure arose from spotting a pattern relating pointers to labels that was actually present in all of training data. This suggests looking at settings of increased functional complexity, where the additional steps in mapping value to label remove such patterns.
Motivated by the previous results, we turn to studying performance and generalization on PVR tasks of increasing functional complexity. Are there limits to the functional complexities neural networks can generalize to? And how does dataset size influence this? Are there inherent difficulties in different aggregation functions, and can we formalize this? We investigate these questions below.
Our empirical investigation centers on PVR tasks with vectorized inputs (Section 3.2), where functional complexity is varied through increasing window (neighborhood) size, with neighborhood size corresponding to the original setting of the pointer indicating a value that directly becomes the output. We start using an MLP architecture, where the terms in the vector are first embedded with a learned embedding matrix (Figure 3), and use the mod_sum aggregation function due to it being most sensitive to the neighborhood values (a single change of any value changes the output.)
Figure 9 illustrates the results, with some striking takeaways. Firstly, we note that training accuracy across all complexities (neighborhood sizes) and dataset sizes is consistently . This contrasts with test accuracy which starts off random at lower training set sizes and shows a thresholding behavior, jumping up to with large enough training data. (Neighborhood 3 is an exception, where even at training points, test accuracy remains at random.) Together, this reveals that at small dataset sizes, these models completely overfit, and are not learning the underlying reasoning rules at training time. With enough data however, we see their test accuracy jumps up. In Section 7, we dive further into this, exploring whether test accuracy translates to learning the right reasoning rules.
We apply noise sensitivity, used to determine complexity of boolean functions [21, Proposition 3.3], to quantify the complexity of our tasks. Intuitively, the noise sensitivity measures how sensitive the outcome of
is to random perturbations with probabilityof the input bits. Figure 10a plots with a range of s, and the results are consistent with our intuition that target functions with higher number of neighbors are more complex. In fact, the choice of the mod_sum label aggregation is also important. As shown in Figure 10b, other common aggregations such as majority voting don’t generate a clean sequence of tasks with increasing difficulties. For min and max aggregation, the tasks actually become easier as neighborhood size increases. Additional details are in the Appendix.
In the previous section, we studied the empirical performance of MLP architectures on PVR tasks of different functional complexities. But the similarities between vectorized PVR and language tasks — both consisting of a sequence of tokens — suggest investigating alternate architectures, such as the Transformer  and even the recently proposed MLP-Mixer . In this section, we examine performance variations across different architectures, as well as massive training dataset sizes.
In Figure 11 we show test performances of different architectures as training dataset size and particularly complexity are increased. We observe that MLP and MLP 2x have similar performance, while the Transformer and particularly MLP-Mixer demonstrate significantly better performance — solving tasks of higher complexity with much less data. These architectures also have larger number of parameters than MLP (Appendix Section A), so their better sample complexity arises from better inductive biases. This suggests a future direction of comparing representations across architectures.
Motivated by the increases in performance from dataset size in Figure 9 and to test the limits of neural network learning, we look at training with massive dataset sizes, up to . We show the results for MLP and MLP-Mixer in Figure 12. For MLP, we observe that with training points, it is able to learn neighborhoods of size (which was at random accuracy in the smaller sweep of Figure 9), but at larger number of datapoints, struggles to fit the training set. MLP-Mixer, with its larger capacity and better inductive bias, shows continuing performance improvements as dataset size is increased, solving more and more complex tasks, although we see variance increase with the most complex tasks, with some seeds failing to train (suggesting open questions on learning dynamics). Note for tasks of complexity , we see test accuracy with MLP-Mixer at training examples.
The results of Figures 9, 12 illustrate that larger datasets help reach high test performance, with models often showing a sharp jump in test accuracy as the dataset size is increased. This raises a key question: is the high test accuracy truly indicative of learning reasoning, or for a task of complexity , are the models simply memorizing the digits corresponding to all the different choices of (i) pointer (ii) neighborhood digits (iii) the associated label? To answer this, we devise a test by combining distribution shift and functional complexity, which provides surprising insights on the reasoning capability of neural networks.
Specifically, for a task of complexity , we ensure the sequence never occurs in the value window in the training set. Meanwhile, the test set is adversarially constructed to contain only in the value window, but the pointers and other digits outside the value window are random. We call this task holdout-1. Similarly, holdout-2,…,holdout-i, etc. are constructed to hold out not only , but also permutations of it. This is the natural extension to complexity of the Holdout Shift experiment described in Sections 3.3. But as the value is not directly the output (as in complexity ), there is potential to avoid the systematic failure of Section 4.
The results of holding out up to permutations for and evaluating on the adversarial test set are shown in Figure 13. We observe a partial remarkable success: there are some random seeds where the neural network fails to train (not included in the plot). But in seeds where it does learn, it generalizes even when all permutations are held out, strong evidence that it is truly learning to reason. Some examples of failed runs are shown in Figure 16: we consistently observe a difference between slow and fast learning. When training converges slowly in a random seed, the test accuracy is poor, while fast training convergence results in test accuracy. Together, these results suggest rich open directions on understanding task training dynamics and the mechanisms of learning.
In this paper, we propose a novel benchmark, Pointer Value Retrieval, to test the limits of neural network generalization. Although PVR consists of a diverse family of tasks, with varying inputs and difficulties, they are unified by a simple rule where part of the input acts as a pointer indicating another part of the input which forms the value. This combination of positions, values and the simple rule allow us to develop nuanced tests of generalization, and our detailed empirical study demonstrates subtle systematic errors on distribution shifts, performance variations when increasing functional complexity and the effect of training set size. Our benchmark also reveals inductive biases across different model architectures that enables learning in high complexity regimes, albeit at the cost of stability. This latter point also relates to our final, striking observation, that when training can be made to succeed, neural networks are capable of learning some amount of reasoning, working on truly unseen instances. These results raise many open questions to explore on this benchmark, from variations of representations across architectures, to properties of the dataset, to the dynamics of learning the learning process, all which offer promising future directions for testing and understanding generalization.
International Conference on Machine Learning, pp. 511–520. Cited by: §2.
Train faster, generalize better: stability of stochastic gradient descent. In International Conference on Machine Learning, pp. 1225–1234. Cited by: Appendix D.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910. Cited by: §2.
Zero-shot task generalization with multi-task deep reinforcement learning. In International Conference on Machine Learning, pp. 2661–2670. Cited by: §2.
Association between surgical skin markings in dermoscopic images and diagnostic performance of a deep learning convolutional neural network for melanoma recognition. JAMA dermatology 155 (10), pp. 1135–1141. Cited by: §1.
PVR Datasets: In the main text, we perform experiments on:
visual block style MNIST PVR IID dataset For this dataset, which is used in Section 4, for the training set, we sample each position iid from the MNIST training examples. For the testset, we similarly sample each position iid from the MNIST test set. To introduce some more variation, each block position is of size x, and we randomly jitter the x MNIST digit within the block.
visual block style MNIST PVR Holdout Shift dataset Here for the training set, we again sample from the MNIST training set, but holding out some digits at some positions. So for the top right, we don’t sample from digits (but sample iid from the remaining MNIST digits), for the bottom left, we don’t sample from (but iid from all other MNIST training digits) and for the bottom right we don’t sample from MNIST digits that are . The pointer is sampled iid from all MNIST training digits. For the test set we sample from the MNIST test digits, but again holding out some digits, i.e. we only sample iid from MNIST test digits for the top right corner, and so on. We also jitter each digit like above.
vectorized PVR dataset For the vectorized PVR dataset, we simply sample iid from for each position in the sequence.
For the visual PVR tasks, specifically the MNIST visual PVR task, we use the standard ResNet18 and VGG11bn, from the torchvision models library. We train for epochs with batch size . We use the Adam optimizer with learning rate .
For the vectorized PVR tasks, we use the following network architectures. Table 1 summarize the number of parameters in each model we use.
: We use an embedding layer with vocabulary size 10 and embedding dimension 64 to map each of the input token to a vector representation. The concatenated representations from all input tokens are then passed through 4 fully connected layers with output dimension 512, 1024, 512, and 64, respectively. We apply ReLU activation function after each fully connected layer. Finally, a linear layer with output dimension 10 is attached as the classifier.
MLP : The same as MLP, except that the output dimensions of the 4 fully connected layers are doubled: 1024, 2048, 1024 and 128.
Transformer: We use the encoder part of standard Transformer , which consists of multiple transformer layer, where each transformer layer consists of a multi-head self-attention block and a MLP block. In particular, we use embedding dimension 512, 4 transformer layers, with 4 heads for each self-attention block and hidden dimension 1024 for each MLP block. Following Dosovitskiy et al. , we prepend a virtual class token with learnable vector representations, and in the final encoder output, attach a linear classifier to the representation of that token for classification.
MLP-Mixer: MLP-Mixer is a new architecture recently proposed by Tolstikhin et al. . It is similar to Transformers, except that the multi-headed self-attention layers are replaced with fully connected layers. We use embedding dimension 512 and 4 mixer layers, with token-MLP dimension 768 and channel-MLP dimension 2048. Similar to Transformer, we use a class token for the purpose of classification.
We did a small scale hyperparameter sweep on different optimizers (SGD, Adam, Adagrad , LAMB ), learning rates () and batch sizes (128, 256, 512, 1024, 2048, 4096). In the end we choose the following hyperparameters by balancing performance and stability across different setups: We use weight decay , SGD optimizer with momentum 0.9, and cosine learning rate scheduling with base learning rate 0.05, and a linear warmup period of 10 epochs. The batch size is 1024, and we train for 200 epochs. For studies with tiny training sets (e.g. 64), we train for at least 800 iterations. The training was done with NVidia P100 / V100 GPUs.
We applied noise sensitivity of boolean functions in the main text to characterize the complexity of our vectorized PVR tasks with different neighborhood sizes and aggregation functions. Specifically, for a function with boolean (binary bits) inputs, and , the noise sensitivity measures the probability that , where consists of uniformly random bits, and is formed from by flipping each bit independently with probability . It is widely used to measure the stability and closely related to various complexity measure of boolean functions. For example, for , the Fourier spectrum of a binary output boolean function can be shown to be -concentrated on degree up to with [21, Proposition 3.3].
We extend it to measure the complexity of our target functions by representing the inputs of our vectorized PVR task as bit vectors. In particular, each digit in our input sequence is represented as bits by the standard binary representation of unsigned integers, where is the smallest integer such that . When , a subset of random uniform bit sequences will fall out of the valid range of . We simply extend the definition of the target functions to take digits with arbitrary values, but convert them into the “valid range” via as a preprocessing.
Figure 15 illustrates test accuracy for different aggregation functions across varying PVR functional complexity and dataset sizes. The learning results are consistent with our measurements of task complexity using noise sensitivity: the tasks become noticeably more challenging to learn as the neighborhood size increases only when using the mod_sum aggregation function.
We present a few cases of training curves in Figure 16. Each color show a pair of runs with identical hyper-parameters and task setups, but with different random seeds. We found that learning could become unstable when the task complexity increases. For example, the red pair of experiments show that one of the training job suddenly collapse during training and fail to to recover from that in the remaining epochs, while another run with a different random seed succeeds and generalize perfectly. Similarly, the green pair show one run that never reaches 20% training accuracy, but another run with perfect training and test accuracy.
A more interesting pattern we observe is “slow learning” vs “fast learning”: sometimes, as demonstrated by the blue pair and the orange pair, both runs succeed in fitting the training set perfectly, yet they have completely different generalization capability. We found that when a model converges rapidly on the training set, it usually generalize well; on the other hand, when a model converges slowly and smoothly on the training set, it usually generalize poorly. It seems in the latter case, the network is just slowly memorizing the training examples without learning the actual concept. This is consistent with earlier theoretical work that quantify generalization via training speed . However, it remains open question to identify here what is the factor that cause the two networks under identical setting to behave differently.