Extracting and reasoning with abstract concepts is a crucial ability for any learner that is to operate in combinatorially complex open worlds or domains with limited or structured data. It is well known that neural learners struggle to operate in such conditions due to their poor generalization capabilities in structured domains Chollet (2019); Marcus (2018). In this work, we demonstrate that spectral regularization provides neural networks with a strong inductive bias towards learning and utilizing abstract concepts akin to a symbolic learner.
For that purpose, we employ the Abstraction and Reasoning Corpus (ARC) Chollet (2019) which contains tasks related to manipulating colored patterns in a grid. In order to successfully solve the tasks in the corpus an agent needs to be able to count, manipulate numbers, work with topological and geometric concepts as well as recognise the notion of objects. There are 400 training tasks and 400 (distinct) evaluation tasks. Each task has a small set of input-output example pairs (between 1 and 5) and a query input pattern. This is quite a challenging dataset due to the small amount of example data, large number of different tasks and their abstract nature.
So far, the best known solution, with a success rate of 20%, is the winner of the ARC Kaggle challenge, and it is a carefully hand-crafted symbolic solver written in approx. 7k lines of C++ code. In this paper, we introduce the Neural Abstract Reasoner (NAR) which achieves an accuracy rate of 79% and so outperforming even the best symbolic solver created by a human. The NAR architecture contains a Differentiable Neural Computer (DNC) that learns general problem solving skills and a Transformer network responsible for solving particular task instances (see. Fig 2 inChollet (2019)
). 7u Importantly, spectral regularization plays a fundamental role in the successful training of NAR. From a purely machine learning perspective, spectral regularization is known to reduce the effective number of parameters in the network, however we provide some additional theoretical intuition and demonstrate that spectral regularization also pushes the network towards finding algorithmically simpler solutions as recommended by Solomonoff’s theory of inductive inferenceLi and Vitanyi (2008).
2 Related Work
Hybrid neuro-symbolic approaches enable agents to solve structured tasks from raw data, while learning faster and being more robust to noise Gaunt et al. (2017); Verma et al. (2018); Penkov and Ramamoorthy (2018); Mao et al. (2018). However, the majority of methods proposed so far are designed with specific domains in mind, making them inapplicable to a broader range of tasks. A notable exception is the architecture proposed by Ellis et al. (2020)
, which is capable of learning rules from geometry, vector algebra, and physics and solve tasks such as drawing pictures or building complete scenes. Importantly, these methods often require lots of data, which is in stark contrast with human capabilities.
The ARC dataset Chollet (2019) is specially designed to push research towards data efficient learners, as there are hunderds of tasks, each of which is represented by no more than 5 input/output examples. To the best of our knowledge, the Neural Abstract Reasoner, presented in this paper, is the first architecture that achieves a performance rate of 79% on the ARC dataset, outperforming state-of-the-art hand-coded symoblic systems by a factor of 4. The NAR architecture is a composition of a slowly learning Differentiable Neural Computer and a fast adapting Transformer network creating an outer learning and inner executing loops, as suggested in Chollet (2019).
Complexity and generalization
The analysis of complexity and generalization metrics applied to neural networks has formed a central line of theoretical ML research with a variety of recent breakthroughs in terms of PAC-based and compression methods (cf. Arora et al. (2018); Bartlett et al. (2017); Jiang* et al. (2020); Suzuki et al. (2020); Wei and Ma (2019) and the references therein). In particular, many generalization approaches based on spectral norm analysis have been so far proposed and investigated Neyshabur et al. (2017); Sanyal et al. (2020). However, to our knowledge the present work is the first to address the relationship between spectral norms’ behaviour and abstract reasoning tasks, whereby a strong relationship between a classical spectral regularisation and the ability of a neural model towards learning abstract reasoning (concepts and rules) is demonstrated. As touched upon in Section 4, one could draw motivation from well-known algorithmic information theory concepts such as Solomonoff inference and program generation based on least complexity Li and Vitanyi (2008); Blier and Ollivier (2018); Schmidhuber (1997).
The ARC dataset consists of a train and evaluation portions and , respectively. Each portion consists of 400 tasks (train tasks are augmented to 15000 through color permutations and rotations). The individual tasks are grouped in tags based on the skills needed to solve them Bonin . A task consists of up to five example input-output pairs and one query pair. A neural learner has to infer a logical rule . All inputs and outputs are grids of variable sizes with 10 colors. A solution is correct only when all the pixels on the grid match.
First, we derive a latent representation of the grids with an InceptionNet-style Szegedy et al. (2015) deterministic auto-encoder. Let be a grid, and be the encoder and decoder, parametrized by
. We train an embedding network by minimizing the standard autoencoder cross-entropy loss.
Next, we consider all latent grid embeddings, and we train a Differentiable Neural Computer Graves et al. (2016) with parameters to infer an instruction set . All inputs are processed by a Transformer Decoder Stack Vaswani et al. (2017) with parameters , which self-attends to all inputs and cross-attends to :
The whole model is then trained end-to-end via ADAM Kingma and Ba (2014) to minimize the cross-entropy loss between the query target and the decoded test output prediction.
We employ a two-stage curriculum during training, first training only a on a single tag , and then expanding to the whole train dataset . Additionally, during the first stage of training, spectral regularization Yoshida and Miyato (2017) with a larger value is applied, which is then annealed in the second stage. When evaluating the model, we apply additional optimization steps (similar to Finn et al. (2017); Krause et al. (2018)), as described in Algorithm 1. See Appendix C for additional details.
We build on methods from Santoro et al. (2016) and use a memory-augmented neural network (the Differentiable Neural Computer Graves et al. (2016)) to derive context for the current task. The multiple read heads and attention mechanisms allow the DNC to relate the input and the output of a pair and compare them to the other input/output pairs that it has already processed. Unlike Santoro et al. (2016), we leverage a Transformer to carry out the task execution based on the DNC context. This decouples the learning of the instruction set from the program execution itself, and allows us to use the input/output relations directly, rather than the more standard . Lastly, the Transformer network relates the inputs to each other, thereby exploiting similarities within them.
As this work is still in progress, these are preliminary results evaluated on grids up to . Nonetheless, we outperform all currently known solutions, including a hand-crafted symbolic solution (see Fig. 1). Spectral regularization proved instrumental for this, and other regularization methods did not yield any significant results (see Fig. 2).
Without any additional adaptation steps Finn et al. (2017), evaluation performance remains low at 1%, while only after 3 steps, that number climbs up to 78.8%. Analyzing more closely the network changes made by the adaptation steps, the gradient norm of is , which implies that the DNC is acting as a true meta-learner, and only the Transformer requires a small change to execute the instructions flawlessly. We again attribute this generalization to spectral regularization.
4 Effects of spectral regularization: stable ranks and complexity
The surprisingly significant effect of a simple spectral regularization strategy in the reasoning tasks suggests strong connections with generalization and the underlying model complexity estimates. On one hand, this motivates the analysis of spectral regularization in terms of some well-known generalization bounds (e.g. based on stable rank and spectral norms), however, we first discuss a perspective inspired by algorithmic information theory. Intuitively, abstract reasoning tasks are induced by a concise set of logic rules and combinatorial patterns, and, hence, it is natural to search forshort programs producing these rules - in this regard, we give motivation as to why spectral regularization naturally shrinks the search space towards shorter programs.
Spectral regularization and polynomials as algorithmically simple programs. Classical methods from program inference and algorithmic information theory, such as Solomonoff inference and Occam’s razor Li and Vitanyi (2008), suggest that "simpler" program models are preferable in terms of forming abstract concepts and generalization - a formal approach towards such issues is given, e.g., by Kolmogorov complexity theory Li and Vitanyi (2008); Schmidhuber (1997). Although the evaluation of algorithmic complexity is a demanding task (Kolmogorov complexity is theoretically uncomputable), one could attempt to devise various proxy metrics that capture the algorithmic complexity of a given function/program.
Here, in an attempt to evaluate and explain the algorithmic complexity of our models from a spectral-regularization perspective, we consider approximations in terms of a simple but flexible class of programs computing rational-coefficient polynomials of maximal degree . Intuitively, approximating a model in terms of for lower values of corresponds to discovering programs of decreased algorithmic length (in terms of operations) that effectively compute . In this direction, we bring forward some classical approximation theory results implying that lower spectral norms yield lower degrees of the approximation polynomial. To ease notation, here we discuss the 1-dimensional case, however, similar results hold for higher dimensions as well Trefethen (2013):
Let represent a model with Lipschitz constant . Then, there exists a polynomial of degree , so that where denotes the usual -norm over the interval .
Since the Lipschitz constant of a neural model is bounded above by the spectral norms of the layers , spectral regularization gives control over ; moreover, a lower value of implies that one can decrease the polynomial degree and retain similar approximation qualities. These observations support our empirical results - introducing spectral regularization steers the model search space towards algorithmically simpler and more robust functions.
Generalization via spectral norms and stable ranks. We recall that the stable rank of a matrix , , is defined as the ratio and note that is at most the rank of . The stable rank is intuitively understood as a continuous proxy to and as a measure for the true parameter count of . Now, let be a deep neural model consisting of layers whose corresponding weight matrices are denoted by . Recent works (e.g. Neyshabur et al. (2017); Arora et al. (2018)) obtain generalization bounds on , roughly speaking, in terms of the expression where denotes the spectral norm of the matrix . A related stronger compression-based estimate in terms of so-called noise-cushions is obtained in Arora et al. (2018). In other words, the generalization error is influenced by the spectral norms as well as stable ranks of the layers.
In this direction, we evaluated our model and the effect of stable ranks. Interpreting Fig. 3, one observes that initially while the model adopts to the single task of pattern_expansion it increases and stabilizes a true parameter count ; afterwards, the model is introduced to the full task bundle where a significant decrease of the stable ranks is observed - according to the last expression this leads to better generalization, and further implies that at the end of training one actually deals with simpler models with better compression properties.
5 Conclusion and Acknowledgements
We have demonstrated the spectral regularization provides neural learners with a significant boost in performance on abstract reasoning tasks. We believe that studying the complexity of the underlying models in the context of powerful frameworks such as Kolmogorov complexity or Solomonoff’s theory of inductive inference is a promising step towards closing the neuro-symbolic gap. We would like to thank Dimitar Vasilev (Microsoft Inc.) for the computational resources used in this work.
- Stronger generalization bounds for deep nets via a compression approach. In 35th International Conference on Machine Learning, ICML 2018, Vol. 1, pp. 390–418. External Links: Cited by: §2, §4.
- Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, External Links: Cited by: §2.
The description length of deep learning models. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 2216–2226. External Links: Cited by: §2.
-  Task tagging. Kaggle public notebook. Note: Accessed on 08.10.2020 External Links: Cited by: §3.
- On the measure of intelligence. arXiv preprint arXiv:1911.01547. Cited by: Appendix Appendix A, §1, §1, §1, §2.
DreamCoder: growing generalizable, interpretable knowledge with wake-sleep bayesian program learning. arXiv preprint arXiv:2006.08381. Cited by: §2.
- Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400. Cited by: §3, §3.
- Differentiable programs with neural libraries. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1213–1222. Cited by: §2.
- Hybrid computing using a neural network with dynamic external memory. Nature 538 (7626), pp. 471–476. Cited by: §3, §3.
- Fantastic generalization measures and where to find them. In International Conference on Learning Representations, External Links: Cited by: §2.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.
- Dynamic evaluation of neural sequence models. In International Conference on Machine Learning, pp. 2766–2775. Cited by: §3.
- An introduction to kolmogorov complexity and its applications. 3 edition, Springer Publishing Company, Incorporated. External Links: Cited by: §1, §2, §4.
- A glimpse at the metaphysics of bongard problems. Artificial Intelligence 121 (1-2), pp. 251–270. Cited by: Appendix Appendix A.
- The neuro-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision. In International Conference on Learning Representations, Cited by: §2.
- Deep learning: a critical appraisal. arXiv preprint arXiv:1801.00631. Cited by: §1.
- Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, External Links: Cited by: §2, §4.
- Learning programmatically structured representations with perceptor gradients. In International Conference on Learning Representations, Cited by: §2.
-  (2006) Polynomial approximation theory. In Spectral Methods: Fundamentals in Single Domains, External Links: Cited by: Appendix Appendix F, Appendix Appendix F.
- Meta-learning with memory-augmented neural networks. In International conference on machine learning, pp. 1842–1850. Cited by: §3.
- Stable rank normalization for improved generalization in neural networks and gans. In International Conference on Learning Representations, External Links: Cited by: §2.
- Discovering neural nets with low kolmogorov complexity and high generalization capability. Neural Netw. 10 (5), pp. 857–873. External Links: Cited by: §2, §4.
- Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network. International Conference on Learning Representations. External Links: Cited by: §2.
- Going deeper with convolutions. In , pp. 1–9. Cited by: §3.
- Multivariate polynomial approximation in the hypercube. Proceedings of the American Mathematical Society 145 (), pp. 4837–4844. Cited by: Appendix Appendix F.
- Approximation theory and approximation practice. SIAM. Cited by: Appendix Appendix F, Appendix Appendix F, §4.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.
Programmatically interpretable reinforcement learning. In International Conference on Machine Learning, pp. 5045–5054. Cited by: §2.
- Data-dependent sample complexity of deep neural networks via lipschitz augmentation. In Advances in Neural Information Processing Systems 32, pp. 9725–9736. Cited by: §2.
- Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941. Cited by: §3.
Appendix Appendix A Abstraction and Reasoning Corpus
The Abstraction and Reasoning Corpus (ARC) [Chollet, 2019] is a dataset of grid-based pattern recognition and pattern manipulation tasks. A decision-making agent sees a small number of examples of input and output grids that illustrate the underlying logical relationship between them. It then has to infer this logical rule and apply it correctly on a test query.
In many ways, the benchmark is similar to the Bongard problems (view [Linhares, 2000]) – relations are highly abstract and geometric. Moreover, only 3-5 examples are presented for each task, therefore, the benchmark tests the ability of an decision-making agent to (i) grasp abstract logic and (ii) adapt quickly to new tasks.
There are 400 training and 400 evaluation task examples, structured as follows:
each task consists of a train and a test set;
the train set includes 3-5 input/output pairs;
the test set includes 1 input/output pair;
an input/output pair is comprised of an input grid and an output grid, the relation between which follows a consistent logic throughout the task;
grids are rectangular and are divided into squares;
grid patterns are drawn with 10 colors;
grid sizes vary between 1 and 30 in length and width; input and output grid sizes are not necessarily equal.
No set of rules exists that can solve all tasks, and while some skills are useful for multiple of them, each task has its own unique principle. This makes trivial approaches like brute-force computation impractical.
If a human was approaching those tasks, they would easily be able to spot logical relations – we have developed the necessary priors to find similarities and infer logic. Therefore, ideally the neural network would derive this prior during training, and that would allow it to generalize well to the evaluation dataset.
For the current scope of our research we use all tasks with grids of size not larger than . For the train set, we augment the tasks to 15000 by permuting colors and by exploiting that the tasks are invariant to rotation and symmetry.
Appendix Appendix B Grid Embedding
Prior to embedding, we zero-pad all grids to be
Prior to embedding, we zero-pad all grids to be, with the original grid in the center of the image. Additionally, colors in the grids are represented as one-hot vectors, making the final dimensionality of the grids (10 colors).
The embedding is done with a convolutional neural network, comprised of an encoder and a decoder. The encoder consists of a basket of 10 convolutions of filter sizes equal to ). Different filter sizes enable the network to capture both local and global patterns. The convolution outputs are flattened and passed to linear layers with hyperbolic tangent activation functions, which transform them into the desired dimensionality (
The embedding is done with a convolutional neural network, comprised of an encoder and a decoder. The encoder consists of a basket of 10 convolutions of filter sizes equal to(the module, Fig. 5
). Different filter sizes enable the network to capture both local and global patterns. The convolution outputs are flattened and passed to linear layers with hyperbolic tangent activation functions, which transform them into the desired dimensionality (). A second neural network computes weights for summing the 10 resulting vectors. The decoder shares the same architecture with the encoder, but in reverse order – first linear layers, after that convolutions and then a weighted sum; finally a softmax over the color dimension.
Summing the convolutional outputs enables the embedding network to be agnostic to the order in which it receives them (as would be with an RNN for instance). The weighs reflect the fact that grid sizes vary and therefore not all filter sizes would be equally applicable or useful.
While training the Embedding network, we found that spectral regularization again proved to be instrumental for achieving 95% perfect reconstruction accuracy. In contrast, networks regularized by weight decay failed to climb above 6%.
What is more, tanh functions yielded a network that is 3 orders of magnitude more stable to Gaussian input noise than the same network, trained under the same conditions, with ReLU activations. We postulate that this is due to the fact that ReLU is unbounded for
What is more, tanh functions yielded a network that is 3 orders of magnitude more stable to Gaussian input noise than the same network, trained under the same conditions, with ReLU activations. We postulate that this is due to the fact that ReLU is unbounded for, therefore random perturbations would have a greater impact.
Appendix Appendix C Architecture Details
Appendix Appendix D Additional Results
Appendix Appendix E Hyperparamters
|learning rate||0.001||Learning rate of the models|
|regularization||6e-4||Coefficient in front of the spectral regularization penalty in the loss|
|annealing factor||10||Factor by which is divided to weaken regularization|
|Embedding network hyperparameters|
|latent dimension||256||The dimension of the embedded grids|
|kernel sizes||Sizes of the used kernels in the convolutional modules|
|# of kernels||128||The number of convolutional kernels|
|DNC hyperparameters (temporal linkage is disabled)|
|# of read heads||6||Number of attention heads in the reading mechanism|
|# of LSTM layers||3||Layers of the LSTM controller|
|LSTM hidden size||512||Hidden size of the LSTM|
|memory dimesions||Word length of 64, 32 memory locations|
|# of decoders||4||Number of networks in the decoder stack|
|# of attention heads||16||Number of heads in the multi-head attention mechanism|
|Transformer hidden size||256||Internal hidden size of the linear layers in the Transformer|
Appendix Appendix F Approximation via polynomial programs
In this Section we briefly elaborate on the polynomial approximation mentioned in the main text. In general, classical results in this direction are based, e.g. on Chebyshev, Legendre and Bersntein polynomial approximations. Here we discuss a 1-dimensional approximation via Bernstein polynomials, but similar estimates hold in higher dimensions as well as other polynomial schemes (e.g. Chebyshev). We refer to 19, L. N. Trefethen (2013) for a thorough collection of results as well as further references.
A well-known method of approximating one-dimensional continuous functions is by means of Bernstein polynomials. Let be continuous. The Bernstein polynomial of order corresponding to is defined as:
The following estimate holds:
where denotes the modulus of continuity defined as
In other words, measures the maximal jump over points which are no more than a distance apart.
Note that whenever has a Lipschitz constant , the modulus of continuity is controlled above via . Applying this bound in the estimate of Theorem 1 yields the result mentioned in the text. As already mentioned, more technical high dimensional analogues of the estimates are also available L. N. Trefethen (2013), 19 - e.g. an analogues result for Chebyshev polynomials holds where the modulus of continuity is appropriately replaced by extrema over complex ellipsoids Trefethen .