1 Introduction
Extracting and reasoning with abstract concepts is a crucial ability for any learner that is to operate in combinatorially complex open worlds or domains with limited or structured data. It is well known that neural learners struggle to operate in such conditions due to their poor generalization capabilities in structured domains Chollet (2019); Marcus (2018). In this work, we demonstrate that spectral regularization provides neural networks with a strong inductive bias towards learning and utilizing abstract concepts akin to a symbolic learner.
For that purpose, we employ the Abstraction and Reasoning Corpus (ARC) Chollet (2019) which contains tasks related to manipulating colored patterns in a grid. In order to successfully solve the tasks in the corpus an agent needs to be able to count, manipulate numbers, work with topological and geometric concepts as well as recognise the notion of objects. There are 400 training tasks and 400 (distinct) evaluation tasks. Each task has a small set of inputoutput example pairs (between 1 and 5) and a query input pattern. This is quite a challenging dataset due to the small amount of example data, large number of different tasks and their abstract nature.
So far, the best known solution, with a success rate of 20%, is the winner of the ARC Kaggle challenge, and it is a carefully handcrafted symbolic solver written in approx. 7k lines of C++ code. In this paper, we introduce the Neural Abstract Reasoner (NAR) which achieves an accuracy rate of 79% and so outperforming even the best symbolic solver created by a human. The NAR architecture contains a Differentiable Neural Computer (DNC) that learns general problem solving skills and a Transformer network responsible for solving particular task instances (see. Fig 2 in
Chollet (2019)). 7u Importantly, spectral regularization plays a fundamental role in the successful training of NAR. From a purely machine learning perspective, spectral regularization is known to reduce the effective number of parameters in the network, however we provide some additional theoretical intuition and demonstrate that spectral regularization also pushes the network towards finding algorithmically simpler solutions as recommended by Solomonoff’s theory of inductive inference
Li and Vitanyi (2008).2 Related Work
Neurosymbolic architectures
Hybrid neurosymbolic approaches enable agents to solve structured tasks from raw data, while learning faster and being more robust to noise Gaunt et al. (2017); Verma et al. (2018); Penkov and Ramamoorthy (2018); Mao et al. (2018). However, the majority of methods proposed so far are designed with specific domains in mind, making them inapplicable to a broader range of tasks. A notable exception is the architecture proposed by Ellis et al. (2020)
, which is capable of learning rules from geometry, vector algebra, and physics and solve tasks such as drawing pictures or building complete scenes. Importantly, these methods often require lots of data, which is in stark contrast with human capabilities.
The ARC dataset Chollet (2019) is specially designed to push research towards data efficient learners, as there are hunderds of tasks, each of which is represented by no more than 5 input/output examples. To the best of our knowledge, the Neural Abstract Reasoner, presented in this paper, is the first architecture that achieves a performance rate of 79% on the ARC dataset, outperforming stateoftheart handcoded symoblic systems by a factor of 4. The NAR architecture is a composition of a slowly learning Differentiable Neural Computer and a fast adapting Transformer network creating an outer learning and inner executing loops, as suggested in Chollet (2019).
Complexity and generalization
The analysis of complexity and generalization metrics applied to neural networks has formed a central line of theoretical ML research with a variety of recent breakthroughs in terms of PACbased and compression methods (cf. Arora et al. (2018); Bartlett et al. (2017); Jiang* et al. (2020); Suzuki et al. (2020); Wei and Ma (2019) and the references therein). In particular, many generalization approaches based on spectral norm analysis have been so far proposed and investigated Neyshabur et al. (2017); Sanyal et al. (2020). However, to our knowledge the present work is the first to address the relationship between spectral norms’ behaviour and abstract reasoning tasks, whereby a strong relationship between a classical spectral regularisation and the ability of a neural model towards learning abstract reasoning (concepts and rules) is demonstrated. As touched upon in Section 4, one could draw motivation from wellknown algorithmic information theory concepts such as Solomonoff inference and program generation based on least complexity Li and Vitanyi (2008); Blier and Ollivier (2018); Schmidhuber (1997).
3 Methods
Description
The ARC dataset consists of a train and evaluation portions and , respectively. Each portion consists of 400 tasks (train tasks are augmented to 15000 through color permutations and rotations). The individual tasks are grouped in tags based on the skills needed to solve them Bonin . A task consists of up to five example inputoutput pairs and one query pair. A neural learner has to infer a logical rule . All inputs and outputs are grids of variable sizes with 10 colors. A solution is correct only when all the pixels on the grid match.
First, we derive a latent representation of the grids with an InceptionNetstyle Szegedy et al. (2015) deterministic autoencoder. Let be a grid, and be the encoder and decoder, parametrized by
. We train an embedding network by minimizing the standard autoencoder crossentropy loss
.Next, we consider all latent grid embeddings, and we train a Differentiable Neural Computer Graves et al. (2016) with parameters to infer an instruction set . All inputs are processed by a Transformer Decoder Stack Vaswani et al. (2017) with parameters , which selfattends to all inputs and crossattends to :
The whole model is then trained endtoend via ADAM Kingma and Ba (2014) to minimize the crossentropy loss between the query target and the decoded test output prediction.
We employ a twostage curriculum during training, first training only a on a single tag , and then expanding to the whole train dataset . Additionally, during the first stage of training, spectral regularization Yoshida and Miyato (2017) with a larger value is applied, which is then annealed in the second stage. When evaluating the model, we apply additional optimization steps (similar to Finn et al. (2017); Krause et al. (2018)), as described in Algorithm 1. See Appendix C for additional details.
Motivation
We build on methods from Santoro et al. (2016) and use a memoryaugmented neural network (the Differentiable Neural Computer Graves et al. (2016)) to derive context for the current task. The multiple read heads and attention mechanisms allow the DNC to relate the input and the output of a pair and compare them to the other input/output pairs that it has already processed. Unlike Santoro et al. (2016), we leverage a Transformer to carry out the task execution based on the DNC context. This decouples the learning of the instruction set from the program execution itself, and allows us to use the input/output relations directly, rather than the more standard . Lastly, the Transformer network relates the inputs to each other, thereby exploiting similarities within them.
Performance
As this work is still in progress, these are preliminary results evaluated on grids up to . Nonetheless, we outperform all currently known solutions, including a handcrafted symbolic solution (see Fig. 1). Spectral regularization proved instrumental for this, and other regularization methods did not yield any significant results (see Fig. 2).
Without any additional adaptation steps Finn et al. (2017), evaluation performance remains low at 1%, while only after 3 steps, that number climbs up to 78.8%. Analyzing more closely the network changes made by the adaptation steps, the gradient norm of is , which implies that the DNC is acting as a true metalearner, and only the Transformer requires a small change to execute the instructions flawlessly. We again attribute this generalization to spectral regularization.
4 Effects of spectral regularization: stable ranks and complexity
The surprisingly significant effect of a simple spectral regularization strategy in the reasoning tasks suggests strong connections with generalization and the underlying model complexity estimates. On one hand, this motivates the analysis of spectral regularization in terms of some wellknown generalization bounds (e.g. based on stable rank and spectral norms), however, we first discuss a perspective inspired by algorithmic information theory. Intuitively, abstract reasoning tasks are induced by a concise set of logic rules and combinatorial patterns, and, hence, it is natural to search for
short programs producing these rules  in this regard, we give motivation as to why spectral regularization naturally shrinks the search space towards shorter programs.Spectral regularization and polynomials as algorithmically simple programs. Classical methods from program inference and algorithmic information theory, such as Solomonoff inference and Occam’s razor Li and Vitanyi (2008), suggest that "simpler" program models are preferable in terms of forming abstract concepts and generalization  a formal approach towards such issues is given, e.g., by Kolmogorov complexity theory Li and Vitanyi (2008); Schmidhuber (1997). Although the evaluation of algorithmic complexity is a demanding task (Kolmogorov complexity is theoretically uncomputable), one could attempt to devise various proxy metrics that capture the algorithmic complexity of a given function/program.
Here, in an attempt to evaluate and explain the algorithmic complexity of our models from a spectralregularization perspective, we consider approximations in terms of a simple but flexible class of programs computing rationalcoefficient polynomials of maximal degree . Intuitively, approximating a model in terms of for lower values of corresponds to discovering programs of decreased algorithmic length (in terms of operations) that effectively compute . In this direction, we bring forward some classical approximation theory results implying that lower spectral norms yield lower degrees of the approximation polynomial. To ease notation, here we discuss the 1dimensional case, however, similar results hold for higher dimensions as well Trefethen (2013):
Proposition 1.
Let represent a model with Lipschitz constant . Then, there exists a polynomial of degree , so that where denotes the usual norm over the interval .
Since the Lipschitz constant of a neural model is bounded above by the spectral norms of the layers , spectral regularization gives control over ; moreover, a lower value of implies that one can decrease the polynomial degree and retain similar approximation qualities. These observations support our empirical results  introducing spectral regularization steers the model search space towards algorithmically simpler and more robust functions.
Generalization via spectral norms and stable ranks. We recall that the stable rank of a matrix , , is defined as the ratio and note that is at most the rank of . The stable rank is intuitively understood as a continuous proxy to and as a measure for the true parameter count of . Now, let be a deep neural model consisting of layers whose corresponding weight matrices are denoted by . Recent works (e.g. Neyshabur et al. (2017); Arora et al. (2018)) obtain generalization bounds on , roughly speaking, in terms of the expression where denotes the spectral norm of the matrix . A related stronger compressionbased estimate in terms of socalled noisecushions is obtained in Arora et al. (2018). In other words, the generalization error is influenced by the spectral norms as well as stable ranks of the layers.
In this direction, we evaluated our model and the effect of stable ranks. Interpreting Fig. 3, one observes that initially while the model adopts to the single task of pattern_expansion it increases and stabilizes a true parameter count ; afterwards, the model is introduced to the full task bundle where a significant decrease of the stable ranks is observed  according to the last expression this leads to better generalization, and further implies that at the end of training one actually deals with simpler models with better compression properties.
5 Conclusion and Acknowledgements
We have demonstrated the spectral regularization provides neural learners with a significant boost in performance on abstract reasoning tasks. We believe that studying the complexity of the underlying models in the context of powerful frameworks such as Kolmogorov complexity or Solomonoff’s theory of inductive inference is a promising step towards closing the neurosymbolic gap. We would like to thank Dimitar Vasilev (Microsoft Inc.) for the computational resources used in this work.
References
 Stronger generalization bounds for deep nets via a compression approach. In 35th International Conference on Machine Learning, ICML 2018, Vol. 1, pp. 390–418. External Links: 1802.05296, ISBN 9781510867963 Cited by: §2, §4.
 Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, External Links: 1706.08498, ISSN 10495258 Cited by: §2.

The description length of deep learning models
. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 2216–2226. External Links: Link Cited by: §2.  [4] Task tagging. Kaggle public notebook. Note: Accessed on 08.10.2020 External Links: Link Cited by: §3.
 On the measure of intelligence. arXiv preprint arXiv:1911.01547. Cited by: Appendix Appendix A, §1, §1, §1, §2.

DreamCoder: growing generalizable, interpretable knowledge with wakesleep bayesian program learning
. arXiv preprint arXiv:2006.08381. Cited by: §2.  Modelagnostic metalearning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400. Cited by: §3, §3.
 Differentiable programs with neural libraries. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1213–1222. Cited by: §2.
 Hybrid computing using a neural network with dynamic external memory. Nature 538 (7626), pp. 471–476. Cited by: §3, §3.
 Fantastic generalization measures and where to find them. In International Conference on Learning Representations, External Links: Link Cited by: §2.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.
 Dynamic evaluation of neural sequence models. In International Conference on Machine Learning, pp. 2766–2775. Cited by: §3.
 An introduction to kolmogorov complexity and its applications. 3 edition, Springer Publishing Company, Incorporated. External Links: ISBN 0387339981 Cited by: §1, §2, §4.
 A glimpse at the metaphysics of bongard problems. Artificial Intelligence 121 (12), pp. 251–270. Cited by: Appendix Appendix A.
 The neurosymbolic concept learner: interpreting scenes, words, and sentences from natural supervision. In International Conference on Learning Representations, Cited by: §2.
 Deep learning: a critical appraisal. arXiv preprint arXiv:1801.00631. Cited by: §1.
 Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, External Links: 1706.08947, ISSN 10495258 Cited by: §2, §4.
 Learning programmatically structured representations with perceptor gradients. In International Conference on Learning Representations, Cited by: §2.
 [19] (2006) Polynomial approximation theory. In Spectral Methods: Fundamentals in Single Domains, External Links: ISBN 9783540307266, Document, Link Cited by: Appendix Appendix F, Appendix Appendix F.
 Metalearning with memoryaugmented neural networks. In International conference on machine learning, pp. 1842–1850. Cited by: §3.
 Stable rank normalization for improved generalization in neural networks and gans. In International Conference on Learning Representations, External Links: Link Cited by: §2.
 Discovering neural nets with low kolmogorov complexity and high generalization capability. Neural Netw. 10 (5), pp. 857–873. External Links: ISSN 08936080, Link, Document Cited by: §2, §4.
 Compression based bound for noncompressed network: unified generalization error analysis of large compressible deep neural network. International Conference on Learning Representations. External Links: Link Cited by: §2.

Going deeper with convolutions.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 1–9. Cited by: §3.  Multivariate polynomial approximation in the hypercube. Proceedings of the American Mathematical Society 145 (), pp. 4837–4844. Cited by: Appendix Appendix F.
 Approximation theory and approximation practice. SIAM. Cited by: Appendix Appendix F, Appendix Appendix F, §4.
 Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.

Programmatically interpretable reinforcement learning
. In International Conference on Machine Learning, pp. 5045–5054. Cited by: §2.  Datadependent sample complexity of deep neural networks via lipschitz augmentation. In Advances in Neural Information Processing Systems 32, pp. 9725–9736. Cited by: §2.
 Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941. Cited by: §3.
Appendix Appendix A Abstraction and Reasoning Corpus
The Abstraction and Reasoning Corpus (ARC) [Chollet, 2019] is a dataset of gridbased pattern recognition and pattern manipulation tasks. A decisionmaking agent sees a small number of examples of input and output grids that illustrate the underlying logical relationship between them. It then has to infer this logical rule and apply it correctly on a test query.
In many ways, the benchmark is similar to the Bongard problems (view [Linhares, 2000]) – relations are highly abstract and geometric. Moreover, only 35 examples are presented for each task, therefore, the benchmark tests the ability of an decisionmaking agent to (i) grasp abstract logic and (ii) adapt quickly to new tasks.
There are 400 training and 400 evaluation task examples, structured as follows:

each task consists of a train and a test set;

the train set includes 35 input/output pairs;

the test set includes 1 input/output pair;

an input/output pair is comprised of an input grid and an output grid, the relation between which follows a consistent logic throughout the task;

grids are rectangular and are divided into squares;

grid patterns are drawn with 10 colors;

grid sizes vary between 1 and 30 in length and width; input and output grid sizes are not necessarily equal.
No set of rules exists that can solve all tasks, and while some skills are useful for multiple of them, each task has its own unique principle. This makes trivial approaches like bruteforce computation impractical.
If a human was approaching those tasks, they would easily be able to spot logical relations – we have developed the necessary priors to find similarities and infer logic. Therefore, ideally the neural network would derive this prior during training, and that would allow it to generalize well to the evaluation dataset.
For the current scope of our research we use all tasks with grids of size not larger than . For the train set, we augment the tasks to 15000 by permuting colors and by exploiting that the tasks are invariant to rotation and symmetry.
Appendix Appendix B Grid Embedding
Prior to embedding, we zeropad all grids to be
, with the original grid in the center of the image. Additionally, colors in the grids are represented as onehot vectors, making the final dimensionality of the grids (10 colors).The embedding is done with a convolutional neural network, comprised of an encoder and a decoder. The encoder consists of a basket of 10 convolutions of filter sizes equal to
(the module, Fig. 5). Different filter sizes enable the network to capture both local and global patterns. The convolution outputs are flattened and passed to linear layers with hyperbolic tangent activation functions, which transform them into the desired dimensionality (
). A second neural network computes weights for summing the 10 resulting vectors. The decoder shares the same architecture with the encoder, but in reverse order – first linear layers, after that convolutions and then a weighted sum; finally a softmax over the color dimension.Summing the convolutional outputs enables the embedding network to be agnostic to the order in which it receives them (as would be with an RNN for instance). The weighs reflect the fact that grid sizes vary and therefore not all filter sizes would be equally applicable or useful.
While training the Embedding network, we found that spectral regularization again proved to be instrumental for achieving 95% perfect reconstruction accuracy. In contrast, networks regularized by weight decay failed to climb above 6%.
What is more, tanh functions yielded a network that is 3 orders of magnitude more stable to Gaussian input noise than the same network, trained under the same conditions, with ReLU activations. We postulate that this is due to the fact that ReLU is unbounded for
, therefore random perturbations would have a greater impact.Appendix Appendix C Architecture Details
Appendix Appendix D Additional Results
Appendix Appendix E Hyperparamters
Hyperparameter  Value  Description 
learning rate  0.001  Learning rate of the models 
regularization  6e4  Coefficient in front of the spectral regularization penalty in the loss 
annealing factor  10  Factor by which is divided to weaken regularization 
Embedding network hyperparameters  
latent dimension  256  The dimension of the embedded grids 
kernel sizes  Sizes of the used kernels in the convolutional modules  
# of kernels  128  The number of convolutional kernels 
DNC hyperparameters (temporal linkage is disabled)  
# of read heads  6  Number of attention heads in the reading mechanism 
# of LSTM layers  3  Layers of the LSTM controller 
LSTM hidden size  512  Hidden size of the LSTM 
memory dimesions  Word length of 64, 32 memory locations  
Transformer hyperparameters  
# of decoders  4  Number of networks in the decoder stack 
# of attention heads  16  Number of heads in the multihead attention mechanism 
Transformer hidden size  256  Internal hidden size of the linear layers in the Transformer 
Appendix Appendix F Approximation via polynomial programs
In this Section we briefly elaborate on the polynomial approximation mentioned in the main text. In general, classical results in this direction are based, e.g. on Chebyshev, Legendre and Bersntein polynomial approximations. Here we discuss a 1dimensional approximation via Bernstein polynomials, but similar estimates hold in higher dimensions as well as other polynomial schemes (e.g. Chebyshev). We refer to 19, L. N. Trefethen (2013) for a thorough collection of results as well as further references.
A wellknown method of approximating onedimensional continuous functions is by means of Bernstein polynomials. Let be continuous. The Bernstein polynomial of order corresponding to is defined as:
(1) 
Theorem 1.
The following estimate holds:
(2) 
where denotes the modulus of continuity defined as
(3) 
In other words, measures the maximal jump over points which are no more than a distance apart.
Note that whenever has a Lipschitz constant , the modulus of continuity is controlled above via . Applying this bound in the estimate of Theorem 1 yields the result mentioned in the text. As already mentioned, more technical high dimensional analogues of the estimates are also available L. N. Trefethen (2013), 19  e.g. an analogues result for Chebyshev polynomials holds where the modulus of continuity is appropriately replaced by extrema over complex ellipsoids Trefethen [2017].