DeepAI
Log In Sign Up

Differentiable programming: Generalization, characterization and limitations of deep learning

05/13/2022
by   Adrián Hernández, et al.
UMH
0

In the past years, deep learning models have been successfully applied in several cognitive tasks. Originally inspired by neuroscience, these models are specific examples of differentiable programs. In this paper we define and motivate differentiable programming, as well as specify some program characteristics that allow us to incorporate the structure of the problem in a differentiable program. We analyze different types of differentiable programs, from more general to more specific, and evaluate, for a specific problem with a graph dataset, its structure and knowledge with several differentiable programs using those characteristics. Finally, we discuss some inherent limitations of deep learning and differentiable programs, which are key challenges in advancing artificial intelligence, and then analyze possible solutions

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

07/13/2022

Distribution Theoretic Semantics for Non-Smooth Differentiable Programming

With the wide spread of deep learning and gradient descent inspired opti...
10/28/2019

Differentiable Convex Optimization Layers

Recent work has shown how to embed differentiable optimization problems ...
12/17/2019

Differentiable programming and its applications to dynamical systems

Differentiable programming is the combination of classical neural networ...
06/25/2022

Learning to Infer 3D Shape Programs with Differentiable Renderer

Given everyday artifacts, such as tables and chairs, humans recognize hi...
10/05/2022

Differentiable Mathematical Programming for Object-Centric Representation Learning

We propose topology-aware feature partitioning into k disjoint partition...
09/11/2014

Building Program Vector Representations for Deep Learning

Deep learning has made significant breakthroughs in various fields of ar...
04/29/2022

GenDR: A Generalized Differentiable Renderer

In this work, we present and study a generalized family of differentiabl...

1 Introduction

In recent years, different deep learning models (neural networks, RNNs, CNNs,…) have been developed and successfully applied in cognitive tasks such as natural language processing, gaming, computer vision and more

LeCun2015DeepLearning

. Although these models were originally biologically inspired, they are specific examples of differentiable programs and the key to their success are the relationships and transformations of the elements (tensors) of the model. A differentiable program stands here for any tensor transformation whose parameters (trained end-to-end) are differentiable.

Differentiable programming is a programming framework that is expressive enough to contain the space of all functions, and flexible enough to incorporate the priors and the structure of the problem when we define a new program. In traditional programming there are certain techniques for constructing algorithms (divide and conquer, recursion, dynamic programming,…) that depend on the problem to be solved. For differentiable programming, it is interesting to have also some characteristics that allow us to incorporate the structure and priors of the problem at hand.

To be more specific, by the structure and priors of a problem we mean the knowledge we have of the problem: temporal dependency, symmetries, relationship between its components, etc. The program characteristics are a series of properties and relations between the elements (tensors) of a differentiable program that allow us to incorporate this prior knowledge and reduce the hypothesis space in which learning is carried out.

In this paper we define and motivate differentiable programming, as well as specify some program characteristics that are important to construct differentiable programs and adapt them to important classes of problems. We analyze several domains or types of differentiable programs, from more general to more specific, and map the problem knowledge and structure to these programs using such characteristics. This procedure will be exemplified with the CiteSeer dataset

yang2016revisiting

and the problem of classifying scientific publications into one of six classes.

We also discuss the limitations of differentiable models and possible solutions, which is a central theme for the future of artificial intelligence. We explain the need of learning strategies that leverage previous data and models to generate new experiences and knowledge as humans do, providing some characteristics and examples of these strategies.

In sum, the topic of this paper is differentiable programming, which is a programming model that generalizes deep learning. We call the concrete examples or models that it generates differentiable programs, which are represented by acyclic directed graphs. The learning process is done in two steps: In the forward pass, the graph is built; in the backward pass, the parameters are adjusted by gradient descent.

This paper is organized as follows. In the next section we describe, motivate and define differentiable programming. In Section 3 we discuss its flexibility and propose a characterization. In Section 4 we study general programs and programs that incorporate very specific priors, while in Section 5 we perform a series of experiments to analyze a particular graph problem using the defined characteristics. Finally, in Section 6 we describe the limitations of differentiable models and possible solutions. The conclusions and a list of acronyms winds up this paper.

2 Differentiable programming. Motivation and definition

In the past years we have seen major advances in the field of machine learning. Deep neural networks, along with the computational capabilities of Graphics Processing Units (GPUs)

Yadan2013MultiGPU have improved the performance of several tasks, including image recognition, machine translation, language modelling, time series prediction and game playing LeCun2015DeepLearning; Sutskever2014SequenceTS; Silver2017MasteringTG.

Let us recall that, in a feedforward neural network (FNN) composed of multiple layers, the output (without the bias term) at layer is given by

(1)

being the weight matrix at layer .

is the activation function and

is the output vector at layer

and the input vector at layer . The weight matrices for the different layers are the parameters of the model.

Deep learning is a part of machine learning that is based on neural networks and uses multiple layers, where each layer extracts higher level features from the input.

Although deep learning can implicitly implement logical reasoning Hohenecker2018Onto; SATNet2019; AssessingSATNet2020, it has limitations that make it difficult to achieve more general intelligence. Among such limitations, let us mention that deep learning only performs perception and does not carry out conscious and sequential reasoning Marcus2018Deep.

Simplified, the learning problem can be formulated as follows. Given an input space and an output space , the pairs

being random variables distributed according to a joint probability mass or density function

, construct a function from a hypothesis space of functions which predicts from after observing a sequence of pairs independently and identically distributed according to .

A learning algorithm over is a computable map from to , specifically, an algorithm that takes as input the sequence of training samples and outputs a function with a low probability of error .

As a general rule we are interested in having a hypothesis space (set of functions) as expressive as possible. However, to solve a specific problem, it is necessary to restrict this hypothesis space depending on the structure and priors of the problem.

This is what we have seen in recent years with the development of specific models and graph structures. For example, multilayer neural networks are based on the compositional and hierarchical structure of information, RNNs (Recurrent Neural Networks) on temporal dependence, CNNs (Convolutional Neural Networks) on translational invariance, etc.

Therefore, to advance and generalize deep learning, a programming model or framework with the following characteristics is necessary.

  1. Be expressive enough to define the space of all hypothesis or functions.

  2. Be able to incorporate the structure and the priors of the problem we want to model. As stated in Hernandez2021Chapter, incorporating prior knowledge about the underlying task or class of functions can make the learning process more efficient.

  3. Be able to define new primitives to advance deep learning capabilities (reasoning, attention, memory, modeling of physical problems, etc.)

As we are going to see, a natural evolution to move forward in these directions is to generalize deep learning using a framework composed of differentiable blocks that allows to overcome the limitations of classical models.

This approach, called differentiable programming, adds new differentiable components to traditional neural networks. For the purposes of this article, differentiable programming is defined as follows.

[Differentiable programming] Differentiable programming is a programming model defined by a tuple where:

  1. Programs are directed acyclic graphs.

  2. Graph nodes are mathematical functions or variables and the edges correspond to the flow of intermediate values between the nodes.

  3. is the number of nodes and the number of input variables of the graph, with . for is the variable associated with node .

  4. is the set of edges in the graph. For each directed edge from node i to node j we have , therefore the graph is topologically ordered.

  5. for is the differentiable function computed by node in the graph. for contains all input values for node .

  6. The forward algorithm or pass, given input variables calculates for .

  7. The graph is dynamically constructed and composed of functions that are differentiable and whose parameters are learned from data.

To learn the parameters of the graph from the data, automatic differentiation is used. Automatic differentiation, in its reverse mode and in contrast to manual, symbolic and numerical differentiation, computes the derivatives in a two-step process Baydin2018AutomaticDiff; Wang2018DemystifyingDP. As described in Baydin2018AutomaticDiff, a function is constructed with intermediate variables such that:

  1. variables are the parameters.

  2. variables are the intermediate variables.

  3. variables are the output variables.

In a first step, the graph is built populating intermediate variables and recording the dependencies. In a second step, called the backward pass, derivatives are calculated by propagating for the output being considered, the adjoints from the output to the inputs.

The reverse mode is more efficient to evaluate for functions with a large number of inputs (parameters) and a small number of outputs. When , as is the case in machine learning with very large and the cost function, only one pass of the reverse mode is necessary to compute the gradient

In the last years, deep learning frameworks such as PyTorch have been developed that provide reverse-mode automatic differentiation

Paszke2017automatic. The define-by-run philosophy of PyTorch, whose execution dynamically constructs the computational graph, facilitates the development of general differentiable programs.

Differentiable programming can be seen as a continuation of the deep learning end-to-end architectures that have replaced, for example, the traditional linguistic components in natural language processing Deng2018NLP; Goldberg2017NLPBook. Differentiable programs are composed of classical blocks (feedforward, recurrent neural networks, etc.) along with new ones such as differentiable branching, attention, memories, etc.

Differentiable programming is an evolution of classical (traditional) software programming where, as summarized in Table 1:

  1. Instead of specifying explicit instructions to the computer, an objective is set and an optimizable architecture is defined which allows to search in a subset of possible programs.

  2. The program is defined by the input-output data and not predefined by the user.

  3. The optimizable elements of the program have to be differentiable, say, by converting them into differentiable blocks.

Classical Programming Differentiable Programming
Sequence of explicit instructions Sequence of differentiable primitives
Fixed architecture Optimizable architecture that searchs in a
subset of possible programs
User defined programs Data defined programs
Imperative programming Declarative programming, specifying the objectives
but not how to achieve them
Direct, intuitive and explainable High level of abstraction
Table 1: Differentiable vs classical programming.

RNNs, for example, are an evolution of feedforward networks because they are classical neural networks inside a for-loop (a control flow statement for iteration) which allows the neural network to be executed repeatedly with recurrence. However, this for-loop is a predefined feature of the model. Differentiable programming allows to dynamically constructs the graph and vary the length of the loop. The ideal situation would be to augment the neural network with programming primitives (for-loops, if branches, while statements, external memories, logical modules, etc.) that are not predefined by the user but learned with the training data.

But many of these programming primitives are not differentiable and need to be converted into optimizable modules. For instance, if the condition of an "if" primitive (e.g., if is satisfied do , otherwise do

) is to be learned, it can be the output of a neural network (linear transformation and a sigmoid function) and the conditional primitive will transform into a convex combination of both branches

. Similarly, in an attention module, different weights that are learned with the model are assigned to give a different influence to each part of the input. Figure 1 shows the program (directed acyclic graph) of a conditional branching.

Figure 1: Differentiable program (computational graph) of differentiable branching.

3 Flexibility and characterization

From Definition 2, it can be seen that differentiable programs are very flexible and give rise to very varied structures, which allows restricting the hypothesis space in an adequate way.

The following elements of differentiable programs support such flexibility and variability.

  1. Nodes can be tensors of any rank. E.g. scalars (0-rank tensors) , vectors (1-rank) , matrices (2-rank tensors) , general tensors .

  2. The function computed by each node can be any transformation of the input tensors, as long as the program (composite of functions ) can represent any continuous function (universal approximator). If the function has parameters, it has to be differentiable with respect to the parameters. Examples are linear transformation of a vector , a non-linear function of a vector , a weighted sum of vectors , etc.

  3. The program not only performs a transformation of the input but also can integrate information in a very flexible way. As we will see in the next subsection, information nodes can be integrated and related using different graph paths.

  4. The graph is dynamically constructed for each input at execution. To do this, each of the primitives are executed sequentially, be they classic programming primitives (for, while, if, etc.) or differentiable functions. Therefore, the graph can be different depending on the input.

  5. Parameters can be set at training time or dynamically calculated. For example, the attention weights in Figure 1 usually depend on both a fixed parameter and the input.

Thus, differentiable programming generates, from a general hypothesis space and a set of priors , a restricted hypothesis space . The learning algorithm over takes as input the sequence of training samples, builds the graph, computes the derivatives and outputs a program or function .

Precisely, the flexibility and richness of differentiable programming make it easy to generate the restricted hypothesis space , translating the prior knowledge and structure of the problem to the elements of a differentiable program, i.e. to a tuple .

Next, we define some program characteristics that help in this process. They are also instrumental to define differentiable programs and adapt them to important classes of problems.

3.1 Program characteristics

In traditional programming there are certain heuristics for constructing algorithms depending on the problem to be solved. New structures have also been developed in neuroscience that help explain neural processing

Hernandez2018MultilayerAN.

For differentiable programming, it is also useful to have some characteristics that allow us to incorporate the structure and priors of the considered problem.

Traditionally, the characteristics of differentiable and deep learning programs (multilayer feed-forward networks, convolutional networks, etc.) were inspired by the brain. However, as we are going to see, although their source of inspiration was the brain, the key to their success has been the concrete relationships and transformations of the elements (tensors) of a differentiable program.

These characteristics are not independent but related to each other. Here we define some of these characteristics.

  1. Integration and relation of information tensors (vectors). Given a sequence of input vectors with and a sequence of intermediate or output vectors with , a program characteristic is the graph path relationship (shortest path, non-intersecting paths…) between each vector.

    For example, in RNNs the length of the path between the output vector and the past input vector increases with .

  2. Relationship between the represented problem structure (locality, local or distant relations, node degree, temporal relations, etc.) and the differentiable program characteristics (depth, edges information, temporal evolution, etc.).

    An example of this characteristic is the relationship between the average degree of a graph and the depth (number of layers) of its corresponding graph neural network (GNN).

  3. Invariant transformations of data and symmetries. If represents the function that transforms the data and is a function that performs a transformation of the data, is invariant under if and only if .

    For example, convolutional networks (CNNs) have translational invariance, that is, the model produces the same response despite translations of input elements.

    In this way, we restrict the structure of the differentiable program taking into account certain symmetries of the information. Another example would be a transformation of input vectors that is invariant under a change of order or permutation of the vectors. Given a sequence of input vectors , the transformation does not depend on the ordering of the set of vectors .

  4. Combination of modules or tasks, that is, the possibility of combining different modules using classical or differentiable primitives. A model is composed of different parts or tasks and each task has a direct meaning or some grounding.

    Thus, by imposing restrictions on the structure, it is possible to better approximate the real distribution of the problem and be able to carry out subsequent tasks.

    For example, different particles that interact with each other in a physical model could correspond to the different nodes of a graph neural network. Also, as in hudson2018compositional, a task can be decomposed into a series of reasoning steps, learning to perform iterative reasoning.

In Section 4 we are going to see several domains or types of differentiable programs, in which we analyze and map the problem structure to specific programs using the characteristics defined above.

4 From general to specific differentiable programs

4.1 Generalization of neural networks. Self-Attention

In describing the flexibility of differentiable programming, we stated that the node function can be any transformation of the input tensors, as long as the program can represent any continuous function (universal approximator).

Traditionally, following the inspiration of biological neural networks, a linear transformation (together with the activation function) of a vector has been used, as described in Equation (1).

If we want to transform a set of vectors, then we can concatenate the vectors and use the neural network of Equation (1) (with a relevant increase in the number of parameters) or use a recurrent neural network (RNN). However, as seen in Figure 2, an RNN assumes a one-step temporal dependence, which implies that the length of the path between the output vector and the past input vector increases with . Consequently, the relationship between the output and the input vectors is not symmetric. As a result, this model is biased towards certain problem structures.

Figure 2: Computational graph of a recurrent neural network.

Therefore, a more general transformation of the input vectors should have the following characteristics.

  1. It acts on a set of vectors, consciously and dynamically choosing the most important ones.

  2. It has to integrate all the input vectors in a direct and symmetric way (as seen in Figure 3), with the same path length between an output vector and each of the input vectors.

  3. The weights to integrate each input vector should not be fixed but dependent on the context (input).

Figure 3: A direct and symmetric transformation of input vectors.

The structure described above is the basis of self-attention vaswani2017attention, one of the most successful deep learning models developed in recent years. The input vectors are transformed (using learnable matrices) into query (), key () and value () vectors and, for each step, the output is a direct integration of the value vectors based on the similarity between the query and each of the keys, . Furthermore, this model has made it possible to integrate sequential and conscious reasoning into deep learning models HernandezAttention.

4.2 Compositional structures and reasoning

The problems that we want to model using machine learning usually have a certain structure and are made up of a sequence of different tasks or modules. For example, human reasoning sometimes takes place in a series of phases, with each phase focusing on different information and feeding the next phase.

To make learning more efficient, the same structure can be transferred to the differentiable program, which will be made up of a sequence of modules. For example, a problem can be decomposed into a sequence of attention tasks. Each task focuses on a set of elements of information (text, image, memory, etc.) and is the input of the next task, as depicted in Figure 4.

Figure 4: A sequence of attention tasks.

Thus, a differentiable program can be very flexibly defined as a sequence or combination of tasks or modules. As stated in HernandezAttention, an attention module can be added to any deep learning model in multiple ways: in several parts of the model; focusing on different elements; with different attention dimensions (spatial, temporal and input dimension), etc.

Even more, as in Recurrent Independent Mechanisms (RIMs) Goyal2021RecurrentIM, there can be multiple modules that operate independently and on each step compete with each other to read from the input and update their states, then attending only the parts of the input relevant to that module.

4.3 Graph structure restrictions. Graph neural networks

After the success of deep learning in image processing, time series analysis, etc., it is logical to apply it to unstructured data and graphs.

Graph neural networks PyGeometric2019; DynamicPyGeometric2021 apply neural networks to graphs, based on nodes connected to each other. To do this, each state of the node of the graph is iteratively updated by adding information from the neighboring nodes :

(2)

with denoting F-dimensional vector features of node in layer and denoting optional edge features from node to node . denotes a differentiable, permutation invariant function (e.g. a summation) and and denote differentiable functions such as feedforward neural networks (FNN).

In Figure 5, where represents the vector features of node in the original graph, we can see two consecutive updates of information among the nodes of the graph.

Figure 5: Two consecutive updates of information between nodes of a graph.

Based on that definition, certain relationships can be drawn between the structure of the represented graph and the characteristics of GNNs:

  1. Information aggregation in the represented network is performed via the layers in the GNN, that add information from the neighboring nodes. The more layers, the further information travels from a node.

  2. Small world networks (most nodes are separated by a short distance) do not need many layers.

  3. In theory, in graphs with a high average degree a GNN with few layers is enough to propagate the information of the nodes. However, the high average degree can cause saturation since the GNN nodes receive information from many nodes at the same time.

  4. Although increasing the number of layers of the GNN allows information to propagate between nodes, the task we perform at each node (e.g., prediction) may depend on the local neighborhood of a node or on distant information. Hence, two different graphs or problems but with similar structure (similar adjacency matrix) may require different GNNs.

  5. There are some invariant transformations of information. For example, the function that integrates the information of the neighboring nodes (e.g. summation) should be permutation invariant.

5 Experiments

5.1 Materials and Methods

In this section we evaluate the structure and priors of a particular problem with various differentiable programs (models) using the characteristics defined in the previous sections. This problem is the classification of scientific publications into one of six classes with the CiteSeer dataset yang2016revisiting. We use the PyTorch framework, especially the PyTorch Geometric PyGeometric2019 and self-attention library.

The CiteSeer dataset is a citation network extracted from the CiteSeer digital library. Nodes are publications and the edges denote citations between publications. The network has 3327 nodes, 3703 features and 6 classes for each node, 9104 edges and is undirected (the edges appearing twice in the edge matrix). The average degree is 2.737, the average clustering coefficient is 0.141 and there are isolated nodes. The dataset is divided into 120 training nodes, 500 validation nodes and 1000 test nodes.

5.2 Results

First, we use non-linear transformations (neural networks) of the vectors of each node without taking into account the graph structure (links):

  1. Using a neural network (without any relational information) with two layers (input and output layer), 59,366 parameters, with node features as inputs and training with the training mask (120 nodes), the test accuracy is 58.2%. The model suffers overfitting due to only a small amount of training nodes and because it does not take into account the link information. Therefore, it generalizes poorly to unseen node representations.

  2. Using a neural network (without any relational information) with three layers and with node features as inputs, intermediate dimensions 512 and 256 (2,029,318 of parameters) and training with the training mask, the test accuracy is 59%, that is, slightly above the 58.2% of the previous configuration. Using the validation mask (500 nodes) to train the model, the loss decreases more slowly and the test accuracy is around 68%.

Second, we apply graph neural networks by adding information from the neighboring nodes:

  1. Using a GNN with two graph convolutional layers and 59,366 parameters, the test accuracy is around 71.4%. If we use the validation mask to train, the test accuracy is 76.2%.

  2. Using a GNN with three graph convolutional layers, we see that the test accuracy decreases to 62% as the intermediate dimensions increase, probably because the model is overfitting the training data. If we lower the feature dimensions to 16, then the test accuracy returns to 67.4%. If we use the validation mask to train the model, we get a test accuracy of 73.3% with feature dimensions of 256 and 1,015,558 parameters.

Third, we use a GNN with two graph convolutional layers and 59,366 parameters but modifying the number of training nodes to assess the influence of the training size:

  1. When the first 620 nodes are used for training the GCN to avoid overfitting, we get a test accuracy of 77%. We obtain the following training set-test accuracy: 50 nodes-62.1%, 70 nodes-63.6%, 120 nodes-71.4%, 400 nodes-77%, 500 nodes-77.3%, 620 nodes-77%, 800 nodes-76.80%, 1000 nodes-76.2%, 1500 nodes-76.9%. Using the validation mask to train, we get a test accuracy of 76.2%.

Fourth, we use a GNN with two graph convolutional layers and 59,366 parameters but modifying the links (edges) of the original graph:

  1. Using random edges (directed edges) between the nodes and training with the training mask, the test accuracy is 16.5% compared to the 71.4% obtained with the correct edges.

  2. Using the training mask and a (i) 20% of the edges removed gives a test accuracy 67.7%; (ii) 25% removed, 68.7%; (iii) 33% removed, 67.1%; (iv) 50% removed, 68.1%; (v) 66% removed, 62%; (vi) 75% removed, 61.1%; and (vii) 80% removed, 59.2%; compared to the 71.4% test accuracy obtained with the correct edges and the 58.2% of the neural network.

Finally, we design a model with a self-attention module between graph nodes, a pre-linear and post-linear layer for adjusting dimensions and a dropout layer. The model has 503,286 parameters and we modify the attention mask to prevent attention to certain positions:

  1. We apply attention to all the nodes of the graph to learn the importance weights between the nodes. Then, training with the training mask the accuracy is 18.10%, while training with the first 1500 nodes the accuracy is 23.10%.

  2. We apply attention only between the same node. Then, training with the training mask the accuracy is 51.30%, while training with the first 1500 nodes the accuracy is 69.10%.

  3. We apply attention only between neighboring nodes, thus respecting the graph structure and learning the weights between neighboring nodes. Then, training with the training mask the accuracy is 65.40%, while training with the first 1500 nodes the accuracy is 73%.

5.3 Discussion

The flexibility of differentiable programming, as a generalization of deep learning, allows us to use very diverse structures to model a specific problem. In Section 4.2 We have evaluated the performance of the model with the CiteSeer dataset by incorporating the structure of the problem to a different degree.

When we used conventional neural networks, without incorporating the links of the graph, the performance was acceptable (59%). Increasing the number of layers and parameters of the neural network did not significantly increase accuracy.

When we applied GNNs by adding information from the neighboring nodes, the accuracy increased significantly, even more when we increased the number of training nodes (77%).

However, when we added more convolutional layers to the GNN, the test accuracy decreased. As we have seen in Section 4.3, in the graph-like CiteSeer dataset, with an average degree of 2.74, increasing the number of layers helps to propagate the information between distant nodes. But probably the task that we performed at each node or document (classification) only depends on the local neighborhood and not on distant nodes, so, in such cases, we do not need any more layers in the GNN.

Also, when we purposely changed the links of the graph when implementing the GNN, the accuracy decreased. When we used random links between the nodes, the accuracy was very bad (16.5%). Removing an increasing percentage of the links lowered the accuracy, approaching the accuracy achieved using neural networks instead of GNNs.

Finally we evaluated the more general transformation, self-attention. As stated in 4.1, it transforms a set of vectors (nodes), dynamically choosing the most important ones. The model has to learn the graph structure. When we applied attention to all the nodes of the graph to learn the importance weights between the nodes, the accuracy was very low (18.1%). When we applied attention only between neighboring nodes, thus respecting the graph structure, accuracy went up to 73%. Therefore, general differentiable programs such as self-attention, applied to problems with structure and without a great amount of data do not work well.

6 The limits of differentiable models and the path to new learning strategies

As we have seen, in differentiable models, the learning algorithm takes as input a sequence of training samples and outputs a function , from the hypothesis space of functions, with a low probability of error . These models are composed of any transformation of tensors and if the transformation has parameters to be learnt, it has to be differentiable. Then, differentiable models learn the transformations of the input that minimize the error to predict the training data.

But the issue with such models is that they do not learn certain aspects of the task that are later combined to generate new knowledge. In particular, they cannot generate new tasks or information without additional training data. As a result, it is not possible to build an increasingly complex world model with just a differentiable model.

By contrast, humans can correctly interpret novel combinations of existing concepts, even if those combinations have not been seen previously. It could be said that differentiable models scale well in data (when we increase the input data) but not in tasks, while in humans it is the other way around.

We need learning strategies that leverage previous data and models to generate new experiences and knowledge. These new learning strategies should have the following characteristics.

  1. Be similar to the learning strategies developed by humans but adapted to the machine learning framework. As pointed out in Laird_Learning_2018, humans have two levels of learning. Level 1, which is more automatic, continuous and unconscious, and level 2, which comprises the deliberate learning strategies that create the experiences for level 1.

  2. In the framework of machine learning, level 1 corresponds to differentiable models that approximate a function and level 2 to learning strategies that, based on previous data and models, create new experiences and tasks.

  3. These learning strategies are algorithmic transformations of existing models and data. They build an increasingly complex model of the world and can be more logical (some logical combination of models characteristics or outputs) or automatic.

  4. Given the training data and task for each model, a learning strategy is an algorithm , where are transformations of all previous information (training data, model information, etc.) and is a new model that, when presented with an input and a task not seen in the training data, is capable of getting the correct output for this task.

  5. They can be based on any previous information: Priors of the models, information of the trained models (intermediate variables, output), training data, automatically generated labels, etc.

We describe here some illustrative examples of these strategies.

  1. Transformations of training data that scale in tasks. A large amount of data (e.g., text) is collected in the form of and the differentiable model learns to map a bigger space: . The model predicts an output by also conditioning on the task in addition to the input and previous output. In this way, a large amount of data is transformed and adapted to build differentiable models that scale in tasks. GPT-X models radford2019language; brown2020language

    , based on language models which are trained in an unsupervised manner on webpage datasets, are a great advance in this line.

  2. Transformation using model priors. We first train one or several differentiable models (e.g., GNNs or attention models) used for some tasks. Each of these models has a series of internal components or variables with a physical meaning,

    for model . Then we apply some machine learning technique (e.g., another differentiable model, symbolic regression, etc.) to extract some relations between the variables of the models. In this way we can generalize and perform new tasks. For example, if we have several attentional convolutional models that perform different tasks on an image, we can relate the attention weights of the different models.

  3. Control and guidance of a program. In an algorithm with a sequence of steps, some steps can depend on some pre-trained functions. Such is the case when decisions are guided by pre-trained models without reward. Algorithm 1 below illustrates this situation with the purchase or sell of raw materials based on the prediction of oil prices and interest rates.

    1.4
    Get the oil prices input data
    Get the interest rates input data
    Use a pre-trained model to predict the trend of oil prices
    Use a pre-trained model to predict the trend of interest rates
    if  then purchase Buy if there is a positive trend in both assets
    else
         if  then don’t do anything
         else
             if  then sell
             end if
         end if
    end if
    Algorithm 1 An algorithm for guided decisions based on pre-trained models

In sum, artificial intelligence models need to build an increasingly complex model of the world. In this section we have discussed the need, characteristics and examples of these new learning strategies. Nevertheless, much work remains to be done to characterize and specify the learning strategies that leverage existing knowledge (models and data), in a similar way as humans do. Furthermore, these strategies should be as automatic as possible and rely less on human engineering.

7 Conclusions

Differentiable programs are very flexible and give rise to a wide variety of structures. This flexibility makes it easy to generate the restricted hypothesis space, translating the prior knowledge and structure of the problem to the elements of a differentiable program, i.e., a tuple .

Similar to classical programming, it is useful to define some characteristics to incorporate the knowledge and structure of the problem to be solved. These characteristics allow us to analyse the differentiable structures that are used in several domains or models and to define new structures with new capabilities.

In experiments, we have seen that in problems with few training data it is important to incorporate the structure of the problem into the models rather than using very general models.

We have also seen that a shortcoming with differentiable models is that they just minimize an error to predict the data, but they cannot generate new tasks or information without additional training data. We have presented the characteristics and examples of new learning strategies that leverage existing knowledge (models and data) to generate new knowledge, in a similar way as humans. However much work remains to be done in this regard.

Writing–original draft preparation and experiments, A.H.; Writing–review, G.M. and J.M.A.; Content discussion and final version, A.H.,G.M. and J.M.A.

This work was supported by the Spanish Ministry of Science and Innovation, grant PID2019-108654GB-I00.

Acknowledgements.
In this section you can acknowledge any support given which is not covered by the author contribution or funding sections. This may include administrative and technical support, or donations in kind (e.g., materials used for experiments). The authors declare no conflict of interest. The funders had no role in the writing of the manuscript, or in the decision to publish the results. Data Availability Statement: The original data used for the simulations can be obtained from https://github.com/kimiyoung/planetoid/tree/master/data Computer Code and Software: The developed source code is available at https://github.com/adrianhernr/DPPaperExperiments The following abbreviations are used in this manuscript:
CNN Convolutional neural network
FNN Feed-forward neural network
GNN Graph neural network
GPT-3 Generative Pre-Trained Transformer-3
GPU Graphics Processing Units
LSTM Long short-term memory
RNN Recurrent Neural Network
References yes

References