The Dynamics of Differential Learning I: Information-Dynamics and Task Reachability

10/04/2018 ∙ by Alessandro Achille, et al. ∙ 0

We study the topology of the space of learning tasks, which is critical to understanding transfer learning whereby a model such as a deep neural network is pre-trained on a task, and then used on a different one after some fine-tuning. First we show that using the Kolmogorov structure function we can define a distance between tasks, which is independent on any particular model used and, empirically, correlates with the semantic similarity between tasks. Then, using a path integral approximation, we show that this plays a central role in the learning dynamics of Deep Networks, and in particular in the reachability of one task from another. We show that the probability of paths connecting two tasks, is asymmetric and has a static component that depends on the geometry of the loss function, in particular on the curvature, and a dynamic component that is model dependent and relates to the ease of traversing such paths. Surprisingly, the static component corresponds to the distance derived from the Kolmogorov Structure Function. With the dynamic component, this gives strict lower bounds on the complexity necessary to learn a task starting from the solution to another. Our analysis also explains more complex phenomena where semantically similar tasks may be unreachable from one another, a phenomenon called Information Plasticity and observed in diverse learning systems such as animals and deep neural networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Among the many virtues of deep neural networks is their transferability: One can train a model for a task (e.g., finding cats and dogs in images), and then use it for another (e.g., outlining tumors in mammograms) with relatively little effort. Sometimes it works. Alas, little is known on how to predict whether or not such transfer learning (TL) or domain adaptation (DA) will work, and if so how much effort is going to be needed, without just trying-and-seeing. It is not a given that training on a sufficiently rich task, and then fine-tuning on anything else, must succeed. Indeed, slight changes in the statistics of the data can make a task unreachable achille2017critical .111We introduce the notion of reachability of a task in Section 2.3.

At the most fundamental level, understanding transfer learning or domain adaptation requires understanding the topology and geometry of the space of tasks. When are two tasks “close”? Can one measure the distance between tasks without actually running an experiment? Does knowing this distance help predict whether transfer learning is possible, and if so how many resources or time will be needed?

Surely the distance between tasks is not just the lexicographic distance between label sets in a taxonomy: The experiments in achille2017critical show that even for the same label set, a task can be unreachable. Surely it is also not just the distance between two sets of parameters in a model (say, a deep neural network) trained for the task: There are many symmetries and large subsets in parameter space that implement the same model. These questions are fundamental because they do not concern a particular choice of model (e.g., neural networks) or optimization scheme (e.g.,stochastic gradient descent, SGD). They are questions about learnability of tasks, and transferability from a task to another. But what is a task? What is its complexity? What is its structure?

1.1 Contributions in relation to prior work

There is growing interest in characterizing the similarity between tasks, and in making the process of transfer learning more systematic and predictable. After a first draft of this manuscript was completed, several works have appeared to describe relations between tasks empirically zamir2018taskonomy .

Rather than defining a proper (symmetric) distance between tasks, we use tools from quantum physics to characterize the probability of reaching one task from another. We then show that this can be decomposed into two parts: One that is static, and only depends on the geometry of the loss function that defines the learning problem. The other is dynamic, in the sense that it does not depend on the length of the path from two tasks, but on the effort needed to traverse it. Effort is related to time in a stochastic minimization procedure such as stochastic gradient descent. We refer to the resulting concept as reachability of a task from a task , rather than a “distance” between and , since two tasks can be very similar both in terms of semantics (same label space) and data (slight perturbations of the same training set), and yet one can observe empirically that it is not possible to successfully fine-tune a model train on a task to work on the other achille2017critical .

We prove that, somewhat surprisingly, the static component of the reachability cost is related to a notion of distance that can be defined abstractly using Kolmogorov complexity theory, and in particular Kolmogorov’s Structure Function. We show that the reachability function correlates with the ease of transfer learning. Kolmogorov’s complexity has previously seen applications in machine learning, particularly

li2006data follows a similar approach to us in definining the complexity of a dataset. However, to the best of our knowledge no previous work has linked the training dynamics of a deep neural network with the Kolmogorov Structure Function of a task.

Our work puts the emphasis on the importance of learning dynamics, not just asymptotics, in the analysis of learning, and in particular in deep neural networks. Our approach relies on Kramer’s rate theory for multi-dimensional systems kramers1940brownian ; caroli1981diffusion . Using a path-integral approach, we express the probability of individual learning trajectories as an integral of the exponential of a suitable defined functional hunt1981path . We then use a dominant path approximation to predict the qualitative behaviour of the transition rates.

2 Static distance between tasks

In this section we formalizing the notion of task and introduce an abstract notion of distance between tasks.

2.1 Kolmogorov’s Structure Function

A task is a random variable we want to infer given an observation

, and given a training set consisting of data samples and their corresponding task values , . A task may therefore be identified with the given dataset , or with an approximation of the posterior density given the dataset. Of course, without additional hypotheses, there is no unique or right solution to the inference problem: Any label assignment on unseen data may, in principle, be correct.

Moreover, even the label of seen data may be ambiguous: if the dataset is composed by a sequence of uniformly random labels, should we output the memorized labels, or correspondingly the posterior , or should we rather output a uniform posterior, i.e., for each ? Notice that in one case, our task has a very complex structure, since we need to memorize several labels, while in the other, the description of the task would be trivial.

An elegant approach to these issues was proposed by Kolmogorov vereshchagin2004kolmogorov , which we present here through the equivalent, but more convenient, formalism of Rissanen’s minimum description length (MDL) framework. Define the Kolmogorov’s structure function of a dataset (task) as:

(1)

where denotes the minimum complexity of a model , such that the cross-entropy loss of the model’s prediction on the dataset is less than . This definition has the important quality of recognizing that there is no single definition of complexity of a task: depending on the level of accuracy we want to achieve, different complexities are required.

It related quantity, which was studied in achille2018kolmogorov and will play a central role for our analysis, is the Lagrangian associated to the minimization problem Equation 1, which may be written as:

(2)

Intuitively, this may be thought as the total cost of encoding the data using the model , when the cost of encoding the model is discounted by a factor . For this reason, we refer to as the complexity of the data at level . This quantity has also been studied under the name Information Bottleneck Lagrangian of the weights achille2017emergence , and for it reduces to standard loss of the MDL framework, and to the Evidence Lowerbound (ELBO) of Variational Inference.

Kolmogorov’s structure function has several well known theoretical properties that make it appealing vereshchagin2004kolmogorov , but we consider it for a different reason: it is intrinsically related to the learning dynamics of deep neural networks (DNNs). To see this, we first need to introduce the complexity of a DNN.

2.2 The complexity of a DNN

In Kolmogorov’s formulation, the complexity of a model would be length of the smallest program encoding the network’s weights. The naive way of encoding a DNN is to save a fixed floating point quantization of the weights. This is clearly suboptimal, since it does not exploit that: (a) close weight configurations give a similar output (especially in a local minimum), and (b) trained neural networks tend to be very low-dimensional, in the sense that only a subset of the weights is actually useful.

We can define a better encoding scheme for the weights as follows: For a fixed architecture, define the model class

 is a normal distribution 

N(μ,Σ)}, and fix a prior . Notice that this is not to be interpreted as a Bayesian prior. Rather, following the MDL formalization of statistical inference this is only a convenient way to describe an encoding algorithm.

Under this encoding, the cost of encoding the model once the prior is fixed is given by the KL-divergence

As it turns out, this encoding cost has a nice relation with the original cross-entropy loss over the data. Let us consider again the Lagrangian from eq. 2 associated with the Structure Function eq. 1. Under this model, we can rewrite the Lagrangian as

Let be a weight configuration. Approximating to the second order, we can immediately minimize out . Let be the Hessian of at a point , and assume the quadratic approximation holds in a large enough neighborhood. Then, we can rewrite the previous expression for as:

The gradient with respect to is

and setting it to zero, we obtain the minimizer . Substituting this back in the previous expression we obtain

(3)

It is interesting to notice that in a local minimum, the Hessian of the cross-entropy loss coincides with the Fisher Information Matrix martens2015optimizing , which gives a link between complexity and Fisher Information.

This also gives a first link between complexity of a task and DNNs: assuming the optimization algorithm is such that the final solution has reaches an optimal compromise between flatness of the Hessian and value of the loss function, then we can conclude that the network asymptotically minimizes the Kolmogorov structure function. In the following we will prove a stronger result: during training with learning rate annealing, SGD selects with high probability solution that realize a annealing compromise between complexity and loss function, therefore tracing the whole structure function during training.

2.3 Conditional structure function and task similarity

Let , be two tasks. We can define the reachability at level of given , as

where denotes the concatenation of the two datasets. Intuitively, measures the additional complexity that we need to learn in order to solve the task at the required complexity level, assuming we already have already learned a solution for the task .

Notice that the definition of reachability is asymmetric. This is indeed a desirable property, as we expect it should be easier to learn a simple task after having trained on related, but more complex one, while going in the opposite direction should be harder.

2.4 Critical periods: a static distance is not enough

Figure 3 shows that the task similarity computed using Kolmogorov’s Structure Function matches one’s intuition of what tasks could be considered similar. More importantly, such similarity correlates with how easy it is to train on a task and fine-tune for another.

However, there are tasks that are remarkably similar to one another, and yet it is not possible to fine-tune from one to the other. Take for instance two classification tasks with identical labels, and with identical images, except that in one of the dataset the images are slightly blurred. One would think that it is quite easy to fine tune one task from the other. Yet, strong empirical evidence suggests this is not possible, regardless of the architecture, the learning method, and even the computational substrate, from deep neural networks to biological circuitry across many different species achille2017critical . Therefore, something is amiss. Not only is the ease of fine-tuning a task from another asymmetric (one can fine-tune from sharp images to blurry ones, but not vice-versa), but whether one can fine-tune at all depends on other factors not captured by the static distance defined thus far.

In the next section, we argue that not only the length of the path, but also the ease of traversing it is critical to determine whether a task is reachable from another, and ultimately whether transfer learning is successful.

3 The Dynamic Distance between tasks and reachability

We adopt the common practice of training a deep neural network (DNN) using stochastic gradient descent (SGD) to guide the construction and computation of the distance between paths, and later point to its universal properties, so the focus on DNNs should not be over-emphasized: For our purpose, they are just a convenient way of representing a task.

A DNN can be described by a function that depends on parameters and is trained to approximate a sufficient representation . One of the most common losses for a DNN is , the empirical cross entropy of the network predictions, that is .

3.1 Probability of paths of SGD

We approximate the discrete evolution of a sample path of SGD taking the temporal sampling step to the limit, obtaining a stochastic differential equation (SDE) of the form222The notation is a short-hand for the more common form where is a Wiener process.

(4)

where is constant. Given the SDE, we can derive a probability functional over paths, which is a valid probability density but cannot be normalized. This can be done formally using the Martin-Siggia-Rose formalism, which associates to each SDE a distribution over paths. Starting from an initial condition at time , we have

(5)

where we have defined the Onsager-Machlup Lagrangian hunt1983path

(6)

Intuitively, the density function penalizes paths whose speed does not match the gradient , and adds a correction based on the divergence of the gradient field in order to account for concentrating or dissipating effects of the potential. It is also intuitive that the density of paths of SGD is informative of the topology and geometry of the loss landscape, and in particular critical points or regions corresponding to tasks.

3.2 The Effective Potential between tasks

One of the main objects of interest for us is the transition probability between two points and in time

. This can be expressed, given the probability distribution over paths, as

(7)

where the integral is over all paths such that and . That is, the probability of reaching at the given time is the mass or “volume” of all paths reaching

. In the following, we will be interested in estimating this transition probability, which tells us which part of the loss landscape are

accessible in a given training time.333This method is more direct than using the Fokker-Plank operator and integrating the corresponding PDE to determine the probability of paths.

Intuitively, we may expect the probability of reaching a point to depend on two separate factors: The energy gap between the initial and final configurations, as well as the existence of probable paths connecting them. To see this formally, notice that the path density eq. 5 can be rewritten using the Stratonovic convention karatzas2012brownian as

We define the effective potential as

(8)

so we can write

(9)

3.3 Reachability of a task

Substituting this expression in eq. 7, we obtain a corresponding decomposition for the transition probability

(10)

The first part is static in the sense that it depends only on the initial and final configurations and is independent of the path used to reach it. The first configuration could be a task for which the network is pre-trained (with weights ), the second configuration could be the target task for transfer learning.

The second part quantifies the volume of likely paths connecting the two endpoint. It is called reachability because, regardless of how large the drop in static potential, the absence of probable paths between the endpoints makes transfer learning impossible.

Note that reachability depends on the tasks (i.e., the data), but also on the particular class of functions (architecture) used to learn the task (i.e., the weights). In the next section we develop tools to compute reachability that does not depend on the architecture and can be quantified by information quantities.

3.4 Curvature, most likely paths and Lagrange approximation

In this section we derive an approximation of eq. 10 which we use to show that, to first approximation, the most likely path is deterministic and follows an effective potential , where denotes the determinant of the positive part of the Hessian, i.e.

, the product of all positive eigenvalues. That is,

the potential needs to be corrected in order to account for the local curvature and the amount of correction depends on the temperature.

One consequence of this fact is that sharp minima may not be minima at all for this particular potential when the temperature is sufficiently high. We will also show that the dynamic part of the potential can create spurious local minima that can inhibit learning of new problems in a transfer learning scenario. We will later also connect the curvature to the amount of information needed to solve a task. Using this connection, we will be able to characterize the “learnability” (reachability) of a task in terms of information-theoretic properties of the data. This completes our program of characterizing the geometry and topology of the space of tasks in a manner that does not depend on how the task is actually learned.

To start with, we make the Lagrange (or Saddle Point) approximation: that given two points and , the probability concentrates around a few most likely paths joining and

that are local maxima of the probability density functional. In other words, all probable paths can be obtained as a perturbation of a few critical paths (or activation trajectories). If the critical paths are sufficiently separated, we can estimate the total probability by approximating each cluster as a Gaussian centered around the cluster maximum.

The local maxima of can be found by minimizing the action in eq. 5 (or equivalently the simplified action in eq. 10). Using the Euler-Lagrange equation , we obtain that critical paths satisfy the differential equation

where the effective potential is the same that appears in the decomposition in eq. 10. We observe that the Laplacian of the potential acts as a drag term in this expression. Therefore, depending on the temperature, the critical paths move more slowly when the curvature increases, which will play a role later.

For ease of exposition, let us assume for the moment that there is only one critical path

between and that satisfies the above equations. Furthermore, let us assume that the path is along a coordinate axis, and approximate the potential up to second order around the path as

The Lagrangian associated with this process is

(11)

The first term in eq. (3.4) accounts for the diffusion along the direction. The third therm contains both derivatives of and second-order terms in (); we can neglect it if we assume the validity of the saddle point approximation and that , that is, that the varies slowly enough. Since we are mainly interested in the dynamics along the coordinate, we can integrate out the variable from eq. (5). We then obtain

(12)

When the diffusion in the direction is much faster than the dynamics along , we can replace the integral with the local equilibrium distribution of at a fixed . The final expression for the marginalized probability density is

(13)

Under this approximation, we finally obtain that the probability of reaching a point in a given time is given by

(14)

We then see that the convergence to a minimum is controlled by an effective potential , which corrects the original potential by a term that depends on both the dissipation constant and the curvature (determinant of the Hessian) at that point. In particular, local minima of may not be minima of the effective potential , so they are unlikely to be reached using SGD. Recall that, for a fixed learning rate, the dissipation coefficient scales as , where is the batch-size and is a constant that depends on the architecture.

From eq. 14 we can derive the Kramer’s convergence rate , which is the expected time to reach a minimum assuming we start from a saddle point and there is only one minimum

We will now proceed to establish a link between the curvature and the structure function of the task, therefore linking the learning dynamics with the structure of the data, which will also gives us an empirically verifiable relationship (Section 4).

3.5 The structure function of the data gives a lower bound on the complexity for SGD to reach that level of precision

Recall that the complexity of a network under our model is given by:

On the other hand, when the network is trained with weight decay, with coefficient , the effective potential minimized by the network is given by:

By letting we obtain that the effective potential that affect the network while training with SGD is exactly the complexity of the dataset at level . Therefore, we may rewrite the static term of the transition probability as:

This has the important implication that the transition probability, and therefore the Kramer’s rate of convergence, is bounded a static part that depends solely on the complexity of the task, or more generally of the difference in complexity between tasks when fine-tuning. To this, however, we must add a dynamic term that also depends on the architecture of the network and the geometry of the loss landscape, and may in general not be trivial.

4 Experiments

Figure 1: (Left) Plot of the time needed for the network to converge (minimize the loss below a certain threshold) vs the batch size for AlexNet on CIFAR-10 with random labels. (Right) Plot of the time needed for the network to converge vs the number of random labels in the dataset. In both cases, the trend of the empirical curve (blue) follows the theoretical prediction (green), where the parameters of the coefficient are fitted from the data.
Figure 2: (Left) For several architectures, Plot of the time needed for the network to converge (minimize the loss below a certain threshold) vs the estimated complexity of the task. (Right) For different learning rates, plot of the number of steps needed to converge to a given threshold on CIFAR-10 using a ResNet-18 architecture as the batch size changes.

4.1 Convergence time for different datataset

In Section 3.5 we have seen that the Kramer’s rate for convergence, in first approximation and ignoring the contribution of the dynamic part of the transition probability, is given by

This gives an empirically verifiable law to test our model: In Figure 2

we plot the time (number of SGD steps) needed by different architecture to converge on several different datasets. We can see that, as expected, different architectures have different parameters that regulate how the complexity affects the converge time, but for a fixed architecture and hyperparameters, the time to converge mainly depends on the complexity of the task alone.

Random labels. The case of random labels is of particular theoretical interest since, provided the value of is below a critical point to allow memorization of the label, the complexity scales linearly with the amount of random labels. In Figure 1 we show that, in accordance with our model prediction, the time to converge scales with the complexity of the dataset, i.e., in this case with the amount of random labels in the dataset.

Changing the batch size. Another way we can act on the time to converge is to change the diffusion constant of the network: We know that for a fixed learning rate the diffusion constant scales as , where is the batch size. Figure 2 (Right) shows that changing the batch size changes the time to convergence, following the predicted trend.

4.2 Time to finetune between tasks

In the previous section we tested the relation between the complexity of the task and the time employed by the network to converge, starting from a random initialization. In practice, we may start from the minimizer of another task, rather than from a random initialization (finetuning). In this case, we expect the time to converge to depend not on the complexity of the task, but rather from the reachability of the new task from the previous task (Section 2.3).

In Figure 3 (Left) we shows the reachability matrix computed using the definition in Section 2.3 for several popular dataset. Notice that this matrix makes intuitive sense: semantically similar tasks are closer to each other, and it is generally easier to learn a task after training on a more complex, related, task (such as going from CIFAR-100 to CIFAR-10), rather than trying to learn a complex task starting from a simple one (e.g., going from MNIST to CIFAR-100).

This may be compared with the matrix of the time necessary to finetune (train until we reach some loss threshold), which we show in Figure 3 (Center). More precisely, in Figure 3 (Right) we show the relation between time to finetune and reachability for several pairs of datasets, which again follows the theoretical prediction between the two.

Figure 3: (Left) Matrix of the reachability between tasks, which we defined in Section 2.3 based on the relatieve Kolmogorov complexity of the task. Each element of the matrix shows the reachability between classification tasks (defined by their dataset) going from the task in the column to the task in the row. Notice that semantically similar task are close to each other, and that it is easier to go from a complex task to a related simple task, than going in the other direction. (Center) This may be compared to the time (number of steps) necessary to reach a given error threshold, when starting the training from the solution of another task. (Right) Scatter plot of the relation between number of steps necessary to converge and the reachability of two dataset.

4.3 Effective potential for random labels

Equation 14

defines the static potential. If the dimension of the weight vector is large enough the static potential can be rewritten as

(15)

where and are constants that may depend on the details of the network, dataset and optimization algorithm.

Having a new potential that takes in to account it’s natural to check if it is able to describe the system. This amounts to neglect the second term in and is the best description we of the dynamics we can write using a conservative force. To see empirically whether this approximation holds, we use the static potential we can define an energy conserving dynamics

(16)

The values of the constants and

depend on the learning rate and can be found by by linear regression. In

Figure 5 we compare the two therm that appear in eq 16 for the ten random label task on a ResNet-18. We find that correctly describes the dynamics of the system.

Figure 4: Kinetic energy as function of time. The orange line is the numerical value, while the blue line is the theoretical prediction

For the considered schedule and architecture, the

term dominates form epoch 150 to epoch 250. In this region we further test the validity of the equation by considering other expressions of

. Figure shows that is more suitable to use in than other quantities related to , such as the trace.

Figure 5: Kinetic energy as function of time. The orange line is the numerical value, while the blue line is the theoretical prediction. The left plot uses the determinant of while the right plot uses its trace.

5 Discussion

The ability of deep networks to function for tasks other than those trained on is one of the reasons of their recent widespread diffusion. However, it is very difficult to predict whether such transfer learning will be successful other than just trying it. In this paper we have laid the foundations to enable quantifying the ease of transfer learning. This requires first defining and formally characterizing tasks, and then establishing some sort of topology in the space of tasks. To the best of our knowledge, we are the first to attempt this. We bring to bear tools from diverse fields, from Kolmogorov Complexity to quantum physics, to enable defining and computing sensible notions of distance that correlate with ease of transfer learning. In the process, we discover interesting connections.

The first is between the notion of task reachability, which we introduce, and the Kolmogorov Structure Function. This in turn is related to information-theoretic treatments of deep learning that have been recently developed

achille2017emergence . Furthermore, our analysis points to the importance of analyzing the dynamics of learning, rather than just focusing on the asymptotics.

We recognize that our theory has several limitations. First, it has not been thoroughly put to the test empirically. While all evidence thus far is encouraging, additional evidence in support, or falsification, of the hypotheses developed are forthcoming. Second, the theory has yet to point to better ways of doing transfer learning. Nevertheless, it is of practical value in that one could predict, before doing the experiment, the cost of training and of fine-tuning a deep learning model for a task.

Acknowledgments

Supported by ONR MURI, ARO.

References

  • [1] A. Achille, M. Rovere, and S. Soatto. Critical Learning Periods in Deep Neural Networks. ArXiv e-prints, November 2017.
  • [2] Alessandro Achille, Glen Mbeng, Giovanni Paolini, and Stefano Soatto. Information complexity of tasks, their structure and their distance. Technical Report UCLA CSD: 180003, Department of Computer Science, University of California, Los Angeles, June 2018.
  • [3] Alessandro Achille and Stefano Soatto. On the Emergence of Invariance and Disentangling in Deep Representations. ArXiv e-prints, June 2017.
  • [4] B. Caroli, C. Caroli, and B. Roulet. Diffusion in a bistable potential: The functional integral approach. Journal of Statistical Physics, 26(1):83–111, Sep 1981.
  • [5] Katharine LC Hunt and John Ross. Path integral solutions of stochastic equations for nonlinear irreversible processes: the uniqueness of the thermodynamic lagrangian. The Journal of Chemical Physics, 75(2):976–984, 1981.
  • [6] Paul M Hunt, Katharine LC Hunt, and John Ross. Path integral solutions for fokker–planck conditional propagators in nonequilibrium systems: Catastrophic divergences of the onsager–machlup–laplace approximation. The Journal of chemical physics, 79(8):3765–3772, 1983.
  • [7] Ioannis Karatzas and Steven Shreve. Brownian motion and stochastic calculus, volume 113. Springer Science & Business Media, 2012.
  • [8] H.A. Kramers. Brownian motion in a field of force and the diffusion model of chemical reactions. Physica, 7(4):284 – 304, 1940.
  • [9] Ling Li. Data complexity in machine learning and novel classification algorithms. PhD thesis, California Institute of Technology, 2006.
  • [10] James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417, 2015.
  • [11] Nikolai K Vereshchagin and Paul MB Vitányi. Kolmogorov’s structure functions and model selection. IEEE Transactions on Information Theory, 50(12):3265–3290, 2004.
  • [12] Amir R Zamir, Alexander Sax, William Shen, Leonidas Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 3712–3722, 2018.