Deep Learning Works in Practice. But Does it Work in Theory?

01/31/2018 ∙ by Lê Nguyên Hoang, et al. ∙ EPFL 0

Deep learning relies on a very specific kind of neural networks: those superposing several neural layers. In the last few years, deep learning achieved major breakthroughs in many tasks such as image analysis, speech recognition, natural language processing, and so on. Yet, there is no theoretical explanation of this success. In particular, it is not clear why the deeper the network, the better it actually performs. We argue that the explanation is intimately connected to a key feature of the data collected from our surrounding universe to feed the machine learning algorithms: large non-parallelizable logical depth. Roughly speaking, we conjecture that the shortest computational descriptions of the universe are algorithms with inherently large computation times, even when a large number of computers are available for parallelization. Interestingly, this conjecture, combined with the folklore conjecture in theoretical computer science that P ≠ NC, explains the success of deep learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Turing Test and Learning Machines

In 1950, Alan Turing [22] boldly conjectured that passing the now-called Turing test would require machine learning. His amazing insight was essentially the following. The human brain has around synapses and of these synapses are likely critical to the kind of natural language processing needed to pass the Turing test. Turing went on regarding this quantity as the minimal amount of bits needed for any algorithm to pass the test. In modern terms, we would say that the Kolmogorov complexity [10] of the Turing test is likely to be of the order of bits. This corresponds to saying that no shorter algorithm can solve this problem.

Unfortunately, writing quality source codes that are bits long is extremely challenging, tedious, time-consuming and prone to errors, even for large teams of top software developers. Recall that there are around such source codes, which is much larger than the number of particles in the universe. Back in 1950, Turing foresaw that, around 2000, computer science would undergo a revolution, as it would rely more and more on machine-written rather than hand-written algorithms. Machine learning (or learning machines, as Turing called it) would thus correspond to the superhuman ability of computers to explore the space of algorithms whose Kolmogorov complexity exceeds the gigabyte. Of course, the key to this new capability of computers is their superhuman computational power, as well as the availability of huge amounts of data to guide them in their exploration of -bit-long source codes.

Now, Turing did not himself specify what machine learning algorithm to use. Few years later, Solomonoff [17, 18]

proposed applying Bayes’ rule to all terminating Turing machines. Denoting

the set of such machines and the observed data, Solomonoff’s machine learning algorithm, called Solomonoff’s induction, consists of computing the posterior credence of a terminating Turing machine given data by

where

is the probability that the Turing machine

assigns to data and

is the prior probability of the Turing machine

which, for the sum of priors to equal 1, will typically have to be exponentially small in the size of the description of . Interestingly, Solomonoff proved two things:

  1. Solomonoff proved that his very general approach to machine learning was complete, in the sense that it provably determined the best-possible predictive theory from an amount of data whose size is (roughly) the Kolmogorov complexity of the data itself.

  2. Solomonoff also proved however that his induction was incomputable. In short, this is because the space of all terminating Turing machines that Solomonoff proposed to explore is ill-behaved. Indeed, as Turing [21] showed in 1936 through the infamous halting problem, this space is not a recursively enumerable set. (Besides, Solomonoff’s use of Bayes’ rule is way too computationally costly to be applied in practice.)

This naturally led to rather focus on restricted better-behaved computational models, e.g.

linear classifiers

, whose exploration is facilitated by some easy-to-compute learning rule, such as

(stochastic) gradient descent

. This allows to automate the data-driven exploration of a restricted subspace of large-Kolmogorov-complexity algorithms, in a way that humans cannot match. This is arguably why machine learning works in theory. What is less clear is why deep learning is particularly effective. Before discussing this, let us first recall what deep actually means which, in turn, requires to recall some basic elements of neural networks.

Neural Networks: the Deeper the Better

Neural networks are often regarded as the most promising restricted computational model for machine learning. Neurons in such a network can be thought of as elementary processing units, while the directed edges of the network can be thought of as communication channels between processing units. A crucial feature of neurons, which we shall get back to below, is that they perform fast nonlinear operations. Neurons typically compute a linear combination of incoming signals, and then apply a sigmoid function or a piecewise linear transformation to the linear combination.

111Note that the coefficients of the linear combination are usually associated to the communication channels, also known as synapses or edges. These two ways of defining neural networks are however equivalent. Yet, here, we shall clearly distinguish communication steps from computation steps.

But neurons may as well compute other fast nonlinear operations, e.g. softmax, max-pooling or energy-based sampling.

One popular neural network architecture is the feed-forward one [16]. In this particular network, neurons are organized in layers. The outer layer is fed with raw data. Neurons of this outer layer then communicate their data to some (or sometimes all) of the neurons of the first hidden layer. These neurons perform their nonlinear transformation of their inputs, then communicate their results to neurons of the second layer, and so on. In this architecture, the number of neurons per layer is commonly known as the width of the neural network, while the number of layers is called its depth. Deep learning roughly boils down to favoring depth over width.222

It is not clear though how to relevantly define depth for recurrent neural network.

This is in contrast with what is sometimes called shallow learning.

Figure 1: A neural network with 3 hidden layers.

An essential feature of neural networks is the ease with which they can be adjusted to data, usually through the (stochastic) gradient descent of a data-driven loss function combined with a

back-propagation algorithm [16]. Such a learning scheme typically corresponds to adapting the importance each neuron gives to its different input signals in order to better fit the data, or even the effect of an input on the output.

Over the last few years, deep learning has allowed monumental breakthroughs in a huge number of areas, including image analysis, speech recognition, natural language processing, car driving, winning the game of Go, music composition, painting drawing, doodling and so on. We refer to [12] for a more thorough list of deep learning success stories. The repeated and surprising successes of deep learning have convinced most practitioners of the fact that deep learning works in practice. 333Given sometimes huge computation power: deep learning for image recognition for instance typically requires billions of images, whose mere processing was beyond the reach of the fastest computers a few decades ago. In fact, deep learning now typically requires GPU hardware.

However, and as we pointed out earlier, while Turing anticipated the success of machine learning long ago, the success of deep learning, as opposed to other machine learning schemes, remains a mystery. Many researchers often assert that no one quite really understands why deep learning performs so well. Yann LeCun, one of the main pioneers of deep learning, talks about the "unreasonable effectiveness of deep learning"[11].

Figure 2: Space of algorithms and problems.

Conjectures

We present below three conjectures. The first can be viewed as a rephrasing of Turing’s argument in modern terminology. As Turing would likely argue, it explains why machine learning often works better than human-written codes.

Conjecture 1

Most of the data from the current state of our universe and most of the problems we aim to solve with these data, as well as any good approximations of these data and problems, have a Kolmogorov complexity larger than bits.

Our second and main conjecture essentially says that the structures we observe in our daily lives have an inherent apparent complexity, or interestingness. Apparent complexity can be precisely captured by the notion of logical depth, introduced by Bennett [3]. More specifically, the logical depth of a datum, e.g. a video or a data set of images, is the computation time of the shortest algorithm that outputs this datum.444To avoid degeneracy, this definition is sometimes replaced by the smallest computation time of algorithms whose length is at most the Kolmogorov complexity plus a small fixed constant. We argue that data from our daily lives typically feature large logical depth. In other words, our data can usually be remarkably well compressed, and the decompression of the optimal data compression often requires large computational power.

Conjecture 2

Most of the data from the current state of our universe and most of the problems we aim to solve with these data, as well as any good approximations of these data and problems, have a large non-parallelizable logical depth.

We will discuss below the importance of non-parallelizability and give several examples arguing for logical depth. Before doing so however, let us first better explain what the conjectures actually imply. Essentially, the two conjectures above say that the full description of our universe requires both large Kolmogorov complexity and large non-parallelizable logical depth. In fact, it seems that many classical problems can be located somewhere in a diagram of two axes that correspond to these two distinct measures of computational complexity. The same seems to hold for machine learning algorithms as well. Figure 2 depicts the relative complexity of different problems and algorithms.

Now, in order to understand the relation with deep learning, it is important to observe that, strictly speaking, a deep neural network does not perform more computation steps than a shallow one. Indeed, each neuron performs a computation step. In this sense, the number of computation steps of a neural network corresponds to the number of neurons it features. The relevant concept of depth of a problem, as highlighed in our main conjecture, is precisely that of non-parallelizable logical depth.

Non-parallelizable logical depth is intimately connected to the fundamental open question in theoretical computer science P versus NC. This question asks whether problems that can be solved in polynomial time on a Turing machine can be solved in polylogarithmic time on a polynomial-size logic gate circuits. It is widely believed that this is not the case: some polynomial-time problems are fundamentally non-parallelizable. This intuition seems precisely to be corroborated by the success of deep learning over shallow learning, as deep learning seems able to compute functions of large non-parallelizable logical depth that highly parallelized shallow neural networks cannot. In fact, this is our third conjecture.

Conjecture 3

At equivalent Kolmogorov complexity, deeper neural networks compute functions with larger non-parallelizable logical depth.

Note that the conjecture would only represent an asymptotic version of what is needed for our third conjecture. Having said this, given that the size of each input of neural networks is often less than a gigabit, a logarithmic-time function of such inputs would typically terminate in at most steps, which could be computed by a neural network of depth 30. Thus, if , one would expect that deep learning with a lot more than 30 layers does not significantly outperform neural networks with 30 layers. However, current state-of-the-art deep learning algorithms can have "over 1200 layers and still yield meaningful improvements" [8]. This seems like a strong evidence for .

Let us recapitulate. On the one hand, hand-written codes have successfully determined solutions to large logical depth problems, like playing chess, but they are limited in their Kolmogorov complexity. On the other hand, shallow machine learning allows a better exploration of a subspace of large Kolmogorov complexity algorithms. However, shallow machine learning excludes all algorithms whose (parallelized) computation times exceed a few (non-parallelized) computation steps. Considering our three conjectures, both of these approaches inevitably underperform, because our current state of the universe and many classical problems seem to precisely require algorithms of both (a) large Kolmogorov complexity as well as (b) large non-parallelizable logical depth. Deep learning is the current state-of-the-art approach to efficiently explore a space of algorithms with both properties. We argue that this is why deep learning works in theory.

Corroborating the Main Conjecture

To corroborate our main conjecture, saying that the data used to feed our algorithms have an apparent complexity, consider the following examples.

  1. Think of the way we, as humans, would describe many of the pictures of the web. Perhaps we would say that, on some image, we see a cat sitting on a laptop keyboard, that the cat is beige and puffy and that it looks sad. Somehow, however incomplete, this amazingly short description of the image successfully describes much of the information carried through the pixel luminosities and colours of a potentially high-definition image. Clearly, some thinking seems required to decompress and to visualize the scene with this information. We invite the reader to do this effort, before comparing what she imagined with Figure 4.

  2. While sound could a priori be any kind of time series, most of our music is actually highly codified. Music is played by a small number of instruments with characteristic timber. Instruments play a handful of possibles pitches at a handful of possible rhythms, and even the combination of pitches and rhythms is often restricted by the choice of music genre. One can thus describe music in a very efficient manner, typically by a ZIP compression of music sheets. Yet, deriving the actual music from its efficient description may require a lot of computation.

  3. Morphogenesis is a wonderfully sophisticated biological computation that has captivated Turing [23]. It is the ability of a single egg to morph into an actual living organism. Organisms can thereby be regarded as structures of large logical depth. Indeed, most of the information about this structure can be found in a DNA whose size is usually of a few (probably compressable) megabits. This corresponds to a reasonably small Kolmogorov complexity. However, deriving the future structure of the living organisms from its genome is likely to necessarily require a huge amount of computation steps. Indeed, gestation typically lasts months.

  4. Fractals are well-known for being computationally demanding. Indeed, fractals can often be obtained by a very simple procedure being recursively repeated over and over, as in the case of the Romanesco cabbage (see Figure 3). This typically corresponds to a low Kolmogorov complexity, but a large non-parallelizable computational depth. Intriguingly, simulations by [14] show that random deep neural networks compute more fractal-like structures than shallow ones do.

    Figure 3: Romanesco cabbage.
  5. Non-parallelizable logical depth allows to better understand why many classical automata such as Wolfram’s rule 30, Langton’s ant or Conway’s game of life are typically known to feature chaotic unpredictable phenomena, to the point where their unpredictability is sometimes used for pseudo-random number generation. This may seem somewhat contradictory with the fact that these automata are in fact deterministic and computable. However, we argue that what is actually meant is that such phenomena are unpredictable by our brains (or our machine learning models), because the logical depth of the phenomena far exceeds the depth of our brains (or even of our current deepest artificial neural networks).

  6. There is a remarkable parallel between the success of deep learning and what Wigner [24] famously dubbed "the unreasonable effectiveness of mathematics in the natural sciences". A common denominator to both approaches to describing our world is the prevalence of depth. Indeed, mathematics can be regarded as the pinnacle of what humans have produced in terms of logical depth. Even though mathematical textbooks rarely exceed a thousand pages, their understanding often require years of study and a huge amount of cognitive efforts. We argue that the formidable logical depth of mathematics has been the key to understand physical phenomena of large logical depth (and small Kolmogorov complexity), in a manner that our comparatively shallow human brains cannot match.

The very notion of apparent complexity has been previously studied in [1], and was argued to be a likely transitional phase in closed systems whose initial state was of remarkably small entropy — which means that our argument for the effectiveness of deep learning may have roots in the second law of thermodynamics. In particular, [1] presented four distinct measures of "apparent complexity", and proved that they all are related. Logical depth was one of these four measures. This give us good reasons to believe that the current state of the universe has a remarkably large logical depth.

Figure 4: A beige, puffy and sad-looking cat sitting on a laptop keyboard.

The Main Conjecture in Perspective

Not surprisingly, several tentative explanations of the success of deep learning were recently proposed. Many share the same purpose, namely identifying a relevant space of functions that require either depth or exponentially large width [7, 4, 2]

. To simplify the theoretical analysis, these approaches usually rather focus on one particular kind of neural networks, e.g. ReLU in

[19, 20], ReLU, sigmoids and threshold in [5] or logic gate circuits in [25]. Logic gate circuits can be regarded as specific kinds of neural networks whose inter-neural communications are bits. For instance, [6] proved the existence of functions whose computations can be performed with a polynomial-size logic gate circuit of depth , but which require exponentially many logic gates to be run on circuits of depth .

A somewhat different approach has been taken by researchers from Google and Stanford [15, 14]

. Instead of searching for a specific space of functions that only deeper neural networks can compute, they determined typical properties that functions computed through deep neural networks have. To do so, they considered a Gaussian probability distributions over deep neural networks and showed that, for certain such probability distributions, a certain measure of the complexity of the functions computed by the random deep neural networks was increasing at a rate of the form

in expectation. In particular, the effect of depth is exponential, while width only acts in a polynomial way. The measures of the complexity of the functions differ in the two papers [15, 14], as each adapts its definitions to the neural networks under study. Yet both essentially boil down to some measure of nonlinearity.

One may also wonder how the results by [15, 14] relate to our analysis. It is worth pointing out that there is a sense in which a large amount of nonlinearity is typical of large logical depth. Indeed, by opposition, the composition of linear operations remains a linear operation. As a result, an algorithm that combines a large number of linear operations is actually equivalent to a shorter algorithm that only computes their composition in a single operation. Therefore, the composition of linear operations does not increase logical depth. It is only the composition of nonlinear operations that may do so.

Now, it was already argued [13] that physical processes are fundamentally sequential, and hence any algorithm that attempts to understand physical data should be sequential as well. Our conjecture however differs from this claim. Indeed, we argue that data derived from sequential processes are not necessarily of large logical depth. As an example, [1] observe that the early universe and the far-end universe have low apparent complexity, even though they are derived from a large number of computational steps. In fact, [1] even argue that the large apparent complexity of our current universe is only a temporary phase which will vanish as entropy continues to increase. In any case, it is this fundamental computational property of our current universe, measured in terms of non-parallelizable logical depth, that our main conjecture relies on.

Corollaries and Further Steps

An interesting corollary of our conjecture is that state-of-the-art predictive models whose predictions rely on a huge amount of unavoidable computation time (rather than on a huge amount of data) are unlikely to be superseded by current machine learning algorithms, as all current machine learning algorithms can be regarded as fast algorithms. Even the deepest neural networks currently being used hardly exceed a few hundred non-parallelized computation steps.

A major difficulty with training even deeper models would be the exploration (or learning) phase. Indeed, currently, a major bottleneck of machine learning is the huge computation power needed for learning (we, humans, have to spread our learning over decades). Models that would require larger computation time to make predictions would likely require larger computation time for learning as well. A solution to this may be to decompose the learning into different phases. This is actually how people learn to play chess, or other deep endeavors like mathematics or computer science. Instead of learning to play entire chess games repeatedly, chess players focus on independent lower-logical-depth states of chess games, e.g. end games.

An important further step could be to foresee the exact need for deeper learning, depending on the task at hand. This challenge appears for humans as well. Indeed, as argued by psychologists [9]

, our brains seem to feature two thinking modes. One is fast, reactive and only partially reliable. It is a shallow learning part of our brains. The other is slow and more reliable. It is a deeper learning part of our brains. One important problem that our brains repeatedly need to solve is whether the slower part of our brain is needed to solve the task at hand. Such a problem will likely need to be solved for artificial intelligence as well.

Conclusion

We proposed here a very first step towards understanding the success of deep learning. Basically, we argued that fundamental concepts in theoretical computer science, namely Kolmogorov complexity, logical depth and the P vs NC conjecture, could provide better insights into the nature of the data exploited by machine learning algorithms, and better foresee which machine learning algorithms are most likely to succeed in this endeavor. In short, neural networks and specifically deep learning seems preferable over other approaches, mostly because neural networks allow for a facilitated exploration of a subspace of large-Kolmogorov-complexity algorithms, and deep learning better matches the large non-parallelizable logical depth of the current state of our universe.

Formalizing our main conjecture goes through a rigorous, natural and exploitable definition of non-parallelizable logical depth, which is non-trival. It would then be interesting to mathematically prove that no shallow neural network can compute large non-parallelizable logical depth, but that deeper neural networks can. Moreover, determining the non-parallelizable logical depth of real data, as well as of specific (approximate) functions related to this data, e.g. chess playing or passing the Turing test, would be a major step towards a theoretical understanding of deep learning.

References

  • [1] S. Aaronson, S. M. Carroll, and L. Ouellette. Quantifying the rise and fall of complexity in closed systems: The coffee automaton. arXiv preprint arXiv:1405.6903, 2014.
  • [2] Y. Bengio and O. Delalleau. On the expressive power of deep architectures. In International Conference on Algorithmic Learning Theory, pages 18–36. Springer, 2011.
  • [3] C. H. Bennett. Logical depth and physical complexity. The Universal Turing Machine A Half-Century Survey, pages 207–235, 1995.
  • [4] M. Braverman. Poly-logarithmic independence fools bounded-depth boolean circuits. Communications of the ACM, 54(4):108–115, 2011.
  • [5] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In Conference on Learning Theory, pages 907–940, 2016.
  • [6] J. Hastad. Almost optimal lower bounds for small depth circuits. In

    Proceedings of the eighteenth annual ACM symposium on Theory of computing

    , pages 6–20. ACM, 1986.
  • [7] J. Hastad and M. Goldmann. On the power of small-depth threshold circuits. In Foundations of Computer Science, 1990. Proceedings., 31st Annual Symposium on, pages 610–618. IEEE, 1990.
  • [8] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic depth. In

    European Conference on Computer Vision

    , pages 646–661. Springer, 2016.
  • [9] D. Kahneman. Thinking, fast and slow. Macmillan, 2011.
  • [10] A. N. Kolmogorov. On tables of random numbers. Sankhyā: The Indian Journal of Statistics, Series A, pages 369–376, 1963.
  • [11] Y. LeCun. The unreasonable effectiveness of deep learning. In Seminar. Johns Hopkins University, 2014.
  • [12] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
  • [13] H. W. Lin and M. Tegmark. Why does deep and cheap learning work so well? arXiv preprint arXiv:1608.08225, 2016.
  • [14] B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances In Neural Information Processing Systems, pages 3360–3368, 2016.
  • [15] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. Sohl-Dickstein. On the expressive power of deep neural networks. arXiv preprint arXiv:1606.05336, 2016.
  • [16] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Cognitive modeling, 5(3):1, 1988.
  • [17] R. J. Solomonoff. A formal theory of inductive inference. part i. Information and control, 7(1):1–22, 1964.
  • [18] R. J. Solomonoff. A formal theory of inductive inference. part ii. Information and control, 7(2):224–254, 1964.
  • [19] M. Telgarsky. Representation benefits of deep feedforward networks. arXiv preprint arXiv:1509.08101, 2015.
  • [20] M. Telgarsky. Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485, 2016.
  • [21] A. M. Turing. On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London mathematical society, 2(1):230–265, 1937.
  • [22] A. M. Turing. Computing machinery and intelligence. Mind, 59(236):433–460, 1950.
  • [23] A. M. Turing. The chemical basis of morphogenesis. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 237(641):37–72, 1952.
  • [24] E. P. Wigner. The unreasonable effectiveness of mathematics in the natural sciences. richard courant lecture in mathematical sciences delivered at new york university, may 11, 1959. Communications on pure and applied mathematics, 13(1):1–14, 1960.
  • [25] A. C.-C. Yao. Separating the polynomial-time hierarchy by oracles. In Foundations of Computer Science, 1985., 26th Annual Symposium on, pages 1–10. IEEE, 1985.