1 Timeline of Deep Learning Highlights
1.1 1962: Neurobiological Inspiration Through Simple Cells and Complex Cells
1.2 1970 a Decade or so: Backpropagation
Error functions and their gradients for complex, nonlinear, multi-stage, differentiable, NN-related systems have been discussed at least since the early 1960s, e.g., [41, 50, 13, 27, 12, 93, 4, 26]. Gradient descent  in such systems can be performed [13, 50, 12]
by iterating the ancient chain rule[57, 58] in dynamic programming style  (compare simplified derivation using chain rule only ). However, efficient error backpropagation (BP) in arbitrary, possibly sparse, NN-like networks apparently was first described by Linnainmaa in 1970 [59, 60] (he did not refer to NN though). BP is also known as the reverse mode of automatic differentiation , where the costs of forward activation spreading essentially equal the costs of backward derivative calculation. See early FORTRAN code , and compare . Compare the concept of ordered derivative  and related work , with NN-specific discussion  (section 5.5.1), and the first NN-specific efficient BP of 1981 by Werbos [90, 92]. Compare [53, 75, 67] and generalisations for sequence-processing recurrent NN, e.g., [94, 73, 91, 68, 80, 61]. See also natural gradients . As of 2013, BP is still the central Deep Learning algorithm.
1.3 1979: Deep Neocognitron, Weight Sharing, Convolution
Fukushima’s deep Neocognitron architecture [29, 30, 31] incorporated neurophysiological insights (TL 1.1) . It introduced weight-sharing Convolutional Neural Networks (CNN) as well as winner-take-all layers. It is very similar to the architecture of modern, competition-winning, purely supervised, feedforward, gradient-based Deep Learners (TL 1.12-1.14). Fukushima, however, used local unsupervised learning rules instead.
1.4 1987: Autoencoder Hierarchies
1.5 1989: Backpropagation for CNN
1.6 1991: Fundamental Deep Learning Problem
1.7 1991: Deep Hierarchy of Recurrent NN
My first recurrent Deep Learning system (present paper) partially overcame the fundamental problem (TL 1.6) through a deep RNN stack pre-trained in unsupervised fashion [79, 81, 82] to accelerate subsequent supervised learning. This was a working Deep Learner in the modern post-2000 sense, and also the first Neural Hierarchical Temporal Memory.
1.8 1997: Supervised Deep Learner (LSTM)
1.9 2006: Deep Belief Networks / CNN Results
1.10 2009: First Competitions Won by Deep Learning
1.11 2010: Plain Backpropagation on GPUs Yields Excellent Results
In 2010, a new MNIST record was set by good old backpropagation (TL 1.2) in deep but otherwise standard NN, without unsupervised pre-training, and without convolution (but with training pattern deformations). This was made possible mainly by boosting computing power through a fast GPU implementation . (A year later, first human-competitive performance on MNIST was achieved by a deep MCMPCNN (TL 1.12) .)
1.12 2011: MPCNN on GPU / First Superhuman Visual Pattern Recognition
In 2011, Ciresan et al. introduced supervised GPU-based Max-Pooling CNN or Convnets (MPCNN) , today used by most if not all feedforward competition-winning deep NN (TL 1.13, TL 1.14). The first superhuman visual pattern recognition in a controlled competition (traffic signs ) was achieved [20, 19] (twice better than humans, three times better than the closest artificial NN competitor, six times better than the best non-neural method), through deep and wide Multi-Column (MC) GPU-MPCNN [18, 19], the current gold standard for deep feedforward NN.
1.13 2012: First Contests Won on Object Detection and Image Segmentation
2012 saw the first Deep learning system (a GPU-MCMPCNN [18, 19], TL 1.12) to win a contest on visual object detection in large images (as opposed to mere recognition/classification): the ICPR 2012 Contest on Mitosis Detection in Breast Cancer Histological Images [2, 74, 16]. An MC (TL 1.12) variant of a GPU-MPCNN also achieved best results on the ImageNet classification benchmark . 2012 also saw the first pure image segmentation contest won by Deep Learning (again through a GPU-MCMPCNN), namely, the ISBI 2012 Challenge on segmenting neuronal structures [3, 15]. This was the 8th international pattern recognition contest won by my team since 2009 .
1.14 2013: More Contests and Benchmark Records
In 2013, a new TIMIT phoneme recognition record was set by deep LSTM RNN  (TL 1.8, TL 1.10). A new record  on the ICDAR Chinese handwriting recognition benchmark (over 3700 classes) was set on a desktop machine by a GPU-MCMPCNN with almost human performance. The MICCAI 2013 Grand Challenge on Mitosis Detection was won by a GPU-MCMPCNN [88, 16]. Deep GPU-MPCNN  also helped to achieve new best results on ImageNet classification  and PASCAL object detection . Additional contests are mentioned in the web pages of the Swiss AI Lab IDSIA, the University of Toronto, NY University, and the University of Montreal.
Drafts/revisions of this paper have been published since 20 Sept 2013 in my massive open peer review web site www.idsia.ch/~juergen/firstdeeplearner.html (also under www.deeplearning.me). Thanks for valuable comments to Geoffrey Hinton, Kunihiko Fukushima, Yoshua Bengio, Sven Behnke, Yann LeCun, Sepp Hochreiter, Mike Mozer, Marc’Aurelio Ranzato, Andreas Griewank, Paul Werbos, Shun-ichi Amari, Seppo Linnainmaa, Peter Norvig, Yu-Chi Ho, Alex Graves, Dan Ciresan, Jonathan Masci, Stuart Dreyfus, and others.
-  A. Angelica interviews J. Schmidhuber: How Bio-Inspired Deep Learning Keeps Winning Competitions, 2012. KurzweilAI: http://www.kurzweilai.net/how-bio-inspired-deep-learning-keeps-winning-competitions.
-  ICPR 2012 Contest on Mitosis Detection in Breast Cancer Histological Images. Organizers: IPAL Laboratory, TRIBVN Company, Pitie-Salpetriere Hospital, CIALAB of Ohio State Univ., 2012.
-  Segmentation of Neuronal Structures in EM Stacks Challenge - IEEE International Symposium on Biomedical Imaging (ISBI), 2012.
A theory of adaptive pattern classifiers.IEEE Trans. EC, 16(3):299–307, 1967.
-  S.-I. Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
-  H. Baird. Document Image Defect Models. In Proceddings, IAPR Workshop on Syntactic and Structural Pattern Recognition, Murray Hill, NJ, 1990.
-  D. H. Ballard. Modular learning in neural networks. In AAAI, pages 279–284, 1987.
-  S. Behnke. Hierarchical Neural Networks for Image Interpretation, volume LNCS 2766 of Lecture Notes in Computer Science. Springer, 2003.
-  R. Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1st edition, 1957.
-  Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
-  T. Bluche, J. Louradour, M. Knibbe, B. Moysset, F. Benzeghiba, and C. Kermorvant. The A2iA Arabic Handwritten Text Recognition System at the OpenHaRT2013 Evaluation. In Submitted to DAS 2014, 2013.
A. Bryson and Y. Ho.
Applied optimal control: optimization, estimation, and control. Blaisdell Pub. Co., 1969.
-  A. E. Bryson. A gradient method for optimizing multi-stage allocation processes. In Proc. Harvard Univ. Symposium on digital computers and their applications, 1961.
-  K. Chellapilla, S. Puri, and P. Simard. High performance convolutional neural networks for document processing. In International Workshop on Frontiers in Handwriting Recognition, 2006.
-  D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber. Deep neural networks segment neuronal membranes in electron microscopy images. In Advances in Neural Information Processing Systems NIPS, 2012.
-  D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber. Mitosis detection in breast cancer histology images using deep neural networks. In MICCAI 2013, 2013.
-  D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep big simple neural nets for handwritten digit recogntion. Neural Computation, 22(12):3207–3220, 2010.
D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber.
Flexible, high performance convolutional neural networks for image
Intl. Joint Conference on Artificial Intelligence IJCAI, pages 1237–1242, 2011.
-  D. C. Ciresan, U. Meier, J. Masci, and J. Schmidhuber. A committee of neural networks for traffic sign classification. In International Joint Conference on Neural Networks, pages 1918–1921, 2011.
-  D. C. Ciresan, U. Meier, J. Masci, and J. Schmidhuber. Multi-column deep neural network for traffic sign classification. Neural Networks, 2012.
-  D. C. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. Technical report, IDSIA, February 2012. arXiv:1202.2745v1 [cs.CV].
-  D. C. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. In IEEE Conference on Computer Vision and Pattern Recognition CVPR 2012, 2012. Long preprint arXiv:1202.2745v1 [cs.CV].
-  D. C. Ciresan, U. Meier, and J. Schmidhuber. Transfer learning for Latin and Chinese characters with deep neural networks. In International Joint Conference on Neural Networks, pages 1301–1306, 2012.
-  D. C. Ciresan and J. Schmidhuber. Multi-column deep neural networks for offline handwritten Chinese character classification. Technical report, IDSIA, September 2013. arXiv:1309.0261.
-  A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catanzaro. Deep learning with COTS HPC systems. In Proc. International Conference on Machine learning (ICML’13), 2013.
-  S. W. Director and R. A. Rohrer. Automated network design - the frequency-domain case. IEEE Trans. Circuit Theory, CT-16:330–337, 1969.
-  S. E. Dreyfus. The numerical solution of variational problems. Journal of Mathematical Analysis and Applications, 5(1):30–45, 1962.
-  S. E. Dreyfus. The computational solution of optimal control problems with time lag. IEEE Transactions on Automatic Control, 18(4):383–385, 1973.
-  K. Fukushima. Neural network model for a mechanism of pattern recognition unaffected by shift in position - neocognitron. In Trans. IECE, 1979.
-  K. Fukushima. Neocognitron: A self-organizing neural network for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4):193–202, 1980.
-  K. Fukushima. Artificial vision by multi-layered neural networks: Neocognitron and its advances. Neural Networks, 37:103–119, Jan. 2013.
-  F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction with LSTM. In Proc. ICANN’99, Int. Conf. on Artificial Neural Networks, pages 850–855, Edinburgh, Scotland, 1999. IEE, London.
-  F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction with LSTM. Neural Computation, 12(10):2451–2471, 2000.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. Technical Report arxiv.org/abs/1311.2524, UC Berkeley and ICSI, 2013.
-  I. J. Goodfellow, A. C. Courville, and Y. Bengio. Large-scale feature learning with spike-and-slab sparse coding. In Proceedings of the 29th International Conference on Machine Learning, 2012.
-  A. Graves, S. Fernandez, F. J. Gomez, and J. Schmidhuber. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural nets. In ICML’06: Proceedings of the International Conference on Machine Learning, 2006.
-  A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber. A novel connectionist system for improved unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5), 2009.
-  A. Graves, A. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent neural networks. In Proceedings of the ICASSP, 2013.
-  A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5-6):602–610, 2005.
-  A. Graves and J. Schmidhuber. Offline handwriting recognition with multidimensional recurrent neural networks. In Advances in Neural Information Processing Systems 21. MIT Press, Cambridge, MA, 2009.
-  A. Griewank. Documenta Mathematica - Extra Volume ISMP, pages 389–400, 2012.
-  J. Hadamard. Mémoire sur le problème d’analyse relatif à l’équilibre des plaques élastiques encastrées. Mémoires présentés par divers savants à l’Académie des sciences de l’Institut de France: Éxtrait. Imprimerie nationale, 1908.
-  J. Hawkins and D. George. Hierarchical Temporal Memory - Concepts, Theory, and Terminology. Numenta Inc., 2006.
-  G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
-  G. E. Hinton. Connectionist learning procedures. Artificial intelligence, 40(1):185–234, 1989.
-  S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut für Informatik, Lehrstuhl Prof. Brauer, Technische Universität München, 1991. Advisor: J. Schmidhuber.
-  S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In S. C. Kremer and J. F. Kolen, editors, A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press, 2001.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
-  D. H. Hubel and T. Wiesel. Receptive fields, binocular interaction, and functional architecture in the cat’s visual cortex. Journal of Physiology (London), 160:106–154, 1962.
-  H. J. Kelley. Gradient theory of optimal flight paths. ARS Journal, 30(10):947–954, 1960.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS 2012), 2012.
-  R. Kurzweil. How to Create a Mind: The Secret of Human Thought Revealed. 2012.
-  Y. LeCun. Une procédure d’apprentissage pour réseau à seuil asymétrique. Proceedings of Cognitiva 85, Paris, pages 599–604, 1985.
-  Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Back-propagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989.
-  Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Handwritten digit recognition with a back-propagation network. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems 2, pages 396–404. Morgan Kaufmann, 1990.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.
-  G. W. Leibniz. Memoir using the chain rule (cited in tmme 7:2&3 p 321-332, 2010). 1676.
-  G. F. A. L’Hospital. Analyse des infiniment petits, pour l’intelligence des lignes courbes. Paris: L’Imprimerie Royale, 1696.
-  S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master’s thesis, Univ. Helsinki, 1970.
-  S. Linnainmaa. Taylor expansion of the accumulated rounding error. BIT Numerical Mathematics, 16(2):146–160, 1976.
-  J. Martens and I. Sutskever. Learning recurrent neural networks with Hessian-free optimization. In Proceedings of the 28th International Conference on Machine Learning, pages 1033–1040, 2011.
-  J. Masci, D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber. Fast image scanning with deep max-pooling convolutional neural networks. In Proc. ICIP, 2013.
-  J. Masci, A. Giusti, D. C. Ciresan, G. Fricout, and J. Schmidhuber. A fast learning algorithm for image segmentation with max-pooling convolutional networks. In Proc. ICIP, 2013.
-  G. Montavon, G. Orr, and K. Müller. Neural Networks: Tricks of the Trade. Number LNCS 7700 in Lecture Notes in Computer Science Series. Springer Verlag, 2012.
-  G. Orr and K. Müller. Neural Networks: Tricks of the Trade. Number LNCS 1524 in Lecture Notes in Computer Science Series. Springer Verlag, 1998.
-  G. M. Ostrovskii, Y. M. Volin, and W. W. Borisov. Über die Berechnung von Ableitungen. Wiss. Z. Tech. Hochschule für Chemie, 13:382–384, 1971.
-  D. B. Parker. Learning-logic. Technical Report TR-47, Center for Comp. Research in Economics and Management Sci., MIT, 1985.
-  B. A. Pearlmutter. Learning state space trajectories in recurrent neural networks. Neural Computation, 1(2):263–269, 1989.
J. B. Pollack.
Implications of recursive distributed representations.In NIPS, pages 527–536, 1988.
M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun.
Efficient learning of sparse representations with an energy-based model.In J. P. et al., editor, Advances in Neural Information Processing Systems (NIPS 2006). MIT Press, 2006.
M. A. Ranzato, F. Huang, Y. Boureau, and Y. LeCun.
Unsupervised learning of invariant feature hierarchies with
applications to object recognition.
Proc. Computer Vision and Pattern Recognition Conference (CVPR’07). IEEE Press, 2007.
-  M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nat. Neurosci., 2(11):1019–1025, 1999.
-  A. J. Robinson and F. Fallside. The utility driven dynamic error propagation network. Technical Report CUED/F-INFENG/TR.1, Cambridge University Engineering Department, 1987.
-  L. Roux, D. Racoceanu, N. Lomenie, M. Kulikova, H. Irshad, J. Klossa, F. Capron, C. Genestie, G. L. Naour, and M. N. Gurcan. Mitosis detection in breast cancer histological images - an ICPR 2012 contest. J. Pathol. Inform., 4:8, 2013.
-  D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing, volume 1, pages 318–362. MIT Press, 1986.
-  R. E. Schapire. The strength of weak learnability. Machine Learning, 5:197–227, 1990.
-  D. Scherer, A. Müller, and S. Behnke. Evaluation of pooling operations in convolutional architectures for object recognition. In Proc. International Conference on Artificial Neural Networks (ICANN), 2010.
-  J. Schmidhuber. A local learning algorithm for dynamic feedforward and recurrent networks. Connection Science, 1(4):403–412, 1989.
-  J. Schmidhuber. Neural sequence chunkers. Technical Report FKI-148-91, Institut für Informatik, Technische Universität München, April 1991.
-  J. Schmidhuber. A fixed size storage time complexity learning algorithm for fully recurrent continually running networks. Neural Computation, 4(2):243–248, 1992.
-  J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234–242, 1992.
-  J. Schmidhuber. Netzwerkarchitekturen, Zielfunktionen und Kettenregel. (Network Architectures, Objective Functions, and Chain Rule.) Habilitationsschrift (Habilitation Thesis), Institut für Informatik, Technische Universität München, 1993.
-  J. Schmidhuber, D. Ciresan, U. Meier, J. Masci, and A. Graves. On fast deep nets for AGI vision. In Proc. Fourth Conference on Artificial General Intelligence (AGI), Google, Mountain View, CA, 2011.
-  J. Schmidhuber, M. C. Mozer, and D. Prelinger. Continuous history compression. In H. Hüning, S. Neuhauser, M. Raus, and W. Ritschel, editors, Proc. of Intl. Workshop on Neural Networks, RWTH Aachen, pages 87–95. Augustinus, 1993.
-  P. Sermanet and Y. LeCun. Traffic sign recognition with multi-scale convolutional networks. In Proceedings of International Joint Conference on Neural Networks (IJCNN’11), 2011.
-  P. Simard, D. Steinkraus, and J. Platt. Best practices for convolutional neural networks applied to visual document analysis. In Seventh International Conference on Document Analysis and Recognition, pages 958–963, 2003.
-  J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. INI German Traffic Sign Recognition Benchmark for the IJCNN’11 Competition, 2011.
-  M. Veta, M. Viergever, J. Pluim, N. Stathonikos, and P. J. van Diest. MICCAI 2013 Grand Challenge on Mitosis Detection (organisers), 2013.
-  P. J. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, 1974.
-  P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In Proceedings of the 10th IFIP Conference, 31.8 - 4.9, NYC, pages 762–770, 1981.
-  P. J. Werbos. Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1, 1988.
-  P. J. Werbos. Backwards differentiation in AD and neural nets: Past links and new opportunities. In Automatic Differentiation: Applications, Theory, and Implementations, pages 15–34. Springer, 2006.
J. H. Wilkinson, editor.
The Algebraic Eigenvalue Problem. Oxford University Press, Inc., New York, NY, USA, 1988.
-  R. J. Williams. Complexity of exact gradient computation algorithms for recurrent neural networks. Technical Report Technical Report NU-CCS-89-27, Boston: Northeastern University, College of Computer Science, 1989.
-  M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. Technical Report arXiv:1311.2901 [cs.CV], NYU, 2013.