1 Timeline of Deep Learning Highlights
1.1 1962: Neurobiological Inspiration Through Simple Cells and Complex Cells
1.2 1970 a Decade or so: Backpropagation
Error functions and their gradients for complex, nonlinear, multistage, differentiable, NNrelated systems have been discussed at least since the early 1960s, e.g., [41, 50, 13, 27, 12, 93, 4, 26]. Gradient descent [42] in such systems can be performed [13, 50, 12]
by iterating the ancient chain rule
[57, 58] in dynamic programming style [9] (compare simplified derivation using chain rule only [27]). However, efficient error backpropagation (BP) in arbitrary, possibly sparse, NNlike networks apparently was first described by Linnainmaa in 1970 [59, 60] (he did not refer to NN though). BP is also known as the reverse mode of automatic differentiation [41], where the costs of forward activation spreading essentially equal the costs of backward derivative calculation. See early FORTRAN code [59], and compare [66]. Compare the concept of ordered derivative [89] and related work [28], with NNspecific discussion [89] (section 5.5.1), and the first NNspecific efficient BP of 1981 by Werbos [90, 92]. Compare [53, 75, 67] and generalisations for sequenceprocessing recurrent NN, e.g., [94, 73, 91, 68, 80, 61]. See also natural gradients [5]. As of 2013, BP is still the central Deep Learning algorithm.1.3 1979: Deep Neocognitron, Weight Sharing, Convolution
Fukushima’s deep Neocognitron architecture [29, 30, 31] incorporated neurophysiological insights (TL 1.1) [49]. It introduced weightsharing Convolutional Neural Networks (CNN) as well as winnertakeall layers. It is very similar to the architecture of modern, competitionwinning, purely supervised, feedforward, gradientbased Deep Learners (TL 1.121.14). Fukushima, however, used local unsupervised learning rules instead.
1.4 1987: Autoencoder Hierarchies
1.5 1989: Backpropagation for CNN
1.6 1991: Fundamental Deep Learning Problem
1.7 1991: Deep Hierarchy of Recurrent NN
My first recurrent Deep Learning system (present paper) partially overcame the fundamental problem (TL 1.6) through a deep RNN stack pretrained in unsupervised fashion [79, 81, 82] to accelerate subsequent supervised learning. This was a working Deep Learner in the modern post2000 sense, and also the first Neural Hierarchical Temporal Memory.
1.8 1997: Supervised Deep Learner (LSTM)
1.9 2006: Deep Belief Networks / CNN Results
A paper by Hinton and Salakhutdinov [44] focused on unsupervised pretraining of feedforward NN to accelerate subsequent supervised learning (compare TL 1.7
). This helped to arouse interest in deep NN (keywords: restricted Boltzmann machines; Deep Belief Networks). In the same year, a BPtrained CNN (TL
1.3, TL 1.5) by Ranzato et al. [70] set a new record on the famous MNIST handwritten digit recognition benchmark [54], using training pattern deformations [6, 86].1.10 2009: First Competitions Won by Deep Learning
1.11 2010: Plain Backpropagation on GPUs Yields Excellent Results
In 2010, a new MNIST record was set by good old backpropagation (TL 1.2) in deep but otherwise standard NN, without unsupervised pretraining, and without convolution (but with training pattern deformations). This was made possible mainly by boosting computing power through a fast GPU implementation [17]. (A year later, first humancompetitive performance on MNIST was achieved by a deep MCMPCNN (TL 1.12) [22].)
1.12 2011: MPCNN on GPU / First Superhuman Visual Pattern Recognition
In 2011, Ciresan et al. introduced supervised GPUbased MaxPooling CNN or Convnets (MPCNN) [18], today used by most if not all feedforward competitionwinning deep NN (TL 1.13, TL 1.14). The first superhuman visual pattern recognition in a controlled competition (traffic signs [87]) was achieved [20, 19] (twice better than humans, three times better than the closest artificial NN competitor, six times better than the best nonneural method), through deep and wide MultiColumn (MC) GPUMPCNN [18, 19], the current gold standard for deep feedforward NN.
1.13 2012: First Contests Won on Object Detection and Image Segmentation
2012 saw the first Deep learning system (a GPUMCMPCNN [18, 19], TL 1.12) to win a contest on visual object detection in large images (as opposed to mere recognition/classification): the ICPR 2012 Contest on Mitosis Detection in Breast Cancer Histological Images [2, 74, 16]. An MC (TL 1.12) variant of a GPUMPCNN also achieved best results on the ImageNet classification benchmark [51]. 2012 also saw the first pure image segmentation contest won by Deep Learning (again through a GPUMCMPCNN), namely, the ISBI 2012 Challenge on segmenting neuronal structures [3, 15]. This was the 8th international pattern recognition contest won by my team since 2009 [1].
1.14 2013: More Contests and Benchmark Records
In 2013, a new TIMIT phoneme recognition record was set by deep LSTM RNN [38] (TL 1.8, TL 1.10). A new record [24] on the ICDAR Chinese handwriting recognition benchmark (over 3700 classes) was set on a desktop machine by a GPUMCMPCNN with almost human performance. The MICCAI 2013 Grand Challenge on Mitosis Detection was won by a GPUMCMPCNN [88, 16]. Deep GPUMPCNN [18] also helped to achieve new best results on ImageNet classification [95] and PASCAL object detection [34]. Additional contests are mentioned in the web pages of the Swiss AI Lab IDSIA, the University of Toronto, NY University, and the University of Montreal.
2 Acknowledgments
Drafts/revisions of this paper have been published since 20 Sept 2013 in my massive open peer review web site www.idsia.ch/~juergen/firstdeeplearner.html (also under www.deeplearning.me). Thanks for valuable comments to Geoffrey Hinton, Kunihiko Fukushima, Yoshua Bengio, Sven Behnke, Yann LeCun, Sepp Hochreiter, Mike Mozer, Marc’Aurelio Ranzato, Andreas Griewank, Paul Werbos, Shunichi Amari, Seppo Linnainmaa, Peter Norvig, YuChi Ho, Alex Graves, Dan Ciresan, Jonathan Masci, Stuart Dreyfus, and others.
References
 [1] A. Angelica interviews J. Schmidhuber: How BioInspired Deep Learning Keeps Winning Competitions, 2012. KurzweilAI: http://www.kurzweilai.net/howbioinspireddeeplearningkeepswinningcompetitions.
 [2] ICPR 2012 Contest on Mitosis Detection in Breast Cancer Histological Images. Organizers: IPAL Laboratory, TRIBVN Company, PitieSalpetriere Hospital, CIALAB of Ohio State Univ., 2012.
 [3] Segmentation of Neuronal Structures in EM Stacks Challenge  IEEE International Symposium on Biomedical Imaging (ISBI), 2012.

[4]
S. Amari.
A theory of adaptive pattern classifiers.
IEEE Trans. EC, 16(3):299–307, 1967.  [5] S.I. Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
 [6] H. Baird. Document Image Defect Models. In Proceddings, IAPR Workshop on Syntactic and Structural Pattern Recognition, Murray Hill, NJ, 1990.
 [7] D. H. Ballard. Modular learning in neural networks. In AAAI, pages 279–284, 1987.
 [8] S. Behnke. Hierarchical Neural Networks for Image Interpretation, volume LNCS 2766 of Lecture Notes in Computer Science. Springer, 2003.
 [9] R. Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1st edition, 1957.
 [10] Y. Bengio, P. Simard, and P. Frasconi. Learning longterm dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
 [11] T. Bluche, J. Louradour, M. Knibbe, B. Moysset, F. Benzeghiba, and C. Kermorvant. The A2iA Arabic Handwritten Text Recognition System at the OpenHaRT2013 Evaluation. In Submitted to DAS 2014, 2013.

[12]
A. Bryson and Y. Ho.
Applied optimal control: optimization, estimation, and control
. Blaisdell Pub. Co., 1969.  [13] A. E. Bryson. A gradient method for optimizing multistage allocation processes. In Proc. Harvard Univ. Symposium on digital computers and their applications, 1961.
 [14] K. Chellapilla, S. Puri, and P. Simard. High performance convolutional neural networks for document processing. In International Workshop on Frontiers in Handwriting Recognition, 2006.
 [15] D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber. Deep neural networks segment neuronal membranes in electron microscopy images. In Advances in Neural Information Processing Systems NIPS, 2012.
 [16] D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber. Mitosis detection in breast cancer histology images using deep neural networks. In MICCAI 2013, 2013.
 [17] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep big simple neural nets for handwritten digit recogntion. Neural Computation, 22(12):3207–3220, 2010.

[18]
D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber.
Flexible, high performance convolutional neural networks for image
classification.
In
Intl. Joint Conference on Artificial Intelligence IJCAI
, pages 1237–1242, 2011.  [19] D. C. Ciresan, U. Meier, J. Masci, and J. Schmidhuber. A committee of neural networks for traffic sign classification. In International Joint Conference on Neural Networks, pages 1918–1921, 2011.
 [20] D. C. Ciresan, U. Meier, J. Masci, and J. Schmidhuber. Multicolumn deep neural network for traffic sign classification. Neural Networks, 2012.
 [21] D. C. Ciresan, U. Meier, and J. Schmidhuber. Multicolumn deep neural networks for image classification. Technical report, IDSIA, February 2012. arXiv:1202.2745v1 [cs.CV].
 [22] D. C. Ciresan, U. Meier, and J. Schmidhuber. Multicolumn deep neural networks for image classification. In IEEE Conference on Computer Vision and Pattern Recognition CVPR 2012, 2012. Long preprint arXiv:1202.2745v1 [cs.CV].
 [23] D. C. Ciresan, U. Meier, and J. Schmidhuber. Transfer learning for Latin and Chinese characters with deep neural networks. In International Joint Conference on Neural Networks, pages 1301–1306, 2012.
 [24] D. C. Ciresan and J. Schmidhuber. Multicolumn deep neural networks for offline handwritten Chinese character classification. Technical report, IDSIA, September 2013. arXiv:1309.0261.
 [25] A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catanzaro. Deep learning with COTS HPC systems. In Proc. International Conference on Machine learning (ICML’13), 2013.
 [26] S. W. Director and R. A. Rohrer. Automated network design  the frequencydomain case. IEEE Trans. Circuit Theory, CT16:330–337, 1969.
 [27] S. E. Dreyfus. The numerical solution of variational problems. Journal of Mathematical Analysis and Applications, 5(1):30–45, 1962.
 [28] S. E. Dreyfus. The computational solution of optimal control problems with time lag. IEEE Transactions on Automatic Control, 18(4):383–385, 1973.
 [29] K. Fukushima. Neural network model for a mechanism of pattern recognition unaffected by shift in position  neocognitron. In Trans. IECE, 1979.
 [30] K. Fukushima. Neocognitron: A selforganizing neural network for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4):193–202, 1980.
 [31] K. Fukushima. Artificial vision by multilayered neural networks: Neocognitron and its advances. Neural Networks, 37:103–119, Jan. 2013.
 [32] F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction with LSTM. In Proc. ICANN’99, Int. Conf. on Artificial Neural Networks, pages 850–855, Edinburgh, Scotland, 1999. IEE, London.
 [33] F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction with LSTM. Neural Computation, 12(10):2451–2471, 2000.
 [34] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. Technical Report arxiv.org/abs/1311.2524, UC Berkeley and ICSI, 2013.
 [35] I. J. Goodfellow, A. C. Courville, and Y. Bengio. Largescale feature learning with spikeandslab sparse coding. In Proceedings of the 29th International Conference on Machine Learning, 2012.
 [36] A. Graves, S. Fernandez, F. J. Gomez, and J. Schmidhuber. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural nets. In ICML’06: Proceedings of the International Conference on Machine Learning, 2006.
 [37] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber. A novel connectionist system for improved unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5), 2009.
 [38] A. Graves, A. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent neural networks. In Proceedings of the ICASSP, 2013.
 [39] A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(56):602–610, 2005.
 [40] A. Graves and J. Schmidhuber. Offline handwriting recognition with multidimensional recurrent neural networks. In Advances in Neural Information Processing Systems 21. MIT Press, Cambridge, MA, 2009.
 [41] A. Griewank. Documenta Mathematica  Extra Volume ISMP, pages 389–400, 2012.
 [42] J. Hadamard. Mémoire sur le problème d’analyse relatif à l’équilibre des plaques élastiques encastrées. Mémoires présentés par divers savants à l’Académie des sciences de l’Institut de France: Éxtrait. Imprimerie nationale, 1908.
 [43] J. Hawkins and D. George. Hierarchical Temporal Memory  Concepts, Theory, and Terminology. Numenta Inc., 2006.
 [44] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
 [45] G. E. Hinton. Connectionist learning procedures. Artificial intelligence, 40(1):185–234, 1989.
 [46] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut für Informatik, Lehrstuhl Prof. Brauer, Technische Universität München, 1991. Advisor: J. Schmidhuber.
 [47] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning longterm dependencies. In S. C. Kremer and J. F. Kolen, editors, A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press, 2001.
 [48] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997.
 [49] D. H. Hubel and T. Wiesel. Receptive fields, binocular interaction, and functional architecture in the cat’s visual cortex. Journal of Physiology (London), 160:106–154, 1962.
 [50] H. J. Kelley. Gradient theory of optimal flight paths. ARS Journal, 30(10):947–954, 1960.
 [51] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS 2012), 2012.
 [52] R. Kurzweil. How to Create a Mind: The Secret of Human Thought Revealed. 2012.
 [53] Y. LeCun. Une procédure d’apprentissage pour réseau à seuil asymétrique. Proceedings of Cognitiva 85, Paris, pages 599–604, 1985.
 [54] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989.
 [55] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Handwritten digit recognition with a backpropagation network. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems 2, pages 396–404. Morgan Kaufmann, 1990.
 [56] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.
 [57] G. W. Leibniz. Memoir using the chain rule (cited in tmme 7:2&3 p 321332, 2010). 1676.
 [58] G. F. A. L’Hospital. Analyse des infiniment petits, pour l’intelligence des lignes courbes. Paris: L’Imprimerie Royale, 1696.
 [59] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master’s thesis, Univ. Helsinki, 1970.
 [60] S. Linnainmaa. Taylor expansion of the accumulated rounding error. BIT Numerical Mathematics, 16(2):146–160, 1976.
 [61] J. Martens and I. Sutskever. Learning recurrent neural networks with Hessianfree optimization. In Proceedings of the 28th International Conference on Machine Learning, pages 1033–1040, 2011.
 [62] J. Masci, D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber. Fast image scanning with deep maxpooling convolutional neural networks. In Proc. ICIP, 2013.
 [63] J. Masci, A. Giusti, D. C. Ciresan, G. Fricout, and J. Schmidhuber. A fast learning algorithm for image segmentation with maxpooling convolutional networks. In Proc. ICIP, 2013.
 [64] G. Montavon, G. Orr, and K. Müller. Neural Networks: Tricks of the Trade. Number LNCS 7700 in Lecture Notes in Computer Science Series. Springer Verlag, 2012.
 [65] G. Orr and K. Müller. Neural Networks: Tricks of the Trade. Number LNCS 1524 in Lecture Notes in Computer Science Series. Springer Verlag, 1998.
 [66] G. M. Ostrovskii, Y. M. Volin, and W. W. Borisov. Über die Berechnung von Ableitungen. Wiss. Z. Tech. Hochschule für Chemie, 13:382–384, 1971.
 [67] D. B. Parker. Learninglogic. Technical Report TR47, Center for Comp. Research in Economics and Management Sci., MIT, 1985.
 [68] B. A. Pearlmutter. Learning state space trajectories in recurrent neural networks. Neural Computation, 1(2):263–269, 1989.

[69]
J. B. Pollack.
Implications of recursive distributed representations.
In NIPS, pages 527–536, 1988. 
[70]
M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun.
Efficient learning of sparse representations with an energybased model.
In J. P. et al., editor, Advances in Neural Information Processing Systems (NIPS 2006). MIT Press, 2006. 
[71]
M. A. Ranzato, F. Huang, Y. Boureau, and Y. LeCun.
Unsupervised learning of invariant feature hierarchies with
applications to object recognition.
In
Proc. Computer Vision and Pattern Recognition Conference (CVPR’07)
. IEEE Press, 2007.  [72] M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nat. Neurosci., 2(11):1019–1025, 1999.
 [73] A. J. Robinson and F. Fallside. The utility driven dynamic error propagation network. Technical Report CUED/FINFENG/TR.1, Cambridge University Engineering Department, 1987.
 [74] L. Roux, D. Racoceanu, N. Lomenie, M. Kulikova, H. Irshad, J. Klossa, F. Capron, C. Genestie, G. L. Naour, and M. N. Gurcan. Mitosis detection in breast cancer histological images  an ICPR 2012 contest. J. Pathol. Inform., 4:8, 2013.
 [75] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing, volume 1, pages 318–362. MIT Press, 1986.
 [76] R. E. Schapire. The strength of weak learnability. Machine Learning, 5:197–227, 1990.
 [77] D. Scherer, A. Müller, and S. Behnke. Evaluation of pooling operations in convolutional architectures for object recognition. In Proc. International Conference on Artificial Neural Networks (ICANN), 2010.
 [78] J. Schmidhuber. A local learning algorithm for dynamic feedforward and recurrent networks. Connection Science, 1(4):403–412, 1989.
 [79] J. Schmidhuber. Neural sequence chunkers. Technical Report FKI14891, Institut für Informatik, Technische Universität München, April 1991.
 [80] J. Schmidhuber. A fixed size storage time complexity learning algorithm for fully recurrent continually running networks. Neural Computation, 4(2):243–248, 1992.
 [81] J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234–242, 1992.
 [82] J. Schmidhuber. Netzwerkarchitekturen, Zielfunktionen und Kettenregel. (Network Architectures, Objective Functions, and Chain Rule.) Habilitationsschrift (Habilitation Thesis), Institut für Informatik, Technische Universität München, 1993.
 [83] J. Schmidhuber, D. Ciresan, U. Meier, J. Masci, and A. Graves. On fast deep nets for AGI vision. In Proc. Fourth Conference on Artificial General Intelligence (AGI), Google, Mountain View, CA, 2011.
 [84] J. Schmidhuber, M. C. Mozer, and D. Prelinger. Continuous history compression. In H. Hüning, S. Neuhauser, M. Raus, and W. Ritschel, editors, Proc. of Intl. Workshop on Neural Networks, RWTH Aachen, pages 87–95. Augustinus, 1993.
 [85] P. Sermanet and Y. LeCun. Traffic sign recognition with multiscale convolutional networks. In Proceedings of International Joint Conference on Neural Networks (IJCNN’11), 2011.
 [86] P. Simard, D. Steinkraus, and J. Platt. Best practices for convolutional neural networks applied to visual document analysis. In Seventh International Conference on Document Analysis and Recognition, pages 958–963, 2003.
 [87] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. INI German Traffic Sign Recognition Benchmark for the IJCNN’11 Competition, 2011.
 [88] M. Veta, M. Viergever, J. Pluim, N. Stathonikos, and P. J. van Diest. MICCAI 2013 Grand Challenge on Mitosis Detection (organisers), 2013.
 [89] P. J. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, 1974.
 [90] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In Proceedings of the 10th IFIP Conference, 31.8  4.9, NYC, pages 762–770, 1981.
 [91] P. J. Werbos. Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1, 1988.
 [92] P. J. Werbos. Backwards differentiation in AD and neural nets: Past links and new opportunities. In Automatic Differentiation: Applications, Theory, and Implementations, pages 15–34. Springer, 2006.

[93]
J. H. Wilkinson, editor.
The Algebraic Eigenvalue Problem
. Oxford University Press, Inc., New York, NY, USA, 1988.  [94] R. J. Williams. Complexity of exact gradient computation algorithms for recurrent neural networks. Technical Report Technical Report NUCCS8927, Boston: Northeastern University, College of Computer Science, 1989.
 [95] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. Technical Report arXiv:1311.2901 [cs.CV], NYU, 2013.
Comments
There are no comments yet.