My First Deep Learning System of 1991 + Deep Learning Timeline 1962-2013

12/19/2013 ∙ by Jürgen Schmidhuber, et al. ∙ 0

Deep Learning has attracted significant attention in recent years. Here I present a brief overview of my first Deep Learner of 1991, and its historic context, with a timeline of Deep Learning highlights.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Timeline of Deep Learning Highlights

1.1 1962: Neurobiological Inspiration Through Simple Cells and Complex Cells

Hubel and Wiesel described simple cells and complex cells in the visual cortex [49]. This inspired later deep artificial neural network (NN) architectures (TL 1.3) used in certain modern award-winning Deep Learners (TL 1.12-1.14). (The author of the present paper was conceived in 1962.)

1.2 1970 a Decade or so: Backpropagation

Error functions and their gradients for complex, nonlinear, multi-stage, differentiable, NN-related systems have been discussed at least since the early 1960s, e.g., [41, 50, 13, 27, 12, 93, 4, 26]. Gradient descent [42] in such systems can be performed [13, 50, 12]

by iterating the ancient chain rule 

[57, 58] in dynamic programming style [9] (compare simplified derivation using chain rule only [27]). However, efficient error backpropagation (BP) in arbitrary, possibly sparse, NN-like networks apparently was first described by Linnainmaa in 1970 [59, 60] (he did not refer to NN though). BP is also known as the reverse mode of automatic differentiation [41], where the costs of forward activation spreading essentially equal the costs of backward derivative calculation. See early FORTRAN code [59], and compare [66]. Compare the concept of ordered derivative [89] and related work [28], with NN-specific discussion [89] (section 5.5.1), and the first NN-specific efficient BP of 1981 by Werbos [90, 92]. Compare [53, 75, 67] and generalisations for sequence-processing recurrent NN, e.g., [94, 73, 91, 68, 80, 61]. See also natural gradients [5]. As of 2013, BP is still the central Deep Learning algorithm.

1.3 1979: Deep Neocognitron, Weight Sharing, Convolution

Fukushima’s deep Neocognitron architecture [29, 30, 31] incorporated neurophysiological insights (TL 1.1)  [49]. It introduced weight-sharing Convolutional Neural Networks (CNN) as well as winner-take-all layers. It is very similar to the architecture of modern, competition-winning, purely supervised, feedforward, gradient-based Deep Learners (TL 1.12-1.14). Fukushima, however, used local unsupervised learning rules instead.

1.4 1987: Autoencoder Hierarchies

In 1987, Ballard published ideas on unsupervised autoencoder hierarchies 

[7], related to post-2000 feedforward Deep Learners (TL 1.9) based on unsupervised pre-training, e.g., [44]; compare survey [45] and somewhat related RAAMs [69].

1.5 1989: Backpropagation for CNN

LeCun et al. [54, 55] applied backpropagation (TL 1.2) to Fukushima’s weight-sharing convolutional neural layers (TL 1.3[29, 30, 54]. This combination has become an essential ingredient of many modern, competition-winning, feedforward, visual Deep Learners (TL 1.12-1.13).

1.6 1991: Fundamental Deep Learning Problem

By the early 1990s, experiments had shown that deep feedforward or recurrent networks are hard to train by backpropagation (TL 1.2). My student Hochreiter [46] discovered and analyzed the reason, namely, the Fundamental Deep Learning Problem due to vanishing or exploding gradients. Compare [47].

1.7 1991: Deep Hierarchy of Recurrent NN

My first recurrent Deep Learning system (present paper) partially overcame the fundamental problem (TL 1.6) through a deep RNN stack pre-trained in unsupervised fashion [79, 81, 82] to accelerate subsequent supervised learning. This was a working Deep Learner in the modern post-2000 sense, and also the first Neural Hierarchical Temporal Memory.

1.8 1997: Supervised Deep Learner (LSTM)

Long Short-Term Memory (LSTM) recurrent neural networks (RNN) became the first purely supervised Deep Learners, e.g., [48, 33, 39, 36, 37, 40, 38]. LSTM RNN were able to learn solutions to many previously unlearnable problems (see also TL 1.10, TL 1.14).

1.9 2006: Deep Belief Networks / CNN Results

A paper by Hinton and Salakhutdinov [44] focused on unsupervised pre-training of feedforward NN to accelerate subsequent supervised learning (compare TL 1.7

). This helped to arouse interest in deep NN (keywords: restricted Boltzmann machines; Deep Belief Networks). In the same year, a BP-trained CNN (TL

1.3, TL 1.5) by Ranzato et al. [70] set a new record on the famous MNIST handwritten digit recognition benchmark [54], using training pattern deformations [6, 86].

1.10 2009: First Competitions Won by Deep Learning

2009 saw the first Deep Learning systems to win official international pattern recognition contests (with secret test sets known only to the organisers): three connected handwriting competitions at ICDAR 2009 were won by deep LSTM RNN [40, 83] performing simultaneous segmentation and recognition.

1.11 2010: Plain Backpropagation on GPUs Yields Excellent Results

In 2010, a new MNIST record was set by good old backpropagation (TL 1.2) in deep but otherwise standard NN, without unsupervised pre-training, and without convolution (but with training pattern deformations). This was made possible mainly by boosting computing power through a fast GPU implementation [17]. (A year later, first human-competitive performance on MNIST was achieved by a deep MCMPCNN (TL 1.12) [22].)

1.12 2011: MPCNN on GPU / First Superhuman Visual Pattern Recognition

In 2011, Ciresan et al. introduced supervised GPU-based Max-Pooling CNN or Convnets (MPCNN) [18], today used by most if not all feedforward competition-winning deep NN (TL 1.13, TL 1.14). The first superhuman visual pattern recognition in a controlled competition (traffic signs  [87]) was achieved  [20, 19] (twice better than humans, three times better than the closest artificial NN competitor, six times better than the best non-neural method), through deep and wide Multi-Column (MC) GPU-MPCNN [18, 19], the current gold standard for deep feedforward NN.

1.13 2012: First Contests Won on Object Detection and Image Segmentation

2012 saw the first Deep learning system (a GPU-MCMPCNN [18, 19], TL 1.12) to win a contest on visual object detection in large images (as opposed to mere recognition/classification): the ICPR 2012 Contest on Mitosis Detection in Breast Cancer Histological Images  [2, 74, 16]. An MC (TL 1.12) variant of a GPU-MPCNN also achieved best results on the ImageNet classification benchmark [51]. 2012 also saw the first pure image segmentation contest won by Deep Learning (again through a GPU-MCMPCNN), namely, the ISBI 2012 Challenge on segmenting neuronal structures  [3, 15]. This was the 8th international pattern recognition contest won by my team since 2009 [1].

1.14 2013: More Contests and Benchmark Records

In 2013, a new TIMIT phoneme recognition record was set by deep LSTM RNN  [38] (TL 1.8, TL 1.10). A new record [24] on the ICDAR Chinese handwriting recognition benchmark (over 3700 classes) was set on a desktop machine by a GPU-MCMPCNN with almost human performance. The MICCAI 2013 Grand Challenge on Mitosis Detection was won by a GPU-MCMPCNN  [88, 16]. Deep GPU-MPCNN [18] also helped to achieve new best results on ImageNet classification [95] and PASCAL object detection [34]. Additional contests are mentioned in the web pages of the Swiss AI Lab IDSIA, the University of Toronto, NY University, and the University of Montreal.

2 Acknowledgments

Drafts/revisions of this paper have been published since 20 Sept 2013 in my massive open peer review web site www.idsia.ch/~juergen/firstdeeplearner.html (also under www.deeplearning.me). Thanks for valuable comments to Geoffrey Hinton, Kunihiko Fukushima, Yoshua Bengio, Sven Behnke, Yann LeCun, Sepp Hochreiter, Mike Mozer, Marc’Aurelio Ranzato, Andreas Griewank, Paul Werbos, Shun-ichi Amari, Seppo Linnainmaa, Peter Norvig, Yu-Chi Ho, Alex Graves, Dan Ciresan, Jonathan Masci, Stuart Dreyfus, and others.

References

  • [1] A. Angelica interviews J. Schmidhuber: How Bio-Inspired Deep Learning Keeps Winning Competitions, 2012. KurzweilAI: http://www.kurzweilai.net/how-bio-inspired-deep-learning-keeps-winning-competitions.
  • [2] ICPR 2012 Contest on Mitosis Detection in Breast Cancer Histological Images. Organizers: IPAL Laboratory, TRIBVN Company, Pitie-Salpetriere Hospital, CIALAB of Ohio State Univ., 2012.
  • [3] Segmentation of Neuronal Structures in EM Stacks Challenge - IEEE International Symposium on Biomedical Imaging (ISBI), 2012.
  • [4] S. Amari.

    A theory of adaptive pattern classifiers.

    IEEE Trans. EC, 16(3):299–307, 1967.
  • [5] S.-I. Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
  • [6] H. Baird. Document Image Defect Models. In Proceddings, IAPR Workshop on Syntactic and Structural Pattern Recognition, Murray Hill, NJ, 1990.
  • [7] D. H. Ballard. Modular learning in neural networks. In AAAI, pages 279–284, 1987.
  • [8] S. Behnke. Hierarchical Neural Networks for Image Interpretation, volume LNCS 2766 of Lecture Notes in Computer Science. Springer, 2003.
  • [9] R. Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1st edition, 1957.
  • [10] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
  • [11] T. Bluche, J. Louradour, M. Knibbe, B. Moysset, F. Benzeghiba, and C. Kermorvant. The A2iA Arabic Handwritten Text Recognition System at the OpenHaRT2013 Evaluation. In Submitted to DAS 2014, 2013.
  • [12] A. Bryson and Y. Ho.

    Applied optimal control: optimization, estimation, and control

    .
    Blaisdell Pub. Co., 1969.
  • [13] A. E. Bryson. A gradient method for optimizing multi-stage allocation processes. In Proc. Harvard Univ. Symposium on digital computers and their applications, 1961.
  • [14] K. Chellapilla, S. Puri, and P. Simard. High performance convolutional neural networks for document processing. In International Workshop on Frontiers in Handwriting Recognition, 2006.
  • [15] D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber. Deep neural networks segment neuronal membranes in electron microscopy images. In Advances in Neural Information Processing Systems NIPS, 2012.
  • [16] D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber. Mitosis detection in breast cancer histology images using deep neural networks. In MICCAI 2013, 2013.
  • [17] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep big simple neural nets for handwritten digit recogntion. Neural Computation, 22(12):3207–3220, 2010.
  • [18] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber. Flexible, high performance convolutional neural networks for image classification. In

    Intl. Joint Conference on Artificial Intelligence IJCAI

    , pages 1237–1242, 2011.
  • [19] D. C. Ciresan, U. Meier, J. Masci, and J. Schmidhuber. A committee of neural networks for traffic sign classification. In International Joint Conference on Neural Networks, pages 1918–1921, 2011.
  • [20] D. C. Ciresan, U. Meier, J. Masci, and J. Schmidhuber. Multi-column deep neural network for traffic sign classification. Neural Networks, 2012.
  • [21] D. C. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. Technical report, IDSIA, February 2012. arXiv:1202.2745v1 [cs.CV].
  • [22] D. C. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. In IEEE Conference on Computer Vision and Pattern Recognition CVPR 2012, 2012. Long preprint arXiv:1202.2745v1 [cs.CV].
  • [23] D. C. Ciresan, U. Meier, and J. Schmidhuber. Transfer learning for Latin and Chinese characters with deep neural networks. In International Joint Conference on Neural Networks, pages 1301–1306, 2012.
  • [24] D. C. Ciresan and J. Schmidhuber. Multi-column deep neural networks for offline handwritten Chinese character classification. Technical report, IDSIA, September 2013. arXiv:1309.0261.
  • [25] A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catanzaro. Deep learning with COTS HPC systems. In Proc. International Conference on Machine learning (ICML’13), 2013.
  • [26] S. W. Director and R. A. Rohrer. Automated network design - the frequency-domain case. IEEE Trans. Circuit Theory, CT-16:330–337, 1969.
  • [27] S. E. Dreyfus. The numerical solution of variational problems. Journal of Mathematical Analysis and Applications, 5(1):30–45, 1962.
  • [28] S. E. Dreyfus. The computational solution of optimal control problems with time lag. IEEE Transactions on Automatic Control, 18(4):383–385, 1973.
  • [29] K. Fukushima. Neural network model for a mechanism of pattern recognition unaffected by shift in position - neocognitron. In Trans. IECE, 1979.
  • [30] K. Fukushima. Neocognitron: A self-organizing neural network for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4):193–202, 1980.
  • [31] K. Fukushima. Artificial vision by multi-layered neural networks: Neocognitron and its advances. Neural Networks, 37:103–119, Jan. 2013.
  • [32] F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction with LSTM. In Proc. ICANN’99, Int. Conf. on Artificial Neural Networks, pages 850–855, Edinburgh, Scotland, 1999. IEE, London.
  • [33] F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction with LSTM. Neural Computation, 12(10):2451–2471, 2000.
  • [34] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. Technical Report arxiv.org/abs/1311.2524, UC Berkeley and ICSI, 2013.
  • [35] I. J. Goodfellow, A. C. Courville, and Y. Bengio. Large-scale feature learning with spike-and-slab sparse coding. In Proceedings of the 29th International Conference on Machine Learning, 2012.
  • [36] A. Graves, S. Fernandez, F. J. Gomez, and J. Schmidhuber. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural nets. In ICML’06: Proceedings of the International Conference on Machine Learning, 2006.
  • [37] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber. A novel connectionist system for improved unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5), 2009.
  • [38] A. Graves, A. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent neural networks. In Proceedings of the ICASSP, 2013.
  • [39] A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5-6):602–610, 2005.
  • [40] A. Graves and J. Schmidhuber. Offline handwriting recognition with multidimensional recurrent neural networks. In Advances in Neural Information Processing Systems 21. MIT Press, Cambridge, MA, 2009.
  • [41] A. Griewank. Documenta Mathematica - Extra Volume ISMP, pages 389–400, 2012.
  • [42] J. Hadamard. Mémoire sur le problème d’analyse relatif à l’équilibre des plaques élastiques encastrées. Mémoires présentés par divers savants à l’Académie des sciences de l’Institut de France: Éxtrait. Imprimerie nationale, 1908.
  • [43] J. Hawkins and D. George. Hierarchical Temporal Memory - Concepts, Theory, and Terminology. Numenta Inc., 2006.
  • [44] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
  • [45] G. E. Hinton. Connectionist learning procedures. Artificial intelligence, 40(1):185–234, 1989.
  • [46] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut für Informatik, Lehrstuhl Prof. Brauer, Technische Universität München, 1991. Advisor: J. Schmidhuber.
  • [47] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In S. C. Kremer and J. F. Kolen, editors, A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press, 2001.
  • [48] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
  • [49] D. H. Hubel and T. Wiesel. Receptive fields, binocular interaction, and functional architecture in the cat’s visual cortex. Journal of Physiology (London), 160:106–154, 1962.
  • [50] H. J. Kelley. Gradient theory of optimal flight paths. ARS Journal, 30(10):947–954, 1960.
  • [51] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS 2012), 2012.
  • [52] R. Kurzweil. How to Create a Mind: The Secret of Human Thought Revealed. 2012.
  • [53] Y. LeCun. Une procédure d’apprentissage pour réseau à seuil asymétrique. Proceedings of Cognitiva 85, Paris, pages 599–604, 1985.
  • [54] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Back-propagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989.
  • [55] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Handwritten digit recognition with a back-propagation network. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems 2, pages 396–404. Morgan Kaufmann, 1990.
  • [56] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.
  • [57] G. W. Leibniz. Memoir using the chain rule (cited in tmme 7:2&3 p 321-332, 2010). 1676.
  • [58] G. F. A. L’Hospital. Analyse des infiniment petits, pour l’intelligence des lignes courbes. Paris: L’Imprimerie Royale, 1696.
  • [59] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master’s thesis, Univ. Helsinki, 1970.
  • [60] S. Linnainmaa. Taylor expansion of the accumulated rounding error. BIT Numerical Mathematics, 16(2):146–160, 1976.
  • [61] J. Martens and I. Sutskever. Learning recurrent neural networks with Hessian-free optimization. In Proceedings of the 28th International Conference on Machine Learning, pages 1033–1040, 2011.
  • [62] J. Masci, D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber. Fast image scanning with deep max-pooling convolutional neural networks. In Proc. ICIP, 2013.
  • [63] J. Masci, A. Giusti, D. C. Ciresan, G. Fricout, and J. Schmidhuber. A fast learning algorithm for image segmentation with max-pooling convolutional networks. In Proc. ICIP, 2013.
  • [64] G. Montavon, G. Orr, and K. Müller. Neural Networks: Tricks of the Trade. Number LNCS 7700 in Lecture Notes in Computer Science Series. Springer Verlag, 2012.
  • [65] G. Orr and K. Müller. Neural Networks: Tricks of the Trade. Number LNCS 1524 in Lecture Notes in Computer Science Series. Springer Verlag, 1998.
  • [66] G. M. Ostrovskii, Y. M. Volin, and W. W. Borisov. Über die Berechnung von Ableitungen. Wiss. Z. Tech. Hochschule für Chemie, 13:382–384, 1971.
  • [67] D. B. Parker. Learning-logic. Technical Report TR-47, Center for Comp. Research in Economics and Management Sci., MIT, 1985.
  • [68] B. A. Pearlmutter. Learning state space trajectories in recurrent neural networks. Neural Computation, 1(2):263–269, 1989.
  • [69] J. B. Pollack.

    Implications of recursive distributed representations.

    In NIPS, pages 527–536, 1988.
  • [70] M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun.

    Efficient learning of sparse representations with an energy-based model.

    In J. P. et al., editor, Advances in Neural Information Processing Systems (NIPS 2006). MIT Press, 2006.
  • [71] M. A. Ranzato, F. Huang, Y. Boureau, and Y. LeCun. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In

    Proc. Computer Vision and Pattern Recognition Conference (CVPR’07)

    . IEEE Press, 2007.
  • [72] M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nat. Neurosci., 2(11):1019–1025, 1999.
  • [73] A. J. Robinson and F. Fallside. The utility driven dynamic error propagation network. Technical Report CUED/F-INFENG/TR.1, Cambridge University Engineering Department, 1987.
  • [74] L. Roux, D. Racoceanu, N. Lomenie, M. Kulikova, H. Irshad, J. Klossa, F. Capron, C. Genestie, G. L. Naour, and M. N. Gurcan. Mitosis detection in breast cancer histological images - an ICPR 2012 contest. J. Pathol. Inform., 4:8, 2013.
  • [75] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing, volume 1, pages 318–362. MIT Press, 1986.
  • [76] R. E. Schapire. The strength of weak learnability. Machine Learning, 5:197–227, 1990.
  • [77] D. Scherer, A. Müller, and S. Behnke. Evaluation of pooling operations in convolutional architectures for object recognition. In Proc. International Conference on Artificial Neural Networks (ICANN), 2010.
  • [78] J. Schmidhuber. A local learning algorithm for dynamic feedforward and recurrent networks. Connection Science, 1(4):403–412, 1989.
  • [79] J. Schmidhuber. Neural sequence chunkers. Technical Report FKI-148-91, Institut für Informatik, Technische Universität München, April 1991.
  • [80] J. Schmidhuber. A fixed size storage time complexity learning algorithm for fully recurrent continually running networks. Neural Computation, 4(2):243–248, 1992.
  • [81] J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234–242, 1992.
  • [82] J. Schmidhuber. Netzwerkarchitekturen, Zielfunktionen und Kettenregel. (Network Architectures, Objective Functions, and Chain Rule.) Habilitationsschrift (Habilitation Thesis), Institut für Informatik, Technische Universität München, 1993.
  • [83] J. Schmidhuber, D. Ciresan, U. Meier, J. Masci, and A. Graves. On fast deep nets for AGI vision. In Proc. Fourth Conference on Artificial General Intelligence (AGI), Google, Mountain View, CA, 2011.
  • [84] J. Schmidhuber, M. C. Mozer, and D. Prelinger. Continuous history compression. In H. Hüning, S. Neuhauser, M. Raus, and W. Ritschel, editors, Proc. of Intl. Workshop on Neural Networks, RWTH Aachen, pages 87–95. Augustinus, 1993.
  • [85] P. Sermanet and Y. LeCun. Traffic sign recognition with multi-scale convolutional networks. In Proceedings of International Joint Conference on Neural Networks (IJCNN’11), 2011.
  • [86] P. Simard, D. Steinkraus, and J. Platt. Best practices for convolutional neural networks applied to visual document analysis. In Seventh International Conference on Document Analysis and Recognition, pages 958–963, 2003.
  • [87] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. INI German Traffic Sign Recognition Benchmark for the IJCNN’11 Competition, 2011.
  • [88] M. Veta, M. Viergever, J. Pluim, N. Stathonikos, and P. J. van Diest. MICCAI 2013 Grand Challenge on Mitosis Detection (organisers), 2013.
  • [89] P. J. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, 1974.
  • [90] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In Proceedings of the 10th IFIP Conference, 31.8 - 4.9, NYC, pages 762–770, 1981.
  • [91] P. J. Werbos. Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1, 1988.
  • [92] P. J. Werbos. Backwards differentiation in AD and neural nets: Past links and new opportunities. In Automatic Differentiation: Applications, Theory, and Implementations, pages 15–34. Springer, 2006.
  • [93] J. H. Wilkinson, editor.

    The Algebraic Eigenvalue Problem

    .
    Oxford University Press, Inc., New York, NY, USA, 1988.
  • [94] R. J. Williams. Complexity of exact gradient computation algorithms for recurrent neural networks. Technical Report Technical Report NU-CCS-89-27, Boston: Northeastern University, College of Computer Science, 1989.
  • [95] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. Technical Report arXiv:1311.2901 [cs.CV], NYU, 2013.