Unsupervised Feature Learning from Temporal Data

04/09/2015 ∙ by Ross Goroshin, et al. ∙ NYU college 0

Current state-of-the-art classification and detection algorithms rely on supervised training. In this work we study unsupervised feature learning in the context of temporally coherent video data. We focus on feature learning from unlabeled video data, using the assumption that adjacent video frames contain semantically similar information. This assumption is exploited to train a convolutional pooling auto-encoder regularized by slowness and sparsity. We establish a connection between slow feature learning to metric learning and show that the trained encoder can be used to define a more temporally and semantically coherent metric.



page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


  • Bengio et al. (2012) Bengio, Yoshua, Courville, Aaron C., and Vincent, Pascal. Representation learning: A review and new perspectives. Technical report, University of Montreal, 2012.
  • Bromley et al. (1993) Bromley, Jane, Bentz, James W, Bottou, Léon, Guyon, Isabelle, LeCun, Yann, Moore, Cliff, Säckinger, Eduard, and Shah, Roopak.

    Signature verification using a “siamese” time delay neural network.

    International Journal of Pattern Recognition and Artificial Intelligence

    , 7(04):669–688, 1993.
  • Bruna & Mallat (2013) Bruna, Joan and Mallat, Stéphane. Invariant scattering convolution networks. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8):1872–1886, 2013.
  • Bruna et al. (2014) Bruna, Joan, Szlam, Arthur, and LeCun, Yann. Signal recovery from pooling representations. In ICML, 2014.
  • Cadieu & Olshausen (2012) Cadieu, Charles F. and Olshausen, Bruno A. Learning intermediate-level representations of form and motion from natural movies. Neural Computation, 2012.
  • Goodfellow et al. (2013) Goodfellow, Ian J., Warde-Farley, David, Mirza, Mehdi, Courville, Aaron, and Bengio, Yoshua. Maxout networks. In ICML, 2013.
  • Goroshin & LeCun (2013) Goroshin, Rostislav and LeCun, Yann. Saturating auto-encoders. In ICLR, 2013.
  • Gregor & LeCun (2010) Gregor, Karol and LeCun, Yann. Learning fast approximations of sparse coding. In ICML’2010, 2010.
  • Hadsell et al. (2006) Hadsell, Raia, Chopra, Soumit, and LeCun, Yann. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006.
  • Hyvärinen et al. (2004) Hyvärinen, Aapo, Karhunen, Juha, Oja, and Erkki. Independent component analysis, volume 46. John Wiley & Sons, 2004.
  • Hyvärinen et al. (2003) Hyvärinen, Aapo, Hurri, Jarmo, and Väyrynen, Jaakko. Bubbles: a unifying framework for low-level statistical properties of natural image sequences. JOSA A, 20(7):1237–1252, 2003.
  • Kavukcuoglu et al. (2009) Kavukcuoglu, Koray, Ranzato, Marc’Aurelio, Fergus, Rob, and LeCun, Yann. Learning invariant features through topographic filter maps. In CVPR, 2009.
  • Kayser et al. (2001) Kayser, Christoph, Einhauser, Wolfgang, Dummer, Olaf, Konig, Peter, and Kding, Konrad. Extracting slow subspaces from natural videos leads to complex cells. In ICANN’2001, 2001.
  • Krizhevsky (2009) Krizhevsky, Alex. Learning multiple layers of features from tiny images. Master’s thesis, University of Toronto, April 2009.
  • Krizhevsky et al. (2012) Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In NIPS, volume 1, pp.  4, 2012.
  • LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278–2324, 1998.
  • Lies et al. (2014) Lies, Jorn-Philipp, Hafner, Ralf M, and Bethge, Matthias. Slowness and sparseness have diverging effects on complex cell learning. 10, 2014.
  • Mobahi et al. (2009) Mobahi, Hossein, Collobert, Ronan, and Weston, Jason. Deep learning from temporal coherence in video. In ICML, 2009.
  • Rifai et al. (2011) Rifai, Salah, Vincent, Pascal, Muller, Xavier, Galrot, Xavier, and Bengio, Yoshua. Contractive auto-encoders: Explicit invariance during feature extraction. In ICML, 2011.
  • Vincent et al. (2008) Vincent, Pascal, Larochelle, Hugo, Bengio, Yoshua, and Manzagol, Pierre-Antoine.

    Extracting and composing robust features with denoising autoencoders.

    Technical report, University of Montreal, 2008.
  • Wiskott & Sejnowski (2002) Wiskott, Laurenz and Sejnowski, Terrence J. Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 2002.
  • Zou et al. (2012) Zou, Will, Zhu, Shenghuo, Yu, Kai, and Ng, Andrew Y. Deep learning of invariant features via simulated fixations in video. In Advances in Neural Information Processing Systems, pp. 3212–3220, 2012.