Semi-supervised Tuning from Temporal Coherence

by   Davide Maltoni, et al.
University of Bologna

Recent works demonstrated the usefulness of temporal coherence to regularize supervised training or to learn invariant features with deep architectures. In particular, enforcing smooth output changes while presenting temporally-closed frames from video sequences, proved to be an effective strategy. In this paper we prove the efficacy of temporal coherence for semi-supervised incremental tuning. We show that a deep architecture, just mildly trained in a supervised manner, can progressively improve its classification accuracy, if exposed to video sequences of unlabeled data. The extent to which, in some cases, a semi-supervised tuning allows to improve classification accuracy (approaching the supervised one) is somewhat surprising. A number of control experiments pointed out the fundamental role of temporal coherence.



There are no comments yet.


page 4

page 6


Semi-Supervised Few-Shot Learning with Prototypical Networks

We consider the problem of semi-supervised few-shot classification (when...

Semi-supervised and Population Based Training for Voice Commands Recognition

We present a rapid design methodology that combines automated hyper-para...

Projected Estimators for Robust Semi-supervised Classification

For semi-supervised techniques to be applied safely in practice we at le...

Coherence Constraints in Facial Expression Recognition

Recognizing facial expressions from static images or video sequences is ...

Deep Learning at 15PF: Supervised and Semi-Supervised Classification for Scientific Data

This paper presents the first, 15-PetaFLOP Deep Learning system for solv...

Semi-supervised Breast Lesion Detection in Ultrasound Video Based on Temporal Coherence

Breast lesion detection in ultrasound video is critical for computer-aid...

A CNN-based Feature Space for Semi-supervised Incremental Learning in Assisted Living Applications

A Convolutional Neural Network (CNN) is sometimes confronted with object...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unsupervised learning is very attractive due to the possibility of easily collecting huge amount of patterns to feed data-hungry deep architectures. It also plays a fundamental role in incremental learning scenarios, where labeled data are often available during the initial training, but not at working time.
The use of time coherence, that is enforcing a smooth output change while presenting temporally-closed pattern representations, is the bases of unsupervised learning approaches such as Slow Feature Analysis (SFA) (Wiskott & Sejnowski (2002)) and Hierarchical Temporal Memory (HTM) (George & Hawkins (2009), Schmidhuber (1992)).
Recent works proved the usefulness of temporal coherence to regularize supervised training of deep architectures (Mobahi et al. (2009), Weston et al. (2012)

) or to deep learn invariant features (

Zou et al. (2011), Zou et al. (2012), Goroshin et al. (2015)). In particular Mobahi et al. (2009)

trained a Convolutional Neural Network (CNN) with stochastic gradient descent by using a loss function extended with a temporal coherence term. Their training algorithm iteratively performs three interleaved steps aimed at:

minimizing the negative log-likelihood; minimizing the network output difference for temporal consecutive video frames; maximizing the network output difference for temporal non-closed video frames. The proposed regularization significantly improves the classification accuracy on COIL-100 (Nene et al. (1996)) and ORL (Samaria & Harter (1994)).
Inspired by Mobahi et al. (2009) we wondered about the effect of completely removing the supervised component from the loss function, and with some surprise, we experimentally found that an HTM, just mildly trained in supervised manner, can significantly improve its classification accuracy, if successively exposed to video sequences of unlabeled data. This is not obvious since in many application domains, it has been shown that semi-supervised and self-training approaches can lead to dangerous drifts (i.e., when mistakes reinforce themselves).

Our scenario of interest is different from the typical semi-supervised learning scenario (

Zhu (2005)) where a small set of labelled data and a larger set of unlabeled data (from the same classes) are available since the beginning. In fact, we assume that the unlabeled data are not available initially and become available (in small batches) only at successive stages, once the system has already been trained and the new data (alone) are used for incremental tuning. This better matches human-like learning scenario involving an initial small amount of direct instruction (e.g. parental labeling of objects during childhood) combined with large amounts of subsequence unsupervised experience (e.g. self-interaction with objects). However it is well known that incremental learning poses extra challenges such as the catastrophic forgetting (McCloskey & Cohen (1989), French (1999)), a manifestation of the stability-plasticity dilemma (Mermillod et al. (2013)).

Our incremental tuning approach exploits temporal coherence as a surrogate supervisory signal. It can be used in conjunction with any architecture whose supervised training loss function includes the desired output vector (e.g., the squared difference loss function). In the simplest version we take the network output at time

as desired output at disregarding of the pattern label. A slightly more sophisticated version (still semi-supervised) performs consistently well in our experiments and, under some conditions, its accuracy approaches the supervised one.
In order to verify architecture independence of our approach, a number of tests have been carried out with two quite different deep architectures: HTM and CNN. Since both approaches performed equally well on the initial supervised training experiments, we expected similar performance when dealing with incremental learning and semi-supervised tuning. However, this was not the case in our tests, where our CNN implementation suffered more than HTM the forgetting effect and the lack of supervision.
To setup our experiments we needed videos where the objects to recognize smoothly move in front of the camera. In particular, to study incremental learning we needed more video sequences of the same object. Instead of collecting a new dataset we decided to generate video sequences from the well-known and largely used NORB dataset (LeCun et al. (2004)). Further experimental validations are provided on COIL-100 dataset.

2 Related work

Semi-supervised learning (Zhu (2005), Chapelle et al. (2006)) exploits both labeled and unlabeled data to build robust models. In particular, in self-training (Rosenberg et al. (2005)

), a classifier is first trained with a small amount of labeled data and then used to classify the unlabeled data. Typically the most confident unlabeled points, together with their predicted labels, are added to the training set. The classifier is re-trained and the procedure repeated. Our approach can be framed in the semi-supervised learning family since we use labeled data for initial training and unlabeled data (form the same classes) for subsequent tuning. However, as pointed out in the introduction, our approach is incremental and the labeled/unlabeled data are used at different stages to mimic human learning. Therefore, particular care must be taken to control the catastrophic forgetting. A recent work by

Goodfellow et al. (2015) investigates the extent to which the catastrophic forgetting problem occurs for modern neural networks.
In specific application domains semi-supervised learning approaches have been proposed to self-update initial models (or templates): see for example Rattani et al. (2009) for biometric recognition and Matthews et al. (2004)

for tracking. Several researchers pointed out, that although the use of unlabeled data can substantially increase the system accuracy and robustness, the risk of drifts is always present. For example, in the context of face recognition,

Marcialis et al. (2008) reported that even with operations of update procedures at high confidence, the introduction of impostors cannot be avoided. Analogously to many domain specific solutions our approach is incremental and can exploit classification confidence. Temporal coherence has already been exploited for face recognition from video (Franco et al. (2010)), but the proposed update solution is domain specific and not easily generalizable as that here introduced.
The most related research of this study are the works by Mobahi et al. (2009) and Weston et al. (2012) where temporal coherence has been embedded in the semi-supervised training of deep architectures. However, in those works unlabeled data are used together with labeled one to regularize the supervised training while we first train a system with labeled data and later we tune it with unlabeled data.
The biological plausibility of the computational learning approach here proposed is discussed in Li & DiCarlo (2008) whose authors introduce the term UTL (Unsupervised Temporal slowness Learning) to describe the hypothesis under which invariance is learned from temporal contiguity of object features during natural visual experience without external supervision.

3 SST: Semi-Supervised Tuning

Let be a temporally coherent sequence of video frames taken from the same object (of class ): while the total object variation (in term of pose, lighting, distortion, etc.) in the whole sequence can be very high, only a limited amount of variation is expected to characterize pairs of successive frames and .
Let be a classifier able to map an input pattern (i.e., a single video frame) into an output vector

denoting the posterior class probabilities

. While in this work will be instantiated with a deep architecture trained with gradient descent, in general can be any trainable classifier returning class probabilities and whose optimization procedure minimize a cost (or loss) including the desired output for the input .
If the squared error is taken as loss function, for each pattern (of class ) the optimization procedure attempts to minimize:


Assuming that has already been trained (with supervision) by using a first batch of data, each subsequence training can be considered as a tuning (i.e., we start with learned parameters). Given a sequence , we define four ways to instantiate the desired vector during the system tuning:

  • [leftmargin=.7cm]

  • Supervised Tuning (SupT): this is the classical supervised approach where the desired output vector has the form (all terms are zero except that corresponding to the pattern class )

  • Supervised Tuning with Regularization (SupTR):


    where controls the influence of the temporal coherence regularizing term. This is close to the approach proposed by Mobahi et al. (2009), but we embed the regularizing term into the desired output and then perform a single optimization step, while Mobahi et al. (2009) make disjoint optimization steps.

  • Semi-Supervised Tuning - Basic (SST-B):


    This simply takes as desired output at time the output vector at time . The class label is not used, but since we assume that the input pattern belongs to one of the know-classes, the update is semi-supervised.

  • Semi-Supervised Tuning - Advanced (SST-A):


    At each step, we fuse the posterior probabilities

    with the posterior probabilities accumulated before; this is a sort of sum rule fusion where the weight of far (in time) patterns progressively vanishes. Then, if at least one of the fused class posteriors (in ) is higher than a given threshold , denoting high self-confidence, the desired output is set to to enforce temporal coherence. Otherwise (high uncertainty cases) no semi-supervised update have to be done, and formally, this can be achieved by passing back to equation (1). Here too, the class label is not used.

4 Dataset and Architectures

4.1 Norb revisited

Instead of collecting just another dataset we focused on the well know and largely used NORB dataset (LeCun et al. (2004)). This is still one of the best dataset to study invariant object recognition and well-fit our purposes because it contains 50 objects and 972 variations for each objects. The 50 objects belong to 5 classes (10 objects per class) and the 972 variations are produced by systematically varying the camera elevation (9 steps), the object azimuth with respect to the camera (18 steps) and the lighting condition (6 steps).

Temporally coherent video sequences can be generated from NORB by randomly walking the 3d (elevations, azimuth, lighting) variation space, where consecutive frames are characterized by a single step along one dimension. In our generation approach the random walking is controlled by some parameters like the number of frames, the probability of taking a step along each of the 3 dimensions, the probability of inverting the direction of movement (flip back), etc. Fig. 1.a shows an example of training sequence. When generating test sequences we must avoid to include frames already used in the training sequences. In particular, when generating test sequences (with a given mindist), we ensure that each test frame has a city-block distance of at least mindist steps (mindist 1) from any of the training set frames. Fig. 1.b shows a test sequence with mindist = 4 with respect to the training sequence of Fig. 1.a.

[capbesideposition=left,top,capbesidewidth=.1cm]figure[] [capbesideposition=left,top,capbesidewidth=.1cm]figure[]

Figure 1: a) An example of training sequence (20 frames). b) A test sequence with mindist = 4 from the previous training sequence.

In the standard NORB benchmark for each of the 5 classes, 5 objects are included in the training set and 5 objects in the test set. In the proposed benchmark we prefer to focus on pose and lighting invariance hence our training and test set are not differentiated by the object identity but by the object pose and lighting (for an amount modulated by mindist). However, for completeness, in Appendix B we also report results on an equivalent benchmark where the native object segregation is maintained. In our benchmark we also focus on monocular representation since the availability of stereo information makes the problem unnecessarily simpler for the task at hand. The benchmark dataset used in our experimentation consists of:

  • [leftmargin=.7cm]

  • 10 training batches . Each is 1,000 frames wide and is composed by 50 temporally coherent sequences (20 frames wide), each representing one of the 50 objects. is used for initial training and for successive incremental tuning. When training the system on we have no longer access to the previous . We do not enforce any mindist among training set sequences, so the same frame can be present in different batches.

  • 10 test batches for each mindist = 1, 2, 3 and 4. Test batches are structured as the train batches, but here mindist is enforced, so each frame included in the test batches has a distance of at least mindist from the 101,000 frames111Actually due to the presence of duplicates in our training random walks, the number of different frames is 8,531 (smaller than 10,000). of the training batches. Higher mindist values make the classification problem more difficult, because patterns are less than similar with respect to the training set ones. The temporal coherent organization of the test batches allows two type of evaluations to be performed:

    • [leftmargin=.7cm]

    • Frame based classification: here temporal organization is not considered and each frame has to be classified independently of its sequence/positions in the batch. For simplicity, for each mindist we can treat the batches , as as a single plain test set of 101,000 patterns.

    • Sequence based classification: this evaluation (not included in the experiments carried out in this paper) is aimed at classifying sequences and not single frames, so one can exploit multiple frames per object and their temporal coherence. Of course this is a simpler classification problem due to the possibility of fusing information. As side effect the number of pattern to classify reduces to .

With the purpose of evaluating our approach on a harder problem we can consider another benchmark (denoted as the “50-class benchmark”) where each object is considered as a separate class. It is worth noting that this is a quite complex problem due to the sometime small variability among objects originally belonging to the same class. To setup this benchmark we can still use the above sequences, with the only caution of ignoring original class labels and taking object labels as class labels.
Original NORB images are 9696 pixels. We noted that working on reduced resolution images (up to 3232) does not reduce classification accuracy (on the 5-class problem). So in order to speed-up the experiments we downsampled the NORB images to 3232 pixels222The same downsampling was done in other works (Saxe et al. (2011), Ngiam et al. (2010), Wagner et al. (2013)).
The full training and test sequences used in this study (provided as sequences of filenames referring to the original NORB images) can be downloaded from []. In the same repository we make available the tool (and the code) used to generated the sequences.

4.2 Htm

Hierarchical temporal memory (HTM) (George & Hawkins (2009)) is a biologically inspired framework that can be framed into multistage Hubel-Wiesel architectures (Ranzato et al. (2007)), a specific family of deep architectures. A brief overview of HTM is provided in Appendix C. A more comprehensive introduction can be found in Maltoni (2011) and Rehn & Maltoni (2014).

Analogously to CNN, HTM hierarchical structure is composed of alternating feature extraction and feature pooling layers. However, in HTM feature pooling is more complex than typical sum o max pooling used in CNN, and the time is used since the first training steps, when HTM self-develops its internal memories, to form groups of feature detectors responding to temporally-close inputs. This is conceptually similar to the unsupervised feature learning proposed in

Zou et al. (2011), Zou et al. (2012), Goroshin et al. (2015).
In the classical HTM approach (Maltoni (2011)) once a network is trained, its structure is frozen, thus making further training (i.e., incremental learning) quite critical. Rehn & Maltoni (2014)

introduced a technique (called HSR) for HTM (incremental) supervised training based on gradient descent error minimization, where error backpropagation is implemented through native HTM message passing based on belief propagation. In the present work HSR will be used for semi-supervised tuning.

The HTM architecture here adopted includes some modifications with respect to Maltoni (2011), Rehn & Maltoni (2014). We experimentally verified that these modifications yield to some accuracy improvement when working with natural images, and, at the same time, reduce the network complexity. Presenting these architectural changes in detail is not in the scope of this paper, but some hints are given in the following:

  • [leftmargin=.7cm]

  • Dilobe ordinal filters: the feature extraction at the first level is not based on a variable number of self-learned templates as described in Maltoni (2011), but is carried out with a bank of 50 dilobe ordinal filters Sun & Tan (2009). Each filter, of size 88, is the sum of two 2d Gaussians (one positive and one negative) whose center, size and orientation are randomly generated (see Fig.2). Each filter computes a simple intensity relationship between two adjacent regions (i.e. the two filter lobes) which is quite robust with respect to light changes, and (in our experience) discriminant enough for low level feature extraction. Although one could setup an unsupervised approach to learn optimal filters from natural images, for simplicity in this work we generated them randomly and kept them fixed.

  • Partial receptive field. in the classical HTM implementation the receptive field of a node is the union of the child node receptive fields, and is not possible for a node to focus only on a specific portion of its receptive field. In general, this does not allows to isolate objects from the background or to deal with partial occlusions. A simple but effective technique has been implemented to deal with partial receptive fields.

Figure 2: A graphical representation of the 50 “random” dilobe filters at level 1.

The HTM architecture used in our experiments has 5 levels:

  • [leftmargin=.7cm, noitemsep]

  • Input: 1024 nodes (3232) connected to image pixels.

  • Intermediate 1: 169 (1313 nodes), each node has 88 child nodes.

  • Intermediate 2: 169 (1313 nodes), each node has 1 child node.

  • Intermediate 3: 9 (33 nodes), each node has 55 child nodes.

  • Output: 1 node with 33 child nodes.

Since intermediate level 2 and 3 include both feature extraction and feature pooling, and the input level in this case is not performing any particular processing, the above 5 levels correspond to 6 levels in a CNN (accidentally the same of LeNet7). Note that an HTM node is much more complex than a single artificial neuron in conventional NN, since it was conceived to emulate a cortical micro-circuit (

George & Hawkins (2009)). HTM accuracy on baseline NORB is reported as additional material in Appendix A.

4.3 CNN: LeNet7 on Theano

The CNN architecture used in our experiments is an minor modification of “LeNet7” that was specifically designed by LeCun et al. (2004) to classify NORB images. This is still one of the best performing architecture on NORB benchmarks. We empirically proved that working on 3232 images does not reduce accuracy with respect to the 9696 original images. So our main modification concerns the reduced feature map size and filter size to deal with 3232 monocular inputs (see Fig. 3).

Figure 3: The CNN used in this work (original LeNet7 adapted to 3232 inputs). X@YY stands for X feature maps of size YY; (ZZ) stands for the filters of size ZZ.

LeCun et al. (2004) suggested to train LeNet7 with the squared error loss function, which naturally fits our semi-supervised tuning formulation. In our experiment on the standard NORB benchmark we evaluated some modifications to the architecture or the training procedure: Max pooling instead of the original sum pooling; Soft-max + log-likelihood instead of squared error;

Dropout; but none of these changes (nowadays commonly used to train CNN on large datasets) led to consistently better accuracy, so we came back to the original version that was easily implemented in Theano (

(Bergstra et al., 2010)). Since we are not using any output normalization333Any attempt to introduce a normalization (e.g. soft-max) resulted in some accuracy loss., the network output vector is not exactly a probability vector: however looking at the output values we noted that after a few training iterations they approximate quite well a probability vector: all elements in and summing to . To be sure of the soundness of our CNN implementation and training procedure we tried to reproduce the results reported in LeCun et al. (2004) for a LeNet7 trained on the full “normalized-uniform” dataset of 24300 patterns (4860 for each of the 5 classes). Since results in LeCun et al. (2004) are reported only for the binocular case, for this control experiment we also used binocular patterns (even if in the format 3232). After some tuning of the training procedure, we achieved a classification error of 5.6% which is slightly better than the 6.6% reported in LeCun et al. (2004), aligned with 5.6% of Ranzato et al. (2007) and not far from the state-of-the-art 3.9% reported in Ngiam et al. (2010). More details on the CNN accuracy on baseline NORB (including a comparison with HTM) can be found in Appendix A.

5 Experiments

5.1 Incremental tuning

In this section we focus on incremental learning and evaluate the (4) tuning strategies introduced in section 3 on the benchmark proposed in section 4.1. In all the experiments:

  • [leftmargin=.7cm]

  • We used 3232 monocular patterns (left eye only).

  • We report classification accuracy as frame based classification accuracy (the sequence based classification scenario will be addressed in future studies). See section 4.1 for the definition of the two scenarios.

  • For semi-supervised tuning, each training batch of 1,000 frames is treated as a single frame flow, without exploiting the regular sequence order and size within the batch to isolate the 50 temporally-coherent sequences. In fact, even if in natural vision abrupt gaze shifts could be detected to segment sequences, we prefer to avoid simplifying assumptions on this.

  • To limit bias induced by the batch order presentation, we averaged experiments over 10 runs and at each run we randomly shuffled the batches (

    is always used for initial supervised training). By measuring the standard deviation across the 10 runs, we can also study the learning process stability.

  • To avoid overfitting we did not performed a fine adjustments of parameters characterizing the (parametric) update strategies. We set them according to some exploratory tests and then kept the same values for all the experiments:

    • [leftmargin=.7cm]

    • For SupTR, the weight of the supervised component is set to .

    • For SST-A the self-confidence threshold is set to 0.65.

When performing incremental learning, care must be taken to avoid catastrophic forgetting. In fact, since patterns belonging to previous batches are no longer available, training the system with new patterns could lead to forget old ones. Even if in our tuning scenario the new patterns come from the same objects (pose and lighting variations) and there is some overlapping444Since we are not enforcing any mindist between training sequences, the random walk can lead to the inclusion of the same frame in different sequences/batches. In our opinion, this better emulates unsupervised human experience with objects, where the same object view can be refreshed over time. in the training sequences, catastrophic forgetting is still an issue.
For HTM we experimentally found that a good tradeoff between stability and plasticity can be achieved by running only 4 HSR iterations for each batch of 1,000 patterns, while for CNN we found that the optimal number of iterations is much higher (about 100 iterations).

Figure 4: HTM accuracy on the test set for mindist = 1 (a) and mindist = 4 (b). Positions (x-coordinate) denote the test set accuracy after incremental tuning with batches . Position exhibits the same accuracy for all the update strategies because it denotes the accuracy after supervised training on

. The bars denote 95% mean confidence intervals (over 10 runs).

Fig. 4 shows HTM accuracy at the end of pre-training555No additional data (e.g., jittered patterns) is used for HTM pre-training. on (point 1) and after each incremental tuning (points ). We note that:

  • [leftmargin=.7cm]

  • Supervised tuning SupT works well and each new batch of data contributes to increase the overall accuracy.

  • Regularized supervised tuning SupTR performs slightly better than SupT, and more important, makes the learning process more stable; this can be appreciated by the smoother trend in the graphs and by the average standard deviation over the 10 runs, that (for mindist = 1) is 0.7% for SupT and 0.4% for SupTR. This is in line with results of Mobahi et al. (2009), where a relevant accuracy improvement was reported on COIL-100 when regularizing the supervised learning with temporal coherence. Here the gap between SupT and SupTR is smaller than in Mobahi et al. (2009), probably due to the fact that our tuning batches are quite small (1,000 patterns) and regularization plays a minor role.

  • SST-B and SST-A accuracy is surprisingly good when compared with supervised accuracy, proving that temporal continuity is a very effective surrogate of supervision for HTM. Initial trends of SST-B and SST-A are similar, then SST-B tends to stabilize while SST-A accuracy continues to increase approaching supervised update SupT. The self-confidence computation that SST-A uses to decide whether updating the gradient or not, seems to be a valid instrument to skip cases where temporal continuity is not effective (e.g. change of sequence, very ambiguous patterns, etc.).

Fig. 5 shows the results of the same experiment performed with CNN. Here we observe that:

  • [leftmargin=.7cm]

  • Accuracy at the end of initial supervised training (on ) is similar to HTM.

  • SupT and SupTR lead to a remarkable accuracy improvement during incremental tuning with even if accuracy is about 2% lower than HTM and for mindist = 4 the learning process appears to be less stable.

  • Unexpectedly, the semi-supervised tuning SST-B and SST-A did not work with our CNN implementation. We tried some modifications (architecture, learning procedure) but without success. The only way we found to increase accuracy in the semi-supervised scenario is with the variant of SST-A (denoted as SST-A-) introduced and discussed in section 5.3, However, also for SST-A- the accuracy gain is quite limited if compared with semi-supervised tuning on HTM.

A similar trend can be observed in the experimental results reported as additional materials (Appendix B), where the native object segregation is maintained.

Figure 5: CNN accuracy on the test set for mindist = 1 (a) and mindist = 4 (b).

5.2 Making the problem harder

The good performance of HTM in semi-supervised tuning reported in the previous section could be attributed to the initial high-chance of self-discovering the pattern class. In fact, if the initial classification accuracy is high enough, the missing class label can be replaced by a good guess. To study SST effectiveness for harder problems, where the initial classification accuracy is lower, we set-up two experiments:

  • [leftmargin=.7cm]

  • the former consists in deliberately (and progressively) deteriorating the initial classification accuracy by providing a certain amount of wrong labels during the supervised training on .

  • the latter uses the same training and test batches but turns the problem into a 50-class classification. As discussed in section 4.1, this is much more difficult (expecially for 3232 patterns) because different NORB objects (e.g. two cars) are visually very difficult to distinguish at certain angles (even for humans).

Fig. 6 shows results of these experiments. We note that:

  • [leftmargin=.7cm]

  • As the initial classification accuracy degrades, SST-A accuracy degrades gently and the gap between initial and final accuracy remains high. Even a limited initial accuracy of about 35% does not prevent SST-A to benefit from semi-supervised tuning.

  • Of course here the gap between SST-A and supervised tuning SupT/SupTR (not reported in the graph) is higher because supervised tuning is able to overcome the introduced initial degradation since the second batch, always leading to a final accuracy close to Fig. 4.a.

  • The 50-class experiment can be considered an extreme case, because the initial classification accuracy is about 25% and even supervised tuning approaches (SupT and SupTR) are not able to increase final performance over 44%. In this scenario SST-B after an initial stability (batches ) starts drifting away (batches ). On the contrary, SST-A denotes a stable (even if limited) accuracy gain, proving to be able to operate also in high uncertainty conditions.

Figure 6: a) HTM + SST-A accuracy on the test set (mindist = 1) for different amounts of wrong labels provided during initial supervised training on . b) HTM accuracy on the test set (mindist = 1) for different update strategies on the 50-class problem.

5.3 Control experiments

In this section we introduce further experiments with the aim to better understand the factors contributing to the success of semi-supervised tuning. In particular we modified SST-A as:

  • [leftmargin=.7cm]

  • SST-A-:


    This is very similar to SST-A, in fact is computed in the same way by exploiting temporal coherence, but here when the self-confidence is higher than the threshold, instead of enforcing the temporal coherent pattern , we pass back the delta distribution corresponding to the self-guessed class.

  • SST-A--noTC:


    Here no temporal coherence is used neither for estimating self-confidence nor for enforcing output continuity. This correspond to the basic self-training approach used in several applications.

Fig. 7.a compares HTM accuracy on SST-A and the two above variants. The small gap between SST-A and SST-A- (in favor of SST-A) can be attributed to the regularizing effect of passing back a temporally coherent output vector instead of a sharp delta vector. A totally unsatisfactory behavior can be observed for the second variant (SST-A--noTC) where the network cannot look back in time but can only exploits the current pattern: the flat accuracy in the graph testifies that in this case self-training does not allow HTM to improve. This is a classical pitfall of basic self-training approaches where the patterns whose label can be correctly guessed do not bring much value to the improve the current representation while really useful patterns (in term of diversity) are not added because of the low self-confidence.

Fig. 7.b shows CNN accuracy for the same experiments. While SST-A--noTC here too remains ineffective, in this case SST-A- is much better than SST-A, even if far from semi-supervised accuracy achieved by HTM. But why our CNN implementation does not tolerate a desired output vector made of (combinations) of past output vectors, and prefer a more radical delta vector computed by self-estimation of the pattern class? By comparing the output vectors produced by HTM and CNN when making inference on new patterns, we noted that HTM posterior probabilities are quite peaked around one class (similarly to delta form) while for CNN they are more softly spread among different classes. Numerically this can be made explicit by computing the average entropy over the network outputs of 1,000 previously unseen patterns: for CNN we measured and entropy of 1.44 bit, while for HTM the entropy is 0.50 bit, which is much closer to the 0 entropy of delta vector. Therefore, it seems that HTM output vectors are already in the right form for the loss function, while CNN output vectors need to be sharpened to make learning more effective.

Figure 7: a) HTM accuracy (5-class problem, maxdist = 1) on SST-A and its two variants. b) CNN accuracy (5-class problem, maxdist = 1) on SST-A and its two variants.

5.4 Further experimental validation on COIL-100

COIL-100 (Nene et al. (1996)) contains a larger number of classes than NORB (100 vs 5), but the available variations for each class are much more limited (72 images per class in COIL-100 vs 9720 images per class in NORB). The 72 poses of each class are spanned by a single mode of variation (i.e., camera azimuth) which is uniformly sampled with 5 degree steps. The single mode of variation and the limited number of poses make the generation of (disjoint) temporally coherent sequences for incremental learning critical. However, we tried to setup a test-bed close to the NORB one (see Section 4.1):

  • [leftmargin=.7cm]

  • 6 poses per class (one pose every 60°) are included in the test set; for each test set pose the two adjacent ones (5° degrees before and after) are excluded from the training batches to enforce a mindist = 2.

  • Temporally coherent sequences are obtained for each class by randomly walking the remaining 54 = 72­6 -12 frames. Training batches (1000 patterns wide) are then generated and used for initial supervised training () and successive incremental tuning (). It is worth noting that with respect to the NORB experiments, in this case the forgetting effect induced by incremental tuning is mitigated by an higher overlapping among the tuning batches due to the small number of frames.

  • To reuse the same HTM and CNN architectures created for NORB, COIL-100 images are subsampled (from 128128) to 3232 and converted from RGB to grayscale.

Fig. 8 shows HTM and CNN accuracy for different incremental tuning strategies. We observe that:

  • [leftmargin=.7cm]

  • The trend for supervised strategies is similar to NORB; both HTM and CNN constantly improve initial accuracy as new batches are presented, with CNN slightly overperforming HTM. For HTM regularization seems not providing any advantage, probably due to the shorter sequence length (10 frames here instead of 20 frames in NORB) and the presence of gaps in the sequences (patterns segregated/excluded because of their inclusion in the test set).

  • Here too, semi-supervised strategies performs better for HTM than for CNN. It is worth noting that in this case the base strategy SST-B outperforms SST-A thus indicating that the self-confidence threshold sc (kept fixed at 0.65) is probably too conservative for this dataset.

Figure 8: HTM and CNN incremental tuning accuracy on COIL-100.

6 Discussion and Conclusions

In this paper we studied semi-supervised tuning based on temporal coherence. The proposed tuning approaches have been evaluated on two deep architectures (HTM and CNN) obtaining partially discordant results.
As to HTM our experiments proved that in some conditions even a trivial approach enforcing the output slow change (SST-B) can significantly improve classification accuracy. A slightly more complex approach (SST-A), exploiting temporal coherence twice: to enforce the output slow-change; to compute a self-confidence value to trigger semi-supervised update, proved to be very effective, sometimes approaching the supervised tuning accuracy.
Our CNN implementation worked well with supervised tuning strategies, but (unexpectedly) demonstrated a lower capacity to deal with incremental semi-supervised tuning. Of course the encountered limitations could be due to the specific CNN architecture and training, and the outcomes of other recent studies (Goodfellow et al. (2015)) can be very useful to check alternative setups (e.g., better investigating the effect of dropout). We recognize that the empirical evaluations carried out in this study are still limited, and to validate/generalize our semi-supervised tuning results, we need to test the proposed approaches on other larger datasets, including natural videos of real objects smoothly moving in front of the camera. We plan to collect a new video dataset in the near future and (of course) to work with patterns larger than 3232 pixels.
However, based on the results obtained so far a question emerges: what made HTM more effective than CNN for incremental learning and semi-supervised tuning from temporal coherence? At this stage we do not have an answer to this question, and we can only formulate some hypotheses, by pointing out architectural/training differences that could have a direct impact on forgetting and capability to work with unlabeled data:

  • [leftmargin=.7cm]

  • Pre-training: McRae & Hetherington (1993) argued that network pre-training can mitigate catastrophic forgetting effects. During initial training HTM self-develop internal memories from patterns of the domain instead of randomly initializing weighs. This could make it more stable and resistant to pattern forgetting and lack of labels. Of course CNN can be pre-trained as well (see Wagner et al. (2013) for a comparative evaluation of different pre-training approaches), and this is one of directions we intend to follow in our future studies.

  • Type of parameters tuned: CNN training is mostly directed to feature extraction layers (i.e. filter parameters), while HTM + HSR main target are parameters of feature pooling layers. Rehn & Maltoni (2014) argued that the most important contribution of HSR is tuning the probabilities denoting how much each coincidence (i.e., a feature extractor) belongs to each group (i.e., a set of feature extractors). Our HTM incremental tuning by HSR is not altering feature extractors, but attempts to optimally arrange existing feature extractors in groups to maximize invariance. Referring to the stability-plasticity dilemma we speculate that keeping feature extractors stable (expecially at low levels) promotes stability while moving pooling parameters is enough to get the required plasticity.

In conclusion, we believe that incremental (semi-supervised and unsupervised) tuning, still scarcely studied with deep learning architectures, is a powerful approach to mimic biological learning where continuous (lifelong) learning is a key factor. The lack of supervision, here surrogated by temporal coherence only, can be complemented by other contextual information coming from different modalities (Multiview learning), or from different processing paths (e.g., Co-training). Of course when supervisor signals are available, both supervised and unsupervised tuning can be fused into an hybrid scheme (as here demonstrated for SupTR). The availability of powerful computing platforms, makes the development of continuous learning system feasible for a number of practical applications. For example in our non-optimized HTM implementation, 4 HSR iterations on 1,000 patterns takes about 35 seconds666On a CPU Xeon W3550 - 4 cores. we are confident that, upon proper optimization, SST can run on-line once a pre-trained system is switched in working mode.


  • Bergstra et al. (2010) Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. Theano: a cpu and gpu math expression compiler. In Proceedings of the Python for scientific computing conference (SciPy), volume 4, pp.  3. Austin, TX, 2010.
  • Chapelle et al. (2006) Chapelle, O., Schölkopf, B., Zien, A., et al. Semi-supervised learning. 2006.
  • Franco et al. (2010) Franco, A., Maio, D., and Maltoni, D. Incremental template updating for face recognition in home environments. Pattern Recognition, 43(8):2891–2903, 2010.
  • French (1999) French, R. M. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
  • George & Hawkins (2009) George, D. and Hawkins, J. Towards a mathematical theory of cortical micro-circuits. PLoS Comput Biol, 5(10):e1000532, 2009.
  • Goodfellow et al. (2015) Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. An empirical investigation of catastrophic forgeting in gradient-based neural networks. arXiv preprint arXiv:1312.6211v3, 2015.
  • Goroshin et al. (2015) Goroshin, R., Bruna, J., Tompson, J., Eigen, D., and LeCun, Y. Unsupervised feature learning from temporal data. arXiv preprint arXiv:1504.02518, 2015.
  • LeCun et al. (2004) LeCun, Y., Huang, F. J., and Bottou, L. Learning methods for generic object recognition with invariance to pose and lighting. In In Proceedings of CVPR’04. IEEE Press, 2004.
  • Li & DiCarlo (2008) Li, N. and DiCarlo, J. J. Unsupervised natural experience rapidly alters invariant object representation in visual cortex. science, 321(5895):1502–1507, 2008.
  • Maltoni (2011) Maltoni, D. Pattern recognition by hierarchical temporal memory. Technical report, DEIS - University of Bologna, April 2011. URL
  • Marcialis et al. (2008) Marcialis, G. L., Rattani, A., and Roli, F. Biometric template update: an experimental investigation on the relationship between update errors and performance degradation in face verification. In Structural, Syntactic, and Statistical Pattern Recognition, pp. 684–693. Springer, 2008.
  • Matthews et al. (2004) Matthews, I., Ishikawa, T., and Baker, S. The template update problem. IEEE Transactions on Pattern Analysis & Machine Intelligence, (6):810–815, 2004.
  • McCloskey & Cohen (1989) McCloskey, M. and Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. The psychology of learning and motivation, 24(109-165):92, 1989.
  • McRae & Hetherington (1993) McRae, K. and Hetherington, P. A. Catastrophic interference is eliminated in pretrained networks. In Proceedings of the 15h Annual Conference of the Cognitive Science Society, pp. 723–728, 1993.
  • Mermillod et al. (2013) Mermillod, M., Bugaiska, A., and Bonin, P. The stability-plasticity dilemma: investigating the continuum from catastrophic forgetting to age-limited learning effects. Frontiers in psychology, 4, 2013.
  • Mobahi et al. (2009) Mobahi, H., Collobert, R., and Weston, J. Deep learning from temporal coherence in video. In

    Proceedings of the 26th Annual International Conference on Machine Learning

    , pp. 737–744. ACM, 2009.
  • Nene et al. (1996) Nene, S. A., Nayar, S. K., Murase, H., et al. Columbia object image library (coil-20). Technical report, Technical Report CUCS-005-96, 1996.
  • Ngiam et al. (2010) Ngiam, J., Chen, Z., Chia, D., Koh, P. W., Le, Q. V., and Ng, A. Y. Tiled convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1279–1287, 2010.
  • Ranzato et al. (2007) Ranzato, M. A., Huang, F. J., Boureau, Y., and LeCun, Y. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pp. 1–8. IEEE, 2007.
  • Rattani et al. (2009) Rattani, A., Freni, B., Marcialis, G. L., and Roli, F. Template update methods in adaptive biometric systems: a critical review. In Advances in Biometrics, pp. 847–856. Springer, 2009.
  • Rehn & Maltoni (2014) Rehn, E. M. and Maltoni, D. Incremental learning by message passing in hierarchical temporal memory. Neural computation, 26(8):1763–1809, 2014.
  • Rosenberg et al. (2005) Rosenberg, C., Hebert, M., and Schneiderman, M. Semi-supervised self-training of object detection models. In Seventh IEEE Workshop on Applications of Computer Vision, pp. 29–36, 2005.
  • Samaria & Harter (1994) Samaria, F. S. and Harter, A. C. Parameterisation of a stochastic model for human face identification. In Applications of Computer Vision, 1994., Proceedings of the Second IEEE Workshop on, pp. 138–142. IEEE, 1994.
  • Saxe et al. (2011) Saxe, A., Koh, P. W., Chen, Z., Bhand, M., Suresh, B., and Ng, A. Y. On random weights and unsupervised feature learning. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1089–1096, 2011.
  • Schmidhuber (1992) Schmidhuber, J. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234–242, 1992.
  • Sun & Tan (2009) Sun, Z. and Tan, T. Ordinal measures for iris recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(12):2211–2226, 2009.
  • Wagner et al. (2013) Wagner, R., Thom, M., Schweiger, R., Palm, G., and Rothermel, A. Learning convolutional neural networks from few samples. In Neural Networks (IJCNN), The 2013 International Joint Conference on, pp. 1–7. IEEE, 2013.
  • Weston et al. (2012) Weston, J., Ratle, F., Mobahi, H., and Collobert, R. Deep learning via semi-supervised embedding. In Neural Networks: Tricks of the Trade, pp. 639–655. Springer, 2012.
  • Wiskott & Sejnowski (2002) Wiskott, L. and Sejnowski, T. J. Slow feature analysis: Unsupervised learning of invariances. Neural computation, 14(4):715–770, 2002.
  • Zhu (2005) Zhu, X. Semi-supervised learning literature survey. Technical report, Department of Computer Sciences, University of Wisconsin, Madison, 2005.
  • Zou et al. (2012) Zou, W., Zhu, S., Yu, K., and Ng, A. Y. Deep learning of invariant features via simulated fixations in video. In Advances in neural information processing systems, pp. 3212–3220, 2012.
  • Zou et al. (2011) Zou, W. Y., Ng, A. Y., and Yu, K. Unsupervised learning of visual invariance with temporal coherence. In NIPS 2011 Workshop on Deep Learning and Unsupervised Feature Learning, 2011.

Appendix A Baseline accuracy on NORB

Here we report accuracy of HTM and CNN on the “standard” normalized-uniform NORB benchmark LeCun et al. (2004).
We consider monocular 3232 patterns, and study the classification accuracy on the full test set of 24,300 patterns, for training sets of increasing size. Results reported below are obtained through a 5-fold cross validation, where for each round, 1/5 of the test set was taken as validation set to stop the gradient descent at an optimal point and the remaining 4/5 used to measure accuracy.
HTM training was performed as described in Rehn & Maltoni (2014): a subset of the available patterns is used for pre-training and the rest of the patterns for supervised tuning through HSR. This allows to better control the network complexity when scaling to large training sets. Since the HTM pre-training algorithm (Maltoni (2011)) internally generates a number of jittered versions777Consisting of small translations, rotations and scale changes. of the input patterns to emulate temporally coherent exploration sequences, for a fair comparison we exported these patterns and added them to the training set used for CNN training888This is not the case for experiment with temporally coherent sequences (reported in section 5.) because when input comes from slowly moving patterns HTM does not need to internally generate jittered patterns.. The number of HSR iterations (for optimal convergence on the validation set) is almost always less than 50. CNN training is performed with mini-batches of 100-200 patterns. The number of error backpropagation iterations (for optimal convergence on the validation set) is almost always less than 150 iterations.

Training patterns Jittered versions HTM CNN
205 800 64.21% 60.58%
505 2,000 73.22% 69.64%
1005 4,000 78.82% 77.27%
2005 4,000 81.86% 82.80%
5005 4,000 84.16% 83.87%
1,0005 4,000 85.37% 85.47%
2,0005 4,000 85.83% 86.20%
4,8605 4,000 86.24% 85.01%
Figure 9: HTM and CNN accuracy on standard normalized-uniform NORB benchmark. The labels (number of training patterns per class) in the x-coordinate are equispaced for better readability.

Fig. 9 shows the accuracy of HTM and CNN. When the number of training patterns per class is small (i.e., 20, 50 and 100) HTM accuracy is slightly better than CNN; for larger training sets the accuracy of the two approaches is very similar. Note that with monocular inputs the error is markedly higher with respect to the binocular case reported in section 4.3. Concerning the training time, a direct comparison is not possible because of different implementation languages and hardware platforms. In particular, the CNN Theano implementation run on a GPU Tesla C2075 Fermi, while the HTM run on a CPU Xeon W3550 - 4 cores. However, to give a coarse indication, both HTM and CNN training took about 3 hours999 For a single round of cross-validation. for the largest training set case: 48605 + 4000 patterns.

Appendix B Incremental tuning on NORB (native object segregation)

The NORB benchmark introduced in Section 4.1 focuses on pose and lighting incremental learning and, unlike the original NORB protocol, it does not split the objects in two disjoint groups: for each class, 5 objects in the training set and 5 objects in the test set. Our choice was aimed at isolating the capability of learning pose and invariance from the capability of recognizing different objects of the same class (which is critical in NORB because of the small number of objects per class). However, to further validate the efficacy of the proposed incremental tuning, here we came back to the native object segregation and report results corresponding to Section 5.1 results under this scenario. Figure 10 shows HTM and CNN accuracy for different tuning strategies. We observe that:

  • [leftmargin=.7cm]

  • The trend is very similar to the Section 5 experiments: even in this case, supervised strategies work well for both strategies while semi-supervised tuning is effective for HTM but not for our CNN implementation.

  • The accuracy achieved is markedly lower with respect to Section 5, but is in line with results reported in Appendix A if we consider the number of training samples and the forgetting effect due to incremental learning.

Figure 10: HTM and CNN incremental tuning accuracy, when splitting class objects as in the original NORB protocol (for each class: 5 objects in the training set and 5 in the test set). No mindist is here necessary between test and training batches because of the object segregation.

Appendix C HTM overview

This Appendix provides a brief overview of HTM. A more detailed introduction to HTM structure, forward and backward messaging (including equations) is given in Sections 1 and 2 of Rehn & Maltoni (2014). HTM pre-training algorithms are presented in detail in Maltoni (2011) while HTM Supervised Refinement (HSR) is introduced in Rehn & Maltoni (2014).


An HTM has a hierarchical tree structure. The tree is built up by a number of levels, each composed of one or more nodes. A node in one level is bidirectionally connected to one or more nodes in the level below and the number of nodes in each level decreases as we ascend the hierarchy. Conversely, the node receptive fields increase as we move up in the tree structure. By allowing nodes to have multiple parents we can create networks with overlapping receptive fields. The lowest level is the input level, and the highest level (with typically only one node) is the output level. Levels and nodes in between input and output are called intermediate levels and nodes.

  • [leftmargin=.7cm]

  • Input nodes constitute a sort of interface: in fact, they just forward up the signals coming from the input pattern.

  • Every intermediate node includes a set, , of so-called coincidence-patterns (or just coincidences) and a set, , of coincidence groups. A coincidence, , is a vector representing a prototypical activation pattern of the node’s children. Coincidence groups are clusters of coincidences likely to originate from simple variations of the same input pattern. Coincidences belonging to the same group can be spatially dissimilar but likely to be activated close in time when a pattern smoothly moves through the node’s receptive field (i.e., temporal pooling). The assignment of coincidences to groups within each node is encoded in a probability matrix , where each element, , represents the probability of a coincidence , given a group .

  • The structure of the output node differs from that of the intermediate nodes. In particular the output node has coincidences but not groups. Instead of memorizing groups and group likelihoods, it stores a probability matrix , whose elements represent the probability of coincidence given the class .


HTM inference (feedforward flow) proceeds from input to output level. Each intermediate node: computes its coincidence activations by combining the messages coming from its child-nodes according to the activation patterns encoded by the coincidences themselves; calculates its group activations by mixing coincidence activations through values; finally, passes up information to parent node(s). The output node computes its coincidence activations and turn them to class posterior probabilities according to .


HTM pre-training is unsupervised for intermediate levels and partially supervised for the output level. Coincidences are learnt by sampling the space of activation patterns while smoothly moving training patterns across the node(s) receptive fields. Once coincidences are created they are clustered in groups by maximizing a temporal proximity criterion. The output node coincidences are learnt in the same (unsupervised) way but coincidence-class relationships are learnt in a supervised fashion by counting how many times every coincidence is the most active one (i.e., the winner) in the context of each class.

HTM Supervised Tuning (HSR).

The probabilities in ’s (remember there is one matrix for each intermediate node) and are the main elements manipulated by HSR. Similarly to error backpropagation, HSR incrementally updates parameter values by taking steps in direction opposite to the gradient of a loss-function. The whole process is implemented in a simple (and computationally light) way based on native HTM (backward) message passing.