Incremental regularized least squares for multiclass classification with recoding, extension to new classes and fixed update complexity.
We consider object recognition in the context of lifelong learning, where a robotic agent learns to discriminate between a growing number of object classes as it accumulates experience about the environment. We propose an incremental variant of the Regularized Least Squares for Classification (RLSC) algorithm, and exploit its structure to seamlessly add new classes to the learned model. The presented algorithm addresses the problem of having an unbalanced proportion of training examples per class, which occurs when new objects are presented to the system for the first time. We evaluate our algorithm on both a machine learning benchmark dataset and two challenging object recognition tasks in a robotic setting. Empirical evidence shows that our approach achieves comparable or higher classification performance than its batch counterpart when classes are unbalanced, while being significantly faster.READ FULL TEXT VIEW PDF
Deep learning has achieved remarkable success in object recognition task...
In recent years, more and more machine learning algorithms have been app...
Continuous/Lifelong learning of high-dimensional data streams is a
For many applications, robots will need to be incrementally trained to
This report summarizes IROS 2019-Lifelong Robotic Vision Competition
We are interested in the problem of continual object recognition in a se...
Recognising relevant objects or object states in its environment is a ba...
Incremental regularized least squares for multiclass classification with recoding, extension to new classes and fixed update complexity.
In order for autonomous robots to operate in unstructured environments, several perceptual capabilities are required. Most of these skills cannot be hard-coded in the system beforehand, but need to be developed and learned over time as the agent explores and acquires novel experience. As a prototypical example of this setting, in this work we consider the task of visual object recognition in robotics: Images depicting different objects are received one frame at a time, and the system needs to incrementally update the internal model of known objects as new examples are gathered.
In the last few years, machine learning has achieved remarkable results in a variety of applications for robotics and computer vision[1, 2, 3]. However, most of these methods have been developed for off-line (or “batch”) settings, where the entire training set is available beforehand. The problem of updating a learned model online has been addressed in the literature [4, 5, 6, 7], but most algorithms proposed in this context do not take into account challenges that are characteristic of realistic lifelong learning applications. Specifically, in online classification settings, a major challenge is to cope with the situation in which a novel class is added to the model. Indeed, most learning algorithms require the number of classes to be known beforehand and not grow indefinitely, and the imbalance between the few examples of the new class (potentially just one) and the many examples of previously learned classes can lead to unexpected and undesired behaviors . More precisely, in this work we theoretically and empirically observe that the new and under-represented class is likely to be ignored by the learned model in favor of classes for which more training examples have already been observed, until a sufficient number of examples are provided also for such class.
Several methods have been proposed in the literature to deal with class imbalance in the batch setting by “rebalancing” the misclassification errors accordingly [8, 9, 10]. However, as we point out in this work, rebalancing cannot be applied to the online setting without re-training the entire model from scratch every time a new example is acquired. This would incur in computational learning times that increase at least linearly in the number of examples, which is clearly not feasible in scenarios in which training data grows indefinitely.
In this work we propose a novel method that learns incrementally both with respect to the number of examples and classes, and accounts for potential class unbalance. Our algorithm builds on a recursive version of Regularized Least Squares for Classification (RLSC) [11, 12] to achieve fixed incremental learning times when adding new examples to the model, while efficiently dealing with imbalance between classes. We evaluate our approach on a standard machine learning benchmark for classification and two challenging visual object recognition datasets for robotics. Our results highlight the clear advantages of our approach when classes are learned incrementally.
The paper is organized as follows: Sec. II overviews related work on incremental learning and class imbalance. In Sec. III we introduce the learning setting, discussing the impact of class imbalance and presenting two approaches that have been adopted in the literature to deal with this problem. Sec. IV reviews the recursive RLSC algorithm. In Sec. V we build on previous Sec. III and IV to derive the approach proposed in this work, which extends recursive RLSC to allow for the addition of new classes with fixed update time, while dealing with class imbalance. In Sec. VI we report on the empirical evaluation of our method, concluding the paper in Sec. VII.
Incremental Learning. The problem of learning from a continuous stream of data has been addressed in the literature from multiple perspectives. The simplest strategy is to re-train the system on the updated training set, whenever a new example is received [13, 14]. The model from the previous iteration can be used as an initialization to learn the new predictor, reducing training time. These approaches require to store all the training data, and to retrain over all the points at each iteration. Their computational complexity increases at least linearly with the number of examples.
Incremental approaches that do not require to keep previous data in memory can be divided in stochastic and recursive methods. Stochastic techniques assume training data to be randomly sampled from an unknown distribution and offer asymptotic convergence guarantees to the ideal predictor . However, it has been empirically observed that these methods do not perform well when seeing each training point only once, hence requiring to perform “multiple passes” over the data [15, 16]. This problem has been referred to as the “catastrophic effect of forgetting” 
, which occurs when training a stochastic model only on new examples while ignoring previous ones, and has recently attracted the attention of the Neural Networks literature[17, 7].
Recursive techniques are based, as the name suggests, on a recursive formulation of batch learning algorithms. Such formulation typically allows to compute the current model in closed form (or with few operations independent of the number of examples) as a combination of the previous model and the new observed example [5, 18]. As we discuss in more detail in Sec. IV, the algorithm proposed in this work is based on a recursive method.
Learning with an Increasing Number of Classes.
Most classification algorithms have been developed for batch settings and therefore require the number of classes to be known a priori. However, this assumption is often broken in incremental settings, since new examples could belong to previously unknown classes. The problem of dealing with an increasing number of classes has been addressed in the contexts of transfer learning orlearning to learn . These settings consider a scenario where linear predictors have been learned to model classes. Then, when a new class is observed, the associated predictor is learned with the requirement of being “close” to a linear combination of the previous ones [20, 21, 22]. Other approaches have been recently proposed where a class hierarchy is built incrementally as new classes are observed, allowing to create a taxonomy and exploit possible similarities among different classes [13, 23]. However, all these methods are not incremental in the number of examples and require to retrain the system every time a new point is received.
Class Imbalance. The problems related to class imbalance were previously studied in the literature [8, 10, 9] and are addressed in Sec. III. Methods to tackle this issue have been proposed, typically re-weighting the misclassification loss  to account for class imbalance. However, as we discuss in Sec. V-B for the case of the square loss, these methods cannot be implemented incrementally. This is problematic, since imbalance among multiple classes often arises in online settings, even if temporarily, for instance when examples of a new class are observed for the first time.
In this section, we introduce the learning framework adopted in this work and describe the disrupting effect of imbalance among class labels. For simplicity, in the following we consider a binary classification setting, postponing the extension to multiclass classification to the end of the section. We refer the reader to 
for more details about the Statistical Learning Theory for classification.
Let us consider a binary classification problem where input-output examples are sampled randomly according to a distribution over . The goal is to learn a function minimizing the overall expected classification error
given a finite set of observations , , randomly sampled from . Here denotes the binary function taking value if and otherwise. The solution to Eq. (1) is called the optimal Bayes classifier
optimal Bayes classifierand it can be shown to satisfy the equation
for all . Here we have denoted by the conditional distribution of given and in this work we will denote by the marginal distribution of , such that by Bayes’ rule
. Computing good estimates oftypically requires large training datasets and is often unfeasible in practice. Therefore, a so-called surrogate problem (see [9, 24]) is usually adopted to simplify the optimization problem at Eq. (1) and asymptotically recover the optimal Bayes classifier. In this sense, one well-known surrogate approach is to consider the least squares expected risk minimization
The solution to Eq. (3) allows to recover the optimal Bayes classifier. Indeed, for any we have
which implies that the minimizer of Eq. (3) satisfies
for all . The optimal Bayes classifier can be recovered from by taking its sign: . Indeed, if and only if .
Empirical Setting. When solving the problem in practice, we are provided with a finite set of training examples. In these settings the typical approach is to find an estimator of by minimizing the regularized empirical risk
to converge in probability to the idealas the number of training points grows indefinitely. In Sec. IV we review a method to compute in practice, both in the batch and in the online settings.
The classification rule at Eq. (2) associates every to the class with highest likelihood . However, in settings where the two classes are not balanced this approach could lead to unexpected and undesired behaviors. To see this, let us denote and notice that, by Eq. (2) and the Bayes’ rule, an example is labeled whenever
Hence, when is close to one of its extremal values or (i.e. or vice-versa), one class becomes clearly preferred with respect to the other and is almost always selected.
In Fig. 1 we report an example of the effect of unbalanced data by showing how the decision boundary (white dashed curve) of the optimal Bayes classifier from Eq. (2) varies as takes values from (balanced case) to (very unbalanced case). As it can be noticed, while the classes maintain the same shape, the decision boundary is remarkably affected by the value of .
Clearly, in an online robotics setting this effect could be critically suboptimal for two reasons: ) We would like the robot to recognize with high accuracy even objects that are less common to be seen. ) In incremental settings, whenever a novel object is observed for the first time, only few training examples are available (in the extreme case, just one) and we need a loss weighting fairly also underrepresented classes.
In this paper, we consider a general approach to “rebalancing” the classification loss of the standard learning problem of Eq. (1), similar to the ones in [8, 9]. We begin by noticing that in the balanced setting, namely for , the classification rule at Eq. (6) is equivalent to assigning class whenever and vice-versa. Here we want to slightly modify the misclassification loss in Eq. (1) to recover this same rule also in unbalanced settings. To do so, we propose to apply a weight to the loss , obtaining the problem
Analogously to the non-weighted case, the solution to this problem is
In this work we take the weights to be and . Indeed, from the fact that we have that the rule at Eq. (7) is equivalent to
which corresponds to the (unbalanced) optimal Bayes classifier in the case , as desired.
Fig. 1 compares the unbalanced and rebalanced optimal Bayes classifiers for different values of . Notice that rebalancing leads to solutions that are invariant to the value of (compare the black decision boundary with the white one).
Interestingly, the strategy of changing the weight of the classification error loss can be naturally extended to the least squares surrogate. If we consider the weighted least squares problem,
we can again recover the (weighted) rule like in the non-weighted setting. Indeed, by direct calculation it follows that Eq. (9) has solution
If we assume and (as in this work), the denominator of Eq. (10) is always positive and therefore if and only if , as desired.
Coding. An alternative approach to recover the rebalanced optimal Bayes classifier via least squares surrogate is to apply a suitable coding function to the class labels , namely
where maps the labels into scalar codes . Analogously to the unbalanced (and uncoded) case, the solution to Eq. (11) is
which, for , corresponds to the numerator of Eq. (10). Therefore, the optimal (rebalanced) Bayes classifier is recovered again by .
In the multiclass setting, the optimal Bayes decision rule corresponds to the function , assigning a label to when , with . Consequently, the rebalanced decision rule would assign class , whenever , where the function assigns a weight to each class. Generalizing the binary case, in this work we set , where we denote , for each .
In multiclass settings, the surrogate least squares classification approach is recovered by adopting a 1-vs-all
strategy, formulated as the vector-valued problem
where is a vector of the canonical basis of (with the -th coordinate equal to and the remaining ). Analogously to the derivation of Eq. (4), it can be shown that the solution to this problem corresponds to for all . Consequently, we recover the optimal Bayes classifier by
where denotes the -th entry of the vector .
The extensions of recoding and rebalancing approaches to this setting follow analogously to the binary setting discussed in Sec. III-D.
In particular, the coding function consists in mapping a vector of the basis to .
Note. In previous sections we presented the analysis on the binary case by considering a coding for class labels. This was done to offer a clear introduction to the classification problem, since we need to solve a single least squares problem to recover the optimal Bayes classifier. Alternatively, we could have followed the approach introduced in this section where classes have labels and adopt surrogate labels and . This would have led to training two distinct classifiers and choosing the predicted class as the of their scores, according to Eq. (14). The two approaches are clearly equivalent since the Bayes classifier corresponds respectively to the inequalities or .
In this section we review the standard algorithm for Regularized Least Squares Classification (RLSC) and its recursive formulation used for incremental updates.
We address the problem of solving the empirical risk minimization introduced in Eq. (5) in the multiclass setting. Let be a finite training set, with inputs and labels . In this work, we will assume a linear model for the classifier , namely , with a matrix in . We can rewrite Eq. (5) in matrix notation as
with the regularization parameter and and the matrices whose -th rows correspond respectively to and . We denote by the squared Frobenius norm of a matrix (i.e. the sum of its squared entries).
Prediction. According to the rule introduced in Sec. III-E, a given is classified according to
with denoting the -th column of .
The closed form for the solution at Eq. (16) allows to derive a recursive formulation to incrementally update in fixed time as new training examples are observed . Consider a learning process where training data are provided to the system one at a time. At iteration we need to compute , where and are the matrices whose rows correspond to the first training examples. The computational cost for evaluating according to Eq. (16) is (for the matrix products) and (for the inversion). This is undesirable in an online setting where can grow indefinitely. To this end, we now review how can be computed incrementally from in . To see this, first notice that, by construction,
and therefore, if we denote and , we obtain the recursive formulations
Computing from requires operations (since has all zero entries but one). Computing from requires , while the inversion requires . To reduce the cost of the (incremental) inversion, we recall that for a positive definite matrix for which its Cholesky decomposition is known (with upper triangular), the inversion can be computed in . In principle, computing the Cholesky decomposition of still requires , but we can apply a rank-one update to the Cholesky decomposition at the previous step, namely , which is known to require . Several implementations are available for the Cholesky rank-one updates; in our experiments we used the MATLAB routine cholupdate.
Therefore, the update from can be computed in , since the most expensive operation is the multiplication . In particular, this computation is independent of the current number of training examples seen so far, making this algorithm suited for online settings.
In this Section, we present our approach to incremental multiclass classification where we account for the possibility to extend the number of classes incrementally and apply the recoding approach introduced in Sec. III. The algorithm is reported in Alg. 1.
We propose a modification of recursive RLSC, allowing to extend the number of classes in constant time with respect to the number of examples seen so far. Let denote the number of classes seen up to iteration . We have two possibilities:
The new example belongs to one of the known classes, i.e. , with .
belongs to a new class, implying that .
In the first case, the update rules for , and explained in Section IV-B can be directly applied. In the second case, the update rule for remains unchanged, while the update of needs to account for the increase in size (since ). However, we can modify the update rule for without increasing its computational cost by first adding a new column of zeros to , namely
which requires operations. Therefore, with the strategy described above it is indeed possible to extend the classification capabilities of the incremental learner during online operation, without re-training it from scratch. In the following, we address the problem of dealing with class imbalance during incremental updates by performing incremental recoding.
The main algorithmic difference between standard RLSC and the variant with recoding is in the matrix containing output training examples. Indeed, according to the recoding strategy, the vector associated to an output label is coded into . In the batch setting, this can be formulated in matrix notation as
where the original output matrix is replaced by its encoded version , with the diagonal matrix whose -th diagonal element is . Clearly, in practice the are estimated empirically (e.g. by , the ratio between the number of training examples belonging to class and the total number of examples).
The above formulation is favorable for the online setting. Indeed, we have
where is the diagonal matrix of the (inverse) class distribution estimators up to iteration . can be computed incrementally in by keeping track of the number of examples belonging to and then computing (see Alg.1 for how this update was implemented in our experiments).
Note that the above step requires , since updating the (uncoded) from requires and multiplying by a diagonal matrix requires . All the above computations are dominated by the product , which requires . Therefore, our algorithm is computationally equivalent to the standard incremental RLSC approach.
Coding as a Regularization Parameter. Depending on the amount of training examples seen so far, the estimator could happen to not approximate well. In order to mitigate this issue, we propose to introduce a parameter and raise element-wise to the power of (indicated by ). Indeed, it can be noticed that for we recover the (uncoded) standard RLSC, since , while applies full recoding. In Sec. VI-C
we discuss an efficient heuristic to findin practice.
Incremental Rebalancing. Note that the loss-rebalancing algorithm (Sec. III-D) cannot be implemented incrementally. Indeed, the solution of the rebalanced empirical RLSC is
with a diagonal matrix whose -th entry is equal to , with the class of the -th training example. Since changes at every iteration, it is not possible to derive a rank-one update rule for as for the standard RLSC.
We empirically assessed the performance of Alg.1 on a standard benchmark for machine learning and on two visual recognition tasks in robotics. To evaluate the improvement provided by the incremental recoding when classes are imbalanced, we compared the accuracy of the proposed method with the standard recursive RLSC presented in Sec. IV-B. As a competitor in terms of accuracy, we also considered the rebalanced approach presented in Eq. (22) (which, we recall, cannot be implemented incrementally).
We adopted the following experimental protocol111Code available at https://github.com/LCSL/incremental_multiclass_RLSC:
Given a dataset with classes, we simulated a scenario where a new class is observed by selecting of them to be “balanced” and the remaining one to be under-represented.
We trained a classifier on the balanced classes, using a randomly sampled dataset containing examples per class (specified below for each dataset). We sampled a validation set with examples per class.
We incrementally trained the classifier from the previous step by sampling online examples for the -th class. Model selection was performed using exclusively the validation set of the balanced classes, following the strategy described in Sec. VI-C.
To measure performance, we sampled a separate test set containing examples per class (both balanced and under-represented) and we measured the accuracy of the algorithms on the test set while they were trained incrementally.
For each dataset, we averaged results over multiple independent trials randomly sampling the validation set. In Table I we report the test accuracy on the imbalanced class and on the entire test set.
MNIST  is a benchmark composed of K greyscale pictures of digits from to . We addressed the 10-class digit recognition problem usually considered in the literature, but using training images per class. The test set was obtained by sampling images per class. We used the raw pixels of the images as inputs for the linear classifier.
iCubWorld28  is a dataset for visual object recognition in robotics, collected during a series of sessions where a human teacher showed different objects to the iCub humanoid robot . We addressed the task of discriminating between the objects instances in the dataset, using all available acquisition sessions per object and randomly sampling and
examples per class. We performed feature extraction as specified in, i.e. by taking the activations of the fc layer of the CaffeNetConvolutional Neural Network .
RGB-D Washington  is a visual object recognition dataset comprising objects belonging to categories, acquired by recording image sequences of each object while rotating on a turntable. We addressed the -class object categorization task, averaging results over the ten splits specified in  (where, for each category, a random instance is left out for testing). We subsampled one cropped RGB frame every five from the full dataset, following the standard procedure. We sampled and images per class and performed feature extraction analogously to iCubWorld28, using the output of CaffeNet’s layer.
In traditional batch learning settings for RLSC, model selection for the hyperparameter is typically performed via hold-out, k-fold or similar cross-validation techniques. In the incremental setting these strategies cannot be directly applied since examples are observed online, but a simple approach to create a validation set is to hold out every -th example without using it for training (e.g., we set ). At each iteration, multiple candidate models are trained incrementally, each for a different value of , and the one with highest validation accuracy is selected for prediction.
However, following the same argument of Sec. III, in presence of class imbalance this strategy would often select classifiers that ignore the under-represented class. Rebalancing the validation loss (see Sec. III) does not necessarily solve the issue, but could rather lead to overfitting the under-represented class, degrading the accuracy on other classes since errors count less on them. Motivated by the empirical evidence discussed below, in this work we have adopted a model selection heuristic for and in Alg. 1, which guarantees to not degrade accuracy on well-represented classes, while at the same time achieving higher or equal accuracy on the under-represented one.
Our strategy evaluates the accuracy of the candidate models on the incremental validation set, but only for classes that have a sufficient number of examples (e.g., classes with fewer examples than a pre-defined threshold are not used for validation). Then, we choose the model with largest for which such accuracy is higher or equal to the one measured for , namely without coding. Indeed, as can be seen in Fig. 2 for validation experiments on iCubWorld28, as grows from to , the classification accuracy on the under-represented class increases, Fig. 2 (Top), while it decreases on the remaining ones, Fig. 2 (Bottom). Our heuristic chooses the best trade-off for such that performance does not degrade on well-known classes, but at the same time it will often improve on the under-represented one.
In Table I we report the results of the three methods on MNIST, iCubWorld28 and RGB-D for a single under-represented class (digit “8”, class and tomato, respectively). We observed a similar behaviour for other classes. We show both the accuracy on all classes (Total Acc., Left) and on the under-represented one (Imbalanced Acc., Right). We note that, on the under-represented class, Alg. 1 (RC) consistently outperforms the RLSC baseline (N), which does not account for class imbalance and learns models that ignore the class. Also the total accuracy of RC results higher. Interestingly, on the two robotics tasks, RC outperforms the loss rebalancing approach (RB), particularly when very few examples of the under-represented class are available. This is favorable since, as we said, the rebalancing approach cannot be implemented incrementally (Sec. V-B).
To offer a clear intuition of the improvement provided by our method, in Fig. 3 we show the accuracy of Naïve RLSC (Red) and Alg. 1 (Blue), separately on the under-represented class (Top), the balanced classes (Middle), and all classes (Bottom), as they are trained on new examples. For this experiment we let each class be under-represented and averaged the results. It can be noticed that Alg. 1 is much better on the imbalanced class, while being comparable on the balanced ones, resulting in overall improved performance.
We point out that the total accuracy on all datasets for is comparable with the state of the art. Indeed, on MNIST we achieve accuracy, which is slightly lower than the one reported in  for a linear classifier on top of raw pixels (this is reasonable, since we are using much fewer training examples). The total accuracy of Alg. 1 on RGB-D is approximately , which is comparable with the state of the art on this dataset . On the iCubWorld28 dataset we achieve accuracy, which is in line with the results reported in Fig. 8 of  (extended version of ).
In this paper we addressed the problem of learning online with an increasing number of classes. Motivated by the visual recognition scenario in lifelong robot learning, we focused on issues related to class imbalance, which naturally arises when a new object/category is observed for the first time. To address this problem, we proposed a variant of the recursive Regularized Least Squares for Classification (RLSC) algorithm that (i) incorporates new classes incrementally and (ii) dynamically applies class recoding when new examples are observed. Updates are performed in constant time with respect to the growing number of training examples. We evaluated the proposed algorithm on a standard machine learning benchmark and on two datasets for visual recognition in robotics, showing that our approach is indeed favorable in online settings when classes are imbalanced.
We note that, in principle, for the experiments where we used features extracted from a Convolutional Neural Network, we could have also directly trained the network online, by Stochastic Gradient Descent (backpropagation). While works empirically investigating this end-to-end approach in settings where new classes are to be progressively included into the model exist, this is still a largely unexplored field, the study of which is not in the scope of this work. The method we propose allows to update a predictor without using training data from previous classes in a fast and stable way, and, by relying on rich deep representations learned offline, is proven to be competitive with the state of the art, while being more suitable for online applications.
The work described in this paper is supported by the Center for Brains, Minds and Machines, funded by NSF STC award CCF-1231216 and by FIRB project RBFR12M3AC, funded by the Italian Ministry of Education, University and Research. We acknowledge NVIDIA Corporation for the donation of the Tesla k40 GPU used for this research.
A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” inNIPS, 2012.
J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,”Journal of Machine Learning Research, 2011.
International joint conference on artificial intelligence, 2001.
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” inACM International Conference on Multimedia, 2014.