1 Introduction and Related Work
, specifically Convolutional Neural Networks (CNNs) for different visual recognition tasks
. Hence, the problems which have been solved by the past with other machine learning approaches have to be developed from Deep Learning perspective. One of them is the incremental/continual/life-long learning. With life-long learning, we mean the ability of humans to learn through experience overtime. For supervised (Deep Learning) approaches, it means that the data for model training are not all available at glance and the model has to be adjusted with the new coming data. In the case of unsupervised learning, such approaches have been proposed since quite a time, e.g. incremental clustering,, . In supervised (Deep Neural Networks) framework, while offline models have shown to be successful in an abundance of fields, such as medical image classification , recognition of historical objects in digital cultural heritage management  and others, online models still seem to lack effectiveness.
The scenarios which might require continual learning approaches are the following but not limited to: i) introduction of new categories from new coming data , specifically with a few data ; ii) refining existing categories to sub categories (hierarchical classification) ; iii) additional training data for existing categories. The latter is not only needed in the scenarios where the existing database is permanently enriched (such as in cultural heritage management), but also in tracking of objects of known class , which is a "online-learning" for class-specific tracking considered in this work, requiring re-training with gradient descent. Our contribution concerns this very case: once a model has been pre-trained on a sufficiently large data set, it has to be adjusted with new coming data without adding or refining categories.
Here comes an important question about catastrophic forgetting. It means that if a system has been trained on a bunch of data and shows good performances, then sequentially trained with new data on new tasks, it could forget about the older tasks . This is the so-called "stability-plasticity dilemma".
In case of human reasoning, the stability–plasticity dilemma regards the extent to which a system must be prone to adapt to new knowledge and, importantly, how this adaptation process should be compensated by internal mechanisms that stabilise and modulate neural activity to prevent catastrophic forgetting . The catastrophic forgetting is already present in the neural networks due to the process of gradient-descent based optimisation of the parameters. At each forward pass the loss is computed with the available model, then during the backward propagation, the gradient of the loss with respect to each parameter is computed. And the parameter adjustment during the optimization process is performed with this new loss . Therefore, if the training data is coming sequentially, as this is the case in continual learning, we will observe a « model drift » from already optimized solution. Some attempts have been made to avoid catastrophic forgetting as in , but the authors consider the case of new classes which appear without pre-training. And their problem is to balance the performances on "old" classes and new ones. We are interested in the case when the taxonomy does not change along the time, but the object appearance may do, as in tracking due to auto occlusions and progressive changes of the view-point or in database enrichment, when the initial database is continuously incremented by different view of the same visual content. The re-training of neural networks in order to adapt models to the new coming content requires a heavy computational workload due to the back-propagation pass in gradient optimization.
In this paper, we propose a novel approach for continual learning, which does not require gradient based optimization. The motivation of it was the recognition of objects to grasp in assistance to amputees with vision-based Neuro-prostheses , when different views of the same object -to-grasp come "on the fly" to adjust the pre-trained model. Another scenario is continual image database enrichment as in the application of . We show that the method performs "not worse" than continual learning by sequential gradient descent optimization. A mathematical fromulation of the method is given and experiments on open image dataset CIFAR-10  are reported. Our approach to incremental learning differs from all the others introduced so far. During the learning procedure we aim to internally change the neuronal networks weight structure rather than changing the whole architecture. We also distance ourselves from classical retraining, which is too computationally expensive for real world applications. When processing data on the fly retraining would be far from real time. The core idea of the method is the adjustment of a weight of the neuron responding to the class of the training example coming sequentially "on the fly". The reminder of the paper is organized as follows. In Section 2 we present mathematical bases of our method. In Section 3 we report on the experiments on publicly available benchmark database. Section 4 concludes this work and outlines its perspectives.
2 Move-to-Data: A Continual Learning Method
In the following we will restrict ourselves to the context of deep learning even thought the definitions can be expanded to the field of machine learning in general. To explain Move-to-Data method we will first introduce notations.
2.1 Definitions and Notations
Suppose we have a sequence of labeled data with some index set . Let us also assume we have -dimensional input data and -dimensional 1-of-c encoded label , where
is a unit vector whileis the number of classes. To given parameters , where is the number of the neural network parameters, we can define the neural network as a function, see eq.:
Assume further that the activation functions of our neural network are:
Any other non-linear response can be used such as sigmoid, leaky ReLu, … Please note that we implicitly assumed a fixed architecture of the Neuronal Network, that is a known shape of the function (1).
Let and . Suppose now we have a neural network trained offline on the data:
In the online learning setting, the learning step would consist of finding the parameters by using all previous information, that is, all previous models and all previous data. We are thus searching for a mapping:
In many applications, we neither have the storage nor the computational power to process all past models and data . Therefore, in incremental learning, we would like to find a mapping of the form:
depending only on the first new coming data and on the previous model .
2.2 "Move-to-Data": Incremental learning approach for a Deep CNN
The leading idea of our approach is to maintain the overall structure of the deep neural network and slightly adapt the model to the new data stream on a small scale.
The naive approach would be to retrain the model with gradient descent. Nevertheless, it is time consuming to apply back propagation on each new data point arriving on the fly.
Let us start off with an observation. Let with . Then the scalar product is large if the angle between and
is small. This, rather trivial observation, indicates that for a high activation at a neuron, the weight vector and the feature vector must have been similar regarding the angle between them. This is often referred as the cosine similarity, this principle has been largely used in Content-Based Image Retrieval Systems.
In CNNs convolutional layers serve to extract features and are followed by fully connected (FC) implementing a neural classifier, see for instance 
for explanations. These layers which can be several cascaded,are implementing Multi-Layerd Perceptron with hidden layers or not. Let us now focus on one last hidden (FC) layerof a neural network with width . As above, we have a feature vector corresponding to the input data for some . The output vector is then given by , where the weights are defined as follows:
For the activation of the -th neuron corresponding to the -th class, we have . As suggested above, the activation will be increased if the weight vector is closer to the feature vector . So for given label in the incremental step belonging to the class for some , in other words , we move the weight vector to the direction of the feature vector as defined in the following equation:
where is chosen be small.
In the context of incremental learning, we are receiving new data on the fly. For each new data point we apply the formula (7). This procedure we call Move-to-Data
. The Move-to-Data method lets the loss function to decrease (see section3. However, one should notice that the loss function will continue decreasing until all the classes have been seen in the samples.
Furthermore, it is important to note that as formulated in (7) and need to be unit vectors to prevent biases induced by scaling. It is common practice to normalize feature and weight vectors. For features it is better known as "feature scaling" and for weights it has the name of re-parametrization. In our case, we do not normalize the weights , but move them to the data vector by the following equation using projection of weigt vector on data vector direction:
where is the Euclidean norm. Note that the proposed adjustment of weights concerns only FC layer. Hence it is applicable not only to CNNs but to a classical MLP as well as to recursive neural networks.
3 Experiments and Results
In this section we describe experiments we have conducted to evaluate proposed Move-to-Data approach on an open dataset: CIFAR10 .
The CIFAR10 dataset consists of 60000 RGB images of dimension 32x32x3. There are 10 classes with 6000 images per class, respectively. The classes are mutually exclusive. The dataset is split up into 50000 training images and 10000 test images which we will call "original training set" and original "test set".
3.1 CNN configuration
We use a ResNet56v2 
convolutional neural network. It is trained with a batch size of 32, for 160 epochs from scratch on a subset of the 50000- images original training set of CIFAR10, see3.2. The optimizer is Adam , with decreasing rate from 0.001 to 0.0001 accordingly to the following non-linear function eq;
Here denotes learning rate, is the decay, , and is the number of iteration. The loss function is the cross entropy loss (see  for a formal definition). Slight data augmentation (random translations and horizontal flip) is used during training.
3.2 Dataset usage
First, a model is trained on a subset of the training images from original training set. This training is done only on 10% (i.e., 5000 images) of the original training set of 50000 images. The images are almost uniformly distributed between classes: the less populated class contains 460 images that is 9.2% of the offline training subset and the most populated class contains 520 images (10.4%). The purpose of this split was twofold: first, to have enough (90%) of images remaining to evaluate the incremental learning. Second, we have a less precise model to better distinguish changes in accuracy during incremental learning.
The remaining 90% (i.e., 45000) images are split in data chunks. In this work we used N = 10, thus having 4500 images in each data chunk. We consider that these chunks are received progressively, one after another. The models are progressively adapted on all images at each data chunk reception. The accuracy metric is computed once all chunk of 4500 has been used for model adaptation.
We compare two methods for quick model adaptation to each newly received chunk. The comparison is done on the "original test set" The base-line method, is a plain fine-tuning applied only on the last FC layer using a new data-chunk. This means that all parameters in convolution layers are frozen as it is done for some layers in . The second method is our Move-to-Data method.
For fair comparison, the fine-tuning is done with a batch size of 1. This is the same strategy as in Move-to-Data, when the weight vector
is adjusted with each coming and passed through the network data vector. Both methods are implemented in Keras
, with a tensorflow backend, and are run on CPU.
Clearly, too strong "move" with of 0.1 yields very strong model drift with a catastrophic forgetting. The model does not generalise on the data. If the "Move-to-data" is weak then the Move-to-Data method gives decent accuracies close to 0.78 with slight increase over chunks of arriving data.
The Figure 2 shows the evolution of model accuracy for successive data chunks for both methods: Move-to-Data and classical fine-tuning, here the parameter is fixed to 0.0001. The accuracies are very close, starting form initial accuracy 0.776. at the beginning Move-to-Data is even slightly better and at the end, at 10th chunk, they are practically equal (0.762 for fine tuning and 0.761 for Move-to-Data). This means that our method has the same "catastrophic forgetting" as the gradient descent sequential fine-tuning. And this is without heavy gradient descent computations.
Our method was implemented only in CPU. Hence to compare its computational time to the fine-tunning base-line, we also perform the fine tuning only in CPU. The results are illustrated in Figure 3. We also give Fine-Tuning times with GPU acceleration with batch size 1 (as in our case of the base-line). Obviously, the CPU implementation cannot compete with GPU acceleration, but all conditions equal (CPU), the Move-to-Data largely bypasses fine-tuning in computational speed, being more than 4 times faster than the latter. The mean computation times along the chunks of the data are 1327,179 136,908 and 333,30447,499 for fine-tuning and Move-to-Data respectively for CPU implementation.
4 Conclusion and Perspectives
Hence, in this paper, we have proposed a new method, Move-to-Data, which is a continual learning approach for deep convolutional neural networks classifiers. We presented and discussed mathematical formulation of the approach and tested it on a publicly available dataset CIFAR10. The method acts only on the last layer of the last fully connected layer of a "classification" part of a CNN. It is generic and can be applied to other kinds of Neural Networks: MLP and RNNs. The experiments show that it is more than 4 times faster than the base-line fine-tuning with the gradient descent while having the same catastrophic forgetting effect measured by comparison of accuracies attained by the two methods. The next step would be to extend the Move-to-Data method to the last two or three fully connected layers.
This paper introduces the method and presents the results of its experimental evaluation. The proof of convergence remains in the perspective of this work as well as its application for objects tracking in video.
The work has been supported by CNRS Interdisciplinary Grant RoBioVis, ERASMUS + internship grant, University of Bordeaux. We thank master student Eliot Ragueneau for his help in data preparation.
-  Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” pp. 436–444, 2015.
-  J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, G. Wang, J. Cai, and T. Chen, “Recent advances in convolutional neural networks,” Pattern Recognition, vol. 77, pp. 354 – 377, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0031320317304120
-  E. Lughofer, “Extensions of vector quantization for incremental clustering,” Pattern Recognition, vol. 41, no. 3, pp. 995–1011, 2008. [Online]. Available: https://doi.org/10.1016/j.patcog.2007.07.019
X. Zhang, C. Furtlehner, and M. Sebag, “Distributed and incremental clustering
based on weighted affinity propagation,” in STAIRS
, ser. Frontiers in Artificial Intelligence and Applications, vol. 179. IOS Press, 2008, pp. 199–210.
-  B. Mansencal, J. Benois-Pineau, R. Vieux, and J. Domenger, “Search of objects of interest in videos,” in 10th International Workshop on Content-Based Multimedia Indexing, CBMI 2012, Annecy, France, June 27-29, 2012, 2012, pp. 1–6. [Online]. Available: https://doi.org/10.1109/CBMI.2012.6269809
K. Aderghal, A. Khvostikov, A. Krylov, J. Benois-Pineau, K. Afdel, and G. Catheline, “Classification of alzheimer disease on imaging modalities with deep cnns using cross-modal transfer learning,” inCBMS. IEEE Computer Society, 2018, pp. 345–350.
-  A. M. Obeso, J. Benois-Pineau, M. S. García-Vázquez, and A. A. Ramírez-Acosta, “Introduction of explicit visual saliency in training of deep cnns: Application to architectural styles classification,” in CBMI. IEEE, 2018, pp. 1–5.
-  F. M. Castro, M. J. Marín-Jiménez, N. Guil, C. Schmid, and K. Alahari, “End-to-end incremental learning,” in ECCV (12), ser. Lecture Notes in Computer Science, vol. 11216. Springer, 2018, pp. 241–257.
-  S. Gidaris and N. Komodakis, “Dynamic few-shot visual learning without forgetting,” in CVPR. IEEE Computer Society, 2018, pp. 4367–4375.
-  A. Dutt, D. Pellerin, and G. Quénot, “Improving hierarchical image classification with merged CNN architectures,” in CBMI. ACM, 2017, pp. 31:1–31:7.
-  H. Li, Y. Li, and F. Porikli, “Deeptrack: Learning discriminative feature representations by convolutional neural networks for visual tracking,” in Proceedings of the British Machine Vision Conference. BMVA Press, 2014.
-  A. Mallya and S. Lazebnik, “Piggyback: Adding multiple tasks to a single, fixed network by learning to mask,” CoRR, vol. abs/1801.06519, 2018.
-  G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, S. Wermter, and K. Technology, “Continual Lifelong Learning with Neural Networks : A Review,” pp. 1–29, 2019.
-  G. E. H. David E Rumelhart and R. J. Williams, “Learning internal representations by error propagation,” in Exploration sin the Microstructure of Cognition, 1986, pp. 318–362.
-  K. Shmelkov, C. Schmid, and K. Alahari, “Incremental learning of object detectors without catastrophic forgetting,” in ICCV. IEEE Computer Society, 2017, pp. 3420–3429.
-  I. González-Díaz, J. Benois-Pineau, J. Domenger, D. Cattaert, and A. de Rugy, “Perceptually-guided deep neural networks for ego-action prediction: Object grasping,” Pattern Recognition, vol. 88, pp. 223–235, 2019.
-  I. González-Díaz, J. Benois-Pineau, J.-P. Domenger, and A. de Rugy, “Perceptually-guided Understanding of Egocentric Video Content,” pp. 434–441, 2018.
-  A. Krizhevsky, “Learning Multiple Layers of Features from Tiny Image,” MIT, NYU, Tech. Rep., 04 2009.
-  J. Hafner, H. S. Sawhney, W. Equitz, M. Flickner, and W. Niblack, “Efficient color histogram indexing for quadratic form distance functions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 17, no. 7, pp. 729–736, 1995.
-  A. Zemmari and J. Benois-Pineau, Deep Learning in Mining of Visual Content, ser. Springer Briefs in Computer Science. Springer, 2020.
-  T. Salimans and D. P. Kingma, “Weight normalization: A simple reparameterization to accelerate training of deep neural networks,” CoRR, vol. abs/1602.07868, 2016. [Online]. Available: http://arxiv.org/abs/1602.07868
K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual
European conference on computer vision. Springer, 2016, pp. 630–645.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 3320–3328.
-  F. Chollet, “Deep learning with python,” 2017.