Continual Learning Using Task Conditional Neural Networks

05/08/2020 ∙ by Honglin Li, et al. ∙ adobe 93

Conventional deep learning models have limited capacity in learning multiple tasks sequentially. The issue of forgetting the previously learned tasks in continual learning is known as catastrophic forgetting or interference. When the input data or the goal of learning change, a continual model will learn and adapt to the new status. However, the model will not remember or recognise any revisits to the previous states. This causes performance reduction and re-training curves in dealing with periodic or irregularly reoccurring changes in the data or goals. The changes in goals or data are referred to as new tasks in a continual learning model. Most of the continual learning methods have a task-known setup in which the task identities are known in advance to the learning model. We propose Task Conditional Neural Networks (TCNN) that does not require to known the reoccurring tasks in advance. We evaluate our model on standard datasets using MNIST and CIFAR10, and also a real-world dataset that we have collected in a remote healthcare monitoring study (i.e. TIHM dataset). The proposed model outperforms the state-of-the-art solutions in continual learning and adapting to new tasks that are not defined in advance.



There are no comments yet.


page 1

page 3

page 9

page 10

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The human brain can adapt and learn new knowledge in response to the changing environments. We can continually learn different tasks while retaining previously learned variations of the same or similar phenomenon and give different reactions under different contexts. Neurophysiology research has found that our neurons are task-independent


. Under different context, the neurons are fired selectively with respect to the stimulus. In contrast, most of the machine learning models, in a scalable way, are not capable of adapting to changing environments quickly and automatically using specific neurons corresponding to different tasks. As a consequence, these machine learning models tend to forget the previously learned task after learning a new one. This scenario is known as catastrophic forgetting or interference in machine learning


Catastrophic interference problem in machine learning is one of the inevitable hurdles to implement a general artificial intelligence learning system without constructing a set of models each dedicated to a specific task or different variations and situations in the data

[21]. Unable to learn several tasks sequentially, the model must be trained with all the possible scenarios in advance. This requirement is intractable in practice and is different from the lifelong learning goal in continual learning models [38].

Continual machine learning algorithms change over time and adapt their parameters to the changes in data or the learning goal. We refer to the learning goal or a specific part of the data with a learning goal as a task. The learning models are not often equipped with solutions to quickly adapt to the situations which they have seen before if their parameters have significantly changed over time by continual learning. Here we use an example to illustrate the forgetting problem. We train a neural network for two different tasks sequentially. After being trained for each task, the model is represented by parameters respectively, where the is the randomly initialised weights, is the weights after learning task 1, is the weights after learning task 2. We use the linear path analysis [12] to visualise the loss surface. We define .

Fig. 1: Loss surface for the first task with respect to the parameters with different distributions. represents the initial parameters, and are the optimal solutions for the first and second task respectively.

As shown in Figure 1, while the model learns the first task, the parameters change from to , and the loss for the first task becomes smaller. When the model learns the second task, the parameters change from to , and the loss for the first task increases significantly.

A real-world example of this problem is a challenge that we have faced in our remote healthcare monitoring study [3]. We have developed a digital platform and a set of machine learning algorithms to perform risk analysis and provide alerts for early interventions in a use-case scenario to support people affected by dementia. In our user group with the in-home monitoring scenario, the distribution of data is periodical (due to seasonal and environmental effects). In some cases, the data and conditions are and conditions sporadically changed and repeated due to variations in participants’ health conditions. When we use continual and adaptive learning to update the models according to these changes, we face the problem of models not preserving the earlier learned tasks when they reoccur. There are two potential solutions, either maintaining multiple models to respond to different situations or developing models that inherently adapt to the changes and preserve the previously learned tasks as well. Maintaining several models for changing tasks also faces another challenge to detect when a change has occurred and being able to identify if the same or a similar task has previously been observed.

A variety of continual learning methods have been proposed to solve the problems mentioned above. Memory-based approaches replay the trained samples to solve the forgetting problem while learning a new task [35]. Regularisation methods reduce the representational overlap of different tasks [20, 18, 41]. Dynamic network approaches assign extra neuron resources to new tasks [40]. A more detailed description of these approaches is provided in Section 3. However, the solutions mentioned above assume the models are aware of the task changes in advance, or this information is given manually to the model throughout the learning.

These models cannot detect the changes [18, 20] or utilise the task-specific neurons without knowing the emerging task in advance [42, 29, 26, 34]. Unfortunately, the task change is rarely known [1] in advance. An ideal incremental learning model must meet the following criteria: 1) The model can learn different tasks sequentially without forgetting the previous ones; 2) The model can obtain the task information and then give different responses to different tasks if required.

We proposed a Task Conditional Neural Network (TCNN), which is a fully automated learning model. TCNN has two main advantages in comparison with the existing solutions: it can detect changes in the data and goals by learning new tasks without forgetting the previous ones; it can give different responses to the different tasks without being informed in advance.

The proposed model overcomes the interference between different tasks by inferring which task the model is encountering with at any given time. TCNN leverages the Probabilistic Neural Networks (PNN) [36]

to construct a trainable measure to distinguish different tasks. PNN is a non-parametric model that suffers from significant error buildup when the number of nodes increases. The conventional neural networks also provide a parametric model that suffers from over-confidence in their predictions in dealing with imbalanced and dynamic data. In contrast to the previous models, TCNN can estimate the probabilistic density to find a general representation of the training samples by maximising the task likelihood. In this way, TCNN obtains the task information instead of being informed with the tasks in advance. The schematic description of the TCNN process is shown in Figure


. During the training step, TCNN optimises the task likelihood measure by maximising the conditional probability of samples given task identity. During the test and run-time, TCNN computes the task likelihood and selects the corresponding neurons to provide a prediction, or it assigns new neuron resources to a new task if the existing resources are not capable of providing a suitable response. TCNN maximise the task likelihood by leveraging a probabilistic layer, shown in Figure


We further combine the replay mechanism with TCNN as Replay Task Conditional Neural Networks (RTCNN) to incrementally learn which samples are out of learning distributions. Generally speaking, while learning the task, the task-specific neurons, where , minimise the task-likelihood of samples from task

. By decreasing the impact of the classifier, RTCNN can incrementally learn the out-task samples without decreasing the performance of the

task-specific neurons on task .

Our main contributions in this paper include: i) proposing a novel method to address the catastrophic interference problem; ii) demonstrating how we can incrementally learn the samples from new tasks without decreasing the performance; iii) proposing a model to detect and learn the new tasks automatically. The rest of the paper is organised as follows. Section 2 empirically investigates the cause of catastrophic forgetting problem in machine learning. Section 3 discusses the related work. Section 4 describes the proposed Task Independent Neural Network (TCNN) model. Section 5 demonstrates the experiments and discusses the evaluation results. Section 6 provides an ablation study and Section 7 concludes the paper.


Fig. 2: Schematic diagram of TCNN. TCNN contains task-specific neurons to response to different tasks. The task layer contains a fully-connected network for classification and a probabilistic layer for measuring the task likelihood. The final output is based on the task likelihood and the classification result.
Fig. 3: The Probabilistic layer in TCNN. is the input to the hidden layer from task-specific neurons of task . is the parameters in the probabilistic layer of task .

2 The Causes of Forgetting Problem in Continual Learning

Kortge et. al [19] state that the interference problem in neural networks caused due to the back-propagation rule. This idea is studied by several groups, including Kirkpatrick et. al and Lee et. al [18, 20]. They argue that the reason for interference is because the parameter space adapts to new task rapidly and then comprise the previous task but with lower accuracy in responding to the earlier learned task. While learning two tasks sequentially, the model pays attention to the current task. In this case, the parameters change significantly after learning the new task, and if the model is given the earlier task again, it will not respond well until it re-learns it again. French et. al [7, 8, 9] argue that the problem is caused due to the overlap in the internal representation of different tasks. This idea is also investigated by Goodfellow et. al [11], who prove that the dropout [37] can mitigate catastrophic interference. Similarly, Masse et. al [26] also demonstrate that by deactivating a portion of the neurons before training new tasks, a model can address the catastrophic interference.

Based on the existing studies reported in [42, 26, 29, 34], the task information is one of the main causes of catastrophic interference. We investigate different scenarios of informing the model about the changing task information:

S1: The task is unknown to the model all the time; S2: The model do not need to be informed of the task information in the testing stage [20, 18]. But it cannot detect the changes automatically at the training stage; S3: The task information is known at both the training and testing stages [42, 29, 34, 26]. The model needs to be told which neurons should be activated during the testing stage; S4: The model knows what task it is about to perform before the training starts and knows the task changes during the test and run-time [13].

Figure 4 demonstrates how the task information affects the results. The scenarios S1 and S2 are shown as baseline in Figure 4. In the scenario S3, there are many different ways to inform the model about the task identities. Here we use context signal [28, 26] and multi-head approach [42]. The context signal is to add the task identity along with the samples in the input layer. The multi-head is to mask the output layer to make the model only response to the current task. The scenario S4 is named as warmup in Figure 4. Warmup allows the model to preserve a small set of samples drawn from the tasks to be learned. Overall, the positive effect of the task information shown in Figure 4 increases in the following order: i) Baseline (no task information), ii) Context signal, iii) Multi-head, iv) Context + warmup, v) Multi-head + warmup.

Since the multi-head approach tells the model in advance which task is about to be performed, the model can determine approximate parameters even the model had not seen the task before. This is why the multi-head approach has a higher overall accuracy after learning the first task.

Fig. 4: Test accuracy with different task information

As shown in Figure 4, a model can address the catastrophic interference problem by using advance information about the changes. The multi-head and warmup explicitly provide all the task information to the model, and this allows them to learn new tasks without forgetting the previous ones. Overall, the more advance information is provided regarding the task that a model is about to encounter, the more effective model will be in adapting to the new goal or data.

Based on the above example, one can see that the task information is important to address the forgetting problem in continual learning. Informing the task identity can be regarded as maximising the likelihood of , where is the samples and is the task information. However, to the best of our knowledge, few studies propose a model that can infer the information without being told in advance. Li et. al [22] leverage the uncertainty to get the task information at the prediction stage. However, their method cannot detect the changes automatically. Farquhar et. al [6] also suggest that mutual information can be used to identify the changes. However, calculating mutual information can become intractable in large-scale scenarios [22]. The closest work to S1, which is able to detect the changes in the tasks, is by Aljundi et. al [1]. However, Aljundi et. al’s work detects the changes based on the plateaus in the loss surface. In other words, they assume that model keeps learning new tasks continuously without identifying a set of specific and reoccurring tasks. The assumption of continuously learning new tasks increases the complexity of the model in Aljundi et. al’s work and hinders the scalability and applicability of their approach to online and real-world learning scenarios. In our work, we develop models that can identify tasks and preserve the learned parameters for distinctive tasks. By doing this, the model can quickly respond to a new task based on the previously learned information.

3 Related work

There are different approaches to address the forgetting problem in continual learning. Parisi et. al [30] categorise these approaches into three groups: Regularisation, Memory Replay and Dynamic Network approaches.

The regularisation approaches find the overlap of the parameter space between different tasks. One of the popular algorithms in this group is Elastic Weight Consolidation (EWC) [18]

. EWC avoids significantly changing the parameters that are important to a learned task. It assumes the weights have Gaussian distributions and approximate the posterior distribution of the weights by the Laplace approximation. A similar idea is used in Incremental Moment Matching (IMM)

[20]. IMM finds the overlap of the parameter distributions by smoothing the loss surface of the tasks. Zeng et. al [41] address the forgetting problem by allowing the weights to change within the same subspace of the previously learned task. Li et. al [24] address the problem by using the knowledge distillation [14]. They enforce the prediction of the learned tasks to be similar to the new tasks [30]. However, these models require advance knowledge of the training tasks and the task changes.

Memory Replay methods mainly focus on interleaving the trained samples with the new tasks. A pseudo-rehearsal mechanism [33] is proposed to reduce the memory requirement for storing the training samples for each task. In a pseudo-rehearsal, instead of explicitly storing the entire training samples, the training samples of previously learned tasks are drawn from a probabilistic distribution model. Shin et. al [35] propose an architecture consisting of a deep generative model and a task solver. Similarly, Kamra et. al [16]

use a variational autoencoder to generate the previously trained samples. However, this group of models are complex to train, and in real-world cases, the sampling methods do not offer an efficient solution for sporadic and rare events. These models also often require advance knowledge of the change occurring.

Dynamic Networks allocate new neurons to new tasks. Yoon et. al [40] propose Dynamic Expandable Networks (DEN) to learn new tasks with new parameters continuously. Similarly, Serra et. al [34] also allocate new parameters to learn new tasks. However, this group of models require the task information to be given to the model explicitly. In other words, the model knows in advance, which neurons should be activated to perform each test task.

4 Task Conditional Neural Network

The core idea in TCNN is to activate/deactivate the neurons based on the task being processed by the model. Masse et. al [26] and Serra et. al [34] use a similar idea in their work. However, they deactivate the neurons based on knowing the task identities in advance. In other words, their models manually activate or deactivate the neurons corresponding to a task. In TCNN, we select the neurons corresponding to a task by learning and preserving parameters of earlier learned tasks at the training stage and by observing and identifying the task identity at the run-time and processing stage. A probabilistic estimation allows TCNN to determine the task identity by evaluating the model state at any given time of the training and run-time process. The task information determination enables TCNN to learn several new tasks automatically. TCNN measures the (un)certainty of the neural network in processing a task and then uses the previously trained set of neurons and the parameters associated with a task. If the task is a brand new one, it allocated new resources and parameters in combination with the existing ones. Different from the previous uncertainty measure methods [31, 10], TCNN measures the confidence of the neural network without ensembling several neural networks. In other words, the complexity of TCNN to produce confidence is relatively small. Furthermore, the task likelihood in TCNN is trainable in a tractable manner.

4.1 Training Stage

At the training stage, TCNN learns how to process a task. It also obtains the task information by processing the input data. While learning a new task, we maximum the task likelihood of , where represents the training samples, represents the hypothesis of the task being the current one.

TCNN uses a probabilistic layer to estimate the probability density of the input data. The function is shown in Equation (1):


where is the training sample, represents the joint function of the parameters of the previous layers shown by , is the kernel in the probabilistic layer, is a hyper-parameter which can be regarded as the radius of .

Different from conventional fully connected networks, the training parameter in the probabilistic layer is a set of parameters as follows. Assuming there are kernels in the probabilistic layer and , we measure the confidence by Equation (2) which is inspired by Wedding et. al [39]:


where is the conditional binary probability of having task given a set of samples and network parameters, and , where the are the parameters in the task-specific neurons. In the rest of this paper, we use to represent the probability of the task being the current task that is processed by the model.

The form of the probabilistic layer is also different from the conventional fully connected layers. It is also a parametric layer which makes it different from the layers in PNN as well. Furthermore, combining with the other layers in the network, this layer can estimate the density of the training samples in high dimension by using a limited number of parameters.

The range of Equation (2) is from 0 to 1. It represents the probability that the samples belong to a certain task. Furthermore, the value of P is significantly affected by the maximum value of . Because a single task may contain several different classes with different distributions, we would like to find a general representation for all of them. Overall, (2) represents the probability of whether the samples belong to a particular task or not. TCNN maximises the Equation (3) while training and learning for various tasks by combining the density estimation with a conventional classification method.


The first term is the likelihood as defined in a classification task. The second term is the task likelihood. is a hyper-parameter. Let’s remind that . Maximising the second term in (3) is equivalent to minimising where , which is easier to implement.

4.2 Prediction Stage

During the test and run-time, TCNN uses task likelihood to decide which neurons should be chosen for the current task or detect whether a change has occurred.

After the model is converged with one task, TCNN decides the confidence interval for that task. The confidence interval is determined by computing the mean and variance of Equation (

2) for a set of training samples

. We assume the task likelihood follows a truncated normal distribution within the range from 0 to 1. We define the acceptance area

as Equation (4), where CF is the confidence factor define the bounds of interval or acceptance threshold.


Assuming TCNN has learned tasks, we first calculate the maximum task likelihood of the associated samples to decide which set of neurons to should provide the prediction. We then multiply the task likelihood by the classification result. Let’s assume task-specific neurons produce the maximum likelihood for a given task. The final prediction is given by Equation (5):


Where is the corresponding classification function, TLF is a gate function called Task Likelihood Filter (TLF) calculated by Equation (6):


Where is the threshold set by the confidence interval. Since we train the neural network with the mini-batch approach [23], it is more appropriate to calculate the expectation of during the prediction state.

In summary, if the the task likelihood of the test samples fall within at least one confidence interval of the task-specific neurons, TCNN will provide a response based on the current model, or it will raise a change alert instead to indicate that the model is dealing with a new task.

During the prediction stage, there could be some outliers that belong to the learned task. We do not want the model to be susceptible to outliers. We introduce a process to detect the changes, which is shown in Equation (

7). At any time range from to , if , we determine if a change is detected, or otherwise the model will provide a response based on the current structure and parameters. Where and is a pre-defined threshold.


4.3 Replay Task Conditional Neural Network

We use mini-batch [23] to find the general representation of the training samples, TCNN may not be able to produce the confidence properly when there is only a single test sample. One possible solution is to associate several samples to a task and to calculate the expectation of . The performance of this solution is analysed in Section 5,6. Another solution is to increase the margin of the confidence between the in-task samples and out-task samples.

To incrementally learn the samples out of training distribution, we combine the replay mechanism with TCNN and introduce Replay Task Conditional Neural Network (RTCNN). RTCNN allows the task-specific neurons to distinguish the in-task and out-task samples efficiently. After learning a task, we store a sub-set of samples associated with that task.. We then re-train the task-specific neurons with all the stored exemplars. For the task-specific neurons for task , if the exemplar is within the

task, the loss function will be as Equation in

3, or we maximise instead. We aim to decrease the task-likelihood of the out-task samples. We do not want the model to change significantly to forget the learned information. The parameter is set to a sufficiently large number during the replay process to control the changes in the model. In other words, instead of learning how to process new tasks, we minimise the task likelihood of samples which are from other tasks.

5 Experiments and Evaluations

We test our model on the Modified National Institute of Standards and Technology (MNIST) handwritten digits dataset. We also use the Canadian Institute For Advanced Research (CIFAR) 10 dataset, which is a collection of images. For a real-world scenario and to address some of the challenges in our healthcare monitoring research, we use the Technology Integrated Health Management (TIHM) dataset [5]. The TIHM dataset consists of several sensor data types collected using in-home monitoring technologies from over 100 homes continuously for six months. The data includes environmental sensory data such as movement, home appliance use, doors open/closed, and physiological data such as body temperature, blood pressure and sleep. The data was fed to a set of analytical algorithms to detect conditions such as hypertension, Urinary Tract Infections and changes in daily activities [4]. A clinical monitoring team used the results of the algorithms on a digital platform that we have developed in our previous work [3] and in some cases verified the results or labelled the false positives. One of the key limitations of our previous work in TIHM was that the algorithms were trained offline, and they did not learn continually. Another limitation was that with using conventional adaptive models, the algorithms changed over time when the environmental or health conditions changed due to seasonal or short-term effects. However, when an earlier learned status is re-observed by the models, the algorithms were not able to perform efficiently due to significant parameter changes. To evaluate the performance of our proposed continual learning and to demonstrate the effectiveness of the model in addressing a real-world problem, we evaluate TCNN on the TIHM dataset and show it can address the challenges mentioned above.

We compare our model with several state-of-the-art approaches in different scenarios as we mentioned in 2. Based on the scenarios that are discussed in Section 2, we compare the proposed methods: S1, with the methods in: i) S2: Incremental Moment Matching (IMM) [20], Orthogonal Weight Modification (OWM) [41], Model Adaptation [15] (MA) and Gradient episodic memory (GEM) [25]; ii) S3: Variational Continual Learning [29], Synaptic Intelligence (SI) [42].

For the existing methods, we follow the original settings, as stated in the above-mentioned papers. Since our model has to detect the changes, there will be True Detection (TD) or False Detection (FD). For our model, after learning the task, we feed the batches of test samples of tasks to the model and use Equation (7) to detect the changes. If the model detects the changes that are associated with previously learned tasks , the FD increases. If the model detects task , which is a new task, the TD increases. We also use True Detection Rate (TDR) and False Detection Rate (FDR) to measure the sensitivity of the model. Assuming the time span of the old/new task is , number of change detected is and the time interval in Equation (7) is . TD, FD, TDR, FDR are calculated by Equation (8):


In our experiments, we set the time interval in Equation (7) to 3, the . For each task, we run the model for 200 time-slots to detect the changes. We assume the data from different tasks arrives in sequence. During the test, the data comes in blocks which contain several samples drawn from the same task. The intuition behind this is that the data coming in a specific time-span may come from a similar distribution. For example, in the healthcare monitoring scenario, a condition that may affect the activity data may last for several days. Consequently, the data from these days will have a similar distribution in the time-span of the short-term condition. In the rest of this paper, the block size represents the number of samples in a block associated with a specific task. While detecting the changes, we set the block to 10. While testing the model, we report the test accuracy with different block sizes. In RTCNN, we store 2000 samples related to each task in MNIST and CIFAR10 datasets, store 45 samples related to each task in the TIHM experiment. Our model has to infer the task information without being informed during the test. If the model fires the right sets of neurons, the test accuracy is the same as training a single model; if the model activates a wrong set of neurons, it will fail to infer the correct task information and the test accuracy will be set to 0. This process is different from the multi-head approach discussed earlier, which may guess the right label even if the model has not come across the current task before. In the following experiments, decision accuracy represents the rate of choosing the correct set of neurons associated with a task. For the proposed methods TCNN and RTCNN, the number in parenthesis represents the block size.

The task likelihood of the experiments can be found in the supplemental document.

5.1 Split MNIST Experiment

The first experiment is split MNIST, which is a benchmark experiment in continual learning field [42, 20, 26]

. We split the MNIST to 5 different tasks of consecutive digits. The basic architecture of our model is a multi-layer perceptron with three hidden layers containing 1000, 1000, and 2560 units, respectively. The last hidden layer of the model is connected to a probabilistic layer containing two vectors. The weight factor

in Equation (3) is set to 2. The factor in Equation (4) is set to 4.

The test accuracy for this experiment is shown in Table I. When the block size is 1, TCNN cannot distinguish the task information efficiently. However, when the block size is increased to 10, TCNN can infer the task information correctly. In RTCNN, we can obtain the task information efficiently without increasing the block size. The comparison of the test accuracy and decision accuracy is shown in Table II. The results show that the replay process improves decision accuracy without affecting the test accuracy.

Method Test Accuracy(%)
Baseline 20.00
OWM 93.55
IMM 68.32
GEM 92.20
SI 98.9
VCL 98.4
TCNN(1) 68.36
TCNN(10) 98.17
RTCNN(1) 96.10
TABLE I: Test accuracy of split MNIST experiment. Methods denoted by ’*’ represent the memory-based approach
Method Test Accuracy(%) Decision Accuracy(%)
TCNN(1) 68.36 68.37
TCNN(10) 98.17 98.35
RTCNN(1) 96.10 96.30
TABLE II: Test accuracy and decision accuracy of split MNIST experiment

The expectation of the task likelihood of 10 samples to be associated with a task produced by each task-specific neurons is shown in Figure 5. In TCNN, the task-specific neurons have higher task likelihood when the samples come from the corresponding task. However, the margin of the task likelihood between different task-specific neurons is not significant. In RTCNN, the task-specific neurons have larger task likelihood in the case of their corresponding task compared with other sets.

(a) First Task-Specific Neurons
(b) Second Task-Specific Neurons
(c) Third Task-Specific Neurons
(d) Forth Task-Specific Neurons
(e) Fifth Task-Specific Neurons
(f) TCNN Task Likelihood
Fig. 5: Task Likelihood for each task-specific neurons to each task in TCNN and RTCNN. The last figure is the zoomed version of task likelihood in TCNN. We can see that each task-specific neurons have higher task-likelihood to the corresponding task, e.g. The first task-specific neurons have highest task-likelihood with the samples from task 1. Comparing to TCNN, the margin of the task-likelihood between the in-task samples and out-task samples is significant.

The detection rate are shown in Table III. The first column represents how many tasks have been learned. Overall, the true detection TD and TDR rates are relatively higher than false detection FD and FDR. Furthermore, if we apply the detection process, the false detection alert FD is relatively small, but the true detection (TD) alert is still sensitive to the changes.

1 Task 1.0 1.0 0.0121 0.1557
2 Task 0.9220 0.9575 0.0169 0.1639
3 Task 0.9371 0.9665 0.00273 0.2030
4 Task 0.8462 0.9148 0.0434 0.2380
TABLE III: Detection rate in split MNIST experiment

5.2 Split CIFAR10 Experiment

In the second experiment, we test our model on a more complex dataset. We split CIFAR10 to 5 tasks and compare the performance of TCNN with the state-of-the-art methods. The based model is a convolutional neural network contains two convolutional layers and three fully-connected layers. To reduce the number of training parameters, we apply a transfer learning technique

[32]. For all the task-specific neurons, they share the convolutional layers, which are trained with the first task. The results are shown in Table IV and V.

Method Test Accuracy(%)
Baseline 20.0
OWM 52.83
IMM 32.36
MA 40.47
SI 94.96
TCNN (1) 32.60
TCNN (10) 57.17
RTCNN(1) 60.10
TABLE IV: Test accuracy of split CIFAR10 experiment. Methods denoted by ’*’ represent the memory-based approach
Method Test Accuracy(%) Decision Accuracy(%)
TCNN(1) 32.60 33.50
TCNN(10) 57.17 60.20
RTCNN(1) 60.10 63.50
TABLE V: Test accuracy and decision accuracy of split CIFAR10 experiment

The detection information is shown in Table VI. The first column represents how many tasks have learned.

1 Task 0.9340 0.9600 0.2879 0.5250
2 Task 0.7879 0.8700 0.3910 0.6300
3 Task 0.8182 0.8950 0.4000 0.6183
4 Task 0.5455 0.7650 0.3196 0.5700
TABLE VI: Detection rate in split CIFAR10 experiment

The task likelihood produced by each task-specific sets of neurons according to the test samples from each task is shown in the supplemental document. Overall, on a complex dataset, TCNN outperforms the state-of-the-art methods.

5.3 Healthcare Monitoring Data Experiment

Our last experiment is to evaluate our model on a remote healthcare monitoring dataset. The Technology Integrated Health Management (TIHM) dataset is collected by in-home monitoring sensory devices. As we discussed earlier, the TIHM dataset has been collected from over 100 homes with more than 200 participants in a clinical study aiming to improve the quality of life and to analyse the risk of adverse health conditions in people with dementia.We do not compare our model with the S3 methods. Since there is only one class in each task, providing the task information is equivalent to telling the model which class it is about to perform.

In this experiment, we first evaluate changes in the daily activities of the participants in the study. This data contains three classes: low, medium and high levels of changes in the routine of daily living activities. Compared to the other experiments discussed above, this is a more challenging problem. The TIHM data is unbalanced. The low activity-change class contains 11057 samples; medium activity-change class includes 1146 samples, and high activity class contains only 64 samples. The model should be able to learn several tasks sequentially and also process the unbalanced data automatically. Different levels of activity and their changes also have various characteristics in different participants. In other words, a change in the level of activities to indicate low or medium or high activity does not have the same distribution in all the participants’ data. We split the dataset into training and test sets and then follow the same steps as described above for other experiments to learn the three tasks in this experiment. Each task in this experiment only contains one class. The latter means that the test accuracy will be the same as decision accuracy.

The test accuracy is shown in Table VII. Overall, combining the replay mechanism with TCNN improves the accuracy significantly with having a relatively small set of training samples. Since for the S3 methods, the model is known what task to perform in advance and there is only one class in each task, hence the accuracy is 100%.

Test Accuracy (%)
IMM 33.3
TCNN (1) 78.8
TCNN (10) 98.7
RTCNN (1) 92.7
TABLE VII: Test accuracy in the TIHM experiment

The detection accuracy is shown in Table VIII. The higher rate of true detection compared with the false detection shows that the model is sensitive to the changes and confident with the learned data.

1 Task 1.0 1.0 0.0 0.025
2 Task 1.0 1.0 0.0 0.0175
TABLE VIII: Detection accuracy in TIHM experiment

We also evaluate the applicability of the model in classifying the cases of Urinary Tract Infections (UTIs) in the dataset. UTIs are one of the common causes of hospital admissions in people with dementia. In the TIHM dataset, we have some cases that are tagged by a monitoring team as true positives or false positives. The underlying data associated with these detected conditions are multivariate sensory data coming from sleep, movement, door and physiological monitoring sensors. One of the key limitations in our previous work in this area [4] was that the algorithms had to be trained offline and also they could not adapt to various distributions representing the patient groups that had UTI but with a different manifestation of symptoms. Using TCNN and RTCNN with the TIHM data, the model can incrementally learn different distributions in each class (i.e. positive or negative for UTIs) and to adapt to the changes in the input data. The results of the test accuracy for this experiment are shown in Table IX.

Test Accuracy (%)
IMM 50.0
TCNN (1) 70.48
TCNN (8) 88.25
RTCNN (1) 78.65
TABLE IX: Test accuracy in detecting Urinary Tract Infections in the TIHM experiment

6 Discussion

In this section, we analyse the performance of TCNN under unknown task settings. We evaluate how the block size affects the performance of the model and visualise the density approximated by the probabilistic layer. We also discuss the probabilistic layer and provide an ablation study.

6.1 The performance of TCNN under Task-Unknown Setting

The task-unknown settings represent conditions in which we do not inform the task information to the model at any times. There are two phases in task-unknown settings: i) Training Phase: the model learns a new task; ii) Prediction Phase: the model provides the results with test data and detects the changes. The model will go to the training phase, if and only if a change is detected in the prediction phase. For the analysis purpose, in the prediction phase, after learning task , the model will provide predictions on the test samples of task . After that, the model will give predictions on the test samples of task . In this experiment, we set the block size to 10 and use the split MNIST and CIFAR10 datasets. In the prediction phase, for each task, we test the model on 200 batches. For the visualisation purpose, we use PAA [17] to process the data. The task-likelihood and the average test accuracy on all the tasks are shown in Figure 6.

Fig. 6: TCNN learns five different tasks sequentially without having the task information in advance. The green blocks at the bottom represent the model tested on the learned task; the red blocks at the bottom represent the model tested on an unseen task. The blue shadows represent the area of , where is the task likelihood. When a change is detected, TCNN adapts to the new task automatically without forgetting the previous ones. (a) MNIST experiment. (b) CIFAR10 experiment. The task likelihoods on the top section of (b) is the zoomed version.

As shown in Figure 6, while a new task transpires, the task likelihood decreases significantly. The average test accuracy shows that TCNN detects the task changes and adapt to the new tasks quickly without forgetting the previously learned ones. Overall, TCNN provides a unique and novel feature by automatically detecting and adapting to new tasks in a scalable and efficient way.

6.2 Block Size

Although TCNN can detect the task changes, the test accuracy can be affected by the block size. As shown in Table I and IV, when the block size is set to 1, the accuracy is low. When the block size is increased, the test accuracy increases significantly. TCNN maximises the task-likelihood by finding a general representation of the training samples. Intuitively providing more samples will enhance determining the task likelihood. As shown in Figure 8, at each time slot, if the block size increases, the performance of TCNN will also increase.

(a) First Task-Specific Neurons
(b) Second Task-Specific Neurons
Fig. 7: Task likelihood with only one sample in the probabilistic layer in the split MNIST experiment. In this case, the task likelihood is not correctly determined. In this example, the second task-specific neurons have a higher likelihood than the first task-specific neurons to the samples from the first task.
Fig. 8: Test Accuracy in the Split MNIST experiment.

6.3 Analyse the Estimated Density

In this section, we use the split MNIST experiment to visualise the density estimated by the probabilistic layer.

In the split MNIST, we have 5 different tasks. Hence in the probabilistic layer for the task, we have two kernels referred to as

. Here we show the joint distribution of

and . Figure (a)a shows distribution of the two kernels in the probabilistic layer in the first task-specific neurons. Figure (b)b shows the sample distribution extracted by the previous hidden layers with function . As shown in these figures, the probabilistic layer estimates the density of the sample distributions successfully.

(a) Kernel Distribution
(b) Sample Distribution
(c) Kernel Distribution
(d) Sample Distribution
Fig. 9: Kernel distribution and sample distribution in the task-specific neurons. The sample is collected from the first task. The sample distribution is obtained by the hidden layers of task-specific neurons. Figure (a)a and (b)b are from the first task-specific neurons, hence the distributions are similar to each other. Figure (c)c and (d)d are from the second task-specific neurons (denoted by ). Hence the distributions are quite different from each other.

For comparison, we visualise the sample distribution in the second task-specific neurons. The kernel and sample distributions extracted by the hidden layers with function of the second task-specific neurons is shown in Figure (c)c.

In this work, we approximate the distributions that can be classified by the neural network. In other words, we consider the hidden layers as a function , which maps the samples in the training task into specific distributions. We jointly train the classifier and the probabilistic layer to approximate these distributions. As the data changes from in-task (i.e. samples more relevant to a specific task) to out-task (i.e. samples less relevant to a particular task) samples, the function cannot map the samples to the learned distributions any more. As shown in Figure 9, the function cannot map the samples from the first task to the kernel distribution trained for task 2 (shown in Figure (c)c and Figure (d)d).

The difference between the joint distribution of the kernel and the samples is shown in Figure 10. The samples are drawn from the first task. The difference between the kernel and sample distribution is small for the first task-specific neurons as shown in Figure (a)a, and the difference is significant for the second task-specific neurons as shown in Figure (b)b.

(a) Task 1 Specific Neurons
(b) Task 2 Specific Neurons
Fig. 10: Difference between the samples distribution and the the task specific neurons in in the probabilistic layers. Sample are drawn from Task 1.

6.4 Ablation Study

Without the probabilistic layer to compute the task likelihood, the model cannot decide which task-specific neurons should be activated and will not be able to detect the changes in tasks. We take the 5 task split MNIST to perform an ablation study. As shown in Table X, the conventional neural network model used in this paper cannot detect changes by the softmax output layer. Farquhar et. al [6] and Li et. al [22] suggest that computing the mutual information by using a Monte Carlo sampling can determine the degree of uncertainty. However, the test process using this approach is computationally extensive and slow. Furthermore, their proposed method needs a large batch of samples to compute a reliable degree for the uncertainty measure. However, Farquhar et. al [6] and Li et. al [22] do not propose a solution to detect a new task with limited samples.

1 Task 0.0 0.02 0.0 0.0
2 Task 0.0 0.0 0.0 0.0
3 Task 0.0 0.0 0.0 0.0
4 Task 0.0 0.0 0.0 0.0
TABLE X: The detection rate without probabilistic layer

The accuracy of the model without including the probabilistic layer is 0.204. Since we have five tasks in this experiment, the accuracy is the same as a random guess.

We then evaluate how the number of parameters in the probabilistic layer affects the detection accuracy. The number of parameters can be regarded as a general representation of the training task. We want to find the probability density of the samples that can be associated with a specific task. The parameters in the probabilistic layer represent all of these densities. We report the five tasks split MNIST experiment’s detection accuracy with just one parameter in the probabilistic layer. As shown in Table XI, the detection accuracy is decreased dramatically compared with Table III, which includes the scenario with two parameters in the probabilistic layer.

1 Task 0.0 0.02 0.08 0.0
2 Task 0.0 0.0 0.0 0.0
3 Task 0.0 0.0 0.0 0.0
4 Task 0.0 0.0 0.0 0.0
TABLE XI: Detection rate in the split MNIST experiment with only one parameter in the probabilistic layer
1 Task 1.0 1.0 0.0 0.025
2 Task 0.0 0.16 0.0 0.0125
3 Task 0.06 0.324 0.0 0.013
4 Task 0.015 0.1 0.0 0.03
TABLE XII: Split MNIST experiment with ten parameter in the probabilistic layer.

The task-likelihoods of the first three task-specific neurons are shown in Figure 7. With only one probability density in the probabilistic layer, the model cannot get the task information correctly. We aim to approximate the distributions which can be classified by the neural network. There are potentially two distributions in each task. The general representation of these two distributions is not sufficient to distinguish different task identities. We visualise the sample distribution from the first task. Then compare it to the kernel distribution of the first and second task-specific neurons. Shown in Figure 11. With only one parameter in the probabilistic layer, the forth task-specific neurons have a smaller difference between the sample distribution and kernel distribution. In other words, the model fails to identify the correct task-specific neurons.

When the number of parameters is larger than the number of classes in the task, the model will no longer be sensitive to the task changes, as shown in Table XII. This is due to the fact that the probabilistic layer overfits the sample distributions.

(a) First Task Specific Neurons
(b) Second Task Specific Neurons
Fig. 11: Sample distributions and kernel distributions. Sample from the first task. Only one parameter in the probabilistic layer.

7 Conclusions

In this paper, we first discuss the reasons for forgetting problem in machine learning when different tasks are given to a model at different times and demonstrate how providing or acquiring the learning task information is essential to address this issue. We also present a challenge that we have in our healthcare monitoring research and discuss how an automated and scalable model can help to solve this issue in dynamic and changing environments. We then propose a Task Conditional Neural Network (TCNN) model for continual learning of sequential tasks. TCNN is a novel model that provides task-specific neurons corresponding to different tasks.

TCNN can learn and decide which neurons should be chosen and activated under different tasks that are given to a model, without having provided the task informaiton in advance. TCNN can detect the changes in the tasks and learn new tasks automatically. The proposed model implements these features by using a probabilistic layer and measuring the task likelihood given a set of samples associated with a specific task. The proposed model interprets the task likelihood as a binary probability and learns the task likelihoods by utilising a probabilistic neural network.

Our proposed model outperforms the state-of-the-art methods in terms of accuracy and also by detecting the changing automatically. We have also shown how TCNN is used to identify the changes in the data and targets and use the previously learned parameters for each task to detect and predict changes in daily-living activities in our remote healthcare monitoring scenario. The proposed model has a significant impact on creating continual learning methods in dynamic and changing environments in which the data distributions and goals change over time, and the previously learned information is required for a machine learning algorithm when a previously learned state reoccurs.

The future work will focus on measuring the confidence value for a specific task for each neuron. In the current model, a probabilistic layer measures the joint confidence of a group of task-specific neurons. If we can measure the confidence for a single neuron, the efficiency of the model will increase significantly. A solution could be storing the confidence measure for each neuron, given the test samples associated with each task. However, this could also provide several magnitudes of complexity sand could make the model intractable. In other words, the confidence of the weights is conditional to the samples instead of having a static value [37]. Identifying a change in the current task can also be improved by a pre-processing decision layer (e.g. a time-series clustering method) or by using a set of ensemble models and an algorithm in a dynamic network to select different subsets of neurons depending on the task.


This work is supported by Care Research and Technology Centre at the UK Dementia Research Institute (UK DRI). The work is also partially supported by the EU Horizon 2020 IoTCrawler project under contract number: 779852. .


  • [1] R. Aljundi, K. Kelchtermans, and T. Tuytelaars (2019) Task-free continual learning. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 11254–11263. Cited by: §1, §2.
  • [2] W. F. Asaad, G. Rainer, and E. K. Miller (2000) Task-specific neural activity in the primate prefrontal cortex. Journal of Neurophysiology 84 (1), pp. 451–459. Cited by: §1.
  • [3] S. Enshaeifar, P. Barnaghi, S. Skillman, A. Markides, T. Elsaleh, S. T. Acton, R. Nilforooshan, and H. Rostill (2018) The internet of things for dementia care. IEEE Internet Computing 22 (1), pp. 8–17. Cited by: §1, §5.
  • [4] S. Enshaeifar, A. Zoha, S. Skillman, A. Markides, S. T. Acton, T. Elsaleh, M. Kenny, H. Rostill, R. Nilforooshan, and P. Barnaghi (2019) Machine learning methods for detecting urinary tract infection and analysing daily living activities in people with dementia. PloS one 14 (1), pp. e0209909. Cited by: §5.3, §5.
  • [5] Shirin. Enshaeifar, Ahmed. Zoha, Andreas. Markides, Severin. Skillman, S. Thomas. Acton, Tarek. Elsaleh, Masoud. Hassanpour, Alireza. Ahrabian, Mark. Kenny, Stuart. Klein, et al. (2018) Health management and pattern analysis of daily living activities of people with dementia using in-home sensors and machine learning techniques.. PloS one 13 (5), pp. e0195605. Cited by: §5.
  • [6] S. Farquhar and Y. Gal (2018) Towards robust evaluations of continual learning. arXiv preprint arXiv:1805.09733. Cited by: §2, §6.4.
  • [7] R. M. French (1991)

    Using semi-distributed representations to overcome catastrophic forgetting in connectionist networks

    In Proceedings of the 13th annual cognitive science society conference, pp. 173–178. Cited by: §2.
  • [8] R. M. French (1992) Semi-distributed representations and catastrophic forgetting in connectionist networks. Connection Science 4 (3-4), pp. 365–377. Cited by: §2.
  • [9] R. M. French (1999) Catastrophic forgetting in connectionist networks. Trends in cognitive sciences 3 (4), pp. 128–135. Cited by: §2.
  • [10] Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §4.
  • [11] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio (2013) An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211. Cited by: §2.
  • [12] I. J. Goodfellow, O. Vinyals, and A. M. Saxe (2014) Qualitatively characterizing neural network optimization problems. arXiv preprint arXiv:1412.6544. Cited by: §1.
  • [13] M. K. Hetherington (1993) Catastrophic interference is eliminated in pretrained networks. In Proceedings of the 15th Annual Conference of the Cognitive Science Society, pp. 723–728. Cited by: §2.
  • [14] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §3.
  • [15] W. Hu, Z. Lin, B. Liu, C. Tao, Z. Tao, J. Ma, D. Zhao, and R. Yan (2018) Overcoming catastrophic forgetting for continual learning via model adaptation. Cited by: §5.
  • [16] N. Kamra, U. Gupta, and Y. Liu (2017) Deep generative dual memory network for continual learning. arXiv preprint arXiv:1710.10368. Cited by: §3.
  • [17] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra (2001) Dimensionality reduction for fast similarity search in large time series databases. Knowledge and information Systems 3 (3), pp. 263–286. Cited by: §6.1.
  • [18] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §1, §1, §2, §2, §3.
  • [19] C. S. Kortge (1990) Episodic memory in connectionist networks. Cited by: §2.
  • [20] S. Lee, J. Kim, J. Jun, J. Ha, and B. Zhang (2017) Overcoming catastrophic forgetting by incremental moment matching. In Advances in Neural Information Processing Systems, pp. 4652–4662. Cited by: §1, §1, §2, §2, §3, §5.1, §5.
  • [21] S. Legg and M. Hutter (2007) Universal intelligence: a definition of machine intelligence. Minds and machines 17 (4), pp. 391–444. Cited by: §1.
  • [22] H. Li, P. Barnaghi, S. Enshaeifar, and F. Ganz (2019) Continual learning using bayesian neural networks. External Links: 1910.04112 Cited by: §2, §6.4.
  • [23] M. Li, T. Zhang, Y. Chen, and A. J. Smola (2014) Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 661–670. Cited by: §4.2, §4.3.
  • [24] Z. Li and D. Hoiem (2018) Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (12), pp. 2935–2947. Cited by: §3.
  • [25] D. Lopez-Paz et al. (2017) Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pp. 6467–6476. Cited by: §5.
  • [26] N. Y. Masse, G. D. Grant, and D. J. Freedman (2018) Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization. Proceedings of the National Academy of Sciences 115 (44), pp. E10467–E10475. Cited by: §1, §2, §2, §2, §2, §4, §5.1.
  • [27] M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of learning and motivation, Vol. 24, pp. 109–165. Cited by: §1.
  • [28] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §2.
  • [29] C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner (2017) Variational continual learning. arXiv preprint arXiv:1710.10628. Cited by: §1, §2, §2, §5.
  • [30] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter (2019) Continual lifelong learning with neural networks: a review. Neural Networks. Cited by: §3, §3.
  • [31] T. Pearce, M. Zaki, A. Brintrup, and A. Neel (2018) Uncertainty in neural networks: bayesian ensembling. arXiv preprint arXiv:1810.05546. Cited by: §4.
  • [32] P. Peng, Y. Tian, T. Xiang, Y. Wang, M. Pontil, and T. Huang (2017) Joint semantic and latent attribute modelling for cross-class transfer learning. IEEE transactions on pattern analysis and machine intelligence 40 (7), pp. 1625–1638. Cited by: §5.2.
  • [33] A. Robins (1995) Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science 7 (2), pp. 123–146. Cited by: §3.
  • [34] J. Serrà, D. Surís, M. Miron, and A. Karatzoglou (2018) Overcoming catastrophic forgetting with hard attention to the task. arXiv preprint arXiv:1801.01423. Cited by: §1, §2, §2, §3, §4.
  • [35] H. Shin, J. K. Lee, J. Kim, and J. Kim (2017) Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pp. 2990–2999. Cited by: §1, §3.
  • [36] D. F. Specht (1990) Probabilistic neural networks. Neural networks 3 (1), pp. 109–118. Cited by: §1.
  • [37] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §2.
  • [38] S. Thrun and T. M. Mitchell (1995) Lifelong robot learning. Robotics and autonomous systems 15 (1-2), pp. 25–46. Cited by: §1.
  • [39] D. K. Wedding II and K. J. Cios (1996) Time series forecasting by combining rbf networks, certainty factors, and the box-jenkins model. Neurocomputing 10 (2), pp. 149–168. Cited by: §4.1.
  • [40] J. Yoon, E. Yang, J. Lee, and S. J. Hwang (2017) Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547. Cited by: §1, §3.
  • [41] G. Zeng, Y. Chen, B. Cui, and S. Yu (2018) Continuous learning of context-dependent processing in neural networks. arXiv preprint arXiv:1810.01256. Cited by: §1, §3, §5.
  • [42] F. Zenke, B. Poole, and S. Ganguli (2017) Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3987–3995. Cited by: §1, §2, §2, §2, §5.1, §5.