Quantum Continual Learning Overcoming Catastrophic Forgetting

08/05/2021 ∙ by Wenjie Jiang, et al. ∙ Tsinghua University 0

Catastrophic forgetting describes the fact that machine learning models will likely forget the knowledge of previously learned tasks after the learning process of a new one. It is a vital problem in the continual learning scenario and recently has attracted tremendous concern across different communities. In this paper, we explore the catastrophic forgetting phenomena in the context of quantum machine learning. We find that, similar to those classical learning models based on neural networks, quantum learning systems likewise suffer from such forgetting problem in classification tasks emerging from various application scenes. We show that based on the local geometrical information in the loss function landscape of the trained model, a uniform strategy can be adapted to overcome the forgetting problem in the incremental learning setting. Our results uncover the catastrophic forgetting phenomena in quantum machine learning and offer a practical method to overcome this problem, which opens a new avenue for exploring potential quantum advantages towards continual learning.



There are no comments yet.


page 1

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


I The Setting

Our numerical simulations are based on the open source package Yao.jl

[65]. To illustrate the catastrophic forgetting phenomena, we randomly initialize an eight-qubit variational quantum circuit (as shown in Fig. S4) as the ansatz for our quantum classifier, in which those rotation angles are variational parameters updated in the training process and unchanged in the inference process, and the CNOT gates is necessary to entangle all qubits since entanglement in quantum circuits is a key resource for potential quantum advantages. This variational architecture is hardware-efficient [66] and is capable to achieve satisfactory performances for our classification tasks (see Fig. S5). Besides, this architecture does not take advantages of the specific structure information of datasets.

All data encountered in our numerical simulations consists of 256 features and can be represented by eight qubits using amplitude encoding. For the original MNIST hand-written digit images, those -pixel images [67] are reduced to -pixel images (see Fig. S5(a)), so that we can simulate this quantum learning process with moderate classical computational resources. Then, we randomly choose a permutation of the 256 pixels and apply it for all images, which produces a new dataset consisting of pixel-permuted images (see Fig. S5(b)). For time-of-flight (TOF) images, we diagonalize the Hamiltonian of quantum anomalous Hall effect with an open boundary condition and calculate the atomic density distributions with different spin bases for the lower band in momentum space to obtain input data. We vary the strength of the spin-orbit coupling and the strength of the on-site Zeeman interaction in both the topological and topologically trivial regions to generate several thousand data samples (see Fig. S5(c)). For the symmetry protected topological state (SPT), we consider the model involving eight spins and exactly diagonalize its Hamiltonian to obtain the ground state which can be naturally represented using eight qubits (see Fig. S5(d)). In this work, we use amplitude encoding to convert the data of our classification tasks into the input quantum states for the quantum classifier.

The process of sequential learning is divided into different phases and our quantum classifier are trained with only one specific dataset in each training phase. For example, to illustrate the catastrophic forgetting phenomena, we first use the randomly initialized quantum classifier to learn to classify original MNIST images. After a satisfactory performance is obtained, this classifier are trained to distinguish permuted MINIST images. The results of different learning phases are shown in the main text, where the forgetting phenomena is revealed. As for continual learning via EWC method, the Fisher information matrix for each task is computed after the corresponding training phases and is stored for those following training phases.

Ii Elastic Weight Consolidation

Figure S5: Results of learning single task. Here we show the classification performances of our quantum classifier on tasks used in our simulations of quantum continual learning and we also plot a sample image of each task: (a) the original MNIST image and its learning performance; (b) the permuted MNIST image and its learning performance; (c) the time-of-flight image and its learning performance; (d) the symmetry protected topological state and its learning performance.

From a high-level perspective, overcoming catastrophic forgetting in quantum continual learning requires protecting the learned knowledge of those previous tasks, as well as learning the new-coming knowledge of following tasks [64, 68]. So our quantum learning model should have enough capacity to store those information. Besides, appropriate management of model’s capacity is required to achieve quantum continual learning in practice. EWC method offers a practical method to do the capacity management: it estimates the necessary capacity for previous tasks and refreshes the rest part which contains rare information about those previously trained tasks. To do this, EWC method evaluates the importance of each variational parameter in the quantum classifier and only allows significant twist for those relatively unimportant ones.

We then give a detailed mathematical derivation of EWC method. For simplicity, we concern the two-task scenario here and use the similar philosophy to explicitly write down the result for the multi-task scenario. From the perspective of maximum likelihood estimation [57], we explore all possibilities of parameters of the quantum classifier to maximize the likelihood function , where is the total dataset ( and are datasets for task and task respectively and we assume that these two tasks are independent to each other). So we have expression

where the first and third equation use the Bayes’ rule and the second equation uses the independence condition. As shown in the main text, we have Taylor Series for the second term:


It is worthwhile to mention that from the perspective of parameter estimation [69]

, this treatment means that we sample parameters from a multivariate normal distribution:


where the optimal solution for task is the mean value of this normal distribution and is the precision matrix ( is the Hessian matrix at the optimal solution for task and is equal to the minus of the Fisher information matrix under some specific conditions [58]). We can rewrite the quadratic term using the Fisher information matrix and absorb it into the likelihood function of sequential tasks. This leads to the loss function for the second task in our scenario:


To reduce the potential storage and computation overhead for those possible large quantum models, we use the diagonal elements of the Fisher matrix as the weights of variational parameters and neglect those off-diagonal entries, which will be discussed later. Thus, we could add the regularization term shown in the main text to the loss function of the second task in order to maximize the likelihood function of joint tasks.

Figure S6: Comparison between the learning result of using diagonal elements of the Fisher matrix and that of using the full Fisher matrix. We train our quantum classifier using the original MNIST dataset and then adapt two kinds of regularization terms to train this classifier using the new-coming permuted MNIST dataset. The learning settings for both cases are exactly the same except the strength parameter .

For continual learning of more than two tasks, we can compute the regularization term for each trained task and add them together to overcome catastrophic forgetting:

where is the original loss function for current task given current parameters is the -th diagonal element of the Fisher information matrix at the optimal point for previous task , is a hyper-parameter controlling the strength of this EWC restriction and so on.

Iii Reasons for neglecting off-diagonal elements

In our numerical simulations, the quantum classifier consists of 248 variational parameters, in which computing and storing the full Fisher matrix is not very hard. Nevertheless, if the number of parameters gets larger and larger to match the exponentially growing dimensionality of the Hilbert space, computing and storing its full Fisher matrix can be quite challenging. From a more practical perspective, we use the diagonal elements of the Fisher matrix which can be estimated by the first order derivative [70].

To compare the learning result of using the diagonal elements of the Fisher matrix and that of using the full Fisher matrix, we train our quantum classifier using the original MNIST images and the permuted MNIST images sequentially. In this simulation, the diagonal elements of the Fisher matrix and the full Fisher matrix are adapted as the metric to quantify the derivative distance in the parameter space respectively. The results in Fig. S6 shows that the performances of both metric choices are at the same level. We remark here that in consideration of the summation of those off-diagonal elements, we manually lower down the strength parameter in the simulation of using the full Fisher matrix. The similar performances between those two learning scenarios indicate that neglecting those off-diagonal elements in the Fisher matrix has no significant influence on the results of quantum continual learning. Thus, we use the diagonal elements as our distance metric in all other numerical simulations.

Iv More numerical results

In this section, we give more results of quantum continual learning. Performances of learning single tasks are shown in Fig. S5 and one sample image of each dataset is plotted. Those results indicate that our quantum classifier is capable to achieve satisfactory performances on those chosen classification tasks.

Figure S7: Illustration of quantum continual learning of classifying different MNIST images. Learning curves of three related tasks: classifying digit 2 and digit 8, classifying digit 1 and digit 4, and classifying digit 0 and digit 9.

In the main text, we show that quantum continual learning of two-task case can be accomplished when those two problems are similar or dissimilar to each other. As a complementary example, we also simulate the quantum continual learning of two related problems. We use MNIST images of different digits to construct several classification tasks and find that the continual learning of this kind of tasks can also be accomplished (see Fig. S7).

We group MNIST hand-written images of different digits to construct several binary classification tasks and use them to train our quantum classifier. For multi-task cases, we choose three pairs of digits and use our quantum classifier to classify their hand-written images. We first train our quantum classifier using images of digit 2 and images of digit 8, which ends with a high classification performance (). Then, we train this quantum classifier to identify digit 1 and digit 4. In the favor of EWC method, our quantum classifier behaves reasonably well at both tasks after the second training phase. Sequentially, we train this circuit to classify digit 0 and digit 9, and find that our quantum classifier can perform relatively well in all three different classification tasks after those training processes.

We also notice that in the continual learning scenario, the performance of our quantum classifier on each task has a slight reduction compared with that in the single task learning scenario. Intuitively, this is caused by an inevitable small deviation from the optimal solution of a single task to the optimal solution of the joint task.