Quantum machine learning is often listed as one of the most promising applications of a near term quantum computer , with important early successes in a range of problems, from classification [13, 28] to generative modelling . However the broader roll out of these methods to real world problems is tempered, in part, by the limited size of quantum computers. Among other limitations, current quantum computers lack enough qubits to run large circuits. Some “circuit partitioning schemes” [5, 22] have been proposed to simulate larger circuits on smaller devices by partitioning the full circuit into a set of smaller circuits (see figure 1). However the exponential number of circuits needed by these schemes is completely intractable for most applications, with billions of sub-circuit evaluations required for even modest quantum machine learning instances.
In this work we examine the necessity of each subcircuit in producing an approximation of some partitioned circuit, presenting reasoning that a smaller amount of circuits could be sufficient in some cases. We then use this as inspiration for a new machine learning technique, which reconciles the need for larger circuit instances with affordable runtimes. Our new technique takes the same form as a given generic machine learning architecture that has been partitioned using the aforementioned techniques but with vastly fewer terms.
on an instances of handwritten digit recognition using a 64 qubit ansatz with access to only a simulated 8 qubit computer (without use of excessive dimensionality reduction, such as dimensional principal component analysis). We also include an experiment testing the model’s ability to replicate the output of larger unpartitioned circuits. Error analysis and the specifics of an evaluation and training schemes are presented in SectionV.
Ii Related work
We are not the first to consider how the partitioning schemes [5, 22] could be made more efficient, whereas other research lines have focused on minimising the computational cost of applying the exact partitioning schemes (e.g. by minimising the number of gates cut), we focus on shrinking the number of subcircuits to approximate the output. As such many of the techniques in this section can be composed with our method to create an even more efficient scheme.
In  an automated cutting procedure is applied to  to produce the minimum number of subcircuits needed, similarly  uses maximum likelihood fragment tomography to improve both the cutting process and the reconstruction of output states. Other authors have considered how the set of subcircuits could be run more effectively by utilising distributed computational resources [27, 4].
Using partitioning schemes produces an additional benefit: reduced noise, stemming from the smaller circuit size [30, 1], this can be the motivation for cut selection, even when the full circuit would “fit” on a quantum machine . This noise reduction is similar to the increased accuracy we may be able to provide to gradients in our model. The potential link between this, and the avoidance of barren plateaus in our model we will discuss in Section V.
After developing our technique we will demonstrate its use on high dimensionality data, specifically handwritten digit recognition. This problem has been tackled before with quantum hardware. In
, dimensionality reduction techniques (such as principal component analysis) are used to reduce the dimensionality of the digits to a feature vector small enough to fit on their 8 qubit machine. Similarly, in
handwritten digits are classified on an 8 qubit machine, in this instance the size of the data is not reduced, the full data is carefully encoded into the quantum computer, first with amplitude encoding, and then by using 11 layers of parameterised gates. Our approach is fundamentally different from either of these. We use the same sized data (88 pixels) but do not apply dimensionality reduction as in , or reuse qubits for multiple data points as in . We follow a simple encoding: giving each pixel its own qubit, which we can achieve as we are approximating a 64 qubit machine, while only using an 8 qubit machine. Other works have addressed high dimensionality data by pushing the limit of the size of quantum machine learning models on current devices .
Iii Model Motivation and Specification
In this section we introduce parameterised quantum circuits (PQCs), a popular concept in quantum machine learning; and circuit partitioning, a method of evaluating quantum circuits that requires a number of qubits greater than what is accessible. By applying these circuit partitioning schemes to PQCs we can produce a more powerful machine learning model than the smaller device naively allows, at the cost of unreasonable runtime. We then go on to develop a novel QML method which intuitively may be as useful for a fraction of the runtime.
iii.1 Parameterised Quantum Circuits
Parameterised quantum circuits (PQCs) are a varied and promising method for quantum machine learning. In general they consist of some set of circuits, , parameterised by a weight vector, . In the most common forms the input datum, , also parameterises gates in the circuit. The set can be indexed as . These circuits yield functions when we specify an initial state, , and an observable, :
We can assume is some fiducial state, such as in the computational basis, without loss of generality.
As each setting of defines a (not necessarily unique) function, (with range limited by the spectrum of ), the set of unitaries defines a set of functions. We call this set the hypothesis class, to coincide with the common usage in machine learning. PQCs have been studied in other contexts, such as quantum chemistry or condensed matter physics , although it is likely our approaches might generalise to these areas, in this work we focus on its application to machine learning.
Definition III.1 (PQC hypothesis class).
The hypothesis class generated by the family of parameterised quantum circuits together with an observable is given by
where is the number of parameters in the model.
These PQCs have proven popular, but the implementation of PQCs is currently tempered by the NISQ machines they run on. Notably the limited number of qubits available limits the width (defined henceforth as number of qubits the circuit acts on) of the circuit that can be run. It is the central concern of this work to produce a model as useful as PQCs of width larger than what the available machines naively permit.
iii.2 Circuit Partitioning
In [5, 22] the authors propose methods to simulate large quantum circuits on smaller quantum machines by partitioning the circuit into smaller disconnected blocks. In this section we will introduce and then employ these methods on PQCs to decrease the size of quantum computer needed.
Consider a partition of the qubits into blocks, , where (where ) such that and5] this fact is used to decompose any particular 2-qubit gate into a gate of the form:
for some complex such that and for 2 dimensional unitaries, and . The number of terms of the sum needed for any particular gate is given by its Schmidt number , generically this number is 4 for 2-qubit entangling gates but for some important cases (including the CNOT and controlled-Z) only 2 terms are needed. For example, we can decompose the Controlled-Z gate into single qubit gates as:
where is the phase gate. The identity (2) allows us to rewrite any particular 2-qubit gate as the sum of products of single qubit operators. Applying this method to every 2-qubit gate connecting two blocks of the partition decomposes the full unitary into a sum of tensor products of unitaries which individually act only on each block of the partition (figure 1).
An example is useful in illustrating this point, suppose we are given a unitary which consists of two disconnected blocks apart from one 2-qubit gate, , connecting the otherwise disjoint blocks, top and bottom:
We can decompose this 2-qubit gate as . The full unitary can thus be written as:
Suppose the initial state is (which we will simply refer to as ) and the measurement is the projection, (using the previous notation for ), we then have that
and the expectation value is given by:
which is the product of inner products local to either element of the partition. This allows us to evaluate each smaller inner product individually and then combine them in a product and sum to replicate the expectation value of the full circuit. Depending on the observable it may be preferable to calculate the expectation value (i.e. the previous equation) or to calculate the inner product presented in the equation before and then square the answer to calculate the expectation value.
These results provide us with a clear path to solve the central goal of this paper thus far, “How to fit a larger model on a smaller machine”. It is simply a matter of specifying a large PQC, then deciding on a partition that separates its initial state and measurement nicely. This partition defines a set of closely related circuits that differ only by the replacement of 2-qubit gates with single qubit gates. The next theorem encapsulates the partitioning of PQCs into a set of a set of smaller subcircuits, and the recombination of them to recreate the result of the larger PQC.
Theorem III.1 (Partitioned model).
For every function and qubit partition with observable , there exists a set of coefficients and unitaries (where each and acts on qubits) which can be combined in a function:
such that for every . For arbitrary gates the number of terms grows as , where r is the number of gates across the partition, but for cut gates with known Schmidt number is the product of the Schmidt number squared of each cut gate.
In many cases the same subcircuit (or its complex conjugate) appears multiple times in equation 6. By storing its value in classical memory the total number of circuit evaluations can be brought down to where is the number of gates across the partition (as mentioned in ).
Mapping this theorem onto our example , the set of coefficients would be and the set of unitaries would be
This example also illustrates the similarity of terms in equation 6, for every “top” circuit is identical up to the replacement of .
Theorem III.1 is useful to our goal, we can fit any large PQC on a small machine, however we have paid a huge cost in the need to run an exponential number of smaller circuits. Indeed given that most QML models are relatively densely connected and increasing depth can lead to improved performance, this exponential overhead in number of cut connections is impractically costly. For example a 2-block division of the hardware efficient ansatz up to depth 6, such as those considered in Jerbi et al.  to solve a simple task would require over 2 billion distinct sub-circuit evaluations. This rough estimation motivates us to revise our goal to “how to fit a larger model on a smaller machine in an acceptable number of circuit evaluations”.
If we are interested in exactly recreating the output of the circuit, this goal might be unattainable, unless we can find an exponential number of terms that perfectly cancel each other. There are fortunately several acceptable simplifications we can make to our goal. Firstly, we are not concerned with the exact replication of the unitary, since our input states are fixed to we only care about the action of our recreated unitary on this state. Second we may be content with approximate results, or perhaps even approximate results for most input datums. Finally, our ultimate goal for a machine learning model, in many cases, is simply to output a binary classifier  (or another simpler discrete set of outputs) so we are not interested in keeping terms which contribute similarly as other terms in the final assignment of a class label.
With this in mind we will now define, the subset partition model, as the best possible approximation of the full result in theorem III.1 keeping only terms.
Definition III.2 (subset partition model).
For a partitioned model, with set of unitaries , we define the -subset partition model as a function using the optimal -sized subset of terms given by:
where we have introduced free parameters that can also be optimised over. In the above definition and are optimised to produce the best approximation of , for some given success metric.
This model is a step towards our goal, if we are given the model it would be possible to run some approximation of the partitioned circuit on a small computer in acceptable time. However we lack the capacity to chose the optimal set of “small circuits” , in general choosing this set corresponds to a combinatorial optimisation problem. In the next section we will describe why this problem is challenging and produce a model that can work around it.
iii.3 Reduced Partition Model
In the last section we tackled the problem of how to fit a large model on a smaller machine, but it required us to run an impractical number of circuits to achieve our goal, we introduced a model to get around this but it was impractical to optimise. We now consider a situation where we are given a runtime “budget”, a hypothetical number of circuits, , that we can afford to evaluate. Choosing which circuits to evaluate from the set generated by the partitioning to perform optimally is an incredibly challenging combinatoric optimisation problem. This process is additionally complicated by the apparent need to use a quantum computer to assess if the circuits can be ignored. In this section we propose a relaxation of the problem: by parameterising the gates that replaced the 2-qubit gates in the circuit cutting process (henceforth called partition gates) such that all terms in the sum are identical up to these introduced parameters. The problem of optimising circuit selection becomes one of optimising the parameters of the partition gates.
The first step of this process is parameterising the gates introduced by the partition. There are many options for doing this, for example when cutting the controlled Z we get the decomposition in equation 3, replacing the 2-qubit gate on either qubit by or . We then wish to create a new parameterised gate which takes a parameter such that the parameterised gate is when and when . composed with is one choice. Defining this way also allows us to extrapolate gates for , creating a continuous parameter we can use for e.g. gradient descent.
We can use this partitioned-gate-parameterisation trick to replace the set , with a new set, , with just one parameterised unitary for each block of the partition, with different terms of the sum differentiated only by different parameters .
For every , there exists a set of unitaries and parameters defining a function:
such that for every and for every observable that can be written as tensor product on the elements of the partition .
In this lemma we have used our new free parameter to parameterise the partition gates, the parameters needed for these partition gates could be calculated from the partitioning theorem or trained through gradient descent. As mentioned, the advantage of this step is that now all terms of the sum are equivalent to each other up to weight parameters and . This is useful in the final model, where we reduce the number of terms to and then allow these parameters to learn freely, making the model capable of replicating any terms present in the original model by changing and .
Definition III.3 (Reduced partition model).
For a PQC hypothesis class , we define the reduced -subset partition model as the family of functions where each function is given by
where the unitaries are those described in Lemma III.1 and we have introduced entirely free parameters and that can be optimised over. This can also be referred to as a “reduced partition model” when is to be specified later.
This new model introduces more free parameters, , into our model, fortunately only number of cut gates are introduced.
The reduced partition model can now use the similarity of the terms of the equation 9 (they are identical up to the weight vector, ) to replicate any subset of terms taken from the partitioned model by simply adjusting the parameters and . This is stated formally in the following theorem.
For any PQC hypothesis class , the -subset partition hypothesis class is included in the hypothesis class of the reduced -subset partition model , i.e.,
I.e., if a given classifier can be sufficiently approximate by considering only terms, then the hypothesis class of the reduced partition model can do at least as well as this approximation. This is the potential advantage of our model. Additionally, the relaxation from manually picking terms to optimising allows us to apply gradient based methods, generally yielding much easier optimisation, but in general suffers as the solutions of the relaxation do not encode meaningful solutions of the original problem (which is discrete in nature). However in our case, since we deal with QML, all this achieves is expanding the hypothesis class, where any solution is meaningful, and optimisation (if done completely) can only yield better results with respect to the training error. Although, when expanding the hypothesis class, the problem may become worse generalisation performance, often evidenced by looser/worse generalisation bounds. We analyse these in the next section.
Iv Generalisation Error
In creating the reduced partition model, we partitioned the circuit and removed terms, which intuitively makes the model simpler, but then introduced free parameters, making the model more complicated and increasing the size of the (reduced model) hypothesis class. In this section we will formalise this change in simplicity with the notion of generalisation error, defined roughly as the gap between performance on a training set and performance on unseen data from the same distribution.
We will only briefly and roughly define a few concepts that are needed, readers keen to see a more are referred to .
We define a supervised learning task on a domain,, and co-domain,
with a probability distribution over,
, and loss function,as the task of outputting a hypothesis, , such that the risk, is minimised. We define the risk for a hypothesis
on a continuous space as the loss for a point multiplied by its probability mass:
In practical settings we normally lack access to the underlying probability distribution, so the true risk cannot be evaluated. Instead we are supplied with training data drawn from , , and must settle for evaluating the risk on this finite set. We call this the empirical risk of with respect to :
Optimising our hypothesis on the training data optimises the empirical risk, which is generally a good proxy for the true risk. The gap between these two risks is bounded by generalisation bounds, specifically by a generalisation gap function, , which can depend on many properties given the setting. For our purpose we will consider it as a function of the hypothesis class, , the size of the training set, , and the probability of the output classifier satisfying the bound, .
We can arrive at a probabilistic bound on generalisation gap for all :
Intuitively the gap has to do with the concept of “overfitting”. Simple models tend to have much smaller generalisation gaps. A function which outputs random labels and does no learning has a generalisation gap of but a large empirical risk. Whereas some complex and large models are found to “overfit” data, where the empirical risk drops to near 0 but the generalisation of the model is very poor, with poor performance on data points not seen during training. Since our model contains more parameters than the model we derived it from, we might fear we have slid into the poor generalisation-good empirical risk category.
We study this question by using two well-known methods to study the generalisation performance of quantum circuits  and . In the analysis we will see that the first method does not distinguish our model from the PQC it was derived from. This is positive from the perspective that the new parameters did decrease performance according to this bound, but negative as these bounds do not identify as a parameter influencing model complexity at all, which intuitively it should. To overcome this, we analyse the additive property of the Rademacher complexity to examine how adding terms to our model increases the generalisation error.
iv.1 Encoding Dependent Generalisation Gap
One insightful analysis of generalisation performance is given in , we will show that its bounds apply directly to our model. The analysis first imports a result originally shown in , that the output of any PQC, , is as a generalised trigonometric polynomial(GTP):
where the effect of all the parameterised gates and the measurement is only reflected in the coefficients . The frequencies available in the GTP (
) are determined entirely by the input data’s encoding strategy, specifically the eigenvalue spectra of Hamiltonians encoding the input data, typically as rotation gates. Further study of the spectra of frequencies,, is available in the aforementioned works.
With very similar analysis it can also be shown that a GTP of this form exists for each term of our sum: Consider a single term,
this is equivalent to reuniting and from product form, and combining into:
this term is now an inner product of an incredibly similar form to the PQC it is derived from, indeed if the encoding gates are untouched by the partitioning scheme then has the same encoding gates and it can be shown admits a representation as a GTP of the same form, with the exact same spectra, . Our new GTP will contain different (and now possibly complex) .
Since each term can be represented as a GTP with the same we are able to combine them into another GTP:
with new weights: .
This defines a new GTP of exactly the same degree and as the full sized circuit which we originally partitioned. Performing the analysis of type presented in  for our circuit gives identical bounds as for the whole (unpartitioned) model.
As our model dramatically differs in the number of terms (which ought to decrease the gap), yet is much more general in the parameters that are free (which should increase the complexity), we see that this bounding technique is quite coarse grained. In particular, even just pure product models (no entangling gates) which are trivially classically simulatable have the same bounds. The fact that the GTP approach yields somewhat loose bounds was emphasized in . We must tighten our analysis to achieve a meaningful bound, in the next section we will achieve this.
iv.2 Term-Based Generalisation Gap
Under the previous analysis we saw that the best upper bounds of one encoding dependent generalisation error for the unpartitioned model matched those of our new model. Since that analysis only considered the encoding (an element unperturbed by our modification) it seems prudent to consider another approach, one that fundamentally considers the increasing number of terms.
In this subsection we will use the additive property of the Rademacher complexity (a metric that can be used to bound the generalisation gap) to examine how adding terms to the form of our model increases the generalisation error. Towards the end of this section we also begin to analyse how the structure of a circuit effects its generalisation bounds. Specifically, we use that for two families of functions and , the Rademacher complexity of the family (i.e., the so-called Minkowski sum, in this case the set of sums of function pairs from F and G) is bounded by 
The extension of this result to our case is then trivial. As in the previous section we consider a single term of the sum and combine the product into a single inner product and view this as a function
If the Rademacher complexity of the function family of these individual terms is upper bounded by , then by applying Eq. (19) directly to our classifier we find that the Rademacher complexity of the partitioned model (i.e., when we consider terms is upper bounded by
. It is not surprising that we can limit the generalisation error by including fewer terms since every term makes the model more expressive. This fact is also reassuring, limiting the number of terms not only saves us in generalisation risk but also in runtime. This means that the hyperparameter of the modelcan be used to perform structural risk minimisation.
If we denote the Rademacher complexity of the original unpartitioned model by , then it seems very likely that generally , as the term is of the same form as the original PQC but with some 2-qubit gates cut. For intuition why this should be the case consider the extremal case of a (deep) circuit with single-qubit parametrized gates intertwined with CNOTs. This is a universal circuit if of sufficient depth, so is highly expressive. In contrast, the extreme variant where all CNOT gates are removed renders this a product model, which will indeed collapse to a single layer of parameterized (arbitrary) single qubit gates, which is less expressive than universal circuits of the same size. This intuition can’t be proved for all circuits (one can construct contrived examples where cutting gates infact raises the Rademacher complexity, e.g. in the appendix A), but insight can be gained by other methods of bounding PQC model complexities. One such bound can be obtained from  which gives a bound on the so-called pseudodimension of a circuit based on the structure of said circuit. To our knowledge pseudodimension does not satisfy the simple Minkowski-sum additivity properties we have previously applied, so we convert the pseudodimension bound into a bound on empirical Rademacher complexity (w.r.t. a training set of size ) using [8, Chapter 3][20, 31]. To illustrate how these methods may lead to useful bounds, it is convenient to study a PQC circuit comprising a brick-wall of fully parametrised 2-qubit gates, similar to figure 1. If there are total gates then we can bound the Rademacher complexity:
where is some universal constant originating from the transition from pseudodimension to Rademacher complexity. Thus this bound applies to each of the terms of our model.
Cutting 2-qubit gates across the partition presents opportunities to absorb gates acting on the same qubits into a single gate, reducing . The exact effect this has on our bound of the Rademacher complexity for each term () will vary for each architecture and each partition. If we consider the circuit in figure 1, cutting the 2-qubit gate between peach coloured and mint blocks naively increases the number of gates. However now the partitions are not linked we can absorb all the gates on either partition into just 2 unitaries, one for the top partition and one for the bottom. The new unitaries are equal to the product of all the unitaries on each partition. For this circuit, cutting the 2-qubit gate across the partitions reduces the gate count from 6 to 2, lowering the bound on Rademacher complexity.
The effect of cutting many gates can be much more extreme: for example in a setting with many 1 and 2 qubit gates, due to non-commutativity, we can end up with circuits of arbitrary depths. However, if we cut all 2-qubit gates and use the fact that fully parametrised gates can absorb neighbouring fully-parameterised gates, we can reduce the entire circuit to a tensor product of single qubit unitaries, with a dramatically tighter upper bound on the pseudodimension and hence Rademacher complexity.
Note that in the above analysis we used the fact that the circuit model upon which the psedudodimension bounds operate work with fully parametrised gates, and allow certain re-writings, which allowed us to lower the bounds – it is not the case that per-se, the pseudodimenson approach in any way “counts” or measures the entangling capacities directly, but rather this is achieved via cutting and rewriting. Further work could focus on using the cutting and rewriting we present here to improve existing generalisation results.
In some cases applying similar rewriting logic can also improve the bounds obtained by the GTP approach of the last subsection. For example consider the case with two commuting encoding gates ( and ) are separated by some 2-qubit gate which doesn’t commute with either encoding gate. The removal of this 2-qubit gate (or just some of its partitioned terms) means the two encoding gates can be combined into a single gate, reducing the available set of frequencies () and lowering the resulting generalisation bound. The scope of such rewriting methods to obtain tighter bounds (which are then circuit-structure-sensitive) is of course very much dependent on the particularities of the circuit, but a possible interesting area of study. We leave the further details for future work.
V Evaluation and training of reduced partition model
In this section we look at how one can evaluate the circuits, what error this would entail, and how it might be trained, we speculate on a possible feature of partitioned PQCs that might placate the effects of so-called “barren plateaus”.
Evaluation of the reduced partition model is a non-trivial task, the terms are composed not of expectation values (which can be evaluated with simple circuits) but of inner products, with different unitaries on either side of the observable. Fortunately this is not an insurmountable problem. To evaluate these inner products we can employ the Hadamard test shown in figure 2. The most challenging component of this circuit is the application of a controlled-, naively this would require controlled gates for every gate in and in . Fortunately this is not the case. Since and differ only by the partition gates, the controlled-circuits can be constructed with controlled operations only on these partition gates, which is a small subset of the total number of gates in the circuit.
As with all NISQ applications we must also concern ourselves with the impact of errors on our results. To analyse this in our model we will consider a first order additive error and find the total error satisfactory.
First, let us replace the non-random inner products,
, with random variables
which are unbiased estimators of the inner product (that iswhere the bar now represents the expectation value). These random variables represent an estimation of the inner product with a non-infinite number of shots on a quantum computer. We are interested in the difference between this noisy estimation and . That is:
Using an approximation () and combining the sums we can expand the problem:
expanding the sum and ignoring terms higher than linear in :
Assuming the magnitude of each inner product is bounded by (which for Pauli string observables is 1) and that each we have our final error:
We recall that the dimension of is L, so this error is approximately linear in both K and L, an acceptable error.
Training with a gradient based approach is easy to apply in our model too. The derivative distributes on terms of the sum and can be evaluated by applying the chain rule to the product in each term. Indeed since most parameters appear in only one gate on one qubit on one side of the partition, the chain rule evaluates to 0 on all but 1 element of the product. Evaluating the gradient then takes at mosttimes the number of evaluations required to evaluate the gradient of one of the smaller circuits. In this case we find that evaluating the gradient for any parameter, , that exists only in the th partition is:
The same applies for the parameters. In many instances the gradient can be made easier to compute, since we have often already evaluated the non-derivative expression before looking for the gradient most of the circuit evaluations are already done, with only the derivative expression for a single inner product requiring a new evaluation. Which can be done in the standard manner (e.g. parameter shift rule ).
v.3 Barren Plateaus
A well studied problem  with PQCs is the “barren plateaus” phenomenon, where large parts of the parameter landscape have an exponentially small gradient, effectively crippling optimisation. This is a manageable problem for currently implementable PQCs due to their limited size, but as PQCs become larger (and their gradient decreases) the problem intensifies . While our model is not immune to barren plateaus we may be able to reduce their effect on our model relative to the size of their effect on the unpartitioned circuit.
Each term of our model is a multiplicative separable function (it can be written as: , where is an inner product and, is the input to the inner product, including the data and weights) we simplify to assuming is a single parameter, for illustrative purposes. To calculate the gradient we apply the chain rule to the product, for most architectures any particular parameter will only appear on one block of the partition, then one term of the chain rule will be non zero
The gradient is thus determined by multiplying together the many amplitudes stemming from the subcircuits of the sum with this lone gradient term (equation 21). Two aspects may make this overall gradient small: First the gradient may be small as it is a PQC and is prone to barren plateaus, however the individual subcircuits generically have larger gradient than the full unpartitioned circuit as they are smaller  (i.e., the barrenness of the plateaus heavily depends on the number of qubits in the circuit). Second, the multiplication with other terms may cause it to decay to zero as we are dealing with a product of terms which are absolute value below 1, the product then decays exponentially in the number of multiplicative terms to some small number. However in our case we are not directly facing this radically smaller number, we fundamentally have more information about the gradient, knowing the total gradient, but also the terms that are combined to form it. We know the effect that varying any of these subterms has on the gradient of the complete circuit. One possible use of this information is to identify which term is driving the gradient to a small value, and to revert its parameters back to an earlier instance which we have stored in memory, through this method the impact of barren plateaus could be mitigated. A technique similar to  could be developed, to avoid low gradient directions, but utilising the more information present in our case.
There is quite a bit of research on additive separable functions, which may transfer to our case . This could lead to significantly easier training. We plan to develop this method of training in a follow-up work.
In the previous sections we laid out a model with considered theoretical underpinning, in this section we will demonstrate that model’s basic utility by showing it can learn a simple large problem, the MNIST handwritten digit recognition, by utilising an ansatz much larger than the computer it has simulated access to. We also present an experiment designed to test if an adequate approximation of a random circuit output can be made with much fewer terms, we then apply our model on the same random circuits output to test its performance on synthetic data.
vi.1 A Large Problem: Handwriting
Reading handwritten numbers is one of the most basic tasks in undergraduate machine learning courses. The MNIST  data set presents a relatively simple task, identify which digit is written in an pixel image, but even this simple task is difficult for current generation quantum machines due to its high dimensionality, with quantum attempts only succeeding recently through careful encoding of the problem (e.g. in ). Often dimensionality reduction techniques such as principal component analysis are applied  but for a simple problem like MNIST handwriting this reduces the learning problem to a triviality. Here we preserve the learning problem by downsampling the image to just 64 pixels, which is importantly still human readable. Here we will show that even simple cases of our model perform adequately and by increasing (the number of terms of our model) we increase that performance.
For purposes of comparison we reduce the problem to differentiating 3 and 6, as in . Our model is based on an 8 block partitioning of the 64 qubit, depth 3 hardware efficient ansatz (of the same form as in ) into 8 qubit blocks, a model which would normally be far outside of our computational power. An unseen validation set is evaluated at every step of training and the results are shown in figure 4. The final training loss (MSE), testing loss (MSE) are shown in the following table, we also apply a step function to the output (to convert its real valued output into a binary label) and list its accuracy.
Data augmentation (skews and rotations) were used to generate more data for the model. Without this augmentation high
terms began to overfit, increasing the training performance while decreasing the validation performance. With data augmentation we can see that our model is behaving well, even in the 1 term case we find that it selects a good arrangement of weights, although with relatively few additional terms the performance increases, for contrast to run this model using the complete circuit partitioning scheme would require the evaluation of over 46000 subcircuits. A neural network with a convolution layer and a single dense 128 neuron hidden layer is provided for comparison. We must consider that our results are on MNIST handwriting, which is known to have many problems and cannot be used to claim that our model excels on all similarly large tasks.
vi.2 Tests on Synthetic Data
This work is built around the assumption that many terms in the partitioned equation for a given circuit are redundant, and a good approximation of the complete circuit can be made by our model. In this section we test this assumption with our first experiment, and then test our complete model on learning a synthetic data set in the second and third experiments.
We take a width 10, depth 3 instance of the hardware efficient ansatz with random weights. Using this circuit we generate a synthetic data set by recording its output on random inputs, we normalise these outputs to to a mean squared average of 1. We then instantiate a modified version of our model corresponding to a partitioning of the full circuit into 2 blocks of width 5. The model is modified from the general model we have described above by fixing (the weights present from the unpartitioned PQC) and only training and (the weights we introduced when creating the model). This modification allows us to examine directly our claim that introduction of the free parameters, and , is sufficient to approximate the output of the full PQC without evaluating the many subcircuits that would be required in theorem III.1. After this experiment we free (apply the full model) and examine the increased performance this gives us.
The results of our experiment are shown in figure 5. The benefits of increasing are more apparent than in the digit recognition experiment, we can see better approximations being made at higher . For some applications more accuracy might be required, it seems increasing further will continue improve this accuracy. Noteably all considered are orders of magnitude below the amount of terms or circuits needed to apply the existing partitioning schemes. The final mean squared error for unseen data averaged over 5 random data sets is presented in the following table:
|L||Final validation MSE|
Where we have included a neural network with a single dense hidden layer of 256 neurons for comparison purposes, other nerual network architectures (1 and 2 hidden layers were tried, with 64 and 256 neurons per layer for each) were tried without meaningful improvement, although it is possible that with thorough tuning these architectures or others could be made to perform strongly.
The previous results are sufficient to show that the training of just the parameters and can lead to models with substantially fewer terms, , while still sufficiently approximating the full circuit in this instance. This approximation was achieved with just the training of and , while fixing the to those that were used to generate the data. However it is not clear, a-priori, that the reduced model should use the same parameters to best mimic the full model. We now allow
to deviate from that of the generating PQC, the resulting mean squared error for unseen data after 20 epochs is presented in the following table:
|L||Final validation MSE|
This improvement in performance is unsurprising as the unrestricted model includes the hypothesis of the model without training , however it was not clear before the experiment that the model would be able to find this higher performance, as the introduction of more parameters may have created too many local optima for efficient optimisation. On the other hand we may have expected a larger increase in performance, as makes up the majority of parameters, we should expect releasing to correspond to a big increase in performance. The lack of this increase could be taken as weak evidence that our approximation (that a smaller set can approximate the output of the whole circuit) to be relatively accurate in this case, even without retraining .
Finally we use the synthetic data set as a training set for our model, with random initialisation of weights. This third experiment allows us to test our models performance on a task which a classical algorithm (the neural network) performs poorly on, without prior knowledge of good parameters.
|L||Final validation MSE|
These performances are strong and comparable to the previous two experiments, where was given, showing that our model performs well on this task, much better than the neural network we compare it to. This final experiment is an excellent demonstration of our model as it would be deployed, and demonstrates that it can learn a non-trivial task where a higher number of qubits would naively be required.
Vii Conclusion and future work
In this work we applied previously developed circuit cutting techniques to parameterised quantum circuits. While it is obvious that this naive approach used too many circuit evaluations to be computationally practical we noted there may exist a smaller set of circuits which would sufficiently approximate the original circuit, although we speculate that finding it would itself be computationally intractable even if it did exist. Instead we proposed a new model based on the relaxation of fixed gates into parameterised gates, such that all circuits were identical up to the weights of these newly parameterised gates. We showed our models hypothesis class contained the relevant unparameterised hypothesis class, that its generalisation error was well behaved and then went on to test it experimentally. The first experiment showed the model was capable of tackling large problem sizes (handwriting). We also tested the ability of a parameterised subset of circuits of the partition to approximate the full unpartitioned output of a random circuit and found a very satisfying approximation, although a larger amount of terms was needed than with the handwriting task, suggesting a link between the problem and the number of terms needed to achieve a given accuracy.
Further work is needed in establishing how many terms () might be required for any given task, and what factors influence this requirement. Future work could also focus around the application of this model, testing it out on larger cutting edge problems, or on achieving higher accuracy. Improvements to the model could come from a development of a robust training procedure to avoid barren plateaus (Section V) or from integrating our work with some of the excellent work already done on improving divide and conquer schemes (Section II). Our work has opened the door for experimentation with much larger “partially quantum” models both implicitly as we have done here, but potentially explicitly, integrating more classical resources into a quantum machine learning setting.
The authors would like to thank Matthias Caro for his insights, particularly on the link between different complexity measures and Elies Gil-Fuster. SCM thanks Radoica Draškić and Yash Patel for their useful discussion. The authors thank Andrea Skolik, Elies Gil-Fuster and Charles Moussa for helpful comments. VD and SCM acknowledge the support by the project NEASQC funded from the European Union’s Horizon 2020 research and innovation programme (grant agreement No 951821). VD and SCM also acknowledge partial funding by an unrestricted gift from Google Quantum AI. VD and CG were supported by the Dutch Research Council (NWO/OCW), as part of the Quantum Software Consortium programme (project number 024.003.037).
-  (2021) Quantum advantage and noise reduction in distributed quantum computing. Physical Review A 104 (5), pp. 052404. Cited by: §II.
-  (2011) Operator-schmidt decomposition and the geometrical edges of two-qubit gates. Quantum Information Processing 10 (4), pp. 449–461. Cited by: §III.2.
-  (2021) -QER: an intelligent approach towards quantum error reduction. arXiv preprint arXiv:2110.06347. Cited by: §II.
-  (2021) Bringing the concepts of virtualization to gate-based quantum computing. Master’s Thesis. Cited by: §II.
-  (2016) Trading classical and quantum computational resources. Physical Review X 6 (2), pp. 021043. Cited by: §I, §II, §III.2, §III.2, Figure 2, Remark.
-  (2020) Pseudo-dimension of quantum circuits. Quantum Machine Intelligence 2 (2), pp. 1–14. Cited by: §IV.2, §IV.
-  (2021) Encoding-dependent generalization bounds for parametrized quantum circuits. arXiv preprint arXiv:2106.03880. Cited by: §IV.1, §IV.1, §IV.1, §IV.
-  (to appear in 2022) Quantum learning theory. Ph.D. Thesis, Technical University of Munich. Cited by: §IV.2.
-  (2019) Gradients of parameterized quantum gates using the parameter-shift rule and gate decomposition. arXiv preprint arXiv:1905.13311. Cited by: §V.2.
-  (2012) The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine 29 (6), pp. 141–142. Cited by: §VI.1.
-  (2018) Classification with quantum neural networks on near term processors. arXiv preprint arXiv:1802.06002. Cited by: §VI.1.
-  (2021) Large-scale quantum machine learning. arXiv preprint arXiv:2108.01039. Cited by: §II.
-  (2019) Supervised learning with quantum-enhanced feature spaces. Nature 567 (7747), pp. 209–212. Cited by: §I.
-  (2021-05) Power of data in quantum machine learning. Nature Communications 12 (1). External Links: Cited by: §VI.1.
Variational quantum policies for reinforcement learning. arXiv preprint arXiv:2103.05577. Cited by: §III.2, §VI.1.
-  (2017) Hardware-efficient variational quantum eigensolver for small molecules and quantum magnets. Nature 549 (7671), pp. 242–246. Cited by: §III.1.
-  (2021) Quantum federated learning through blind quantum computing. Science China Physics, Mechanics & Astronomy 64 (10), pp. 1–8. Cited by: §II, §VI.1.
-  (2018) Differentiable learning of quantum circuit born machines. Physical Review A 98 (6), pp. 062324. Cited by: §I.
-  (2018) Barren plateaus in quantum neural network training landscapes. Nature communications 9 (1), pp. 1–6. Cited by: §V.3, §V.3.
-  (2003) Entropy and the combinatorial dimension. Inventiones mathematicae 152 (1), pp. 37–55. Cited by: §IV.2.
-  (2018) Foundations of machine learning. MIT press. Cited by: §III.2, §IV.
-  (2020) Simulating large quantum circuits on a small quantum computer. Physical Review Letters 125 (15), pp. 150504. Cited by: §I, §II, §II, §III.2.
-  (2021) Quantum circuit cutting with maximum-likelihood tomography. npj Quantum Information 7 (1), pp. 1–8. Cited by: §II.
-  (2021) Machine learning of high dimensional data on a noisy quantum processor. npj Quantum Information 7 (1), pp. 1–5. Cited by: §II.
-  (2018) Quantum computing in the nisq era and beyond. Quantum 2, pp. 79. Cited by: §I.
-  (2022) Avoiding barren plateaus using classical shadows. arXiv preprint arXiv:2201.08194. Cited by: §V.3.
Quantum divide and conquer for combinatorial optimization and distributed computing. arXiv preprint arXiv:2107.07532. Cited by: §II.
-  (2019) Quantum machine learning in feature hilbert spaces. Physical review letters 122 (4), pp. 040504. Cited by: §I.
-  (2021) Effect of data encoding on the expressive power of variational quantum-machine-learning models. Physical Review A 103 (3), pp. 032430. Cited by: §IV.1.
-  (2021) Cutqc: using small quantum computers for large quantum circuit evaluations. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 473–486. Cited by: §II, §II.
-  (1997) A theory of learning and generalization: with applications to neural networks and control systems. Springer-Verlag. Cited by: §IV.2.
-  (2018) Mathematical foundations of supervised learning. July. Cited by: §IV.2.
-  (2015) Coordinate descent algorithms. Mathematical Programming 151 (1), pp. 3–34. Cited by: §V.3.
-  (2022) If you listen carefully, you can hear the goat screaming “MNIST results!” But the dude isn’t listening carefully.. Twitter. Note: https://twitter.com/ylecun/status/1481327585640521728 Cited by: §VI.1.
Appendix A Example of increased complexity after the removal of a 2-qubit gate
Consider the following circuit
where the encoding gates are Rz. With the CNOT in place the state vector at the end will always be , irregardless of the input vector. Removing the CNOT
produces the final state . Which (when combined with a larger circuit, in a way that utilises the presently just global phase) can be used to differentiate points.
While no one would seriously implement this ansatz it serves to prove that in some cases removing 2-qubit gates can infact increase the complexity.