# Quantum Speedup in Adaptive Boosting of Binary Classification

In classical machine learning, a set of weak classifiers can be adaptively combined to form a strong classifier for improving the overall performance, a technique called adaptive boosting (or AdaBoost). However, constructing the strong classifier for a large data set is typically resource consuming. Here we propose a quantum extension of AdaBoost, demonstrating a quantum algorithm that can output the optimal strong classifier with a quadratic speedup in the number of queries of the weak classifiers. Our results also include a generalization of the standard AdaBoost to the cases where the output of each classifier may be probabilistic even for the same input. We prove that the update rules and the query complexity of the non-deterministic classifiers are the same as those of deterministic classifiers, which may be of independent interest to the classical machine-learning community. Furthermore, the AdaBoost algorithm can also be applied to data encoded in the form of quantum states; we show how the training set can be simplified by using the tools of t-design. Our approach describes a model of quantum machine learning where quantum speedup is achieved in finding the optimal classifier, which can then be applied for classical machine-learning applications.

## Authors

• 2 publications
• 1 publication
• 26 publications
• 1 publication
12/26/2021

### The Quantum Version of Prediction for Binary Classification Problem by Ensemble Methods

In this work, we consider the performance of using a quantum algorithm t...
09/17/2018

### Implementable Quantum Classifier for Nonlinear Data

In this Letter, we propose a quantum machine learning scheme for the cla...
08/13/2019

### Quantum adiabatic machine learning with zooming

Recent work has shown that quantum annealing for machine learning (QAML)...
11/17/2017

### Hardening Quantum Machine Learning Against Adversaries

Security for machine learning has begun to become a serious issue for pr...
07/02/2020

### Quantum Ensemble for Classification

A powerful way to improve performance in machine learning is to construc...
01/23/2012

### A probabilistic methodology for multilabel classification

Multilabel classification is a relatively recent subfield of machine lea...
05/27/2014

### Layered Logic Classifiers: Exploring the And' and Or' Relations

Designing effective and efficient classifier for pattern analysis is a k...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Appendix A Proof of Theorem 1

Here we perform our analysis for the probabilistic case, which can degenerate to the conventional AdaBoost if the outputs of classifiers are certain. Moreover, we assume a certain target label exists for all , where is the sample space of all possible inputs. Let be the probability mass function defined on .

The goal of AdaBoost is to find the optimal coefficients of the linear model based on the basis classifiers with the minimum exponential error, which is the average of over the joint distribution of inputs and classifiers. Here

are random variables which yield the conditional probabilities

. Let , that is if , and otherwise. Then is fully determined by a conditional probability mass function .

The exponential error as the cost function yields

 CT=∑xp(x)T∏t=1⎛⎜⎝∑rxt=0,1qt(rxt|x)e−αt(−1)rxt⎞⎟⎠ (11)

because . In AdaBoost, the optimization problem is done by adding each term into one by one with the optimal weight at iteration. Let be the exponential error of the first terms of

Let be a binary string . Let , and let . Then equation (11) gives

 Ct= (12) ∑x∈Ω,st−1q(st−1,x)wxst−1⎛⎜⎝∑rxt=0,1qt(rxt|x)e−αt(−1)rxt⎞⎟⎠.

This is a convex function respect to , and an unique solution to the problem exists at the extreme. Taking its derivative to gives

 ∑x,st−1q(st−1,x)wxst−1(−qt(0|x)e−αt+qt(1|x)eαt)=0 (13)

and hence

 e2αt ∑x,st−1q(st−1,x)wxst−1qt(1|x) (14) = ∑x,st−1q(st−1,x)wxst−1qt(0|x).

That is

 αt=12ln∑x,st−1q(st−1,x)wxst−1qt(0|x)∑x,st−1q(st−1,x)wxst−1qt(1|x). (15)

Let

 ~Rt:=∑x,st−1q(st−1,x)wxst−1qt(1|x)∑x,st−1q(st−1,x)wxst−1 . (16)

Then the optimal weight of each iteration is

 αt=12ln(1−~Rt~Rt). (17)

In the following, we demonstrate that the optimal weight can be adaptively obtained. When , initialize . Thus

 ~R1 =∑xp(x)q1(1|x)∑xp(x) (18) =∑xp(x)q1(1|x) =Ep×q1[rx1]

which is exactly the generalization error of .

Let be the normalization factor. Then

 ~Rt =∑x,st−1q(st−1,x)qt(1|x)wxst−1Zt (19) =∑x,st−1q(st−1,x)∑rxt=0,1qt(rxt|x)wxst−1Ztrxt =Eq[wxst−1Ztrxt].

By definition . Therefore

 Zt+1 =∑x,stq(st,x)wxst (20) =∑x,st−1q(st−1|x)wxst−1(qt(0|x)e−αt+qt(1|x)eαt) =e−αt(Zt~Rte2αt+Zt(1−~Rt)) =e−αt(Zt~Rt1−~Rt~Rt+Zt(1−~Rt)) =2e−αt(1−~Rt)Zt.

Let

 Wxst+1:=wxstZt+1 (21) = wxst−1Zt+1e−αt(−1)rt = Wxst12(1−~Rt)eαt(1−(−1)rt).

Therefore, all the values of can be obtained by iterating with the information of . It is not hard to check that (21) is equivalent to the updating rule  (6). These values again yield for next iteration, and therefore every could be determined analytically in this manner.

## Appendix B Proof of Theorem 2

In the section Proof of Theorem 1, a theoretical optimal solution to the AdaBoost Model is derived. However, in practice, the underlying distribution of inputs is unknown, and therefore the values of are impossible to be evaluated. Also, usually the training algorithm cannot cover the whole sample space (otherwise the explicit relationship between inputs and output are known, and machine learning is unnecessary).

Similar to other machine learning tasks, this problem is solved by sampling. Clearly, with a underlying distribution on the sample space , each can be viewed as a random variable on the sample space. This can be done with an interesting result derived from Hoeffding’s inequality.

###### Theorem 4 (Hoeffding’s inequality).

If a sample of size is drawn from a distribution on a sample space , then given a random variable on and any positive number

 P\qty[\abs1N∑x∈SX(x)−ED[X]≥ϵ]≤2exp(−2Nϵ2^c2) (22)

where .

The key point here is that, though cannot be evaluated in practice, is computable, and it approximates well when is large.

According to equation (19), For a sample of pairs drawn from the distribution , let

 ^Rt:=1N∑x∈S[Wxstrxt]. (23)

Then theorem 4 shows that

 P(|~Rt−^Rt|≥ϵ)≤2e−2ϵ2N/c2t , (24)

where is the size of and .

To be noticed, the value of is derived with iteration according to equation (21). Since , is always positive, which means is always non-negative as well. Further, .

Therefore, for a target precision of , a sample with size is good enough to achieve the goal with a constant probability. Nevertheless, the size of sample have to be determined before hand; and hence we should choose

 N=O\qty(^c2ϵ2) , (25)

where .

###### Remark.

However, might not be small when

is large, which indicates that AdaBoost may not be good if the model does not converge fast with the number of classifiers used. These might be improved by other boosting algorithms, e.g. LogitBoost, Gradient Boosting, XGBoosting.

As long as we obtain a sample of size , according to theorem 4, the algorithm 1 approximates well. This algorithm evaluates each data for each classifier , and therefore requires queries.

## Appendix C Quantum Simulation of Classical Process

This section reviews some results from Kitaev’s paper Kitaev (1995) that simulate classical Boolean circuits with quantum circuits. For convenient, without loose of generality the classical registers are denoted with Dirac notations here.

According to lemma 1 and 7 in Kitaev (1995), if a function can be computed with Boolean operations a basis , which is a small set of Boolean operations, then it can be computed with operations in the basis . The basis is defined in a way that, for each , there is a Also the operation to copy a state

 τA,B:|x⟩A|v⟩B→|x⟩A|v⊕x⟩B (26)

(which is indeed a CNOT gate) have to be included into .

Furthermore, we say a circuit computes a Boolean function , if it converts . With basis , this is computation is performed as .

However, one may only need partial information about the output . Classically, it is free to readout part of the the output bits and drop the rest. Nevertheless, in quantum computation, dropping those “garbage” bits () would destroy the quantum state if they are in superposition. But as shown above, can be constructed with reversible gates. Divide register into two parts , and then is

(it is always separable as the initial states are all tensor product states), above process is then

 |x⟩X|0⟩B1|0⟩B2→|x⟩X|f(x)⟩B1|gar(x)⟩B2.

By repeating this process on an extra register , the process

 |x⟩X|0⟩B1|0⟩B2|0⟩B′1|0⟩B′2
 →|x⟩X|H(x)⟩B1|gar(x)⟩B2|f(x)⟩B′1|gar(x)⟩B′2

can be constructed.

If the input state is on quantum registers and it is in superposition

 \qty(1√N∑x|x⟩X)|0⟩B1|0⟩B2|0⟩B′1|0⟩B′2,

this process will give

 1√N∑x|x⟩X|f(x)⟩B1|gar(x)⟩B2|f(x)⟩B′1|gar(x)⟩B′2.

Then the pairwise operation (26) is performed between the “garbage” states on and , which gives

 1√N∑x|x⟩X|f(x)⟩B1|gar(x)⊕gar(x)⟩B2|f(x)⟩B′1|gar(x)⟩B′2
 =1√N∑x|x⟩X|f(x)⟩B1|0⟩B2|f(x)⟩B′1|gar(x)⟩B′2.

Finally, the original process is performed again on which ends up at

 \qty(1√N∑x|x⟩X|f(x)⟩B1)|0⟩B2|0⟩B′1|0⟩B′2.

Since the appending registers are all end up at , it is free to drop them after computation.

In summary, the process can be achieved with quantum gates even for computation in superposition. This fact indicates that each arithmetic part in our quantum algorithm can be performed with the same complexity of the classical algorithm. Since all ancillary registers always start and end at , they are neglected in our notation for simplicity.

Note that, although above result is only valid for Boolean functions, as how modern computers work, these Boolean operations are indeed universal. In case people want to deal with real numbers on computers, those values have to be encoded into binary strings up to some precision.

###### Example 1.

The updating rule (6) is purely arithmetic. This can be viewed as repeating controlled operation on a register , encoding a numerical value in terms of binary strings up to some precision. Each application of is controlled by each qubit of the string . More precisely,

 Ui=U0⊗|si=0⟩⟨si=0|i+U1⊗|si=1⟩⟨si=1|i ,

where ; . As a result, lines 10-14 in algorithm 1 can be performed in quantum circuits with the same order of gates as classical circuit. Additionally, this can be done in superposition for all , and hence the “for” loop in classical algorithm can be done in one shot.

Similarly, another step for phase estimation in our algorithm can be done with this method.

###### Example 2.

There exists an operation such that for ,

 Qt|ξ⟩M|0⟩=|ξ⟩M(√1−ξ^c|0⟩+√ξ^c|1⟩ . (27)

The requirement of is presented to make sure is a real number, and therefore, the state can be constructed on an ancillary register “anc”. Here , where is the binary representation of the real number up to some precision. The process to compute is arithmetic. By further appending an additional qubit to the system, an operation can be constructed as lemma 4 in Dervovic et al. (2018), such that it converts to

 |ξ⟩M|A⟩anc(sin(A)|0⟩+cos(A)|1⟩)
 =|ξ⟩M|A⟩anc(√1−ξ^c|0⟩+√ξ^ct|1⟩).

Finally the register can be cleared and dropped with the garbage dropping technique above. This whole process is exactly the operation .

The operations in these examples would be useful in next section.

## Appendix D Proof of Theorem 3

In the Quantum AdaBoost Algorithm, the computation other then the average of can be performed in parallel on the whole sample. That is, for every initial state , where is the data points of a sample of size drawn from the sample space , the classical algorithm outputs to the register , which encoding the numerical value of . Note that here is the state corresponding to the binary value of , as how modern computer saves numerical values. With this property, the AdaBoost algorithm can be performed by following adaptive procedure:

At iteration, given the classical information of (where can be obtained in iteration), initialize the state of three registers , , as

 1√N∑x∈S|x⟩X⊗|Wxs1≡1⟩M⊗|0⟩⊗tRt, (28)

with access to the quantum oracle defined in (7), one can obtain

 (29)

With the classical information of , one can update the register with the updating rule (6) (which is a classical arithmetic process shown in example 1) to the state

 (30)

Compose the whole arithmetic process that converts (28) to (30) and rewrite it as :

 At1√N∑x∈S|x⟩X⊗|Wxs1⟩M⊗|0⟩Rt (31) = 1√N∑x∈S∑st√q(st|x)|x⟩X⊗|Wxst⟩M⊗|st⟩Rt ,

With an extra working register, apply the operation in example 2 to the final state in (31)

 QtAt1√N∑x∈S′|x⟩X|Wxs1⟩M|0⟩Rt|0⟩= (32)  ⎷1−∑x,stq(st|x)rxtWxst^cN|φ0⟩|0⟩+√∑x,stq(st|x)rxtWxst^cN|φ1⟩|1⟩,

where

 |φ0⟩ =1√N∑x,st√q(st|x)√1−rxtWxst/^c√1−^Rt/^c|x⟩X|Wxst⟩M|st⟩Rt (33) |φ1⟩ =1√N∑x,st√q(st|x)√rxtWxst/^c√^Rt/^c|x⟩X|Wxst⟩M|st⟩Rt.

Note that for each , .

According to the definition in equation (23), the result of (32) is indeed

 √1−^Rt^c|φ0⟩|0⟩+√^Rt^c|φ1⟩|1⟩. (34)

This can be rewrite as

 |ψ0⟩:=sin(θt)|φ0⟩|0⟩+cos(θt)|φ1⟩|1⟩, (35)

which performs a rotation of angle .

Let . After a Pauli- operation is performed on the last register of , it is transformed to

 sin(θt)[sin(θt)|ψ0⟩+cos(θt)|ψ1⟩] (36) − cos(θt)[cos(θt)|ψ0⟩−sin(θt)|ψ1⟩] = cos(2θt)|ψ0⟩+sin(2θt)|ψ1⟩.

Let . Apply the inverse operation to (36), so that is mapped back to the initial state (28). Note that, is orthogonal to and our operation is unitary. Therefore, if an operation only inverse the amplitude of the every state perpendicular to the initial state (analogy to the diffusion operator in Grover’s algorithm Grover (1996)) is applied and the operation is performed again, would be left unchanged. This procedure gives

 cos(2θt)|ψ0⟩−sin(2θt)|ψ1⟩ (37) = cos(2θt)[sin(θt)|φ0⟩|0⟩+cos(θt)|φ1⟩|1⟩] − sin(2θt)[cos(θt)|φ0⟩|0⟩−sin(θt)|φ1⟩|1⟩] = cos(3θt)|φ0⟩|0⟩+sin(3θt)|φ1⟩|1⟩.

In conclusion, converts the initial state to

 cos((2k+1)θt)|φ0⟩|0⟩+sin((2k+1)θt)|φ1⟩|1⟩ . (38)

Such operation provides the possibility to estimate with phase estimation algorithm.

To fairly compare the query complexities, we want to constrain the results from both classical and quantum algorithm to the same precision. In order to approximate with the target precision , the phase estimation algorithm have to estimate with precision , and as shown in (25), a sample of size is enough to estimate each with with precision .

In the step of our quantum algorithm, by choosing number of iterations in (38) to be , the phase estimation process could read out the value of , such that .

In order to estimates with the same precision as the classical algorithm, we need to bound to make sure

 |^ccos2(^θt)−^Rt|≤ϵ .

This can be done by choose a proper . Then, the task of our analysis is to bound the value of in terms of and as in the classical case.

Let , then gives Since , is a small number, and hence

 \abscos−1(√^Rt+^ϵ^c)−cos−1(√^Rt^c) (39) ≈ \abs^ϵ^cddxcos−1(√x)|x=^Rt/^c .

When , is almost a constant, and . This is usually true since and . Note that , and . To make sure , or equivalently , the optimal can be chosen is . This gives .

Moreover, for iteration, the step (29) requires queries. So the query complexity for each iteration is .

Nevertheless, in order to obtain the value of , each quantum iteration is followed with a measurement. The information of saved in superposition would be disrupted and thus it have to be evaluated from every beginning every time. Therefore the overall complexity is , comparing to the classical case, which is . As discussed in remark at the end of the section Proof of Theorem 1, AdaBoost algorithm may not work well if it does not converge within a small number of iterations. Therefore, the here may be considered as a small constant.

Also, for both quantum and classical algorithms, we use , the query complexity of classical algorithm can be rewritten as and the quantum query complexity is then .

This quantum algorithm could give the same result of the classical algorithm with the same order of precision with same success probability.