Quantum computing [NC10, Amb17] is one of the hot topics in computer science of last decades. There are many problems where quantum algorithms outperform the best known classical algorithms [DW01, Jor, KS19, KKS19]. Today quantum computing is often used in machine learning to speed up construction of machine learning models or to predict a result for new input data [AdW17, Kop18, SSP15, SSP14]. Sometimes the learning process (construction) of a machine learning model takes a long time because of the large size of data. Even a small reduction in running time can provide a significant temporary benefit to the program.
Decision trees are often used to build a classifier. Random forest[HTF09], Gradient tree boosting[Fri99] models are very popular and effective for solving classification and regression problems. These algorithms are based on decision trees. There are several algorithms for trees construction that are CART[LHAJ84], ID3[Qui86], C4.5[Qui96], C5.0[RJ15] and others. We consider C5.0 [c5019] algorithm for decision tree classifiers. It works in running time, where is a height of a tree, is the size of a training set,
is a number of attributes for one vector from the training set, andis a number of classes.
In this paper, firstly, we present an improved version of the classical algorithm that uses Self-balancing binary search tree [CLRS01] and has running time. As a self-balancing binary search tree we can use the AVL tree [AVL62, CLRS01] or the Red-Black tree [GS78, CLRS01]. Secondly, we describe a quantum version of the C5.0 algorithm. We call it QC5.0. The running time of QC5.0 is equal to . The algorithm is based on generalizations of Grover’s Search algorithm [Gro96] that are amplitude amplification [BHMT02] and Dürr-Høyer algorithm for minimum search [DH96].
Machine learning [Eth10, Qui86] allows us to predict a result using information about past events. C5.0 algorithm is used to construct a decision tree for classification problem [KQ02]. Let us consider a classification problem in formal way.
There are two sequences: is a training data set and is a set of corresponding classes. Here is a vector of attributes, where , is a number of attributes, is a number of vectors in the training data set, is a number of class of vector. An attribute is a real-valued variable or a discrete-valued variable, i.e. for some integer . Let if is a real value; and if is a discrete-valued attribute. The problem is to construct a function that is called classifier. The function classifies a new vector that is not from .
There are many algorithms to construct a classifier. Decision tree and the algorithm C5.0 for constructing a decision tree are a central subject of this work.
A decision tree is a tree such that each node tests some condition on input variables. Suppose is some test with outcomes that is tested in a node. Then, there are outgoing edges for the node for each outcome. Each leaf is associated with a result class from . The testing process is the following. We start test conditions from the root node and go by edges according to a result of the condition. The label on the reached leaf is the result of the classification process.
Our algorithm uses some quantum algorithms as a subroutine, and the rest part is classical. As quantum algorithms, we use query model algorithms. These algorithms can do a query to a black box that has access to the training data set and stored data. As a running time of an algorithm, we mean a number of queries to the black box. In a classical case, we use the classical analog of the computational model that is query model. We suggest [NC10] as a good book on quantum computing and [Amb17] for a description of the query model.
3 The Observation of C4.5 and C5.0 Algorithms
We consider a classifier that is expressed by decision trees. This section is dedicated to the C5.0 algorithm for decision trees construction for the classification problem. This algorithm is the improved version of the algorithm C4.5, and it is the part of the commercial system See5/C5.0. C4.5 and C5.0 algorithms are proposed by Ross Quinlan[KQ02]. Let us discuss these algorithms. C4.5 belongs to a succession of decision tree learners that trace their origins back to the work of Hunt and others in the late 1950s and early 1960s [Hun]. Its immediate predecessors were ID3 [Qui79], a simple system consisting initially of about 600 lines of Pascal, and C4 [Qui87].
3.1 The Structure of the Tree
Decision tree learners use a method known as divide and conquer to construct a suitable tree from a training set of vectors:
If all vectors in belong to the same class , then the decision tree is a leaf labeled by .
Otherwise, let be some test with outcomes that produces a non-trivial partition of . Let be the set of training vectors from that has outcome of . Then, the tree is presented in Figure 1. Here is a result of growing a decision tree for a set .
C4.5 uses tests of three types, each of them involves only a single attribute
. Decision regions in the instance space are thus bounded by hyperplanes, each of them is orthogonal to one of the attribute axes.
If is a discrete-valued attribute from , then possible tests are
with outcomes, one for each value from . (This is the default test.)
where . Tests of this kind are found by a greedy search that maximizes the value of the splitting criterion (It is discussed below).
If is a real-valued attribute, then a test is “” with two outcomes that are “true” and “false”. Here is a constant threshold. Possible values of are found by sorting the distinct values for set. Possible thresholds are values between each pair of adjacent values in the sorted sequence. So, if the training vectors from have distinct values for -th attribute, then thresholds are considered.
3.2 Test Selection Procedure
C4.5 relies on a greedy search, selecting a candidate test that maximizes a heuristic splitting criterion.
Two criteria are used in C4.5 that are information gain, and gain ratio. Let be a set of indexes of training vectors from that belong to -th class, for . Let be a relative frequency of training vectors in with indexes from . The information content of a message that identifies the class of vectors from is After that we split into subsets with respect to a test , the information gain is
The potential information from the partition itself is
. The test is chosen such that it maximizes the gain ratio that is .
3.3 Notes on C5.0 algorithm
C4.5 was superseded in 1997 by a commercial system See5/C5.0 (or C5.0 for short). The changes encompass new capabilities as well as much-improved efficiency, and include the following items. (1) A variant of boosting , which constructs an ensemble of classifiers that are later vote to give a final classification. Boosting often leads to a dramatic improvement in predictive accuracy. (2) New data types (e.g., dates), “not applicable” values, variable misclassification costs, and mechanisms for a pre-filtering of attributes. (3) An unordered rule sets that it is a situation when a vector is classified, all applicable rules are found and voted. This fact improves both the interpretability of rule sets and their predictive accuracy. (4) Greatly improved scalability of both decision trees and (particularly) rule sets (sets of if-then rules, representation of decision tree). Scalability is enhanced by multi-threading; C5.0 can take advantage of computers with multiple CPUs and/or cores.
3.4 Running Time of the One-threading Tree Constructing Part of C4.5 and C5.0 algorithms
Let us remind the parameters of the model. is a number of vectors in a training set, is a number of classes, is a number of attributes (elements of vectors from the training set). Let the height of a constructing tree be a parameter of the algorithm. Let be a set of indexes of real-valued attributes and let be a set of indexes of discrete-valued attributes.
Let us describe the procedure step by step because we will use it for our improvements. Assume that we construct a binary tree of height .
The main procedure is ConstructClassifiers that invoke a recursive procedure FormTree for constructing nodes. The main parameters of FormTree are that is an index of tree level; that is a result subtree that the procedure will construct; that is a set that we use for constructing this subtree.
Let us present ConstructClassifiers and FormTree procedures as Algorithm 1. The FormTree procedure does two steps. The first one ChooseSplit is choosing the test that is the choosing an attribute and the splitting by this attribute that maximize the objective function . The result attribute index is and the result split is variable. The second step Divide is the splitting processes itself.
Let us describe the ChooseSplit procedure that provides the best attribute index and split itself. It is presented in Algorithm 2. The procedure considers each attribute, and it has two different kinds of processing processes of the attribute depending on belonging to or . Let us describe the procedure for a real-valued attribute . We have three steps.
The first step is a sorting by element of vectors. This procedure is . Assume that the result indexes in a sorted order are , where . So, now we can split vectors by and then there will be two sets and , for .
The second step is computing , and , where , , . We use the following formula for the values:
if ; and otherwise.
, if .
, if .
The third step is choosing maximum , where and . We use these formulas because they correspond to splitting and , , and .
If we process a discrete-valued attribute from , then we can compute the value of the object function when we split all elements of according value of the attribute. So , for .
Let us describe the processing of discrete-valued attributes. The first step is computing the case numbers of classes before split, the case numbers of classes after split, the case numbers for t values of current attribute.
The second step is calculating an entropy before split. The third step is calculating the entropies after split to branches, information gain and potential information . The last step is calculating a gain ratio .
Let us describe the ChooseSplit procedure that splits the set of vectors. The procedure also described in Algorithm 2 and 3. The Divide procedure recursively invokes the FormTree procedure for each set from sequence of sets for constructing child subtrees.
Let us discuss the running time of the algorithm.
The running time of C5.0 is .
Proof. In fact, takes the main time. That is why we focus on analyzing this procedure. Let us consider a real-valued attribute. The running time for computing of , and is . The running time for sorting procedure is . The running time of computing a maximum of gain ratios for different splits is . Additionally, we should initialize array that takes . The total complexity of this processing a real-valued attribute is .
Let us consider a discrete-valued attribute. The cases processing time complexity is . An information gain for some discrete attribute is calculated with running time, where is a number of attribute values, is a number of classes. An entropy before cutting is calculated with running time, an entropy after cutting is calculated in . The potential information is calculated with running time. The gain ratio is calculated with running time.
Therefore the running time of processing of one discrete-valued attribute is .
Note that if we consider all sets of one level of the decision tree, then we collect all elements of . Therefore, the total complexity for one level is , and the total complexity for the whole tree is
4 Improvement of the Classical C4.5/C5.0 algorithms
4.1 Improvement of Discrete-valued Attributes Processing
If we process a discrete-valued attribute from , then we can compute the value of the object function when we split all elements of according value of the attribute. So , for .
We will process all vectors of one by one. Let us consider processing of current -th vector such that and . Let us compute the following variables: is a number of elements of ; is a number of vectors from that belongs to the class ; is a number of vectors from that belongs to the class ; is a potential information; is ; is information of ; . Assume that these variables contains values after processing -th vector and and contains values before processing -th vector. The final values of the variables will be after processing all variables. We will recompute each variable according to the formulas from Figure 2 (only variables that depends on and are changed)
So, finally we obtain the ProcessDiscrete procedure from Algorithm 6.
4.2 Using a Self-balancing Binary Search Tree
We suggest the following improvement. Let us use a self-balancing binary search tree [CLRS01] data structure for and . As a self-balancing binary search tree we can use the AVL tree [AVL62, CLRS01] or the Red-Black tree [GS78, CLRS01]. This data structure can be used for implementation of mapping from set of indexes to set of values. We always mean that the data structure contains only indexes with a non-zero value, and other values are zero. We use indexes of non-zero elements as key for constructing the search tree and values as additional data that is stored in a corresponding node of the tree. In the paper we call this data structure as Tree Map. The data structure has three properties on running time. (i) Running time of adding, removing and inserting a new index (that is called key) to the data structure is , where is a number of keys in the tree or a number of indexes with non-zero values. (ii) Running time of finding a value by index and modification of the value is (iii)Running time of removing all indexes from the data structure and checking all indexes of data structure is , where is a number of indexes with non-zero values.
If we use Tree Map, then we can provide the following running time.
The running time of C5.0 that uses Tree Map (Self-balancing binary search tree) is .
Proof. Let us follow the proof of Theorem 3.1. If we do not need to initialize the and , but erase these values after processing an attribute, then this procedure takes steps. So, the running time for processing a real-valued attribute becomes , and for a discrete-valued attribute, it is because we process each vector one by one and recompute variables that takes only steps for updating values of and steps for other actions. Therefore, the total complexity is .
5 Quantum C5.0
The key idea of the improved version of C5.0 algorithm is using the Dürr and Høyer’s algorithm for maximum search and Amplitude Amplification algorithm.
These two algorithms in combination has the following property:
Suppose, we have a function such that the running time of computing is . Then, there is a quantum algorithm that finds argument of maximal , the expected running time of the algorithm is and the success probability is at least
and the success probability is at least.
Using this Lemma we can replace the maximum search by attribute in ChooseSplit function and use ProcessAttribute as function . Let us call the function QChooseSplit. Additionally, for reducing an error probability, we can repeat the maximum finding process times and choose the best solution. The procedure is presented in Algorithm 7.
The running time of the Quantum C5.0 algorithm is . The success probability of QC5.0 is , where is a number of inner nodes (not leaves).
Proof. The running time of ProcessAttribute is . So the running time of maximum searching is . With repeating the algorithm, the running time is . If we sum the running time for all nodes, then we obtain . The success probability of the Dürr and Høyer’s algorithm is . We call it times and choose a maximum among values of gain ratios. Then, we find a correct attribute for one node with a success probability . We should find correct attributes for all nodes except leaves. Thus, the success probability for the whole tree is equal to , where is a number of internal nodes (not leaves).
Firstly, we have suggested a version of the C4.5/C5.0 algorithm with Tree Map (Self-balancing binary search tree, for example Read-Black tree or AVL tree) data structure. This version has a better running time. Secondly, we have presented a quantum version of the C5.0 algorithm for classification problem. This algorithm demonstrates almost quadratic speed-up with respect to a number of attributes.
- [AdW17] Srinivasan Arunachalam and Ronald de Wolf. Guest column: a survey of quantum learning theory. ACM SIGACT News, 48(2):41–67, 2017.
- [Amb17] A. Ambainis. Understanding quantum algorithms via query complexity. arXiv:1712.06349, 2017.
- [AVL62] George M Adel’son-Vel’skii and Evgenii Mikhailovich Landis. An algorithm for organization of information. In Doklady Akademii Nauk, volume 146, pages 263–266. Russian Academy of Sciences, 1962.
G. Brassard, P. Høyer, M. Mosca, and A. Tapp.
Quantum amplitude amplification and estimation.Contemporary Mathematics, 305:53–74, 2002.
- [c5019] C5.0: An informal tutorial, 2019. url=https://www.rulequest.com/see5-unix.html.
- [CLRS01] T. H Cormen, C. E Leiserson, R. L Rivest, and C. Stein. Introduction to Algorithms-Second Edition. McGraw-Hill, 2001.
- [DH96] Christoph Durr and Peter Høyer. A quantum algorithm for finding the minimum. arXiv preprint quant-ph/9607014, 1996.
- [DW01] Ronald De Wolf. Quantum computing and communication complexity. 2001.
- [Eth10] Alpaydin Ethem. Introduction to machine learning. 2010.
J. H. Friedman.
Greedy function approximation: A gradient boosting machine.1999.
Lov K Grover.
A fast quantum mechanical algorithm for database search.
Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pages 212–219. ACM, 1996.
- [GS78] L. J Guibas and R. Sedgewick. A dichromatic framework for balanced trees. In Proceedings of SFCS 1978), pages 8–21. IEEE, 1978.
- [HTF09] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning:Data Mining, Inference, and Prediction. Second Edition. 2009.
- [Hun] EB Hunt. Concept learning: An information processing problem. 1962.
- [Jor] Stephen Jordan. Bounded error quantum algorithms zoo. https://math.nist.gov/quantum/zoo.
- [KKS19] K. Khadiev, D. Kravchenko, and D. Serov. On the quantum and classical complexity of solving subtraction games. In Proceedings of CSR 2019, volume 11532 of LNCS, pages 228–236. 2019.
- [Kop18] Dawid Kopczyk. Quantum machine learning for data scientists. arXiv preprint arXiv:1804.10068, 2018.
- [KQ02] R. Kohavi and J. R. Quinlan. Data mining tasks and methods: Classification: decision-tree discovery. Handbook of data mining and knowledge discovery. – Oxford University Press, 2002.
- [KS19] K. Khadiev and L. Safina. Quantum algorithm for dynamic programming approach for dags. applications for zhegalkin polynomial evaluation and some problems on dags. In Proceedings of UCNC 2019, volume 4362 of LNCS, pages 150–163. 2019.
- [LHAJ84] Breiman L., Friedman J. H., Olshen R. A., and Stone C. J. Classification and regression trees. 1984.
- [NC10] Michael A Nielsen and Isaac L Chuang. Quantum computation and quantum information. Cambridge university press, 2010.
- [Qui79] J R. Quinlan. Discovering rules by induction from large collections of examples. Expert systems in the micro electronics age, 1979.
- [Qui86] J. R. Quinlan. Induction of decision trees. Machine learning, pages 81–106, 1986.
- [Qui87] J. R. Quinlan. Simplifying decision trees. International journal of man-machine studies, 27(3):221–234, 1987.
J. R. Quinlan.
Improved use of continuous attributes in c4.5.
Journal of Artificial Intelligence Research, pages 77–90, 1996.
Pandya R. and Pandya J.
C5. 0 algorithm to improved decision tree with feature selection and reduced error pruning.International Journal of Computer Applications., pages 18–21, 2015.
Maria Schuld, Ilya Sinayskiy, and Francesco Petruccione.
The quest for a quantum neural network.Quantum Information Processing, 13(11):2567–2586, 2014.
- [SSP15] Maria Schuld, Ilya Sinayskiy, and Francesco Petruccione. An introduction to quantum machine learning. Contemporary Physics, 56(2):172–185, 2015.