Introduction
The use of Noisy Intermediate Scale Quantum (NISQ) devices for machine learning tasks has been proposed in various forms [1]. However, there is still a lot to be understood about the potential sources of quantum advantage. While it is reasonable to postulate that quantum computers may outperform classical computers in machine learning tasks, the existing propositions often perform poorly due to various restrictions [2].
As some people put it, NISQ devices are best at simulating themself. In particular, this means performing noisy random quantum circuits. It may seem to be not a very useful task, however the noisy devices has been successfully used as a resource for so called quantum generative models, leading to nontrivial probability distributions in vector spaces
[3, 4, 5, 6]. Thus, maybe we could harness that specific resource for machine learning tasks.At the same time, is has been shown that any translation invariant kernel could be substituted by a explicit feature map based on some probability distribution in the vector space of the features [7]. Putting it the other way around, any unique probability distribution on a vector space of features can be expected to raise a new kernel for that features space.
We recall the theory of samplingbased approach to machine learning and establish a link to quantum generative models. Our goal is to show that we could develop quantum machine learning methods that use quantum devices solely as a source of a probability distribution, hoping to efficiently use the quantum resource.
We use a scheme designed for random features for largescale kernel machine [7] as sampling based classification. We propose a method of quantum sampling based on random quantum circuits with parameterized rotations distribution. In short we obtain a competitive quantum classifier with crucial component being quantum sampling – a promising task for quantum supremacy.
1 Preliminaries – Randomized Feature Maps
One method to tackle largescale kernels has been proposed by Rahimi and Recht [7, 8]. The proposition is to do preprocessing, mapping the input data to a randomized lowdimensional feature space and then apply a linear classifier. The explicit pipeline is shown in Figure 1. The key of the idea is to replace designing a fancy kernel with developing a sophisticated method for sampling vectors. We will introduce the original idea, split it into steps and specify which ones will be further considered.
Let us recall one of the key theorems that provides a foundation for the proposed scheme. Lets assume we have an input feature space and a shift invariant kernel , is the corresponding probability distribution in the input features space ,
is an explicit map into a higher dimensional space built in a certain way based on random variable
sampled with distribution .Claim 1
(Uniform convergence of Fourier features) [7]. Let be a compact subset of with diameter . Then, for the mapping , we have
(1) 
where
is the second moment of the Fourier transform of
. Further, with any constant probability when .The proposition from [7] for the construction of and was to sample a set of vectors , of the same dimension as the data at hand as a basis for the new features. The vectors in the training data are compared with the random set and new features are generated as of the corresponding vectors. Then all features are transformed separately via a one dimensional nonlinear (cosine, sine) function
(2) 
New data set is passed on to a linear classifier for training. New test data is transformed the same way. The inner product is computed using the same random vectors sample and transformed with the same nonlinear function. Then, the previously trained linear classifier is used.
The scheme can be seen as a typical classifier with preprocessing phase. In such picture we have the following steps:

Initialization.

Preprocessing of the training data.

Classifier training with the processed data.

Preprocessing of the test data.

Applying the classifier on the processed test data.
To implement the scheme we need to specify the initialization and preprocessing steps.
We split the preprocessing into features generation and nonlinear map and will focus mostly on the former one later in the paper. We operate on data points with dimension and create new features that does not have to be equal .

Initialization: Sample , from a distribution given by hyperparameters.

Preprocessing: Transform any given training or test vector into .
In the remaining part we will show how to implement these steps using quantum circuits for vectors sampling.
2 Features Generation from Quantum Circuits
For our purposes it should be sufficient to know that any program that can be run on a
qubit quantum computing device can usually be described by a unitary operator
. In this notation for an input vector , the output is simply a result of multiplication(3) 
In this context the considered vectors are normalized and called states.
We aim at linking random vectors to quantum circuits in such a way that allows us to compute the inner product. Such scheme would be compatible with the scheme in the previous section. The quantum part in the most general picture is presented in Fig. 2.
In particular, we fix a quantum operation and denote the row (Hermitian conjugated) as
(4) 
using basis vector . For given we want to compute the inner product . Let us note that we have
(5) 
when . We can obtain inner product with some on a quantum computer by injecting as the input state and reading the value of the output state
(6) 
Let us note that reading the exact value requires so called state tomography [9] and can be done with arbitrary precision with arbitrarily high probability, but always approximately. In practice esimating this value is a Bernoulli trial.
Based on the above considerations the randomized feature map is constructed as follows. Based on some quantum operation construction procedure we define a probability distribution on a set of unitary operators. This set will be described in detail later as circuit Ansätze. The parameters of this distribution are hyperparameters of the whole classification scheme. From fixed distribution we sample a set of unitary operators , and in consequence vectors . Each vector is a row of a unitary operator and thus normalized. To obtain vectors of variable length we sample the lengths . The resulting set of vectors is , where , for . Let us note that the length can be correlated with the sampled vector. For example, we could design a larger circuit, with additional qubits, that after the measurement would indicate the length.
The mapping is performed as follows. In order to obtain feature for data point we

map a data point into a normalized vector and store ,

apply circuit to state , computing ,

estimate real part of the amplitude of , obtaining ,

scale the vectors obtaining ,

apply a nonlinear map cos/sin obtaining , ,

return .
Effectively is a concatenation of and , where cos/sin are elementwise matrix operations, rows of are random vectors and columns of are data points . The steps are sketched in Fig. 3.
3 Example
3.1 Quantum Circuits
In the previous section we stated that we use quantum operations to generate the features. In Section 2 we only mentioned that any operation corresponds to an unitary operation. However, we will use a much more practical way of defining quantum operations – quantum circuits. That is a computational model inspired by classical logic circuits. The common idea is to describe a complex global operation with a sequence of simple and small basic operations. The most often recommended introduction can be found in [10].
We will use a set of basic operations represented by two parameterized unitary operations and build circuits by multiplying these matrices. The operations will correspond to one qubit rotations and a twoqubit entangling gate CNOT. The matrix representations of the two are
(7) 
where is a Pauli matrix.
The end operation acts on a bigger space than
or CNOT. We will assume that all of the gates are extended to the same space with tensor product operation. Thus we will specify on which subspace the operator works. In case of
the operator works on a subspace corresponding to one qubit and we will use(8) 
to mark that it is the th out of qubits, where is a
dimensional identity matrix and
is tensor product operation (also tensordot in e.g. numpy). In case of CNOT the operator works on a product of two subspaces, corresponding to so called target and control qubits. We will mark it with CNOT.3.2 Random Vectors Circuit Ansätze
For this example we chose an Ansätze that generate a broad family of quantum circuits with little hyperparameters that have intuitive interpretation. The parameters that will need to be fixed are the number of layers
, the parameters of the normal distribution used for rotations:
, and variance of the vectors length
.For qubits the circuit is created as follows. First a rotation gate is added on each of the qubits, gates in total.
(9) 
Then for layer we repeat: sample a control and action qubits for a CNOT gate and then add a rotation on both action and control qubits
(10) 
The resulting operator is composed as
(11) 
An example is presented in Figure 4. The rotation angles are sampled from the distribution described by the hyperparameters,
. We use Gaussian distribution with fixed mean and variance. An additional hyperparameter is
that will affect the weights of the vectors. For circuit we store as the weight corresponding to circuit , so that we will effectively consider vector .3.3 Setting
In this work we perform basic accuracy measuring experiments. As the testing dataset we consider the MNIST dataset [11] as in [12]. We aim at beating the SVM with radical basis function as the kernel. We explore the space of hyperparameters and the relation between the score and the number of random quantum circuits used.
The whole experimental algorithm is based on the randomized feature maps scheme presented in Section 1. We first describe the details of the preprocessing and classification used, and then report the obtained results.
The MNIST dataset contains 70000 images corresponding to digits 09. We extract only two of the digits: 3 and 5. There are 13454 data points of either of them. For measuring the accuracy we use a single trainingtest pair with size proportion 6:1. The size of the sets are 11532, and 1922 correspondingly.
Before feeding the algorithm with data we do simple feature selection. We plan to use a 7 qubit circuits that operate on vectors of dimension equal to
. Thus we select 128 best features according to a test, looking for multimodal distributions.For features selection we use SelectKBest method and we perform classification with LinearSVC method from sklearn [13].
3.4 Scores
In the presented example we analyse the accuracy of the resulting classification scheme. We will compare the results to the ones obtainable by linear and nonlinear methods. The results depend on the hyperparameters selection, thus we show the results obtained for a range of values.
The accuracy considered here is the fraction of correct answers in a binary classification scheme. For comparison we take permutation invariant methods, without any optimisation towards image processing. The two main reference points are Linear SVC and SVM with radical basis function kernel. The scores obtainable with these methods are roughly 96% and 99%.
One particularly important hyperparameter is selected circuits size. This is the one that is connected to the complexity of the quantum part of the scheme. Larger circuit size would increase both simulation cost and quantum device running time. In the case of the selected Ansätze we can select the number of CNOT gates freely. In this example we consider the number of CNOT gates being multiplication of the number of qubits.
Our hyperparameter selection has been done with grid search for small number of random vectors . The best score was equal to 0.9909 for , although the average sore was highest for the largest tested value of . This result is better than what we achieved using SVM with radical basis function kernel. The best results has been obtained for and circuit layers number set to twice the number of qubits.
Method  Accuracy 

Logistic regression  .959 
SVM+RBF  
Quantum Generative Model Kernel  .9909 
The best scores obtained with the considered methods. As the reference methods we consider logistic regression and supported vector machine with radical basis function kernel (SVM+RBF). The results of the reference methods come from
[12].The histograms of the scores are presented in Fig. 5 and Fig. 6. The relation between best score and the number of random vectors is presented in Fig. 7.
4 Discussion
There are three key points that we want to discuss, concerning simulation complexity, applicability to quantum data and the obtained scores.
Firstly, the proposed method provides a link between quantum circuit Ansätze and machine learning kernels. Any family of quantum circuits gives a new kernel. For small quantum circuits this gives a quantum inspired kernel creation method. For large circuits, that we can expect to yield a distribution that is hard to simulate, we obtain kernels that can be considered nonclassical. However, we want to stress that the fact that simulating the circuits is time consuming does not mean that sampling from the resulting distribution is as well. Many families of random circuits are known to converge with length to easily sampled distributions, in particular designs [14].
Secondly, the method works with data encoded in quantum states. Thus, is compatible with input data that is intrinsically of quantum nature. This may be an important feature if quantum simulation on quantum devices become common, and methods that handle output data directly would be desirable. In particular, the kernel can be joined with methods that operate on states with known preparation scheme but without the need to obtain the representation in computational basis, as in VQE [15]. Also, these are the most probable scenarios to require circuit sizes that cannot be simulated classically in reasonable time.
Lastly, the exemplary Ansätze seems competitive compared to established classical methods. The comparison is far from being a good argument for arguing supremacy over classical methods, but supports optimistic view. The selected problem is best handled with methods that harness spacial relations in the image [16]. For fair comparison these relations should be included.
The generative model that we choose in this work is only an example. Apart from the circuit model there are other quantum computational models. A natural turn for future work would be to look at other possibilities. Another example of computationally universal model that can generate probability distributions with specific features would be a quantum walk [17]. One could also turn to a general description of a quantum system given by Schroedinger/Lindblad equation [18]. These models could yield different probability distributions, and thus be the source of different kernels.
References
 [1] Jacob Biamonte, Peter Wittek, Nicola Pancotti, Patrick Rebentrost, Nathan Wiebe, and Seth Lloyd. Quantum machine learning. Nature, 549(7671):195, 2017.
 [2] Carlo Ciliberto, Mark Herbster, Alessandro Davide Ialongo, Massimiliano Pontil, Andrea Rocchetto, Simone Severini, and Leonard Wossnig. Quantum machine learning: a classical perspective. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 474(2209):20170551, 2018.
 [3] John Preskill. Quantum computing in the nisq era and beyond. Quantum, 2:79, 2018.
 [4] Scott Aaronson and Lijie Chen. Complexitytheoretic foundations of quantum supremacy experiments. arXiv preprint arXiv:1612.05903, 2016.
 [5] Sergio Boixo, Sergei V Isakov, Vadim N Smelyanskiy, Ryan Babbush, Nan Ding, Zhang Jiang, Michael J Bremner, John M Martinis, and Hartmut Neven. Characterizing quantum supremacy in nearterm devices. Nature Physics, 14(6):595, 2018.
 [6] Jonathan Romero and Alan AspuruGuzik. Variational quantum generators: Generative adversarial quantum machine learning for continuous distributions. arXiv preprint arXiv:1901.00848, 2019.
 [7] Ali Rahimi and Benjamin Recht. Random features for largescale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2008.
 [8] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in neural information processing systems, pages 1313–1320, 2009.
 [9] RT Thew, Kae Nemoto, Andrew G White, and William J Munro. Qudit quantumstate tomography. Physical Review A, 66(1):012303, 2002.
 [10] Michael A Nielsen and Isaac L Chuang. Quantum Computation and Quantum Information. Cambridge University Press, 2010.

[11]
Mohd Razif Shamsuddin, Shuzlina AbdulRahman, and Azlinah Mohamed.
Exploratory analysis of mnist handwritten digit for machine learning
modelling.
In
International Conference on Soft Computing in Data Science
, pages 134–145. Springer, 2018.  [12] Christopher Wilson, Johannes Otterbach, Nikolas Tezak, Robert Smith, Peter Karalekas, Anthony Polloreno, Sohaib Alam, Gavin Crooks, and Marcus Da Silva. Quantum kitchen sinks: An algorithm for machine learning on nearterm quantum computers. Bulletin of the American Physical Society, 2019.
 [13] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine Learning in Python . Journal of Machine Learning Research, 12:2825–2830, 2011.
 [14] Aram W Harrow and Richard A Low. Random quantum circuits are approximate 2designs. Communications in Mathematical Physics, 291(1):257–302, 2009.

[15]
Alberto Peruzzo, Jarrod McClean, Peter Shadbolt, ManHong Yung, XiaoQi Zhou,
Peter J Love, Alán AspuruGuzik, and Jeremy L O’brien.
A variational eigenvalue solver on a photonic quantum processor.
Nature communications, 5:4213, 2014.  [16] http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_ results.html#4d4e495354.
 [17] Przemysław Sadowski, Jarosław Adam Miszczak, and Mateusz Ostaszewski. Lively quantum walks on cycles. Journal of Physics A: Mathematical and Theoretical, 49(37):375302, 2016.
 [18] Łukasz Pawela and Przemysław Sadowski. Various methods of optimizing control pulses for quantum systems with decoherence. Quantum Information Processing, 15(5):1937–1953, 2016.