# Learning with Density Matrices and Random Features

A density matrix describes the statistical state of a quantum system. It is a powerful formalism to represent both the quantum and classical uncertainty of quantum systems and to express different statistical operations such as measurement, system combination and expectations as linear algebra operations. This paper explores how density matrices can be used as a building block to build machine learning models exploiting their ability to straightforwardly combine linear algebra and probability. One of the main results of the paper is to show that density matrices coupled with random Fourier features could approximate arbitrary probability distributions over ℝ^n. Based on this finding the paper builds different models for density estimation, classification and regression. These models are differentiable, so it is possible to integrate them with other differentiable components, such as deep learning architectures and to learn their parameters using gradient-based optimization. In addition, the paper presents optimization-less training strategies based on estimation and model averaging. The models are evaluated in benchmark tasks and the results are reported and discussed.

Comments

There are no comments yet.

## Authors

• 13 publications
• 1 publication
• 3 publications
• 10 publications
07/20/2021

### Quantum Measurement Classification with Qudits

This paper presents a hybrid classical-quantum program for density estim...
06/07/2020

### Random circuit block-encoded matrix and a proposal of quantum LINPACK benchmark

The LINPACK benchmark reports the performance of a computer for solving ...
08/12/2020

### Wishart and random density matrices: Analytical results for the mean-square Hilbert-Schmidt distance

Hilbert-Schmidt distance is one of the prominent distance measures in qu...
10/24/2021

### Neo: Generalizing Confusion Matrix Visualization to Hierarchical and Multi-Output Labels

The confusion matrix, a ubiquitous visualization for helping people eval...
03/09/2021

### Machine Learning the period finding algorithm

We use differentiable programming and gradient descent to find unitary m...
01/03/2018

### Randomized Linear Algebra Approaches to Estimate the Von Neumann Entropy of Density Matrices

The von Neumann entropy, named after John von Neumann, is the extension ...
04/14/2021

### Identification of unknown parameters and prediction with hierarchical matrices

Statistical analysis of massive datasets very often implies expensive li...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The formalism of density operators and density matrices was developed by von Neumann as a foundation of quantum statistical mechanics (von1927wahrscheinlichkeitstheoretischer). From the point of view of machine learning, density matrices have an interesting feature: the fact that they combine linear algebra and probability, two of the pillars of machine learning, in a very particular but powerful way.

The main question addressed by this work is how density matrices can be used in machine learning models. One of the main approaches to machine learning is to address the problem of learning as one of estimating a probability distribution from data, joint probabilities in generative supervised models or conditional probabilities in discriminative models.

The central idea of this work is to use density matrices to represent these probability distributions tackling the important question of how to encode arbitrary probability density functions in

into density matrices.

The quantum probabilistic formalism of von Neumann is based on linear algebra, in contrast with classical probability which is based in set theory. In the quantum formalism the sample space corresponds to a Hilbert space and the event space to a set of linear operators in , the density operators.

The quantum formalism generalizes classical probability. A density matrix in an -dimensional Hilbert space can be seen as a catalog of categorical distributions on the finite set . A direct application of this fact is not very useful as we want to efficiently model continuous probability distributions in . One of the main results of this paper is to show that it is possible to model arbitrary probability distributions in using density matrices of finite dimension in conjunction with random Fourier features (rahimi2007rff). In particular the paper presents a method for non-parametric density estimation that combines density matrices and random Fourier features to efficiently learn a probability density function from data and to efficiently predict the density of new samples.

The fact that the probability density function is represented in matrix form and that the density of a sample is calculated by linear algebra operations makes it easy to implement the model in GPU-accelerated machine learning frameworks. This also facilitates using density matrices as a building block for classification and regression models, which can be trained using gradient-based optimization and can be easily integrated with conventional deep neural networks. The paper presents examples of these models and shows how they can be trained using gradient-based optimization as well as optimization-less learning based on estimation.

The paper is organized as follows: Section 2

covers the background on random features, kernel density estimation and density matrices;

Section 3 presents four different methods for density estimation, classification and regression; Section 4 discusses some relevant works; Section 5 presents the experimental evaluation; finally, Section 6 discusses the conclusions of the work.

## 2 Background

### 2.1 Random features

Random Fourier features (RFF) (rahimi2007rff) is a method that builds an embedding given a shift-invariant kernel such that , where is the transpose operator111Later on this operator will be used as the Hermitian conjugate that, for real matrices, is equivalent to the transpose operator.. One of the main applications of RFF is to speedup kernel methods, being data independence one of its main advantages.

The RFF method is based on the Bochner’s theorem. In layman’s terms, Bochner’s theorem shows that a shift invariant positive-definite kernel

is the Fourier transform of a probability measure

. rahimi2007rff use this result to approximate the kernel function by designing a sample procedure that estimates the integral of the Fourier transform. The first step is to draw iid samples from an iid samples

from an uniform distribution in

. Then, define:

 ϕrff:Rd→RDx↦√2D(cos(w∗1x+b1),…,cos(w∗Dx+bD)). (1)

rahimi2007rff showed that the expected value of uniformly converges to :

###### Theorem 2.1.

(rahimi2007rff) Let be a compact subset of with a diameter . Then for the mapping defined above, we have

 Pr[supx,y∈M|ϕ∗rff(x)ϕrff(y)−k(x,y)|≥ϵ]≤28(σpdiam(M)ϵ)2exp(−Dϵ24(d+2)), (2)

where, is the second momentum of the Fourier transform of . In particular, for the Gaussian kernel , where is the spread parameter (see Eq. LABEL:q:Parzen-Gaussian).

Different approaches to compute random features for kernel approximation have been proposed based on different strategies: Monte Carlo sampling (quocle2013fastfood; lee2016orthogonal), quasi-Monte-Carlo sampling (avron2016quasi; shen2017random), and quadrature rules (dao2017gaussian).

### 2.2 Kernel density estimation

Kernel Density Estimation (KDE) (rosenblatt1956; parzen1962estimation), also known as Parzen-Rossenblat window method, is a non-parametric density estimation method. This method does not make any particular assumption about the underlying probability density function. Given an iid set of samples , the smooth Parzen’s window estimate has the form

 ^fX(x)=1NλN∑i=1kλ(x,xi), (3)

where is a kernel function and is the smoothing bandwidth parameter of the estimate. A small -parameter implies a small grade of smoothing. In contrast, a high -parameter implies a high grade of smoothing, possibly leaving behind some important structures.

rosenblatt1956 and parzen1962estimation showed that eq. 3

is an unbiased estimator of the pdf

. When is the Gaussian kernel, eq. 3 takes the form

 ^gγ,X(x)=1N(π/γ)d2N∑i=1e−γ∥xi−x∥2. (4)

KDE has several applications: to estimate the underlying probability density function, to estimate confidence intervals and confidence bands

(efron1992bootstrap; chernozhukov2014gaussian), to find local modes for geometric feature estimation (chazal2017robust; chen2016comprehensive), to estimate ridge of the density function (genovese2014nonparametric), to build cluster trees (balakrishnan2013cluster)

, to estimate the cumulative distribution function

(nadaraya1964some), to estimate receiver operating characteristic (ROC) curves (mcneil1984statistical), among others.

### 2.3 Density matrices

The state of a quantum system is represented by a vector

, where is the Hilbert space of the possible states of the system. Usually222In the latter sections of this paper we will use , but most of the results apply to the complex case. .

As an example, consider a system that could be in two possible states, e.g. the spin of an electron. The state of this system is, in general, represented by a column vector , with . This state represents a system that is in a superposition of two basis states . The outcome of a measurement of this system, along the axis, is determined by the Born rule: the spin is up with probability and down with probability . Notice that and could be negative or complex numbers, but the Born rule guarantees that we get valid probabilities.

The probabilities that arise from the superposition of states in the previous example is a manifestation of the uncertainty that is inherent to the nature of quantum physical systems. We call this kind of uncertainty quantum uncertainty. Other kind of uncertainty comes, for instance, from errors in the measurement or state-preparation processes, we call this uncertainty classical uncertainty. A density matrix is a formalism that allows us to represent both types of uncertainty. To illustrate it, let’s go back to our previous example. The density matrix representing the state is:

 ρ=ψψ∗=[|α|2αβ∗β∗α|β|2], (5)

where is the conjugate transpose of and is the conjugate of . As a concrete example, consider the corresponding density matrix is:

 ρ1=ψ1ψ∗1=[12−12−1212], (6)

which represents a superposition state where we have a probability of measuring any of the two states. Notice that the probabilities for each state are in the diagonal of the density matrix. is a rank-1 density matrix, and this means that it represents a pure state, i.e. a state without classical uncertainty. The opposite to a pure state is a mixed state, which is represented by a density matrix with the form:

 ρ=N∑i=1piψiψ∗i, (7)

where , , and are the states of a an ensemble of quantum systems, where each system has an associated probability . As a concrete example of a mixed state consider two pure states and , and consider a system that is prepared in state with probability and in state with probability as well. The state of this system is represented by the following density matrix:

 ρ2=12ψ2ψ∗2+12ψ′2ψ′∗2=[120012], (8)

At first sight, states and may be seen as representing the same quantum system, one where the probability of measuring an up state (or down state) is . However they are different systems, represents a system with only quantum uncertainty, while corresponds a system with classical uncertainty. To better observe the differences of the two systems we have to perform a measurement along a particular axis. To do so, we use the following version of the Born rule for density matrices:

 P(φ|ρ)=Tr(ρφφ∗)=φ∗ρφ (9)

which calculates the probability of measuring the state in a system in state . If we set we get and , showing that in fact both systems are different.

## 3 Methods

### 3.1 Density matrix kernel density estimation (DMKDE)

In this section we present a non-parametric method for density estimation. We start with a set of iid samples drawn from a particular distribution in . Each sample will be embedded into using RFF as a feature map (eq. 1). Each embedded vector will be assimilated as the state of a quantum system. A density matrix representing a mixture of all the training samples is calculated by averaging the density matrix representing each sample. This matrix represents the probability distribution of the samples and is used for calculating the density of new samples. The overall process is defined next:

• Input. A sample set with , parameters ,

• Calculate and using the random Fourier features method described in Section 2.1 for approximating a Gaussian kernel with parameters and .

• Apply (eq. 1):

 zi=ϕrff(xi). (10)
• Density matrix estimation:

 ρ=1NN∑i=1ziz∗i, (11)

The density of a new sample is calculated as:

 ^fρ(x)=Tr(ρϕrff(x)ϕrff(x)∗)Z=ϕrff(x)∗ρϕrff(x)Z, (12)

where the normalizing constant is defined as:

 Z=(π2γ)d2. (13)

The following proposition shows333The proof can be found in the supplementary material. that , as defined in eq. 12, uniformly converges to the Gaussian kernel Parzen’s estimator (eq. 4).

###### Proposition 3.1.

Let be a compact subset of with a diameter , let a set of iid samples, then (eq. 12) and (eq. 4) satisfy:

 Pr[supx∈M|^fρ(x)−^g2γ,X(x)|≥3Zϵ]≤28(√2dγdiam(M)ϵ)2exp(−Dϵ24(d+2)) (14)

The Parzen’s estimator is an unbiased estimator of the true density function from which the samples were generated and Proposition 3.1 shows that can approximate this estimator. The density matrix estimator has an important advantage over the Parzen’s estimator, its computational complexity. The time to calculate the Parzen’s estimator (eq. 4) is while the time to estimate the density based on the density matrix (Eq. 12) is , which is constant on the size of the sample dataset.

Notice that the vectors in eq. 10 are in general non-normal. We normalize these vectors which ensures that and also ensures that in eq. 11 is a well defined density matrix. Empirically, we observed that this change did not affect the performance of the method and, in some cases, improved the results.

Also, the density calculation in eq. 12 can be simplified by factorizing as

 ρ=V∗ΛV, (15)

where , is a diagonal matrix and is the reduced rank of the factorization. This is done by performing a spectral decomposition thanks to the fact that is a Hermitian matrix. The calculation in eq. 12 can be expressed as:

 ^fρ(x)=1Z∥Λ12Vϕrff% (x)∥2. (16)

This reduces the time to calculate the density of a new sample to .

Something interesting to notice is that the process described by eqs. 11 and 10 generalizes density estimation for variables with a categorical distribution, i.e. . To see this, we replace in eq. 10

by the well-known one-hot-encoding feature map:

 ϕohe:D→RKi↦Ei, (17)

where is the unit vector with a 1 in position and 0 in the other positions. It is not difficult to see that in this case

 ρii=Pr(x=i)=|{xj|j∈{1,…,N},xj=i}|N. (18)

### 3.2 Density matrix kernel density classification (DMKDC)

The extension of kernel density estimation to classification is called kernel density classification (hastie2009elements)

. The posterior probability is calculated as

 ^Pr(Y=j|X=x)=πj^fj(x)∑Kk=1πk^fk(x), (19)

where and are respectively the class prior and the density estimator of class .

We follow this approach to define a classification model that uses the density estimation strategy based on RFF and density matrices described in the previous section. The input to the model is a vector . The model is defined by the following equations: equationparentequation

 z =ϕrff(x)=cos(Wrffx+brff), (20a) z′ =z∥z∥, (20b) ~yi =∥Viz′∥2  ∀i=1…K, (20c) ~y′i =πiyi∑Kj=iyj (20d)

The hyperparameters of the model are the dimension of the RFF representation

, the spread parameter of the Gaussian kernel, the class priors and the rank of the density matrix factorization. The parameters are the weights and biases of the RFF, and , and for .

The model can be trained using two different strategies: one, using DMKDE to estimate the density matrices of each class; two, use stochastic gradient descent over the parameters to minimize an appropriate loss function.

The training process based on density matrix estimation is as follows:

1. Use the RFF method to calculate and .

2. For each class :

1. Estimate as the relative frequency of the class in the dataset.

2. Estimate using eq. 11 and the training samples from class .

3. Find a factorization of rank of :

 ρi=V∗iΛVi.

Notice that this training procedure does not require any kind of iterative optimization. The training samples are only used once and the time complexity of the algorithm is linear on the number of training samples. The complexity of step 2(b) is and of 2(c) is .

Since the operations defined in eqs. 20d, 20c, 20b and 20a are differentiable, it is possible to use gradient-descent to minimize an appropriate loss function. For instance, we can use categorical cross entropy:

 L=K∑i=1yilog(~y′i) (21)

where corresponds to the one-hot-encoding of the real label of the sample . The version of the model trained with gradient descent is called DMKDC-SGD.

An advantage of this approach is that the model can be jointly trained with other differentiable architecture such as a deep learning feature extractor.

### 3.3 Quantum measurement classification (QMC)

In DMKDC we assume a categorical distribution for the output variable. If we want a more general probability distribution we need to define a more general classification model. The idea is to model the joint probability of inputs and outputs using a density matrix. This density matrix represents the state of a bipartite system whose representation space is where is the representation space of the inputs, is the representation space of the outputs and

is the tensor product. A prediction is made by performing a measurement with an operator specifically prepared from a new input sample.

This model is based on the one described by Gonzalez2020 and works as follows:

• Input encoding. The input is encoded using a feature map

 z=ϕX(x). (22)
• Measurement operator. The effect of this measurement operator is to collapse, using a projector to , the part of the bipartite system while keeping the part unmodified. This is done by defining the following operator:

 π=zz∗⊗IdHY, (23)

where is the identity in .

• Apply the measurement operator to the training density matrix:

 ρ=πρtrainπ\Tr[πρtrainπ], (24)

where can be built, for instance, as in eq. 11, or can be a parameter density matrix, as will be shown next.

• Calculate the partial trace of to obtain a density matrix that encodes the prediction:

 ρY=\TrX[ρ]. (25)

The parameter of the model, without taking into account the parameters of the feature maps, is the density matrix, where and are the dimensions of and respectively. As discussed in Section 3.1, the density matrix can be factorized as:

 ρtrain=V∗trainΛVtrain (26)

where , is a diagonal matrix and is the reduced rank of the factorization. This factorization not only helps to reduce the space necessary to store the parameters, but learning and , instead of , helps to guarantee that is a valid density matrix.

As in Subsection 3.2, we described two different approaches to train the system: one based on estimation of the and one based on learning using gradient descent. QMC can be also trained using these two strategies.

In the estimation strategy, given a training data set the training density matrix is calculated by:

 ρtrain=1NN∑i=1(ϕX(xi)⊗ϕY(yi))(ϕX(xi)⊗ϕY(yi))∗. (27)

The computational cost is ).

For the gradient-descent-based strategy (QMC-SGD) we can minimize the following loss function:

 L=DY∑i=1yilog(ρYii), (28)

where is the -th diagonal element of .

As in DMKDC-SGD, this model can be combined with a deep learning architecture and the parameters can be jointly learned using gradient descent.

QMC can be used with different feature maps for inputs and outputs. For instance, if (eq. 1) and (eq. 17), QMC corresponds to DMKDC. However, in this case DMKDC is preferred because of its reduced computational cost.

### 3.4 Quantum measurement regression (QMR)

In this section we show how to use QMC to perform regression. For this we will use a feature map that allows us to encode continuous values. The feature map is defined with the help of equally distributed landmarks in the interval444Without loss of generality the continuous variable to be encoded is restricted to the interval.:

 αi=i−1D−1 for i=1…D. (29)

The following function (which is equivalent to a softmax) defines a set of unimodal probability density functions centered at each landmark:

 pi(x)=(exp(−β∥x−αi∥2)∑mj=1exp(−β∥x−αj∥2))i=1…D, (30)

where controls the shape of the density functions.

The feature map is defined as:

 ϕsm:[0,1]→RDx↦(√p1(x),…,√pD(x)). (31)

This feature map is used in QMC as the feature map of the output variable (). To calculate the prediction for a new sample we apply the process described in Subsection 3.3 to obtain . Then the prediction is given by:

 ^y=EρY[αi]=D∑i=1ρYiiαi. (32)

Note that this framework also allows to easily compute confidence intervals for the prediction. The model can be trained using the strategies discussed in Subsection 3.3. For gradient-based optimization we use a mean squared error loss function:

 L=D∑i=1(y−^y)2+αD∑i=1ρYii(^y−αi)2 (33)

where the second term correspond to the variance of the prediction and

controls the trade-off between error and variance.

## 4 Related Work

The ability of density matrices to represent probability distributions has been used in previous works. The early work by lior

uses the density matrix formalism to perform spectral clustering, and shows that this formalism not only is able to predict cluster labels for the objects being classified, but also provides the probability that the object belongs to each of the clusters. Similarly,

tiwariQIBC2019 proposed a quantum-inspired binary classifier using density matrices, where samples are encoded into pure quantum states. In a similar fashion, Sergioli2018

proposed a quantum nearest mean classifier based on the trace distance between the quantum state of a sample, and a quantum centroid that is a mixed state of the pure quantum states of all samples belonging to a single class. Another class of proposals directly combine these quantum ideas with customary machine learning techniques, such as frameworks for multi-modal learning for sentiment analysis

(LI202158; melucci2020ieee; ZHANG201821).

Since its inception, random features have been used to improve the performance of several kernel methods: kernel ridge regression

(avron2017random)

, support vector machines (SVM)

(sun2018but), and nonlinear component analysis (xie2015scale). Besides, random features have been used in conjunction with deep learning architectures in different works (arora2019exact; ji2019polylogarithmic; li2019implicit).

The combination of RFF and density matrices was initially proposed by Gonzalez2020. In that work, RFF are used as a quantum feature map, among others, and the QMC method (Subsection 3.3

) was presented. The emphasis of that work is to show that quantum measurement can be used to do supervised learning. In contrast, the present paper addresses a wider problem with several new contributions: a new method for density estimation based on density matrices and RFF, the proof of the connection between this method and kernel density estimation, and new differentiable models for density estimation, classification and regression.

The present work can be seen as a type of quantum machine learning (QML), which is generally referred as the field in the intersection of quantum computing and machine learning (schuld2015introduction; schuld2018supervised). In particular, the methods in this paper are in the subcategory of QML called quantum inspired classical machine learning, where theory and methods from quantum physics are borrowed and adapted to machine learning methods intended to run in classical computers. Works in this category include: quantum-inspired recommendation systems (tang2019quantum), quantum-inspired kernel-based classification methods (tiwari2020kernel; Gonzalez2020)

, conversational sentiment analysis based on density matrix-like convolutional neural networks

(zhang2019conversational)

, dequantised principal component analysis

(tang2019quantuminspired), among others.

Being a memory-based strategy, KDE suffers from large-scale, high dimensional data. Due to this issue, fast approximate evaluation of non-parametric density estimation is an active research topic. Different approaches are proposed in the literature: higher-order divide-and-conquer method

(gray2003nonparametric), separation of near and far-field (pruning) (march2015askit), and hashing based estimators (HBE) (charikar2017hashing). Even though the purpose of the present work was not to design methods for fast approximation of KDE, the use of RFF to speed KDE seems to be a promising research direction. Comparing DMKDE against fast KDE approximation methods is part of our future work.

## 5 Experimental Evaluation

In this section we perform some experiments to evaluate the performance of the proposed methods in different benchmark tasks. The experiments are organized in two subsections: classification evaluation and ordinal regression evaluation.

### 5.1 Classification evaluation

In this set of experiments, we evaluated DMKDC over different well known benchmark classification datasets.

#### 5.1.1 Data sets and experimental setup

Six benchmark data sets were used. The details of the data sets are reported in the Supplementary Material. In the case of Gisette and Cifar, we applied a principal component analysis algorithm using 400 principal components in order to reduce the dimension. DMKDC was trained using the estimation strategy (DMKDC) and an ADAM stochastic gradient descent strategy (DMKDC-SGD). As baseline we compared against a linear support vector machine (SVM) trained using the same RFF as DMKDC. In the case of MNIST and Cifar, we additionally built an union of a LeNet architecture

(lecun1989backpropagation)

as a feature extraction layer and DMKDC as the classifier layer. The LeNet layer is composed of two convolutional layers, one flatten layer and one dense layer. The former convolutional layer has 20 filters, kernel size of 5, same as padding, and ReLu as the activation function. The latter convolutional layer has 50 filters, kernel size of 5, same as padding, and ReLu as the activation function. The dense layer has 84 units and ReLU as the activation function. The dense layer is finally connected to DMKDC. We report results for the combined model (LeNet DMKDC-SGD) and the LeNet model with a softmax output layer (LeNet).

For each data set, we made a hyper parameter search using a five-fold cross-validation with 25 randomly generated configurations. The number of RFF was set to 1000 for all the methods. For each dataset we calculated the inter-sample median distance and defined an interval around . The parameter of the SVM was explored in an exponential scale from to . For the ADAM optimizer in DMKDC-SGD with and without LeNet we choose the learning rate in the interval . The RFF layer was always set to trainable. The number of eigen-components of the factorization was chosen from

where each number represents a percentage of the RFF. After finding the best hyper-parameter configuration using cross validation, 10 different experiments were performed with different random initialization. The mean and the standard deviation of the accuracy is reported.

#### 5.1.2 Results and discussion

Table 1 shows the results of the classification experiments. DMKDC is a shallow method that uses RFF, so a SVM using the same RFF is fair and strong baseline. In all the cases, except one, DMKDC-SGD outperforms the SVM, which shows that it is a very competitive method. DMKDC trained using estimation shows less competitive results, but they are still remarkable taking into account that this is an optimization-less training strategy that only passes once over the training dataset. For MNIST and Cifar the use of a deep learning feature extractor is mandatory to obtain competitive results. The results show that DMKDC can be integrated with other deeper neural network architectures to reach these results.

### 5.2 Ordinal regression evaluation

Many multi-class classification problems can be seen as ordinal regression problems. That is, problems where labels not only indicate class membership, but also an order. Ordinal regression problems are halfway between a classification problem and a regression problem, and given the discrete probability distribution representation used in QMR, ordinal regression seems to be a suitable problem to test it.

#### 5.2.1 Data sets and experimental setup

Nine standard benchmark data sets for ordinal regression were used. The details of each data set are reported in the Supplementary Material. These data sets are originally used in metric regression tasks. To convert the task into an ordinal regression one, the target values were discretized by taking five intervals of equal length over the target range. For each set, 20 different train and test partitions are made. These partitions are the same used by Chu2005 and several posterior works, and are publicly available at http://www.gatsby.ucl.ac.uk/~chuwei/ordinalregression.html. The models were evaluated using the mean absolute error (MAE), which is a popular and widely used measure in ordinal regression (Gutierrez2016; Garg2020).

QMR was trained using the estimation strategy (QMR) and an ADAM stochastic gradient descent strategy (QMR-SGD). For each data set, and for each one of the 20 partitions, we made a hyper parameter search using a five-fold cross-validation procedure. The search was done generating 25 different random configuration. The range for was computed in the same way as for the classification experiments, , the number of RFF randomly chosen between the number of attributes and , and the number of eigen-components of the factorization was chosen from where each number represents a percentage of the RFF. For the ADAM optimizer in QMR-SGD we choose the learning rate in the interval and . The RFF layer was always set to trainable, and the criteria for selecting the best parameter configuration was the MAE performance.

#### 5.2.2 Results and discussion

For each data set, the means and standard deviations of the test MAE for the 20 partitions are reported in Table 2, together with the results of previous state-of-the-art works on ordinal regression: Gaussian Processes (GP) and support vector machines (SVM) (Chu2005), Neural Network Rank (NNRank) (Cheng2008), Ordinal Extreme Learning Machines (ORELM) (Deng2010) and Ordinal Regression Neural Network (ORNN) (Fernandez-Navarro2014).

QMR-SGD shows a very competitive performance. It outperforms the baseline methods in six out of the nine data sets. The training strategy based on estimation, QMR, did not performed as well. This evidences that for this problem a fine tuning of the representation is required and it is successfully accomplished by the gradient descent optimization.

## 6 Conclusions

The mathematical framework underlying quantum mechanics is a powerful formalism that harmoniously combine linear algebra and probability in the form of density matrices. This paper has shown how to use these density matrices as a building block for designing different machine learning models. The main contribution of this work is to show a novel perspective to learning that combines two very different and seemingly unrelated tools, random features and density matrices. The, somehow surprising, connection of this combination with kernel density estimation provides a new way of representing and learning probability density functions from data. The experimental results showed some evidence that this building block can be used to build competitive models for some particular tasks. However, the full potential of this new perspective is still to be explored. Examples of directions of future inquire include using complex valued density matrices, exploring the role of entanglement and exploiting the battery of practical and theoretical tools provided by quantum information theory.

## Acknowledgements

The authors gratefully acknowledge the help of Diego Useche with code testing and optimization.

## A. Proofs

###### Proposition 3.1.

Let be a compact subset of with a diameter , let a set of iid samples, then (Eq. 12) and (Eq. 4) satisfy:

 Pr[supx∈M|^fρ(x)−^g2γ,X(x)|≥3Zϵ]≤28(√2dγdiam(M)ϵ)2exp(−Dϵ24(d+2)) (34)
###### Proof.
 ^fρ(x) =1ZTr(ρϕrff(x)ϕrff(x)∗) =1ZTr(ϕrff(x)∗ρϕrff(x)) =1ZTr(ϕrff(x)∗(1NN∑i=1ziz∗i)ϕrff(x)) =1ZNN∑i=1Tr(ϕ% rff(x)∗ϕrff(xi)ϕrff(xi)∗ϕ% rff(x)) =1ZNN∑i=1(ϕrff(x)∗ϕrff(xi))2 (35)

Because of Theorem 2.1 we know that

 Pr[supx,y∈M|ϕ∗rff(x)ϕrff(y)−e−γ∥x−y∥2|≥ϵ]≤28(√2dγdiam(M)ϵ)2exp(−Dϵ24(d+2))=B

By construction , then . Then

 Pr[supx,y∈M|(ϕ∗rff(x)ϕrff(y))2−e−2γ∥x−y∥2|≥3ϵ]≤B (36)

Combining Equations A. Proofs and 36 we get:

 Pr[supx∈M|^fρ(x)−^g2γ,X(x)|≥3Zϵ]≤B

## B. Additional experiments

### B.1 Density estimation evaluation

In this subsection, we used density estimation algorithms proposed in Section 3.1. We evaluated our proposed method and compared to kernel density approximation.

#### Data sets and experimental setup

The goal of the exploratory experiments presented in this subsection is to assess the ability of DMKDE to approximate the density estimated with conventional KDE. We use a synthetic dataset and the well known MNIST dataset.

The synthetic dataset corresponds to a mixture of univariate Gaussians. The mixture weights are 0.3 and 0.7 respectively and the parameters are and . Both KDE and DMKDE are trained with different number of training samples and are evaluated over a test dataset of 1,000 samples equally spaced in the interval . Since the true pdf is known the results are assessed calculating the RMSE against that pdf. The reduced rank of the matrix in DMKDE is set to 30 (Eq. 15).

For the MNIST dataset, we use the dataset default partition in train and test. The density predicted by DMKDE is compared against the density predicted by KDE calculating the RMSE.

#### Results and discussion

Figure 1 depicted the relationship between the increase of the dimension of random Fourier features and its fit concerning the real density. DMKDE converge in terms of RMSE to the values obtained by kernel density estimation when the dimension of random Fourier features increases.

Figure 2 shows the time expended by kernel density estimation and DMKDE. The number of training samples were increased keeping the test dataset of 1,000 samples equally spaced as explained in Subsection LABEL:subsection:density-estimation-datasets. We observed that DMKDE is nearly constant when the number of samples increases. Furthermore, the time of kernel density estimation increases linearly with the number of samples.