1 Introduction
Kernel methods represent a successful class of Machine Learning models, achieving stateoftheart performance on a variety of tasks with theoretical guarantees
[1, 2, 3]. Applying kernel methods to largescale problems, however, poses computational challenges, and this has motivated a variety of contributions to develop them at scale; see, e.g., [2, 4, 5, 6, 7].Consider a supervised learning task, and let
be a set of inputs with associated with a set of labels . In kernel methods, it is possible to establish a mapping between inputs and labels by first mapping the inputs to a highdimensional (possibly infinite dimensional) Hilbert space using a nonlinear feature map , and then to apply the model to the transformed data. What characterizes these methods is that the mapping does not need to be specified and can be implicitly defined by choosing a kernel function . While kernel methods offer a flexible class of models, they do not scale well with the number of data points in the training set, as one needs to store and perform algebraic operations with the kernel matrix , whose entries are , and which require storage and operations.In a series of celebrated papers [8, 9], Rahimi and Recht have proposed approximation techniques of the kernel function using random features (RFs), which are based on random projections of the original features followed by the application of a nonlinear transformation. In practice, the kernel function is approximated by means of the scalar product between finitedimensional random maps :
(1) 
The RFbased approximation turns a kernelbased model into a linear model with a new set of nonlinear features ; as a result, the computational complexity is reduced from to to construct the random features and to optimize the linear model, where is the RF dimension and the number of data points. Furthermore, there is no need to allocate the kernel matrix, reducing the storage from to . Unless approximation strategies to compute random features are used, e.g., [10], computing RFs is one of the main computational bottlenecks.
A completely different approach was pioneered in Saade et al. [11], where the random projections are instead made via an analog optical device – the Optical Processing Unit (OPU) – that performs these random projections literally at the speed of light and without having to store the random matrix in memory. Their results demonstrate that the OPU makes a significant contribution towards making kernel methods more practical for largescale applications with the potential to drastically decrease computation time and memory, as well as power consumption. The OPU has also been applied to other frameworks like reservoir computing [12, 13]
[14].Building on the milestone work of [11], the goal of the present contribution is threefold: a) we derive in full generality the kernel to which the dot product computed by the OPU RFs converges, generalizing the earlier computation of [11] to a larger class of kernels; b) we present new examples and a benchmark of applications for the kernel of the OPU; and c) we give a detailed comparison of the running time and energy consumption between the OPU and a last generation GPU.
2 The Optical Processing Unit
The principle of the random projections performed by the Optical Processing Unit (OPU) is based on the use of a heterogeneous material that scatters the light that goes through it, see Fig. 1 for the experimental setup. The data vector is encoded into light using a digital micromirror device (DMD). This encoded light then passes through the heterogeneous medium, performing the random matrix multiplication. As discussed in [15], light going through the scattering medium follows many extremely complex paths, that depend on refractive index inhomogeneities at random positions. For a fixed scattering medium, the resulting process is still linear, deterministic, and reproducible. Reproducibility is important as all our data vectors need to be multiplied by the same realisation of the random matrix.
After going through the ”random” medium, we observe a speckle figure on the camera, where the light intensity at each point is modelled by a sum of the components of
weighted by random coefficients. Measuring the intensity of light induces a nonlinear transformation of this sum, leading to:
Proposition 1.
Given a data vector , the random feature map performed by the Optical Processing Unit is:
(2) 
where is a complex Gaussian random matrix whose elements
, the variance being set to one without loss of generality, and depends on a multiplicative factor combining laser power and attenuation of the optical system. We will name these RFs optical random features.
3 Computing the Kernel
When we map two data points into a feature space of dimension using the optical RFs of Eq. 2, we have to compute the following to obtain the associated kernel :
(3)  
(4) 
with and .
Theorem 1.
The kernel approximated by the dot product of optical random features of Eq. 2 is given by:
(5) 
where the norm is the norm.
Proof.
By rotational invariance of the complex Gaussian distribution, we can fix
and , with being the angle between and , andbeing two orthonormal vectors. Letting
, and be the complex conjugate of , we obtain:By a parity argument, the third term in the integral vanishes, and the remaining ones can be explicitly computed, yielding:
∎
Numerically, one can change the exponent of the feature map to , which, using notations of Eq. 2, becomes:
(6) 
Theorem 2.
When the exponent is even, i.e. , , the dot product of feature maps of Eq. 6 tends to the kernel (for ):
(7) 
Moreover, a generalization can be established.
Eq. 7 is connected to the polynomial kernel [1] defined as:
(8) 
with and the order of the kernel. For thekernel is called homogeneous. For the polynomial kernel consists of a sum of lower order homogeneous polynomial kernels up to order . It can be seen as having richer featurevectors including all lowerorder kernel features. For optical RFs raised to the power of we have a sum of homogeneous polynomial kernels taken to even orders up to .
Since , the kernel scales with , which is characteristic to any homogeneous polynomial kernel. It is easy to extend this relation to the inhomogeneous polynomial kernel by appending a bias to the input vectors, i.e. when and . A practical drawback of this approach is that increasing the power of the optical RFs also increases their variance. Thus, convergence requires higher projection dimensions. Although high dimensional projections can be computed easily using the OPU, solving models on top of them poses other challenges that require special treatment [2] (e.g. Ridge Regression scales cubically with ). Therefore, we did not include these cases in the experiments in the next section and leave them for future research.
4 Experiments
In this section, we assess the usefulness of optical RFs for different settings and datasets. The model of our choice in each case is Ridge Regression. OPU experiments were performed remotely on the OPU prototype ”Vulcain”, running in the LightOn Cloud with library LightOnOPU
v1.0.2. Since the current version only supports binary input data we decide to binarize inputs for all experiments using a threshold binarizer. The code of the experiments is publicly available
^{1}^{1}1 https://github.com/joneswack/opukernelexperiments.4.1 Optical random features for Fashion MNIST
We compare optical RFs (simulated as well as physical) to an RBF Fourier Features baseline for different projection dimensions
on Fashion MNIST. We use individually optimized hyperparameters for all RFs that are found for
using an extensive grid search on a heldout validation set. The same hyperparameters are also used for the precise kernel limit. Fig. 2 shows that the overall classification error decreases as increases. The RBF and synthetic optical RFs perform similarly well although the RBF features have a slight gain (roughly 0.24% for ). This gain is more noticeable when looking at the kernel limit. Using a higher degree OPU kernel (), results can be further improved but they are not shown due to slower convergence.The real OPU loses around 0.6% accuracy for , which is due to experimental noise (vibrations, electronic noise) and deviations from the theoretical model (finite pixel size, camera quantization).
4.2 Transfer learning on CIFAR10
Architecture  ResNet34  AlexNet  VGG16  

Layer  L1  L2  L3  Final  MP1  MP2  Final  MP2  MP3  MP4  MP5  Final 
Dimension  4 096  2 048  1 024  512  576  192  9 216  8 192  4 096  2 048  512  25 088 
Sim. Opt. RFs  30.4  24.7  28.9  11.6  38.1  41.9  19.6  28.2  20.5  20.7  29.8  15.2 (12.9) 
Optical RFs  31.1  25.7  29.7  12.3  39.2  42.6  20.8  30.9  23.3  21.5  30.2  16.4 
RBF Four. RFs  30.1  25.2  30.0  12.3  39.4  41.9  19.1  28.0  20.7  20.7  30.1  14.8 (13.0) 
No RFs  31.3  26.7  33.5  14.7  44.6  48.8  19.6  27.1  21.0  22.5  34.8  13.3 
An interesting use case for the OPU is transfer learning for image classification. For this purpose we extract a diverse set of features from the CIFAR10 image classification dataset using three different convolutional neural networks (ResNet34
[16], AlexNet [17] and VGG16 [18]). The networks were pretrained on the wellknown ImageNet classification benchmark
[19]. For transfer learning, we can either finetune these networks and therefore the convolutional features to the data at hand, or we can directly apply a classifier on them assuming that they generalize well enough to the data. The latter case requires much less computational resources while still producing considerable performance gains over the use of the original features. This lightweight approach can be carried out on a CPU in a short amount of time where the classification error can be improved with RFs.
We compare Optical RFs and Fourier RFs for the RBF kernel to a simple baseline that directly works with the provided convolutional features (no RFs). Table 1
shows the test errors achieved on CIFAR10. Each column corresponds to convolutional features extracted from a specific layer of one of the three networks.
Since the projection dimension was left constant throughout the experiments, it can be observed that RFs perform particularly well compared to a linear kernel when where is the input dimension. For the opposite case the lower dimensional projection leads to an increasing test error. This effect can be observed in particular in the last column where the test error of the RF approximation is higher than without RFs. The contrary can be achieved with large enough as indicated by the values for the true kernel in parenthesis. In general, the simulated as well as the physical optical RFs yield similar performances as the RBF Fourier RFs on the provided convolutional data.
4.3 Projection time and energy consumption
The main advantage of the OPU compared to a traditional CPU/GPU setup is that the OPU takes a constant time for computing RFs of arbitrary dimension (up to on current hardware) for a single input. Moreover, its power consumption stays below 30 W independently of the workload. Fig. 3 shows the computation time and the energy consumption over time for GPU and OPU for different projection dimensions
. In both cases, the time and energy spending do not include matrix building and loading. For the GPU, only the calls to the PyTorch function
torch.matmul are measured and energy consumption is the integration over time of power values given by the command nvidiasmi.For the OPU, the energy consumption is constant w.r.t. and equal to 45 Joules (30 W multiplied by 1.5 seconds). The GPU computation time and energy consumption are monotonically increasing except for an irregular energy development between and . This exact irregularity was observed throughout all simulations we performed and can most likely be attributed to an optimization routine that the GPU carries out internally. The OPU is up to 7 times more energy efficient than the GPU for large random projections. The GPU starts to use more energy than the OPU from and runs out of memory for . The exact crossover points may change in future hardware versions. The relevant point we make here is that the OPU has a better scalability in with respect to computation time and energy consumption.
Conclusion and perspectives
The increasing size of available data and the benefit of working in highdimensional spaces led to an emerging need for dedicated hardware. GPUs have been used with great success to accelerate algebraic computations for kernel methods and deep learning. Yet, they rely on finite memory, consume large amounts of energy and are very expensive.
In contrast, the OPU is a scalable memoryless hardware with reduced power consumption. In this paper, we showed that optical RFs are useful in their natural form and can be modified to yield more flexible kernels. In the future, algorithms should be developed to deal with largescale RFs, and other classes of kernels and applications should be obtained using optical RFs.
Acknowledgements
RO acknowledges support by grants from Région IledeFrance. MF acknowledges support from the AXA Research Fund and the Agence Nationale de la Recherche (grant ANR18CE460002). FK acknowledges support from ANR17CE230023 and the Chaire CFMENS. The authors thank LightOn for the access to the OPU and their kind support.
References

[1]
Bernhard Schölkopf, Alexander J Smola, Francis Bach, et al.,
Learning with kernels: support vector machines, regularization, optimization, and beyond
, MIT press, 2002.  [2] Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco, “Falkon: An optimal large scale kernel method,” in Advances in Neural Information Processing Systems, 2017, pp. 3888–3898.
 [3] Andrea Caponnetto and Ernesto De Vito, “Optimal rates for the regularized leastsquares algorithm,” Foundations of Computational Mathematics, vol. 7, no. 3, pp. 331–368, 2007.
 [4] Alex J Smola and Bernhard Schölkopf, “Sparse greedy matrix approximation for machine learning,” 2000.
 [5] Yuchen Zhang, John Duchi, and Martin Wainwright, “Divide and conquer kernel ridge regression,” in Conference on Learning Theory, 2013, pp. 592–617.
 [6] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco, “Less is more: Nyström computational regularization,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds., pp. 1657–1665. Curran Associates, Inc., 2015.
 [7] Kurt Cutajar, Edwin Bonilla, Pietro Michiardi, and Maurizio Filippone, “Random feature expansions for deep Gaussian processes,” in ICML 2017, 34th International Conference on Machine Learning, 611 August 2017, Sydney, Australia, Sydney, AUSTRALIA, 08 2017.
 [8] Ali Rahimi and Benjamin Recht, “Random features for largescale kernel machines,” in Advances in neural information processing systems, 2008, pp. 1177–1184.
 [9] Ali Rahimi and Benjamin Recht, “Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning,” in Advances in neural information processing systems, 2009, pp. 1313–1320.
 [10] Quoc Le, Tamás Sarlós, and Alex Smola, “Fastfoodapproximating kernel expansions in loglinear time,” in Proceedings of the international conference on machine learning, 2013, vol. 85.
 [11] A. Saade, F. Caltagirone, I. Carron, L. Daudet, A. Drémeau, S. Gigan, and F. Krzakala, “Random projections through multiple optical scattering: Approximating kernels at the speed of light,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2016, pp. 6215–6219.
 [12] Jonathan Dong, Sylvain Gigan, Florent Krzakala, and Gilles Wainrib, “Scaling up echostate networks with multiple light scattering,” in 2018 IEEE Statistical Signal Processing Workshop (SSP). IEEE, 2018, pp. 448–452.
 [13] Jonathan Dong, Mushegh Rafayelyan, Florent Krzakala, and Sylvain Gigan, “Optical reservoir computing using multiple light scattering for chaotic systems prediction,” IEEE Journal of Selected Topics in Quantum Electronics, vol. 26, no. 1, pp. 1–12, 2019.
 [14] Nicolas Keriven, Damien Garreau, and Iacopo Poli, “Newma: a new method for scalable modelfree online changepoint detection,” arXiv preprint arXiv:1805.08061, 2018.
 [15] Antoine Liutkus, David Martina, Sébastien Popoff, Gilles Chardon, Ori Katz, Geoffroy Lerosey, Sylvain Gigan, Laurent Daudet, and Igor Carron, “Imaging with nature: Compressive imaging using a multiply scattering medium,” Scientific reports, vol. 4, pp. 5552, 2014.
 [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015.
 [17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds., pp. 1097–1105. Curran Associates, Inc., 2012.
 [18] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” CoRR, vol. abs/1409.1556, 2014.

[19]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein,
Alexander C. Berg, and Li FeiFei,
“ImageNet Large Scale Visual Recognition Challenge,”
International Journal of Computer Vision (IJCV)
, vol. 115, no. 3, pp. 211–252, 2015.
Comments
There are no comments yet.