    # Towards Trainable Media: Using Waves for Neural Network-Style Training

In this paper we study the concept of using the interaction between waves and a trainable medium in order to construct a matrix-vector multiplier. In particular we study such a device in the context of the backpropagation algorithm, which is commonly used for training neural networks. Here, the weights of the connections between neurons are trained by multiplying a forward' signal with a backwards propagating error' signal. We show that this concept can be extended to trainable media, where the gradient for the local wave number is given by multiplying signal waves and error waves. We provide a numerical example of such a system with waves traveling freely in a trainable medium, and we discuss a potential way to build such a device in an integrated photonics chip.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Results

### Gradient descent on the Helmholtz equation

In this section we present the gradient of a certain cost function w.r.t. a certain parameter for a system described by the Helmholtz equation. Suppose we have a field which adheres to the following equation:

 ∇2ϕa(r)+k2(r)ϕa(r) = a(r) for r∈Ω ϕa(r) = 0 for r∈∂Ω, (1)

where is the local wave number, and is a certain source term which can for instance encode an input vector. We assume there exists a certain cost functional that we want to minimise by adapting . It is possible to show (see the supplementary material) that the gradient of this cost functional w.r.t. is proportional to:

 g(r)∼−R(ϕa(r)ϕe(r)). (2)

Here, is a second complex field (which can be interpreted as the ‘error’ signal), which adheres to the following equation:

 ∇2ϕe(r)+k2(r)ϕe(r) = e(r) for r∈Ω ϕe(r) = 0 for r∈∂Ω,

with

 e(r)=∂Q(ϕa)∂ϕa(r). (3)

This means that the information for the local gradient can be obtained by combining information from two complex fields, both of which are solutions of the wave equation, and therefore possible to generate physically.
We can now associate these equations with a trainable neural network block as follows. First of all, we assume that the source term emerges from an array of emitters, and can be related to an input vector as follows:

 a(r)=Na∑i=1aiβi(r), (4)

with the -th element of of size , and a source field associated with the -th emitter. Next we assume that there exists an array of receivers which make measurements the complex field to generate an output vector of dimensionality , where each element is described by:

 oi=∫Ωϕa(r)γi(r)dr,

with a function describing the properties of the -th receiver. Here it is important to note that–due to the fully linear nature of the entire system–the output vector can be written as , i.e., there exists a matrix , determined by the combination of the wave number function , the boundary conditions, and the emitter-receiver functions and . In this sense, the whole system acts as an implicit MVM.
In the common neural network training scheme, there will be a certain cost vector associated with the output , which represents the gradient of an external cost function w.r.t. the individual elements of . We can relate this with the cost functional by explicitly writing the dependencies: (note that by definition does not depend on ). Therefore, the functional derivative can be written as

 e(r)=No∑i=1∂Q(o(ϕa),o∗(ϕ∗a))∂oi∂oi∂ϕa(r)=No∑i=1eiγi(r),

where are the individual elements of . Note that this equation takes on a very similar form as Equation (4). Indeed, when we can use the emitters and receivers in two directions i.e., when they act as transducers), it is possible to simply emit the error signal into the system.

### Numerical example of an MVM Figure 3: Simulation results of the wave-based MVM. (a): Depiction of the relative refractory index of the slab after optimisation. The rectangular area in the middle has a variable wave number. The colour bar at the bottom indicates the relative change from the average value. The red dots indicate the positions of the emitters/receivers. Outside the yellow boundary the medium becomes absorbing. (b): The NRMSE shown as a function of the number of training iterations. (c) Example images showing the real part of ϕa and ϕe, emitted from the left and right, respectively.

In this section we will apply the previously derived concepts to a numerical examples. We will train a slab of material to perform a matrix-vector product of the input vector with a pre-specified matrix . As we wish to use the classic neural network training mechanism, we will do this by drawing examples of input and desired output, and as a cost function we will use the squared error between the actual and desired output. Suppose we draw an input vector

, with complex elements where the real and imaginary parts have been drawn from a standard normal distribution. Using the previously defined ideas, this is then sent into a medium encoded as waves, and converted back into an output vector

. If we define

as the “target” output, the cost function is given by

 Q(o,o∗)=||o−ot||2=No∑i=1(oi−oti)(oi−oti)∗

We find that the derivative is given by

 ei=∂Q(o,o∗)∂oi=(oi−oti)∗.

Note that the gradient found for a single input/output example pair will not suffice to make the system emulate . Rather, we will use an iterative optimisation scheme such that at the -th iteration we update as follows:

 k(r)←k(r)−ηkgk(r),

with the learning rate, and the gradient obtained using input-output pair sample and

. Such an optimisation scheme is the equivalent of stochastic gradient descent used in neural network training, where each training iteration only a small fraction of the data is presented to the network. In our case this is the extreme case of a single data point per iteration.

Figure 3(a) shows a schematic depiction of the simulated setup. Further details of the experiment are explained in the supplementary material. The results are displayed in Figure 3. The medium adapts itself successfully to perform the required linear transformation between input and output fields, as can be seen from Figure 3(b), which shows the evolution of the NRMSE as a function of the number of training iterations. The sudden peak at around the -th iteration likely has to do with poor convergence of the method we used to compute the complex fields. We also found that the lowest NRMSE the system can reach during training depends on the precision tolerance used in this computation. The more precise the complex fields are computed, the lower the NRMSE gets.
Figure 3(a) depicts the structure of the simulated device, where we plotted the varying refractory index as a function of space. We have included the scaling of the colour coding, which shows the relative magnitude of the variations w.r.t. the average (equal to one). As can be seen, these variations remain relatively small, staying within 10 % of the average value. Figure 3(c) shows examples of the complex fields of and .

### Optical implementation

Here we will detail the requirements to implement a trainable MVM using optics. We will assume the trainable medium consists of a planar material of which the refractory index can be modulated spatially, for example by using the Pockels effect in indium phosphide or lithium niobate, by using free carrier dispersion in semiconductors , or by the thermo-optic effect, present in many materials including silicon.
In many cases, the Maxwell equations in a two-dimensional structure can be reduced to a scalar wave equation, which makes optics a viable implementation platform for a trainable MVM.
Looking at the expression for the gradient (Equation (2)), one would need to make a measurement of the real part of the product of two complex fields, over the entire region where one wishes to modulate the refractory index. This requires a system-wide interferometric measurement of the complex fields and , such that we can directly measure their respective real and imaginary parts. In a planar material, this is–at least in principle–possible, as we could access the evanescent field at any location. In reality, however, this would pose a significant challenge, as for interferometry we would need to have access to a reference signal at every location too. Even when we succeed in separately measuring , , , and , we would still need to compute the quantity

 R(ϕaϕe)=R(ϕa)R(ϕe)−I(ϕa)I(ϕe)

for each location, greatly increasing the device’s complexity.
An alternative way of obtaining the gradient is to generate the complex conjugate field of either or . If we are capable to do so, we could make three system-wide field intensity measurements (which are far simpler than interferometric measurements) as follows:

 2R(ϕaϕe) = ϕaϕe+ϕ∗aϕ∗e (5) = (ϕa+ϕ∗e)(ϕ∗a+ϕe)−ϕaϕ∗a−ϕeϕ∗e = I(ϕa+ϕ∗e)−I(ϕa)−I(ϕe),

with indicating the intensity of the field . Immediately obvious from this expression is that the necessary computations are far simpler (additions instead of multiplications, which can be performed in an analog fashion easily). Indeed, using this approach seems more promising than direct measurements of and . The issue then shifts to the following question: how well are we able to generate the field (or )?
Generating the complex conjugate of an optical field is a well-studied problem, and has many potential applications in microscopy, lasers, sensors, holography, etc. Especially the concept of complex conjugate mirrors to correct phase distortions has received lots of attention. In our case, Equation (5) shows that we don’t require a complex conjugate mirror, but rather, we’d need to generate the complex conjugate field separately (at a different instant than which the original field is present) and add this to another optical field. One possible approach to achieve this would be to construct an optical phased array. If we could measure the phase front , with consisting of the set of points where light exits the system, and we are able to regenerate the complex conjugate of this phase front as a source, we can–in principle–generate and . There is one important caveat though: there should be no significant internal losses anywhere within the medium; phase conjugation can only invert phase distortions but not optical losses. This means that we would need to capture and detect all the light that exits the system. Internal losses are unfortunately ubiquitous in integrated photonics. In our case they are even a necessity if we want to measure the optical intensity throughout the trainable medium.

### Trainable waveguides Figure 4: Structure of a trainable nanophotonic chip. (a): Possible design of a trainable optical circuit. The black lines are waveguides. The grey boxes represent unitary transformations of the optical signals, and the grey circles represent tuneable phase shifters. (b): Example of a waveguide structure which would implement a ‘useful’ unitary transformation. The places where the waveguides approach each other very closely are directional couplers which allow light to be exchanged between the waveguides without losses. This structure was the one used for the numeric simulations, though in reality, since the only role of the unitary matrices is to ‘mix’ the light form different waveguides, likely more compact and convenient designs are possible.

Clearly, bulk material slabs in which light can propagate freely may pose too many challenges. We need an optical system that offers more control over the complex field that is generated within. For this, let us consider optical circuits that consist of single-mode waveguides, which greatly simplifies the conceptual description of the optical field; the optical signal present in each waveguide can be described by a single complex number (when considering one direction of travel). Note that optical circuits simply are defined by the refractory index of the system varying in space. This means that the theory that has been presented in the first part of this paper can be applied to such systems too.
Let us consider the aforementioned theory on an ideal waveguide. Equation (5) tells us how we can express the gradient w.r.t. the local refractory index as a function of intensities. In the case of a waveguide, can represent a guided mode with light traveling in one directional mode which we could consider the ‘forward’ direction. The field will represent a guided mode traveling in the opposite direction, i.e., ‘backwards’. Its complex conjugate, is reversed in direction, and therefore corresponds to a guided mode traveling also traveling forward, in the same direction as . If we consider the intensity , throughout the waveguide, this is the intensity of a guided mode, unchanging in the direction of travel. Therefore, the gradient one obtains within a single waveguide is uniform throughout its length. It follows that the primary effect of adapting the refractory index will be to change the phase shift induced by this waveguide. It also follows that it suffices to measure light intensities at a single location within the waveguide. Tuneable phase shifters are an integral part of integrated photonics, and can be implemented in several ways (mechanically, electro-optically, or thermo-optically).

#### Trainable optical circuit

We start from the idea that we have an optical chip with input waveguides and output waveguides. Using trainable waveguides as phase modulators, we then consider a structure as represented in Figure 4(a). Here, the signal alternates between being passed through a set of tuneable waveguides (phase shifters) and being transformed with fixed, unitary transformations .
The presented structure would need to have coherent detection and generation of signals exclusively at the in- and output sources. Generating the conjugate field or would only require the coherent measurement of the field at the receivers (which is a requirement in any case to produce the output vector ), and generating the complex conjugate to be sent back into the system. After this, we need to perform the three measurements mentioned before and measure the light intensities from Equation (5) within each waveguide in order to know how to change its refractory index.
To shed some light on the functionality of such a device and foreseeable issues, we will simulate its training process under successively less ideal circumstances. Note that the device represented in Figure 4

will (if there are no losses) always perform a unitary transformation from input to output. Therefore, we will use a unitary matrix as a target for training.

In the supplementary material we explain in detail how the chip is modelled, and how the training is performed. During a single training update, ‘signal’ light enters the chip from the left, is transformed by the chip, and is coherently detected at the output (right side). In order to perform the training, we generate the complex conjugate of this light at the right side, add it up to the ‘error’ input, and send it backwards into the chip such that we can detect , where stands for the approximation of the complex conjugate field (the difference is due to losses in the system). Next we send in the error signal and the conjugated output signal separately to determine and . In total this means that three measurements are necessary to obtain the gradient, all based on light propagating through the chip in the backwards direction. Alternatively, we can work from the left side, where we use an approximation for the error field , combined with the normal input field . In case of losses it is beneficial to do both (as we show later). Practically this can also be performed in three measurements, as we can simply send in light from two directions at once and measure the sum of their intensities (if we average out the intensity fluctuations caused by standing waves). Before we move to numerical results, it is worth considering what form the unitary transforms would take on a chip. We found that–as long as sufficiently ‘mixes’ the states–it’s exact form matters very little for final performance. With sufficient mixing we mean that each exiting waveguide should contain light of a sufficiently large number of entering waveguides, such that information of each input channel is effectively spread over the chip. In the numerical experiments that follow we constructed the by combining multiple 50/50 directional couplers (See Figure 4(b) for more details).

#### Numeric Simulations Figure 5: Results of training the optical chip to emulate a pre-specified unitary transformation. Four different scenarios are considered, taking under consideration different degrees of non-ideal behaviour.

We will simulate the training process of a chip with 50 inputs and 50 outputs. The details of the experiments are explained in the supplementary material. We test the following scenarios:

• First of all we consider the ideal, lossless scenario.

• Next we assume a 6% power loss in each waveguide/phase shifter, based on assuming we couple out one percent of power for local measurements, and a 5% intrinsic loss (which was the lowest number we could find in the integrated photonics literature ).

• We assume the same losses as before, but now we compute two gradients resulting from the two directions light can travel through the chip, and we use the average gradient of that.

• Finally we assume uneven losses, where each waveguide loses an additional fraction of power, uniformly and randomly picked between zero and one percent. Here we will use the average gradient again, resulting from two directional modes.

Figure 5

shows the evolution of the NRMSE for each of the four experiments. Under ideal circumstances, the chip performs very well and the error goes down to negligible levels. When the chip is lossy, the training process is hampered. This is mostly due to the fact that power levels decrease by almost a factor 80 (-19 dB) from entrance to exit, which means that the gradients obtained for phase shifters close to the light source are far larger than those further away. Indeed, when we use light coming from both directions this problem is partially eliminated. Note that there are probably more elegant ways to avoid this problem, for instance by having different ‘sensitivities’ for different phase shifters at different locations, such that internal losses are compensated completely.

Finally, uneven losses seem to pose the biggest challenge. One needs to be careful to interpret this result, however: due to uneven losses, the chip will be intrinsically unable to produce a unitary transformation from input to output, as the intermediate stages no longer are unitary. This means that the NRMSE will always have a certain lower bound given by the uneven losses. To check to what degree the result is due to this bound or to impairment in the training process, we simply re-measure the NRMSE after the training phase on the same (trained) chip, but without the uneven losses. It turns out this reduces the resulting NRMSE by a factor about 2 (). If we do not keep the learning rate fixed throughout training, but let it drop linearly to zero, this even becomes almost a factor 10 (), while there is no difference in final performance for the chip with the uneven losses. This shows that uneven losses do not seem to critically endanger the training process.

One final observation is that the trained phase shifts have a relatively small standard deviation. We initialised them at zero, and after training, we observed that nearly all the phase shifts are still within

, i.e., only covering one fifth of the full range. This is important as it implies that we do not necessarily need phase shifters which can cover this entire range. Indeed, truncating the phase shifts within the aforementioned range does not visibly affect performance in any of the presented experiments.

## Discussion

Many more factors need to be taken into account before a trainable MVM using waves may be constructed in practice. For the nanophotonic implementation, two important remaining factors we haven’t yet studied are measurement noise and scalability. The first is concerned with how a noisy measurement will affect the obtained gradient. Indeed, we do not wish to lose a lot of power in the local power measurements within each waveguide/phase shifter. This means that the adaptations of the chip will need to be based on measurements of very low optical power, perhaps close to the noise floor. One factor that works in our advantage is the fact that these power measurements do not need to happen extremely fast. Whereas in typical telecommunication applications photodiodes need to measure optical power at rates well over the Gigahertz range, we can work with far lower measurement rates. Suppose for example we allow one microsecond for each of the three required power measurements, this would still allow several hundreds of thousand updates per second, while allowing a relatively long measurement time (i.e., time to accumulate energy) for each phase shifter.
Scalability in terms of allowable losses pose a more significant challenge. Right now we simulated a chip where the light needs to travel through 70 arrays of phase shifters, losing almost 99 % of power in the process. Scaling the chip up to larger proportions will certainly exacerbate this problem, and one would very quickly reach a situation in which the chip would be unusable. This means that scalability of nanophotonic trainable chips would hinge completely on the ability of using low-loss components, or potentially on the use of optical amplification that compensates the losses. Here, again, one factor that plays in our advantage is the fact that the mechanism used for phase shifting within the chip doesn’t need to have a fast response time. Note that in typical (e.g., telecom) applications, speed is more important than losses. This means that one of the most typical design constraints is lifted, and when developing suitable phase shifters for trainable optics, one can focus on compactness and low losses.
The proposed physical optimisation method of this paper may also find applications outside of neural networks. It could be used more generally to automatically tune optical linear systems if a specified target transformation is not available, but when an error (a difference between a desired and actual output field) can be defined. This may have applications in decoding multimode communication channels (similar to ), when the channel is not reliable (changing over time). A fast adaptation method based on physically implemented gradient descent may be useful here.

## References

•  Hinton, G. et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE 29, 82–97 (2012).
•  Krizhevsky, A., Sutskever, I. & Hinton, G. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, 1106–1114 (2012).
•  LeCun, Y., Bengio, Y. & Hinton, G. Nature 521, 436–444 (2015).
•  Sutskever, I., Vinyals, O. & Le, Q. V. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 3104–3112 (2014).
•  Cho, K. et al. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In

Conference on Empirical Methods in Natural Language Processing (EMNLP 2014)

(2014).
•  Socher, R., Karpathy, A., Le, Q., Manning, C. & Ng, A. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2014 (2014).
•  Mnih, V. et al. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).
•  Rumelhart, D., Hinton, G. & Williams, R. Learning internal representations by error propagation (MIT Press, Cambridge, MA, 1986).
•  Werbos, P. Generalization of backpropagation with application to a recurrent gas market model. Neural Networks 1, 339–356 (1988).
•  Farhat, N. H., Psaltis, D., Prata, A. & Paek, E. Optical implementation of the hopfield model. Applied Optics 24, 1469–1475 (1985).
•  Yeh, P. & Chiou, A. E. Optical matrix–vector multiplication through four-wave mixing in photorefractive media. Optics letters 12, 138–140 (1987).
•  Goodman, J. W., Dias, A. & Woody, L.

Fully parallel, high-speed incoherent optical method for performing discrete fourier transforms.

Optics Letters 2, 1–3 (1978).
•  Caulfield, H. J., Kinser, J. & Rogers, S. K. Optical neural networks. Proceedings of the IEEE 77, 1573–1583 (1989).
•  Yang, L., Zhang, L. & Ji, R. On-chip optical matrix-vector multiplier for parallel computation. SPIE Newsroom (2013).
•  Deng, L. et al. Complex learning in bio-plausible memristive networks. Scientific reports 5 (2015).
•  Prezioso, M. et al. Training and operation of an integrated neuromorphic network based on metal-oxide memristors. Nature 521, 61–64 (2015).
•  Hermans, M., Burm, M., Van Vaerenbergh, T., Dambre, J. & Bienstman, P. Trainable hardware for dynamical computing using error backpropagation through physical media. Nature communications 6 (2015).
•  Tezak, N. & Mabuchi, H.

A coherent perceptron for all-optical learning.

EPJ Quantum Technology 2, 1–22 (2015).
•  Vandoorne, K. et al. Experimental demonstration of reservoir computing on a silicon photonics chip. Nature Communications 5 (2014).
•  Miller, D. A. Self-configuring universal linear optical component [invited]. Photonics Research 1, 1–15 (2013).
•  Miller, D. A. Sorting out light. Science 347, 1423–1424 (2015).
•  Mower, J., Harris, N. C., Steinbrecher, G. R., Lahini, Y. & Englund, D. High-fidelity quantum photonics on a programmable integrated circuit. arXiv preprint arXiv:1406.3255 (2014).
•  Steinbrecher, G., Harris, N. C., Mower, J., Prabhu, M. & Englund, D. R. Programmable nanophotonic processor for arbitrary high fidelity optical transformations. In CLEO: QELS_Fundamental Science, FW4A–2 (Optical Society of America, 2015).
•  Jensen, J. S. & Sigmund, O. Topology optimization for nano-photonics. Laser & Photonics Reviews 5, 308–321 (2011).
•  Seliger, P., Mahvash, M., Wang, C. & Levi, A. Optimization of aperiodic dielectric structures. Journal of Applied Physics 100, 034310 (2006).
•  Lalau-Keraly, C. M., Bhargava, S., Miller, O. D. & Yablonovitch, E. Adjoint shape optimization applied to electromagnetic design. Optics express 21, 21693–21701 (2013).
•  Liu, V. & Fan, S. Compact bends for multi-mode photonic crystal waveguides with high transmission and suppressed modal crosstalk. Optics express 21, 8069–8075 (2013).
•  Niederberger, A. C., Fattal, D. A., Gauger, N. R., Fan, S. & Beausoleil, R. G. Sensitivity analysis and optimization of sub-wavelength optical gratings using adjoints. Opt. Express 22, 12 (2014).
•  Poli, R., Kennedy, J. & Blackwell, T. Swarm intelligence 1, 33–57 (2007).
•  Robinson, J. & Rahmat-Samii, Y. Particle swarm optimization in electromagnetics. Antennas and Propagation, IEEE Transactions on 52, 397–407 (2004).
•  Roelkens, G. et al. High efficiency diffractive grating couplers for interfacing a single mode optical fiber with a nanophotonic silicon-on-insulator waveguide circuit. Applied Physics Letters 92, 131101 (2008).
•  Zhang, Y. et al. A compact and low loss y-junction for submicron silicon waveguide. Optics express 21, 1310–1316 (2013).
•  Skinner, S. R., Behrman, E. C., Cruz-Cabrera, A. A. & Steck, J. E. Neural network implementation using self-lensing media. Applied optics 34, 4129–4135 (1995).
•  Steck, J. E. et al. Backpropagation training of an optical neural network. In Microelectronics for Neural, Fuzzy and Bio-Inspired Systems, 1999. MicroNeuro’99. Proceedings of the Seventh International Conference on, 346–351 (IEEE, 1999).
•  Harris, N. C. et al. Efficient, compact and low loss thermo-optic phase shifter in silicon. Optics express 22, 10487–10493 (2014).

## Acknowledgements

MH has been supported by the Interuniversity Attraction Poles (IAP) program of the Belgian Science Policy Office (BELSPO) (IAP P7-35 “photonics@be”). TVV wants to thank the Special Research Fund of Ghent University (BOF) for financial support through a postdoctoral grant.

## 1 Gradient descent on the Helmholtz equation

In this section we explain how we can find the gradient of a certain cost function w.r.t. a certain parameter for a system described by the Helmholtz equation. Suppose we have a field which adheres to the following equation:

 ∇2ϕa(r)+k2(r)ϕa(r) = a(r) for r∈Ω (6) ϕa(r) = 0 for r∈∂Ω, (7)

where is a closed domain of space, and is its edge, such that vanishes at the boundary. the function acts as the source, which in our case will encode the input vector of the MVM we wish to implement. We now introduce a certain cost functional that we want to minimize. In particular we wish to adapt the function . Therefore, we are interested in finding the gradient of w.r.t. . Normally, we can write:

 g(r)=dQ(ϕa)dk(r)=∫Ω∂Q(ϕa)∂ϕa(r′)dϕa(r′)dk(r)dr′. (8)

At this point however, care needs to be taken on how to interpret these derivatives. While and are both strictly real variables, is generally complex. The fact that is strictly real means that it is not an analytic function, and we need to write: , the asterisk

indicating the complex conjugate. We need to apply the chain rule separately for

and (using the so-called Wirtinger derivatives). We find that

Since is strictly real, it follows that

 ∂Q(ϕa,ϕ∗a)∂ϕa=[∂Q(ϕa,ϕ∗a)∂ϕ∗a]∗.

Similarly, we find that

such that we can rewrite the equation as:

 g(r)=2R(∫Ω∂Q(ϕa,ϕ∗a)∂ϕa(r′)dϕa(r′)dk(r)dr′). (9)

We will use a more compact notation for the first factor:

 e(r)=∂Q(ϕa,ϕ∗a)∂ϕa(r). (10)

The second factor can be found when we consider Equation (6) and (7) as a linear operator , such that we can rewrite them as:

 Lϕa(r)=s(r), (11)

where and for . For points on the boundary we define as the identity: , and , such that the condition of Equation (7) is absorbed into the definition of , and Equation (11) describes both the Helmholtz equation and the boundary conditions. Taking the derivative of Equation (11) w.r.t. we find:

 Ldϕa(r′)dk(r)=−∂L∂k(r)ϕa(r′). (12)

The right hand side is a functional derivative, and is equal to , with the dirac delta function. The left hand side is the operator operating on . Therefore, is the solution of the same Helmholtz equation with the same boundary conditions, but a different source function.
We will use the Green’s function that acts as the inverse of operator , which allows us to write:

 ϕa(r)=∫ΩG(r,r′)s(r′)dr′, (13)

as a solution to . When using as source function, this leads to:

 dϕa(r′)dk(r) = −∫ΩG(r′,r′′)∂Lϕa(r′′)∂k(r)dr′′ (14) = −2G(r′,r)k(r)ϕa(r).

Inserting this in Equation (9) yields:

 g(r)=−4k(r)R(ϕa(r)∫ΩG(r′,r)e(r′)dr′). (15)

Note that the arguments of the Green’s function in the integral have switched places. It can be proven that the Green’s function associated with Equation (6) is self-adjoint, which means that . Therefore we can write:

 g(r)=−4k(r)R(ϕa(r)∫ΩG(r,r′)e(r′)dr′). (16)

In other words, if we define a second field which adheres to the equation this becomes:

 g(r)=−4k(r)R(ϕa(r)ϕe(r)). (17)

Both and can be obtained in a physical manner as they are solutions to the same wave equation with different source terms.
We can simplify the gradient even further when we assume that only varies slightly around a fixed average value which only has a global scaling effect on the gradient:

 g(r)∼−R(ϕa(r)ϕe(r)). (18)

## 2 Numerical example of an MVM

We use a planar medium with ten sources / receivers, which are simulated as having

 βi(r)=exp(−||r−rβi||22σ2)2πσ2,
 γi(r)=exp(−||r−rγi||22σ2)2πσ2,

with and are the locations of the -th source and receiver, respectively. The parameter we chose to be approximately equal to half a wavelength in the medium surrounding the sources and receivers. The spacing between the sources and receivers was slightly over two wavelengths.
In between the sources and receivers is the medium of which we spatially modulate the wave number. Each iteration we draw a vector and use this to generate the source . We compute the complex field , and the output . Next, we redo the simulation with the error source term , and compute the resulting complex field . After both simulations, we compute the gradient and update in the region which was designated as the trainable medium.
We numerically solve the respective Helmholtz equations using the stabilised biconjugate gradient method. The simulation area is represented by a rectangular grid of elements, which have a size of roughly one tenth of the wavelength (for the average wave number of the system). To speed up the convergence of the computation of the complex fields we use an absorbing boundary layer, which we implement by adding a gradually increasing imaginary part to the wave number. Note that this is not a necessary element for the physical setup. We could for instance put the trainable medium between two mirrors, such that less energy is lost between the source and receivers.
We used 1000 training iterations. The matrix was picked to be a random complex matrix, and was next scaled down with a factor 25. This scaling was necessary as only a fraction of the total energy of the sources is radiated onto the receivers. Interestingly, when we used much larger scaling factors for , the training algorithm adapts to extreme values in an attempt to reflect as much energy as possible back to the receivers (eventually leading to the inability to compute the complex fields). The learning rate we picked at a suitable starting value , and we let it drop linearly to zero over the course of the training phase: , with the total number of iterations. As a qualitative measure of performance of the system at the -th iteration we use the normalised root mean square error (NRMSE), equal to , where means the average over .

## 3 Nanophotonic implementation

We will obtain gradients using the following expression:

 2R(ϕaϕe) = ϕaϕe+ϕ∗aϕ∗e (19) = (ϕa+ϕ∗e)(ϕa+ϕ∗e)−ϕaϕ∗a−ϕeϕ∗e = I(ϕa+ϕ∗e)−I(ϕa)−I(ϕe),

which requires the measurement of three intensities, and the generation of the complex conjugate field of either or .

### 3.1 Error sources within waveguides

Note that there is an important difference between the way light enters a photonic chip, and the theory presented in Section 1. Here, we assumed that there exists a source term, which for EM radiation would imply an optical antenna (represented in the maxwell equations by an alternating current with the same frequency as the light). In reality, however, light from an external laser source entering a chip is modulated in phase and amplitude, which means there is no internal ‘source’ within the chip. This means that, when we generate the field , we need to modulate the light at the point of measurement as if it was produced by a source term. It turns out that this means that we need to multiply the complex field of the ‘error’ light entering the chip with a factor . This can be proven when we apply the theory of Section 1 to a waveguide with guided modes. We start by writing the ‘output’ field exiting a chip in the form of a guided mode in a waveguide as:

 ϕa(r)=p(y)exp(ȷk0x)ϕo, (20)

where is the normalised transversal field profile of the guided mode, is the effective wave number of that particular guided mode, and is a complex variable that will be the effective output. Note that this implies that is the transversal direction, the waveguide lies according to the -axis, and light is traveling in the positive -direction. We assume that the output is generated at by , such that

 o=∫Ωϕa(r)γ(r)dr=ϕo.

Conversely, the source term for the error becomes

 e(r)=∂Q∂op(y)δ(x).

We assume can be split up into different coordinate parts as well, with the transversal part having the same field profile i.e.:

 ϕe(r)=q(x)p(y).

When we insert this into the Helmholtz equation with the source term stated above, we obtain:

 p(y)∂2q(x)∂x2+q(x)∂2p(y)∂y2+k2(y)q(x)p(y)=∂Q∂oδ(x)p(y). (21)

Note that , as we assume a waveguide structure which doesn’t change in the -direction. We can eliminate by inserting Equation (20) into the homogeneous Helmholtz equation. This leads to:

 ∂2p(y)∂y2=p(y)(k20−k2(y)). (22)

If we insert Equation (22) in Equation (21), and divide by , we obtain:

 ∂2q(x)∂x2+k20q(x)=∂Q∂oδ(x).

The solution to this equation can be found by using the one-dimensional Green’s function of the Helmholtz equation, which yields:

 q(x)=−ȷ∂Q∂oexp(ȷk0|x|)2k0.

We are only concerned with the light going back into the chip, i.e., for , which indeed represents light propagating into the negative -direction and therefore into the chip. In order to ‘simulate’ a source with these properties, we need to modulate the incoming light with the error signal , times a factor (ignoring the scaling factor ).

### 3.2 Abstraction of the chip

When we describe the optical fields at the input waveguides with and those at the output with , the transformation performed by the entire chip can be written as

 o=Wa=[N∏i=1PiUi]a,

where , a diagonal matrix with phase shifts written in vector form as , describing the phase shifts performed by the waveguides after the -th unitary transformation of the signal. Concretely, after abstracting the operation of the chip we find that the complex field of the light before entering the -th unitary transformation is given by:

For the error light, entering from the opposite direction, the complex field of the incoming light at the same locations is given by:

 ek=[N−k−1∏i=0UTN−iPN−i]e,

with , the output error. We wish to apply Equation (19), so we will need to measure the complex state at the point where the original input was inserted, and re-emit its complex conjugate back into the network, added to the original . After all, when the system is completely unitary we can write:

 e∗k=[k∏i=1PiUi]e∗0. (23)

when and are lossy, however, this is no longer true, and we can only generate an approximation of the complex fields:

 ^e∗k=[k∏i=1PiUi]e∗0. (24)

When we define , this field at the same locations throughout the chip becomes:

We will denote the gradient for the phase shift with . Following Equation (19) we can write this gradient as:

 −gi∼hih∗i−aia∗i−^ei^e∗i (25)

which correspond to the intensities at the locations of the waveguides. Intensity measurements will cause a small loss everywhere, and typically variable phase shifters are lossy too. Let’s incorporate this effect; we assume that , with a loss factor slightly smaller than one. In this case decays throughout the chip at the same rate as . In the end this means that, even though the signals lose power throughout the chip, the phase relations between and remain correct. Indeed, in the numerical simulation we show that the resulting gradient is effective in training the conceptual optical chip.
Note that we can also obtain the gradient using light going in the opposite direction by approximating the complex conjugate field as

 ^a∗k=[N−k−1∏i=0UTN−iPN−i]o∗,

which would give an alternative gradient:

 −g′i∼h′ih′i∗−^ai^a∗i−eie∗i, (26)
 h′k=[N−k−1∏i=0UTN−iPN−i]h′,

and . Indeed, if the chip is lossy it is desirable to take the average of both approximate gradients. This will even out the effects of losses and perhaps reduce approximation errors.

We will also simulate what happens when the transformations aren’t perfectly unitary. Suppose for instance that a certain amount of energy is lost, but a different amount for each output channel. Effectively, the transformation could be written as , with a true unitary matrix and a diagonal matrix with elements smaller than one on the diagonal, which all differ from each other. This adds another problem to training the chips, as unevenly distributed losses will affect phase relations when attempting to create the field . Of course, there are gradations to this issue. If the differences in losses are only slight, it might only affect the operation to a limited degree, and one would still obtain useful gradients throughout the chip.
We will simulate the training process of a chip with 50 inputs and 50 outputs. To have a sufficient number of trainable parameters we assume phase shifter arrays. It can be proven that a unitary matrix of size is defined by parameters (as opposed to a random square complex matrix which has parameters). It is unclear if the chip structure as we presented is theoretically able to emulate all possible unitary matrices if it has only layers of phase shifters (which would provide the right number of parameters). In practice we find that does not converge to a satisfactory level. More layers, lead to a much better performance. We settled for as a tradeoff between performance and additional losses.
For each experiment we performed 20,000 training iterations, where as before, each iteration a single input / desired output pair was presented to the simulated chip. The learning rate we kept fixed during the training, but it was optimised in each of the four experiments to give the best final result. Note that the assumed losses will severely reduce the optical power that exits the chip. We want to train the chip to perform a unitary transform, however, which would preserve the total power. Therefore we ‘compensate’ all losses by using only normalised input vectors and renormalizing the measurements at the output stages (both the output signal and the transformed error signal ). These renormalised signals are then further used, both for training and for determining the NRMSE.