I Introduction
Feedforward networks of tunable beamsplitters or “nodes”, typically implemented as meshes of MachZehnder interferometers (MZIs), can perform linear operations on sets or “vectors” of optical inputs in singlemode waveguides Reck et al. (1994); Miller (2013a); Carolan et al. (2015). With advances in photonic integration, these networks have found a wide range of classical and quantum photonic applications where the mesh is configured to implement some specific linear transform or matrix. Some applications, like mode unscrambling in optical communications Annoni et al. (2017), favor a selfconfiguring approach Miller (2013a, b) in which the mesh sets itself up in real time to implement the matrix that undoes the mixing. Other applications, including photonic neural networks Shen et al. (2017), universal linear quantum computing Carolan et al. (2015), and photon random walks Harris et al. (2017), may need to have the mesh implement some specific matrix that is calculated externally. These applications promise fast and energyefficient matrix multiplication or analog computation via the physical process of light propagating through programmed nodes that can arbitrarily redistribute that light. For such applications, we will demonstrate how the nodes in an arbitrary feedforward network (e.g., the simple grid network of Fig. 1) can be efficiently programmed in a faulttolerant manner to implement a desired matrix operator.
Though it is straightforward to calculate phase shifts and/or beamsplitter split ratios to implement a matrix in such a mesh, any fixed fabrication of such settings is challenging for large meshes due to the precise settings required Flamini et al. (2017). For example, these errors limit the classification accuracy of optical neural networks Shen et al. (2017); Fang et al. (2019) and prevent scaling up the number of components in quantum linear optical circuits Carolan et al. (2015); Flamini et al. (2017). We therefore prefer reconfigurable beamsplitter nodes in such networks and corresponding setup algorithms that can directly program desired matrices.
Such setup algorithms also let us fully calibrate the nodes; finding the voltage drive settings for any programmed node subsequently allows us to interpolate among those settings to implement desired matrices by applying appropriate voltages directly. Although each node can be individually calibrated in some specific architectures
Mower et al. (2015); Carolan et al. (2015), this approach is slow for largescale meshes and must be repeated if components experience environmental drift. Here, we show an approach that can greatly reduce the time required for such setup and/or calibration processes and also generalizes to any feedforward mesh (i.e., where light propagates unidirectionally). By exploiting a graphtopological approach, we identify which nodes can be programmed simultaneously and provide the necessary parallelized algorithm to reduce the setup time by a factor of order .One general setup approach that does not require prior knowledge of each node’s parameters is based on what could be called “nullification.” By specific choices of input modes (amplitude and phase) sent into input waveguides, corresponding nodes can be programmed by nullifying power at one of their outputs by optoelectronic feedback control Miller (2013a); Annoni et al. (2017); Miller (2017); Hughes et al. (2019). For example, networks made from one or more diagonal sets of nodes Miller (2013a) are programmed using nullification. One such diagonal gives a selfaligning beam coupler Miller (2013b), and multiple diagonals can be cascaded to form triangular grid networks (as in the Reck scheme Reck et al. (1994); Miller (2013a); Annoni et al. (2017)) that implement arbitrary
unitary matrices. The appropriate input vectors programming such meshes are then the complex conjugates of the rows of the target unitary matrix
Miller (2013a); Annoni et al. (2017).More generally, other feedforward networks, such as rectangular grids (Clements scheme Clements et al. (2016)), may not support selfconfiguration, so some separate design calculation must be performed to calculate the desired parameters of each node. Nonetheless, with the knowledge of the desired node parameters, such networks can be progressively configured using the reversed local light interference method (RELLIM) nullification approach proposed in Ref. 11. A key question is whether we can minimize the total time required to program or calibrate the network. The selfconfiguring algorithm for diagonal or triangular meshes and the RELLIM algorithm as originally conceived Miller (2017) give prescriptions for setting up the nodes sequentially (i.e., oneatatime) and thus require a number of steps equal to the number of nodes.
In this paper, we propose a graphtopological framework that arranges any given feedforward network into columns of nodes that can be programmed simultaneously, with just one “nullification set” input vector for the entire column, rather than programming one node at a time with possibly different input vectors for each such node. The resulting “parallel nullification” or parallel RELLIM (“PRELLIM”) protocol uses up to ()times fewer calibration steps and input vectors than RELLIM, where is the number of input modes to the system. Our protocol ultimately enables efficient, faulttolerant, flexible, and scalable calibration (with time complexity of the number of columns in the device) of an arbitrary feedforward photonic mesh architecture. Example such architectures include triangular Reck et al. (1994); Miller (2013a) and rectangular Clements et al. (2016) grid networks (capable of implementing arbitrary unitary matrices) and butterfly Flamini et al. (2017); Fang et al. (2019) networks (capable of implementing any permutation or DFT unitary matrices).
We outline a typical scenario that benefits from parallel nullification in Fig. 1(a), where a model of an optical network stored in a digital computer (e.g. a CPUtrained optical neural network) must be programmed into a photonic circuit. The model consists of the network topology (node connection patterns) and tunable node parameters used to calculate the nullification set vectors for calibration. The parallel nullification procedure shown in Fig. 1(b), consists of calibrating the mesh one column at a time using the nullification set and tuning all nodes within each column in parallel until their bottom outputs are all nullified (i.e. transmit zero power). In Fig. 1(c), we show that after parallel nullification is applied to all columns, our device matches the computer model as accurately as physically possible.
The remainder of this paper is organized as follows. In Section II, we lay out the foundations of our graphtopological framework used to formally define a general feedforward photonic mesh. In Section III, we propose our parallel nullification protocol and demonstrate how our protocol can deploy machine learning models on optical neural networks in Section IV. We then more generally discuss faulttolerant performance of parallel nullification in presence of systematic errors in Section V.
Ii Feedforward photonic network
In this section, we introduce some required mathematical and graphtopological terminology and concepts for feedforward photonic networks. For any such network with input and output waveguides, there is a linear device operator Miller (2012, 2019a) or matrix relating the input and output waveguide amplitudes for monochromatic light at steady state. Because we are considering feedforward networks, by choice we consider only the forward amplitudes in all waveguides. With a set of amplitudes in the input waveguides, represented by the input mode vector (the th element stores the amplitude and phase of the input mode in waveguide ), and a corresponding vector of output mode amplitudes , then as shown in Fig. 1. Since we presume no backwards waves, can be considered to be an transmission matrix. Each individual tunable beamsplitter node (depicted by dots in Fig. 1) can similarly be described by a transmission matrix that describes how the light in its two input waveguide ports is distributed across its two output ports.
Based on the graphtopological arguments of this section, we always arrive at a compact definition for in terms of node columns that can each be programmed in a single step as suggested by Fig. 1. For example, the triangular grid mesh Reck et al. (1994) can be programmed in steps (as opposed to the typical steps), the rectangular grid mesh Clements et al. (2016) in steps, and the butterfly mesh Flamini et al. (2017) in steps.
ii.1 Nodes
Each node of the feedforward network is a tunable beamsplitter whose requirement is to be able to arbitrarily redistribute light. This is accomplished by concatenating a phase shifter () to a tunable coupler () resulting in the transmission matrix :
(1) 
where and . By notation “,” we note that there are many equivalent constructions of provided explicitly in Appendix E, which have mathematically different, but functionally equivalent, representations including tunable directional couplers (TDCs) and MachZehnder interferometers (MZIs).
Parallel nullification is agnostic of the exact modulation schemes for and , as long as corresponds to a controllable phase difference between the two input waveguides (so a differential phase) and covers the full range of transmissivities (bar state or to cross state or ).
In an port device, to represent the effect of one element we may embed the transmission matrix along the diagonal of an otherwise identity matrix. Formally, this would allow the resulting embedded operation to operate between modes in waveguides and (a Givens rotation) as follows:
(2) 
ii.2 Graphtopological framework
We typically make photonic mesh networks in a gridlike manner Reck et al. (1994); Miller (2013a); Clements et al. (2016) for efficiency and compactness in fabrication. This also can allow equal path lengths if we want to make the interference relatively insensitive to wavelength changes. Configuring the nodes of the network takes much longer than the time for light to propagate through the network. So, for the purposes of analyzing more general feedforward networks, we can consider monochromatic, continuouswave light such that actual optical distances and geometrical arrangements are unimportant for calibration and programming of the network.
In contrast, the “network topology” (i.e., connectivity independent of lengths of internode links) is important for defining valid feedforward configurations and the required inputs and node sequence by which we can progressively program or calibrate the network. We could mathematically view the internode links of the network as made of flexible (and stretchable) fibers between nodes that can move in space as long as the network topology is not changed. As such, we represent the notion of light moving “forward” or lefttoright (independent of the physical node locations in space) as moving along a given fiber away from network inputs or node outputs, which we define to be on the “left,” and towards network outputs or node inputs, which we define to be on the “right.” We will ultimately examine grouping nodes into “columns” to establish which nodes can be configured simultaneously or within the same propagation time step.
In such a graphtopological view, we can propose a sequential constructive definition that generates any arbitrary feedforward mesh network. As represented in the example in Fig. 2(a), we start with input waveguides or fibers on the physical left (though this positioning is arbitrary). In constructing the network, we perform node addition by interfering any chosen pair of waveguides or fibers at the inputs of a node to generate outputs from that node. We assume light flows forward, so at any subsequent (further to the right) node addition, we interfere any two waveguides that exit earlier (further to the left) nodes. In this way, given a sufficient number of nodes and arbitrary choice of waveguides to couple, we can generate any feedforward mesh architecture. Furthermore, rather than adding a single node at a time in our construction, we can add at most nodes at a time (where represents the largest integer less than or equal to ) to account for the largest number of waveguides that can be simultaneously coupled. The resulting architecture is shown in Fig. 2(f) where we maintain the feedforward property that propagation along the waveguides always crosses the dotted vertical lines from lefttoright. This results in what we will define as the most “compact” representation, resembling more closely some of the commonly known rectangular and triangular feedforward architectures Reck et al. (1994); Clements et al. (2016).
It can be helpful to find this compact representation of Fig. 2(f) starting from the more arbitrary layout of Fig. 2(a) using a “topological sort” or homomorphic transformation. To do this, we propose a timeordered “traversal” where we hypothetically insert a mode at every input to the mesh and allow the resulting mode vector mathematically to propagate progressively (i.e., to traverse) through the device. At every step of the traversal, shown in Fig. 2(a)(e), we allow the mode vector amplitudes to pass through and be transformed by the nodes in the network. Since the transmission matrix for each node in Eq. 1 interferes two inputs, two modes need to be input into a node to traverse that node (advance to the outputs of that node). Via this procedure, we equivalently construct sets of nodes (in circles of the same color of Fig. 2) that all connect only to previous (already traversed) nodes. We can then “compactify” the network into the numbered vertical columns as in Fig. 2(f) corresponding to the time steps (colored and indexed 1 to 4) in which nodes are traversed. Since the nodes in these columns are not connected to one another, we are free to configure these nodes in any order, including configuring them all simultaneously (i.e., in the same time step) assuming the preceding nodes are configured. Using this graphtopological approach, we can always generate the most compact device (in terms of number of node columns or “optical depth”) possible for any feedforward architecture by lining up the nodes according to their time step as in Fig. 2(f).
Such a compactified version is the network with the lowest optical depth representation of the mesh, and nodes in each time step (or column) must be independent since there cannot be a valid path between them without contradicting our traversal algorithm; hence they can to be tuned in parallel. Our columnwise or timestep labelling is a specific case of the wellknown Dijkstra’s algorithm Dijkstra (1959) to find the maximum longest path (over all input source nodes) in a directed acyclic graph (feedforward mesh) using breadthfirst (timeorder) search to the mesh.
Each node belongs to exactly one column, but if our traversal algorithm finds multiple column assignments for that node (which would require revisiting that node), then there would be be a cycle in the graph. Therefore, by applying this algorithm, we can formally identify cycles in any mesh architecture (as in the lattice meshes of Refs. 17; 18) that would disqualify such a mesh as a feedforward architecture. Though those meshes with cycles have specific uses, nodes in such meshes must be tuned individually or using some global optimization approach Pérez and Capmany (2019) and may be susceptible to backreflections during programming or calibration.
ii.3 Transmission matrix representation
Although our graphtopological construction can generally be used to define any feedforward mesh, we need an equivalent transmission matrix that allows for straightforward simulation and calibration of such devices. Once we have arrived at the compact representation of Fig. 2(f), we can define each column entirely using the general definition of Eq. 3, later depicted in Fig. 3(c). For each added column of nodes, we select the appropriate source nodes using the permutation matrix and connect them to simultaneouslyacting nodes via the blockdiagonal matrix (the product of up to nodes given by Eq. 2):
(3)  
where phase parameters are and . Assuming nodes in the column, then is defined such that we can add synthetic bar state beamsplitter nodes for all , i.e. for any waveguides that do not interact in that column. While all waveguides in a column can technically be interfered at nodes for even
, for odd
there will always be a single remaining waveguide that is not connected to a node in that column. However, this ultimately does not change the definition in Eq. 3 since we can always choose this to be the th waveguide by appropriate choice of .To complete meshes for which we desire a fully arbitrary unitary transformation, we may need a further set of phase shifters on the output nodes of the mesh that set the relative phases of the rows of the implemented matrix (represented by the diagonal unitary matrix ). (Such phase shifters can also be at the input if the external phase shift of each node is applied after the tunable coupler, effectively a mirror image of our current definition.) Furthermore, we include a final permutation before or after those final phase shifters, allowing us to arbitrarily rearrange the rows of the matrix at the end.
Performing our transmission matrix construction for columns, we arrive at an expression for an arbitrary feedforward mesh:
(4)  
where the matrices and vector represent the full set of mesh parameters with ranges and . Any feedforward architecture is entirely defined by these permutation matrices and (representing the choices of waveguides to interfere), which we explicitly define for commonly proposed architectures (e.g., triangular, rectangular, butterfly) in Appendix C, and by the various phase settings in the nodes and at the output.
Iii Parallel nullification
Now that we have defined a columnwise feedforward photonic mesh, we present parallel nullification as summarized in Fig. 3. The programming algorithm consists of an offchip calculation of a sequence of inputs (or “nullification set”) as in Fig. 3(a) followed by parallel nullification of columns from lefttoright of the mesh network as in Fig. 3(b).
In a realistic setting, we do not have direct access to the input of each column, but rather the inputs to the overall device. We therefore need the overall input vector assigned to each column that leads to the desired nullified output for that column. In both the RELLIM protocol Miller (2017) and our parallel nullification approach introduced here, the nullification set calculation works due to the reciprocity of feedforward meshes. In particular, we calculate a vector of complex amplitudes that would emerge from shining the desired nullified output backwards to the input from any column through a correctly programmed mesh. The nullification set for RELLIM results from mathematically propagating light backwards from the desired output of a single nullified node Miller (2017). In contrast, the nullification set for parallel nullification results from mathematically propagating light backwards from the desired output of an entire column of nullified nodes, as in Fig. 3(a).
When it is time to actually program the columns on the physical mesh, we physically send the phaseconjugate (i.e., complex conjugate) of this result, which we can now call the nullification set vector , into the mesh inputs as in RELLIM Miller (2017). As long as all the preceding mesh columns are set correctly, reciprocity ensures this vector can be used to correctly program the corresponding column to the desired mesh settings as shown in Fig. 3(b)(d) by physically nullifying the appropriate outputs of that column.
In this section, we first formalize the nullification set calculation which is performed separately on a traditional computer. We then discuss the mathematics and physical procedures behind parallel nullification of each column and the overall programming algorithm that sets the physical parameters of the device (which we will refer to as ) to the desired settings ( respectively).
iii.1 Nullification set calculation
For parallel nullification, there exist many valid calculations of nullification sets; we could choose to nullify either the top or the bottom port at any given node, and the value of the (nonzero) power at the nonnullified port does not matter. For definiteness, we will consider a simple valid target vector or more formally:
(5) 
where represents the th standard Euclidean basis vector in , or equivalently, unit power in node output port .
We calculate the nullification set vector for each column as depicted in Fig. 3(a):
(6) 
Since we have already programmed columns , sending in to the device yields correct values for after parallel nullification of column .
We can compute the entire nullification set offchip in time assuming all are known since each results from reverse propagating through columns. For example, we can calculate the nullification set in time for the rectangular or triangular grid meshes. Code for calculating the nullification set (Eq. 6) for any feedforward mesh is provided in our Python software module neurophox Pai et al. (2019), further discussed in Appendix A,^{1}^{1}1See https://github.com/solgaardlab/neurophox and nullification set results calculated for rectangular grid networks Clements et al. (2016) are shown in Appendix D.
iii.2 Nullification
After calculating our nullification set offchip, we program the physical device. Before programming a given node in column , the settings and will be different from desired settings and respectively. Given nullification set input (presuming all preceding columns of the mesh are already set correctly), nullification can be achieved by two independent steps to nullify the bottom port (port ) Miller (2013a, b) as proven explicitly in Appendix B:

Sweep (i.e., adjust the relative phase of the node inputs) until bottom port power is minimized.

Sweep (i.e., adjust the node split ratio) until bottom port is nullified, as shown in Fig. 3(d).
Given the permuted mode pair entering from the previous column , this straightforward twostep optimization exactly adjusts settings , to be the desired :
(7)  
Parallel nullification of all nodes in column (i.e. ) can be achieved because, as discussed in Section II, the choice of nodes assigned to column ensures all such optimizations are independent (i.e., do not influence each other). Nullification can be accomplished physically by sampling and measuring a small fraction of the power in the bottom output port. Nullification is then achieved through local feedback loops and is accomplished once zero power is measured. This nullification procedure has been experimentally demonstrated previously with noninvasive CLIPP detectors Annoni et al. (2017); Grillanda et al. (2014); Morichetti et al. (2014) that are effectively lowloss because they rely on light already absorbed in background loss processes in the waveguide.
iii.3 Programming algorithm
The parallel nullification programming algorithm proceeds formally as in Alg. 1, which we simulate in neurophox Pai et al. (2019). If we are configuring the actual physical network, the entire ForwardPropagate method is a physical process in which we generate an actual vector of optical inputs , and propagate them through the mesh to physically generate the vector at the node outputs in layer , which are permuted by to produce the vector we nullify in Eq. 7, . The vector is the propagated fields of our calculated inputs to the output of column as depicted in Fig. 3(b):
(8) 
where we use to mathematically represent the transmission matrices describing alreadyprogrammed physical columns of nodes (correctly set to column parameters ). The final (parallel) forloop of Alg. 1 represents a physical parallel nullification of output powers using optoelectronic feedback on each column in order from to .
The inputs to Alg. 1 are the desired settings for the feedforward mesh architecture which are used to first generate the nullification set . Algorithm 1 ultimately results in setting the physical mesh parameters to the desired , for which a full testing suite is provided in in neurophox Pai et al. (2019). As shown in Fig. 3(a), each nullification set vector has the necessary information from past columns to tune all devices in column in parallel. We demonstrate the parallel nullification algorithm from Fig. 3(b) for a rectangular grid mesh network in Appendix C.
While we have ignored adjusting the final output phase shifts ( in Sec. II) in Alg. 1, we emphasize that such phase shifts merely serve to define an “output phase reference,” which may be unimportant in some specific applications (e.g., in the optical neural network application we now discuss) but is anyway straightforward to adjust after parallel nullification of all columns.
Iv Optical neural networks
In this section, we primarily discuss applying parallel nullification to programming reconfigurable optical neural networks (ONNs), as shown in Fig. 4. Parallel nullification can be used to diagnose and errorcorrect integrated optical neural networks, which are capable of performing machine learning tasks that transform opticallyencoded data and can be significantly more energyefficient than their electrical counterparts Shen et al. (2017).
For our demonstration, we choose the MNIST classification task, a popular standard in machine learning models consisting of images of handwritten decimal digits from to . Using neurophox Pai et al. (2019) and GPUaccelerated automatic differentiation in tensorflow Abadi et al. (2016), we train a twolayer ONN (shown in Fig. 4(a)) consisting of two
rectangular grid meshes followed by ReLUlike optical nonlinearity or activation layers
Williamson et al. (2020) to classify each image to the appropriate digit label. Our training examples consist of lowfrequency FFT features preprocessed from each image that are input into the device, and our ultimate goal is to direct the light into port labelled by digit , as demonstrated in Fig. 4(a) for digit 0. Further details on data preprocessing, model choice, and neural network robustness and performance are discussed in Appendix F and in Ref. 22.Reprogrammable electrooptic nonlinearities Williamson et al. (2020) (or generally any ReLUlike optical nonlinearities that can be tuned to operate in a linear regime) allow multilayer, “deep” ONNs to be programmed or calibrated using our parallel approaches. Crucially, this means that it is possible to calculate inputs to the entire ONN (rather than each layer) and sequentially program the columns of all mesh networks in the overall device to program any desired operator of choice. We now demonstrate the use of this protocol for correcting significant drifts in our specific simulated ONN.
The training of the ONN parameters is shown in Fig. 4(b), achieving accuracy on training data (60000 training examples) and accuracy on holdout evaluation data (10000 testing examples). In our simulated environment, we use parallel nullification to correct phase drift for in our MNISTtrained network as shown in Fig. 4(c). In particular, the confusion matrix (representing correct predictions on the diagonal and incorrect predictions off the diagonal) improves from Fig. 4(e) to Fig. 4(d) using parallel nullification, improving the test set accuracy from to . It is important to note (and this is discussed further in Appendix F) that only very minor decrease in performance is seen for , which can generally be thought of as the phase shift tolerance threshold for nodes in our ONN for the MNIST task.
Parallel nullification is therefore a promising option for realizing machine learning models on reconfigurable devices Shen et al. (2017); Harris et al. (2018). Such devices provide strictly more generality and flexibility over nonreconfigurable ONNs which can only implement one model (specified prefabrication Fang et al. (2019)) and furthermore cannot be dynamically errorcorrected postfabrication.
V Error considerations
With our optical neural network example, we have shown how parallel nullification can correct phase drift and thus improve performance considerably. We further discuss the fault tolerance of parallel nullification to sources of systematic error that arise during fabrication of feedforward mesh networks.
v.1 Phase errors
The parallel nullification step in Eq. 7 is agnostic to any static phase shifts that may accumulate at each column due to path length variations, which in other cases (e.g., nonreconfigurable systems) result in phase errors. Specifically, parallel nullification implicitly sets the reference by which phases in the device are measured Miller (2017), so the nullification set calculation of Eq. 6 gives the correct inputs for parallel nullification regardless of how the node is controlled. A correctly programmed mesh can be achieved using a TDC or MZIbased tunable beamsplitter as depicted in Fig. 3(d) or any of the variations discussed in Appendix E.
v.2 Split ratio error
Parallel nullification can correct split ratio errors similarly to how phase errors are corrected, but in some cases, the split ratio range can be limited at each node of the photonic mesh (e.g., due to imperfect 50/50 beamsplitters in typical MZIs Miller (2015)). This limited range problem can be avoided entirely using TDCs or doubleMZIs Miller (2015) rather than single MZIs at each node. As discussed in Appendix E, we can equivalently consider as the “tunable coupling constant” for the TDC. A TDC can achieve perfect operation, or full split ratio range, if the modes are phase matched over the entire tunable range of . We could ensure this range is achievable by making the device suitably long such that the full split ratio range is contained between the minimum and maximum extent of the tunable coupling constant (i.e., such that ).
v.3 Thermal crosstalk
Thermal crosstalk between device elements can occur whenever there is significant heat generated in the phaseshifting process, such as in thermal phase shifters. We assume that thermal crosstalk is very small between columns and only occurs within each column so that calibrations of past columns are not affected by those of future columns. As this thermal crosstalk increases, the parallel nullification of each column takes longer because the optimizations within each column are no longer independent of each other (e.g., the settings of node would be affected by the settings of nodes and ). If running parallel nullification, however, we might still be able to efficiently find an optimal setting for the column as long as this crosstalk is reasonably small. Furthermore, phase shifter technologies that have little to no crosstalk (such as MEMS phase shifters Edinger et al. (2019)) would be faster to program because nullifications within the column would be truly independent. In Appendix E, we propose node configurations with at most phase shifts (rather than ), requiring smaller temperature variations per waveguide length and limiting thermal crosstalk compared to conventional designs.
v.4 Loss
Parallel nullification is capable of programming lossbalanced architectures. A lossbalanced architecture is achieved if all the waveguide path lengths and bends are equal (assuming uniform waveguide scattering loss) and the phase shifters are lossless (i.e., changing a phase shift does not increase or decrease loss incurred by that phase shift). If all modes encounter a loss at each column , then it is straightforward to show the column can be programmed to implement using parallel nullification. From Eq. 4, the overall network implements , where is ideally a “global loss” equal to the product of all columnwise losses, i.e. Harris et al. (2018). Lossbalanced grid architectures (such as rectangular grid meshes Clements et al. (2016)) and other symmetric architectures such as the butterfly (FFT) architecture Flamini et al. (2017) can be fabricated to fit this criterion. Other architectures (e.g., triangular meshes Miller (2013a)) can include “dummy” elements to achieve the same balance. In the case of optical neural networks, some additional calibration of the nonlinear elements may also be required in the presence of loss, which may benefit from reprogrammability of such elements Williamson et al. (2020).
If the feedforward mesh (or specifically a column of the mesh) suffers from “loss imbalance,” then different amounts of light are lost from each output as light propagates. In this case, we might need to readjust the nullification set (e.g., by adjusting the computer model of the mesh to account for lossy mesh columns) to more accurately program in the desired operator, which is a direction that should be further explored.
Vi Discussion and Conclusion
We derive a graphtopological property for any reconfigurable feedforward photonic network of tunable beamsplitter nodes that allows efficient programming (“parallel nullification”) of node columns that are not affected by each other and thus can be tuned simultaneously. With a model of the device stored in a computer, we find a set of vector inputs to the device (the “nullification set”) and for each column, nullify the bottom power of all tunable beamsplitters in parallel using the corresponding nullification vector. The nullification set can be internally generated given a singlemode input by appending the optical setup machine of Ref. 11 to the mesh.
Parallel nullification can quickly diagnose and errorcorrect phase drifts of a reconfigurable photonic device, demonstrated in the context of reconfigurable optical neural networks in Section IV. As nullification set inputs are sent in order from to , error appears as nonnullified power at the bottom output ports of the problematic column. This error may be corrected by nullifying these ports so that further debugging of the photonic circuit can be performed.
Our programming algorithm differs from calibration schemes Harris et al. (2017); Shen et al. (2017); Carolan et al. (2015); Mower et al. (2015) that fully characterize all tunable elements (e.g. relating phase shift to voltage via a cubic model). Such approaches can achieve high fidelities, but components that experience environmental drift need to be recalibrated. In contrast, parallel nullification is not tied to any specific calibration model and should readily adapt to calibration drift and environmental perturbations.
However, as suggested in the Introduction, it is possible to apply our graphtopological arguments to “parallel calibration.” We explicitly provide this parallel calibration protocol in Appendix G that is similar in principle to current calibration protocols Harris et al. (2017); Shen et al. (2017); Carolan et al. (2015); Mower et al. (2015), but with notable differences (e.g., increased efficiency via parallelization) and simplifications (e.g., removing the need to explicitly calibrate “metaMZI” cells Harris et al. (2017)). At a high level, each node column is calibrated in parallel, assuming no crosstalk among elements in the column, by simultaneously sweeping phase shifter values and measuring corresponding powers in embedded detectors. Such calibration can elucidate phase shiftvoltage relationships, which are required for initialization of the network or in situbackpropagation with respect to the actual node voltages Hughes et al. (2018), which can be useful in the context of training optical neural networks.
Our approach is similar to the RELLIM approach of Ref. 11, where each input tunes a single node, since it relies on the reciprocity of linear optical networks. However, each input in parallel nullification tunes an entire column of nodes simultaneously based on the feedforward mesh definition proposed in Eq. 4. Our approach can be faulttolerant for feedforward meshes with phase shift crosstalk (i.e., thermal crosstalk Shen et al. (2017)) and beamsplitter fabrication errors. Additionally, parallel nullification is currently the most efficient protocol for programming or calibrating a feedforward photonic mesh. In particular, where is the number of device columns and is the number of input modes, our parallel nullification protocol requires just input vectors and programming steps, resulting in up to times speedup over existing componentwise calibration approaches.
Funding Information
Air Force Office of Scientific Research (AFOSR), specifically for the Center for EnergyEfficient 3D Neuromorphic Nanocomputing (CEE3N) and a MURI program, Grant Nos. FA95501810186 and FA95501710002 respectively.
Acknowledgments
We would like to thank Nathnael Abebe, Ben Bartlett, and Rebecca L Hwang for useful discussions.
References
 Reck et al. (1994) Michael Reck, Anton Zeilinger, Herbert J. Bernstein, and Philip Bertani, “Experimental realization of any discrete unitary operator,” Physical Review Letters 73, 58–61 (1994).
 Miller (2013a) David A. B. Miller, “Selfconfiguring universal linear optical component [Invited],” Photonics Research 1, 1 (2013a).
 Carolan et al. (2015) Jacques Carolan, Christopher Harrold, Chris Sparrow, Enrique MartínLópez, Nicholas J. Russell, Joshua W. Silverstone, Peter J. Shadbolt, Nobuyuki Matsuda, Manabu Oguma, Mikitaka Itoh, Graham D. Marshall, Mark G. Thompson, Jonathan C.F. Matthews, Toshikazu Hashimoto, Jeremy L. O’Brien, and Anthony Laing, “Universal linear optics,” Science (2015), 10.1126/science.aab3642.
 Annoni et al. (2017) Andrea Annoni, Emanuele Guglielmi, Marco Carminati, Giorgio Ferrari, Marco Sampietro, David Ab Miller, Andrea Melloni, and Francesco Morichetti, “Unscrambling light  Automatically undoing strong mixing between modes,” Light: Science and Applications 6 (2017), 10.1038/lsa.2017.110.
 Miller (2013b) David A. B. Miller, “Selfaligning universal beam coupler,” Optics Express 21, 6360 (2013b).

Shen et al. (2017)
Yichen Shen, Nicholas C. Harris, Scott Skirlo, Mihika Prabhu, Tom BaehrJones, Michael Hochberg, Xin Sun, Shijie Zhao, Hugo Larochelle, Dirk Englund, and Marin Soljačić, “Deep learning with coherent nanophotonic circuits,”
Nature Photonics 11, 441–446 (2017).  Harris et al. (2017) Nicholas C. Harris, Gregory R. Steinbrecher, Mihika Prabhu, Yoav Lahini, Jacob Mower, Darius Bunandar, Changchen Chen, Franco N.C. Wong, Tom BaehrJones, Michael Hochberg, Seth Lloyd, and Dirk Englund, “Quantum transport simulations in a programmable nanophotonic processor,” Nature Photonics 11, 447–452 (2017).
 Flamini et al. (2017) Fulvio Flamini, Nicolò Spagnolo, Niko Viggianiello, Andrea Crespi, Roberto Osellame, and Fabio Sciarrino, “Benchmarking integrated linearoptical architectures for quantum information processing,” Scientific Reports 7, 15133 (2017).
 Fang et al. (2019) Michael Y.S. Fang, Sasikanth Manipatruni, Casimir Wierzynski, Amir Khosrowshahi, and Michael R. DeWeese, “Design of optical neural networks with component imprecisions,” Optics Express 27, 14009 (2019).
 Mower et al. (2015) Jacob Mower, Nicholas C. Harris, Gregory R. Steinbrecher, Yoav Lahini, and Dirk Englund, “Highfidelity quantum state evolution in imperfect photonic integrated circuits,” Physical Review A 92, 032322 (2015).
 Miller (2017) David A. B. Miller, “Setting up meshes of interferometers – reversed local light interference method,” Optics Express 25, 29233 (2017).
 Hughes et al. (2019) Tyler W. Hughes, R. Joel England, and Shanhui Fan, “Reconfigurable Photonic Circuit for Controlled Power Delivery to LaserDriven Accelerators on a Chip,” Physical Review Applied 11, 064014 (2019).
 Clements et al. (2016) William R. Clements, Peter C. Humphreys, Benjamin J. Metcalf, W. Steven Kolthammer, and Ian A. Walmsley, “An Optimal Design for Universal Multiport Interferometers,” Optica , 1–8 (2016).
 Miller (2012) David A.B. Miller, “All linear optical devices are mode converters,” Optics Express 20, 23985 (2012).
 Miller (2019a) David A. B. Miller, “Waves, modes, communications and optics,” arXiv preprint (2019a).
 Dijkstra (1959) E. W. Dijkstra, “A note on two problems in connexion with graphs,” Numerische Mathematik 1, 269–271 (1959).
 Pérez et al. (2017) Daniel Pérez, Ivana Gasulla, Lee Crudgington, David J. Thomson, Ali Z. Khokhar, Ke Li, Wei Cao, Goran Z. Mashanovich, and José Capmany, “Multipurpose silicon photonics signal processor core,” Nature Communications (2017), 10.1038/s41467017007141.
 Pérez and Capmany (2019) Daniel Pérez and Jose Capmany, “Scalable analysis for arbitrary photonic integrated waveguide meshes,” Optica 6, 19 (2019).
 Pai et al. (2019) Sunil Pai, Ben Bartlett, Olav Solgaard, and David A. B. Miller, “Matrix Optimization on Universal Unitary Photonic Devices,” Physical Review Applied 11, 064044 (2019).
 Grillanda et al. (2014) Stefano Grillanda, Marco Carminati, Francesco Morichetti, Pietro Ciccarella, Andrea Annoni, Giorgio Ferrari, Michael Strain, Marc Sorel, Marco Sampietro, and Andrea Melloni, “Noninvasive monitoring and control in silicon photonics using CMOS integrated electronics,” Optica 1, 129 (2014).
 Morichetti et al. (2014) Francesco Morichetti, Stefano Grillanda, Marco Carminati, Giorgio Ferrari, Marco Sampietro, Michael J. Strain, Marc Sorel, and Andrea Melloni, “NonInvasive OnChip Light Observation by Contactless Waveguide Conductivity Monitoring,” IEEE Journal of Selected Topics in Quantum Electronics 20, 292–301 (2014).

Williamson et al. (2020)
Ian A. D. Williamson, Tyler W. Hughes, Momchil Minkov, Ben Bartlett, Sunil Pai, and Shanhui Fan, “Reprogrammable ElectroOptic Nonlinear Activation Functions for Optical Neural Networks,”
IEEE Journal of Selected Topics in Quantum Electronics 26, 1–12 (2020).  Abadi et al. (2016) Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng, “TensorFlow: A System for LargeScale Machine Learning,” in Operating Systems Design and Implementation (Savannah, GA, 2016) pp. 265–283.
 Harris et al. (2018) Nicholas C. Harris, Jacques Carolan, Darius Bunandar, Mihika Prabhu, Michael Hochberg, Tom BaehrJones, Michael L. Fanto, A. Matthew Smith, Christopher C. Tison, Paul M. Alsing, and Dirk Englund, “Linear programmable nanophotonic processors,” Optica 5, 1623 (2018).
 Miller (2015) David A. B. Miller, “Perfect optics with imperfect components,” Optica 2, 747 (2015).
 Edinger et al. (2019) Pierre Edinger, Carlos ErrandoHerranz, and Kristinn Gylfason, “Lowloss MEMS phase shifter for large scale reconfigurable silicon photonics,” in The 32nd IEEE International Conference on Micro Electro Mechanical Systems (2019).
 Hughes et al. (2018) Tyler W. Hughes, Momchil Minkov, Yu Shi, and Shanhui Fan, “Training of photonic neural networks through in situ backpropagation and gradient measurement,” Optica 5, 864 (2018).
 Perez et al. (2017) Daniel Perez, Ivana Gasulla, Jose Capmany, and Richard A. Soref, “Hexagonal waveguide mesh design for universal multiport interferometers,” in 2016 IEEE Photonics Conference, IPC 2016 (2017) pp. 285–286.
 Russell et al. (2017) Nicholas J. Russell, Levon Chakhmakhchyan, Jeremy L. O’Brien, and Anthony Laing, “Direct dialling of Haar random unitary matrices,” New Journal of Physics (2017), 10.1088/13672630/aa60ed.
 Miller (2019b) David A.B. Miller, “Phase shifting by mechanical movement,” (2019b).
 Taballione et al. (2018) Caterina Taballione, Tom A. W. Wolterink, Jasleen Lugani, Andreas Eckstein, Bryn A. Bell, Robert Grootjans, Ilka Visscher, Jelmer J. Renema, Dimitri Geskus, Chris G. H. Roeloffzen, Ian A. Walmsley, Pepijn W. H. Pinkse, and KlausJ. Boller, “8x8 Programmable Quantum Photonic Processor based on Silicon Nitride Waveguides,” in Frontiers in Optics / Laser Science (OSA, Washington, D.C., 2018) p. JTu3A.58.
Appendix A Neurophox: open source software
In our simulation framework neurophox, we provide our general definition of feedforward mesh architectures from Eq. 4 and the reconfigurable neural network model of Fig. 4. This is the first time to our knowledge that feedforward meshes have been defined in this way, and it allows for a greatly simplified framework for defining and simulating mesh architectures. In neurophox, we provide Python code to calculate the nullification set and simulate parallel nullification on a physical chip using this nullification set. We also provide the code to train fully optical feedforward neural networks on the MNIST dataset with automatic differentiation in tensorflow.
Appendix B Twostep nullification
We give an explicit proof of the twostep nullification protocol in Ref. 2 in our context. In our setup for this proof, we have an input vector to a node , yielding output vector , i.e. (defined in Eq. 1 of the main text). We would like to find the such that .
In particular, we want to show that minimizing bottom port power () with respect to gives regardless of the value of . Then, it is clear that if we sweep to nullify power, we must have .
(9)  
where we allow any , i.e. . Now that we have optimized , we optimize to completely nullify .
(10)  
These results match the desired expressions of Eq. 7.
Appendix C Feedforward mesh examples
We can apply the general definition of a feedforward mesh to commonly studied architectures shown in Fig. 5. Grid architectures include the rectangular or Clements mesh Clements et al. (2016) and the triangular or Reck mesh Reck et al. (1994), which are both universal photonic mesh architectures. For a rectangular mesh, each of the columns has and defined as an upward circular shift for odd and a downward circular shift for even . For a triangular mesh, each of the columns has with the same as the rectangular mesh. The fact that the triangular mesh has the same as the rectangular mesh allows it to be “embeddable” within a rectangular mesh.
As shown in Fig. 6, a physical rectangular grid mesh (note the graphtopological transformation from Fig. 5) can be progressively tuned to implement any unitary matrix by performing parallel nullification protocol.
The butterfly architecture is an example of an alternative feedforward architecture that can be tuned using parallel nullification. Such architectures are designed to be compact, robust, and faulttolerant alternatives to universal meshes Flamini et al. (2017); Fang et al. (2019). This means that implementing reconfigurable architectures of this form is potentially more scalable (to the feature size or number of inputs and outputs) and can be useful for machine learning approaches even though it cannot actually implement any arbitrary unitary operator. The operations that the butterfly mesh can
implement, however, is the discrete Fourier transform (DFT) operator and (when concatenated with its mirror image to form the Benes network) any permutation operator.
Even meshes that are not explicitly feedforwardonly Perez et al. (2017); Pérez et al. (2017), but are meant for more generalpurpose approaches, can in theory implement the rectangular or triangular grid architectures. With some additional characterization, it may be possible to program or calibrate such general architectures in the lab setting using our method.
Appendix D Nullification set patterns
While the nullification set is mostly useful in the progressive calibration of columned optical meshes, there is interesting structure in the nullification vectors for a random unitary matrix implemented on a rectangular mesh device.
For a rectangular mesh, the implementation of these Haar random unitary matrices (i.e. matrices with roughly uniformrandom magnitude elements) for large involves low reflectivity in the center of the mesh and random reflectivity on the boundary Russell et al. (2017); Pai et al. (2019). For our choice of normalization basis calculation in Eq. 6
, the nullification vectors of a Haar random matrix lie somewhere between those for a mesh of only bar state nodes (as indicated by the “bar/identity” label) and those of a mesh of only cross state nodes (as indicated by the “cross/flip” label) in Fig.
7. As expected, the nullification set for the cross/flip mesh has an antisymmetric configuration versus the more symmetric configuration of the nullification set for the bar/identity mesh in Fig. 7. We note that the vectors are not generally mutually orthogonal, so the nullification set also generally do not form a unitary matrix if arranged side by side.There is also an information theoretic connotation of the nullification set. The nullification set for a mesh with phase settings that are uniformly set from (which leads to the “banded unitary” in Fig. 7(a)) looks more random (i.e. less structured) than the nullification set for the Haarrandom unitary (the truly random unitary in Fig. 7(a)). This is due to the nonlinear relationship between the transmission amplitude of the individual nodes and the final output magnitudes of the overall device Russell et al. (2017); Pai et al. (2019).
Appendix E Tunable beamsplitter variations
There are many possible ways of positioning phase shifters in an MZI that are all ultimately equivalent in allowing universal unitary meshes and parallel nullification. The simplest general statement is that, for a MZI node, we need two phase shifters, one of which must be on one waveguide arm inside the MZI to control the split ratio. The second phase shifter can be on any input or output waveguide Reck et al. (1994) or on the other waveguide arm inside the MZI Miller (2013a). If the second phase shifter is on an input arm of the MZI, then we need additional phase shifters for the mesh to implement an arbitrary unitary operator. Though such a design is wellsuited for selfconfiguration and setting up input vectors into the device, the protocol for parallel nullification may be less straightforward since such this symmetric configuration does not have an equivalent form as in Eq. 1. For simplicity in programming the device, we primarily concern ourselves with configurations where the second phase shifter is on the input waveguide obeying the form of Eq. 1.
A commonly proposed alternative to the MZI for the split ratio modulation ( in Eq. 1 of the main text) is the tunable directional coupler (TDC), which can achieve any split ratio by simply tuning the coupling region of a directional coupler. The transmissivity varies as with and the tunable coupling constant from coupled mode theory. Phase matched modes should always allow for the full range of transmissivities as long as the range of is large enough. One proposal that follows the TDC scheme is that of Ref. 30.
An alternative control scheme for phase shift operator in Eq. 1 of the main text is a “differential mode” phase shifter scheme. To make meshes more compact, it is functionally equivalent to have phase shifters in both the top and bottom waveguides that can reach a maximum of phase shift. Tuning in the range of would then consist of tuning the top phase shifter from steady state, whereas tuning in the range of would consist of tuning the bottom phase shifter from steady state. Of course, the tradeoff of this more compact scheme is increased complexity in the number of electrical contacts and the logic of the nullification protocol. Independent of any of these control schemes, the transmission matrix model of Eq. 1 generates the correct set of nullification vectors, so parallel nullification works for any of these schemes.
Appendix F Neural network training
As in Ref. 22
, we preprocess each image by applying a Fourier transform and discrete lowpass filter, since such a task is potentially feasible in the optical domain via Fourier optics. In our lowpass feature selection, we pick the center
block of pixels since most of the useful data is within this lowfrequency band, giving us a total of features. Unlike in Ref. 22, we retain some of the redundant Fourier features in this window since it slightly boosts neural network performance, and we use a mean square error loss instead of a categorical cross entropy loss due to slight performance improvement in the former.Our prediction consists of dropping the final outputs of the second neural network layer (shown by the light blue arrows in Figure 4 of the main text), and squaring the amplitudes of the remaining outputs (equivalent to taking a power measurement, denoted by the red photodetector symbols of Figure 4). Given the th data sample (pair of input feature vector and onehot label vector), our cost function is the mean square error
(11) 
where represents the raw output powers of the neural network given input . In practice, our classification will always correspond to the port in which the highest output power is measured, ideally guiding input mode vector to the output port corresponding to . Our model achieves a final train accuracy of and a final test accuracy of . Slightly worse training performance was found using a categorical cross entropy loss.
Finally, we perform an robustness analysis of imperfections in neural network performance (as in Ref. 9). We train the same twocolumn neural network of Figure 4 for different sizes of , achieving test accuracies of respectively, though performance varies slightly from run to run. The training curves and robustness analysis for different phase errors (Gaussian phase errors of the form ) are shown in Figure 9, where we find that phase errors above begin to significantly affect performance. Of course, once parallel nullification is applied, any such errors can be corrected regardless of the exact calibration model for the individual phase shifters. Note that in the case , significant overfitting is present, though adding dropout (i.e., zeroingout power in some of the ports) after the first ONN layer might be one method to “regularize” the physical ONN and therefore reduce this overfitting.
We note that the study in Ref. 9 studies the MNIST task on larger simulated ONNs and accomplishes a similar test accuracy (). Aside from training significantly fewer parameters than in Ref. 9, we use unitary rather than general linear layers, we use mean square error rather than categorical cross entropy loss, and our feature selection is different (lowfrequency Fourier features rather than all raw features).
Appendix G Parallel calibration
When calibrating a large number of nodes in a feedforward mesh Harris et al. (2017); Carolan et al. (2015); Taballione et al. (2018), it is generally convenient to have a fast method to generate calibration curves for the voltage drive of each tunable element in the device. Like parallel nullification, our “parallel calibration” protocol proceeds from lefttoright and can generate these transmission curves in parallel for all nodes within a given column.
In some grid meshes, all nodes can be systematically tuned to cross state (e.g., in the programmable nanophotonic processor (PNP) Harris et al. (2017) or selfconfiguring triangular grid network Miller (2015)). In such schemes, power maximization at the appropriate output detectors can be used to remove the need for embedded detectors (i.e., use output detectors for the entire calibration procedure).
Other feedforward schemes, however, require embedding photodetectors for parallel nullification, which may as well be used for parallel calibration. To calibrate , we find the necessary input vector so that the bottom input port of all nodes in column are nullified (i.e., input power of to all nodes in the column) while setting all previously calibrated layers to bar state. We then measure resulting output transmissivities as we sweep all simultaneously from the minimum to maximum allowable voltage drive settings, as shown in Fig. 10(a). To calibrate , we input the uniform power in all mesh inputs such that all nodes in column have equal power in both input ports (i.e., input power of to all nodes in the column). Assuming we have already calibrated , we set all tunable . Now, we can calibrate just as we calibrated as shown in Fig. 10(b), similar to a parallelized version of the metaMZI scheme Harris et al. (2017); Mower et al. (2015). This calibration approach works for any of the node configurations in Fig. 8. As outlined in Ref. 7, we can calibrate the phase given the transmissivity using the relation , where is the reference phase and is the calibration curve given a sweep over all possible voltage settings for . Once the sweep is finished, we define our calibration by storing the full phase shiftvoltage lookup table or a cubic model fit to that curve (3 parameters per tunable element or 6 parameters per node) Carolan et al. (2015).
Calibrating then for each layer of the mesh from lefttoright results in a fully calibrated device, irrespective of the feedforward architecture. Once calibrated, any reachable unitary operator on the device can be implemented by simply flashing desired values based on the measured transmission curves. Our parallel calibration procedure would need to be repeated once the nodes in the device experience fatigue or environmental perturbation, leading to calibration drift. As mentioned in the main text, this calibration drift might be diagnosed efficiently by sending in nullification set input vectors to identify errors within each column of the feedforward network denoted by unnullified nodes for the corresponding column. Our calibration can also be used to initialize parallel nullification or calculate the final updates for a procedure like in situ backpropagation Hughes et al. (2018). Our parallel calibration, like parallel nullification, generally give us up to speedup and is also generally applicable to any feedforward mesh. It is worth noting also that our parallel calibration protocol might be transferrable to existing technologies such as the PNP grid architecture, with some minor modifications of the protocol discussed in the supplementary of Ref. 7. Such grid architectures are already arranged in node columns and thus can be calibrated efficiently by following the steps of Fig. 10, potentially without embedded detectors.