# Learning to Solve AC Optimal Power Flow by Differentiating through Holomorphic Embeddings

Alternating current optimal power flow (AC-OPF) is one of the fundamental problems in power systems operation. AC-OPF is traditionally cast as a constrained optimization problem that seeks optimal generation set points whilst fulfilling a set of non-linear equality constraints – the power flow equations. With increasing penetration of renewable generation, grid operators need to solve larger problems at shorter intervals. This motivates the research interest in learning OPF solutions with neural networks, which have fast inference time and is potentially scalable to large networks. The main difficulty in solving the AC-OPF problem lies in dealing with this equality constraint that has spurious roots, i.e. there are assignments of voltages that fulfill the power flow equations that however are not physically realizable. This property renders any method relying on projected-gradients brittle because these non-physical roots can act as attractors. In this paper, we show efficient strategies that circumvent this problem by differentiating through the operations of a power flow solver that embeds the power flow equations into a holomorphic function. The resulting learning-based approach is validated experimentally on a 200-bus system and we show that, after training, the learned agent produces optimized power flow solutions reliably and fast. Specifically, we report a 12x increase in speed and a 40 robustness compared to a traditional solver. To the best of our knowledge, this approach constitutes the first learning-based approach that successfully respects the full non-linear AC-OPF equations.

There are no comments yet.

## Authors

• 4 publications
• 4 publications
• 7 publications
• 40 publications
• ### Learning Optimal Solutions for Extremely Fast AC Optimal Power Flow

In this paper, we develop an online method that leverages machine learni...
09/27/2019 ∙ by Ahmed Zamzam, et al. ∙ 0

• ### Adversarially Robust Learning for Security-Constrained Optimal Power Flow

In recent years, the ML community has seen surges of interest in both ad...
11/12/2021 ∙ by Priya L. Donti, et al. ∙ 0

• ### DeepOPF: A Feasibility-Optimized Deep Neural Network Approach for AC Optimal Power Flow Problems

The AC-OPF problem is the key and challenging problem in the power syste...
07/02/2020 ∙ by Xiang Pan, et al. ∙ 0

• ### Solving realistic security-constrained optimal power flow problems

We present a decomposition approach for obtaining good feasible solution...
10/04/2021 ∙ by Cosmin G. Petra, et al. ∙ 0

• ### Physics-Informed Neural Networks for AC Optimal Power Flow

This paper introduces, for the first time to our knowledge, physics-info...
10/06/2021 ∙ by Rahul Nellikkath, et al. ∙ 0

• ### Optimal Operation of Power Systems with Energy Storage under Uncertainty: A Scenario-based Method with Strategic Sampling

The multi-period dynamics of energy storage (ES), intermittent renewable...
07/21/2021 ∙ by Ren Hu, et al. ∙ 0

• ### Voltage Feasibility Boundaries for Power System Security Assessment

Modern power systems face a grand challenge in grid management due to in...
02/27/2021 ∙ by Mazhar Ali, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

The Optimal Power Flow (OPF) problem is fundamental to power systems operation [fioretto2020predicting]. In general, OPF finds the optimal generation set points that minimize operation costs given a set of loads while at the same time satisfying physical and security constraints. The physicality of solutions is ensured by enforcing the power flow equations – a set of non-linear equality constraints.

The increasing penetration of distributed energy resources (DER) is posing new challenges for power system operation. Traditionally, generation schedules were updated at 5-min intervals [fioretto2020predicting]. Due to the variability and uncertainty of renewable generation, system operators are required to update generation schedules more frequently. At the same time, system operators need to manage a growing number of smaller resources, resulting in a significantly larger problem than the conventional OPF problem [guerrero2020towards]. Thus, the integration of DERs requires solving larger OPF problems at shorter time intervals. This motivates research on using function approximators, such as neural networks (NN), to learn solutions to the OPF problem, due to their fast inference and their potential to scale well to larger power networks [owerko2020optimal].

In order to deal with the non-linearity of the OPF problem, a common practice in power system operation is to linearize the grid dynamics, leading to the DC-OPF approximation. The DC-OPF problem is generally formulated as either a linear program (LP) or a quadratic program (QP), depending on the cost function. Such a linearization can work reasonably well for small networks, but the resulting approximation error scales poorly with network size

[ferc2012history]. Furthermore, the assumptions required for linearizing the power flow equations are not valid for heavily-loaded or distribution networks [chatzivasileiadis2018lecture] and the author of [baker2019solutions] show that even under mild assumptions DC solutions are usually never AC feasible. Ideally, the full non-linear AC power flow equations should be used. However, using the non-linear power flow equations comes with another challenge. Because the non-linear AC-OPF equality constraints have a fractal nature riddled with spurious roots, which act as attractors [thorp1997load], it is difficult to disambiguate physical from spurious non-physical solutions. As a result, convergence to the true solution cannot be guaranteed rendering existing solvers brittle.

In this paper, we propose a learning-based approach to the AC-OPF problem that alleviates the aforementioned issues by differentiating through a robust power flow solver, as shown in Figure 1. Our contributions can be summarized as follows. Firstly, we show that a power flow solver based on a holomorphic embedding of the power flow equations [trias2012holomorphic] is differentiable in its arguments. The primary benefit of using Holomorphic Embedded Load Flow method (HELM) is that it is much more likely to find the correct solution if it exists. Secondly, we can use this differentiable HELM solver as a computational layer in a NN. This allows us to learn a neural policy that solves the OPF problem, while respecting the power flow constraints. Furthermore, we show how this method can be extended to enforce convex security constraints, i.e., voltage magnitudes, as well as non-convex integer constraints, i.e., unit commitment. To validate our approach, we empirically show that it is faster and more robust (i.e., results in physically-viable solutions more often) on a 200-bus system in comparison to traditional solvers based on mixed-integer interior point methods (MATPOWER [zimmerman2011matpower]). Specifically, our proposed method achieves a 12x increase in speed and a 40% increase in robustness compared to MATPOWER.

## Ii Related Work

### Ii-a Challenges of Solving AC-OPF

As described earlier, AC-OPF is traditionally posed as a constrained optimization problem, i.e. a cost function is minimized under network constraints oftentimes relying on some form of projected gradient descent [carpentier1962contribution]. Note that this problem formulation faces numerous computational difficulties. First, the non-linear equality constraint (2) poses a challenge. In reality, (2) is a necessary but not sufficient condition for the system to be in a physical state. There are assignments of nodal voltages that fulfill the power flow equations that. however, do not constitute a physical state [tamura1983relationship, thorp1997load]. Optimization algorithms based on some form of projected gradient descent might be attracted to such a non-physical solution rendering them not robust. Advanced techniques like the Homotopy [okumura1991solution] or Continuation [milano2009continuous] methods alleviate but not fully remedy this problem, and can incur substantial computational costs. Furthermore, non-convex constraints create additional computational issues. Algorithms such as branch-and-bound that are typically employed to deal with integer constraints require solving multiple, and in the worst case exponentially many, relaxed linear programs [lawler1966branch].

### Ii-B Learning OPF

Because larger OPF problems need to be solved more frequently, there is growing interest in learning to solve OPF problems with NNs. These approaches formulate OPF as the problem of learning a policy, which maps the demand configurations to generation set points. This policy could be learned from historical data, i.e., imitation learning, and/or by interacting with the system, i.e. reinforcement learning.

The most common paradigm of existing solutions is to learn a mapping from demand assignments to optimal generation set points based on a data set provided by an external solver, such as MATPOWER [zimmerman2011matpower]. This is analogous to behaviour cloning of an expert policy. In this case, the external OPF solver constitutes the expert policy [chen2020learning]. Following [gupta2020deep], we refer to such a paradigm as OPF-then-learn. Specifically, authors of [guha2019machine, owerko2020optimal] pose AC-OPF as an end-to-end regression task. Specifically, in [owerko2020optimal] graph neural networks (GNN) were adopted, which make predictions based on local information from neighboring nodes. Such an approach allows for a decomposition of the grid-level problem based on power system connectivity and, in principle, scales well to large networks. However, a significant limitation of such an approach is that the solutions may not adhere to physical and security constraints inherent to the physical system. Concretely, 49% and 30% of the solutions were infeasible when applied to the IEEE 30-bus and 118-bus test systems respectively [guha2019machine].

This has motivated research on learning-based methods that respect constraints. The authors of [chen2020learning, zhao2020deepopf+, singh2020learning] studied the problem in the context of DC-OPF. Since DC-OPF can either be posed as an LP or QP, the problem has amenable mathematical properties and permits for relatively simple solution strategies. More relevant to our work, [zamzam2019learning, fioretto2020predicting] studied the problem in the context of AC-OPF. To ensure feasibility, in [zamzam2019learning] generation set points produced by a NN were passed into a power flow solver. This speeds up the computation since the power flow equations are easier to solve than AC-OPF. In [fioretto2020predicting], the authors applied primal-dual updates to the Lagrangian dual function of the underlying constrained optimization problem, where a NN predicts the generation set points and the power system states. Notably, the approach proposed by [fioretto2020predicting] produces more accurate and cost-effective solutions than those by DC-OPF approximation, which is commonly used in industry.

However, there are key limitations to the OPF-then-learn paradigm. Firstly, a large labeled data set needs to be generated for training. Given the convergence issues of solving AC-OPF, some load configurations may not have a solution from commercial solvers. Secondly, the learned solution may not generalize to unseen scenarios and re-training the system on incoming demand assignments requires creating a new training set. Because of this, authors of [gupta2020deep] proposed OPF-and-learn as an alternative paradigm where generation cost is optimized directly by computing gradients through the OPF problem. While OPF-then-learn is analogous to imitation learning, OPF-and-learn

can be seen as a one-step Markov Decision Process (MDP). However, the

OPF-and-learn approach proposed in [gupta2020deep] relies on the DC-OPF linearization, which limits the accuracy of solutions and is not applicable to heavily-loaded or distribution networks. Furthermore, when assuming the DC-OPF linearization, creating a training set is trivial because optimal solutions can be obtained reliably and fast. Thus, by making the DC-OPF assumption, the circular dependence on training data is broken but a significantly easier problem is solved which limits the potential advantages of the OPF-and-learn paradigm.

To the best of our knowledge, this paper introduces the first OPF-and-learn approach that respects the full non-linear power flow constraints as well as non-convex unit-decomittment and non-linear physical and security constraints.

## Iii Proposed Learning Framework

Let be the cost associated with the assignment of nodal voltages and being the number of buses of the system. In the following, without loss of generality, we define the cost in terms of for presentational simplicity. However, numerous constraints need to be enforced, some of which are non-linear and non-convex, i.e. the load flow problem is traditionally posed as:

 minimize w.r.t. v: c(v) (1) subject to: S=Sg−Sd=diag(v)(Yv)∗ (2) ki(v)≤0 (3) hi(v)=0 (4)

with and being a demand and generation assignment, respectively, and being the bus admittance matrix. For different demand assignments , this optimization problem would be solved over and over again.

Our proposed learning-based formulation following the OPF-and-learn paradigm is as follows: First, we note that the nodal voltages are a function of and , i.e. with solving the power flow equations (2). Second, we introduce a function that is tasked with producing optimal generator configurations as a function of the system state, in this case the demand , in an expected risk minimization setting. Thus, with parameterizing the function . Third, we assume knowledge of a data set containing historic demand assignments, i.e. a collection of possible states . Note that unlike OPF-then-Learn

approaches [cite,cite,cite] we do not assume knowledge of the corresponding optimal generator assignments. The goal is to estimate the parameters of the function

, in this case , that produce optimal generation assignments as a function of the demand. As an optimization objective and learning signal for the function , we propose the following:

 minimize w.r.t. Θ: ∑Sd∈Dc(v(Sd,gΘ(Sd)))=L (5) subject to: ki(v)≤0 (6) hi(v)=0 (7)

When compared to traditional optimization based approaches, this problem formulation has a number of advantages:

1. The non-linear power flow constraint (2) vanishes. This increases robustness and avoids convergence to non-physical solutions given a differentiable and robust power flow solver.

2. As we will show later, optimization under non-convex constraints can be amortized, i.e. time training the system is spent once and after training, inference is extremely fast and solely requires a forward-pass through a NN even when respecting binary unit-commitment constraints.

3. Because training entails learning an optimal mapping between Euclidean spaces that represent demand and generation assignments respectively, the proposed approach exploits covariances between load flow problems and allows for the generalization to unseen problems.

However, optimizing w.r.t. (5

) poses challenges: If gradient descent is used for optimization, gradients need to be defined. Applying the chain rule to

yields:

 ∂L∂Θ=∂c∂v∂v∂g∂g∂Θ

Thus, the loss is differentiable if the cost function , the voltage function and the actor function are differentiable. and can be usually assumed to be differentiable, however, the fact that computing the gradient through a power flow solver w.r.t. to the generation assignment , i.e. , is possible, might not be obvious. In section IV, we will show that computing the gradient through a robust power flow solver that represents the full non-linear AC-OPF equations, namely the Holomorphic Embedded Load Flow Method, is indeed possible. Doing this allows us to obtain the learning signal .
Furthermore, the constraints i.e. (6) and (7) need to be enforced. In section V, we will show that an auxiliary function can be used to enforce the Karush-Kuhn-Tucker conditions ultimately allowing us to enforce arbitrary constraints similar but not identical to the approaches introduced in [fioretto2020predicting, gupta2020deep].
In section VI, we show how the proposed approach can be extended to handle non-convex, e.g. binary unit commitment constraints, by optimizing a variational lower bound whilst introducing little computational overhead during inference.
The approach is validated empirically on a 200 bus system. The experimental setup and results are described in section 5. In section 6, our findings are concluded and pathways for future work are laid out.

## Iv Holomorphic Embedded Load Flow Method

HELM was first proposed by Trias [trias2012holomorphic, trias2015fundamentals] and was later extended in e.g. [subramanian2013pv, wallace2016alternative]. HELM addresses the convergence issues of power flow solvers based on Householder’s methods for root-finding such as the Newton-Raphson algorithm. Because the power flow equations have multiple roots but only a single physically realizable solution, these types of solvers are at risk to converge to a spurious, unstable or low voltage solution [deng2013convergence]. Whether or not the ‘correct’ root is being found is usually dependent on the initial condition  [thorp1997load]. Because HELM does not require an initial guess or initial condition, it overcomes the ambiguity problem that Householder’s solvers face, by performing analytical continuation from a known physically-realizable solution. Analytical continuation also allows HELM to find the root that is on the same branch-cut as the previously known physically-realizable solution. This solution is unique because analytical continuation is unique when the function at hand is holomorphic. HELM treats the complex nodal voltages at each bus as holomorphic functions of a complex scalar . These functions are evaluated at a point for which obtaining a physically-realizable solution is trivial (usually ) and are then continued to the solution at a desired point where the original power flow equations are recovered (usually ).

Let be a function of the complex scalar . Equation (8) then describes such a holomorphic embedding, i.e. obtaining a solution at is trivial because no power is flowing and the original power flow equations (2) are recovered at . See [wallace2016alternative] for a proof that is indeed holomorphic.

 YV(z)=zS∗V∗(z∗) (8)

In order to obtain the power series coefficients required for analytical continuation, and its reciprocal are approximated by a power series expansion, i.e.:

 V(z) =∞∑n=0c[n]zn (9) 1V(z)=W(z) =∞∑n=0d∗[n]zn (10)

Similar to traditional power flow solver such as Newton Raphson, in order to avoid overspecification of the problem, a slack bus is introduced: Let be the reduced matrix by removing the row and column of the slack bus and be the slack-row of sans self-admittance. We assume that the voltage at the slack generator is with .

For the th row of (8) the following then holds:

 ∑kYrik∞∑nck[n]zn+(vs+0j)ys =zS∗i∞∑nd∗i[n]zn (11)
 Setting z=0: ∑kYrikck[0]=−vsys (12)

Thus, solving the linear system in (12) yields a solution at . Note that this solution is physically-realizable because no power is flowing. Higher order power series coefficients can be obtained by equating coefficients of the same order and by making use of which yields:

 dk[0] =1ck[0] (13) ∑kYrikck[n] =S∗idi[n−1] (14) di[n] =−∑n−1m=0ci[n−m]di[m]ci[0] (15)

After obtaining power series coefficients, analytic continuation is performed to obtain a solution at

. However, since the radius of convergence is usually smaller than 1, analytical continuation is performed using Padé approximants instead of directly evaluating (

9). Padé is a rational approximation of power series known to have the widest radius of convergence [stahl1989convergence]. Analytical continuation by Padé approximation is performed as follows:

 Vi(z)≈R(z)=∑mj=0ai,jzj1+∑mk=1bi,kzk (16)

Approximants of order , i.e. and can be obtained from the power series coefficients by solving a linear system of equations, specifically:

 [IM(ci)][aibi]=ci

with:

and

being the identity matrix. Because we perform analytical continuation to

, plugging the obtained coefficients into (16) yields: .

### Iv-a Differentiating through HELM

In the following we will view HELM as a function that maps complex nodal power to complex nodal voltages, i.e. . We will show that is not holomorphic but -differentiable in ultimately allowing us to compute gradients w.r.t. the parameters of the actor function . By making use of the chain-tule, the strategy is to decompose HELM into a succession of functions and show that each function is -differentiable. Specifically, we decompose HELM into its algorithmic steps, i.e. with computing power series coefficients, computing Padé approximants and computing voltage phasors given Padé approximants. We then show that , and are -differentiable. Note that

is a recursive function and that writing its gradient out would be tedious. But its gradients can be computed efficiently using the backpropagation algorithm and the implementation is trivial in any deep learning frameworks with automatic differentiation capability, e.g.

PyTorch and TensorFlow.

As stated earlier, HELM first computes the power series coefficients followed by Padé approximation. The power series coefficients and are obtained in alternating fashion: Let and be the function that produces and respectively. Note that requires knowledge of the previous -coefficient and , whereas is a function of all previous - and -coefficients:

 corrs. to (14): fc,n(x)=(Yr)−1fd,n−1(x)x∗ (17) corrs. to (15): fd,n(x)=∑n−1m=0fc,n−m(x)fd,m(x)fc,0(x) (18)

Because of the complex conjugation in (17), is not holomorphic in , and , if . However, it is easy to see that, by induction, (17) and (18) are -differentiable when and are -differentiable which is easy to see from (12) and (13).
After obtaining the power series coefficients, Padé approximants and are calculated. Note that this also only includes solving a linear system of equations, i.e.

 fab(x)=[ab]=[IM(x)]−1x

which is differentiable. Then includes only a summation and fraction, i.e:

 fv([ab])=m∑i=0ai/(1+m∑i=0bi)

Although, is not holomorphic, it is -differentiable in its argument and, when applied to , it is -differentiable in .

## V Enforcing Constraints

### V-a A priori constraints

As stated earlier, we treat the generation assignment as the output of a parameterized function . Because of the reasoning laid out earlier, we require to be differentiable and because of recent successes of NNs in non-linear optimization, we choose to be a NN with a penultimate sigmoidal layer [zamzam2019learning]. We incorporate the generation limits of the generators into the output layer of the NN and therefore enforce generation limits by construction. Let be the penultimate layer with

being the number of generator buses. Thus, every generator is associated with two neurons, i.e.:

 gΘ(Sd)i=(Sg)i =(Pmaxi−Pmini)σi+Pmini +j(Qmaxi−Qmini)σi+Ng+jQmini

with ,, and being the active and reactive generation limit respectively. Because is bounded by non-slack generation limits cannot be violated. However, other constraints such as e.g. voltage magnitude or thermal line limits are not enforced by construction. That is why, in the next section we show how to enforce what we call a posteriori constraints, i.e. constraints whose violation is only known after evaluating .

### V-B A posteriori constraints

We adapt ideas from mathematical optimization to enforce arbitrary constraints on

. In mathematical optimization, the Karush-Kuhn-Tucker conditions (KKT-conditions) are necessary conditions for a solution to be optimal [gordon2012karush]. Given the optimization problem (1) expressed in terms of , the KKT conditions state that a solution is locally optimal under some regularity conditions when there exist such that:

• (Dual feasibility)

• (Complementary slackness)

• (Primal feasibility)

• (Stationarity)

Note that, without loss of generality (because any equality constraint can be expressed as two inequality constraints) and for notational convenience, we restrict the optimization problem to only have inequality constraints.

However, as stated earlier, we are not interested in the solution of a single constraint optimization problem but instead in solutions to all instances of a class of optimization problem. In this case, , i.e. the demand assignment, specifies the instance of the optimization problem whereas the network topology, i.e. admittance matrix , specifies the class. First, we note that the KKT-multipliers are dependent on the instance of the optimization problem, thus instead of introducing a scalar , we introduce a scalar-valued function . In order to enforce dual feasibility by construction, we choose to be a NN with soft-plus output parameterized by . Furthermore, let , the th output of and . We will now introduce a learning criterion and show that local optima of this criterion fulfill the KKT-conditions for instances of the class contained in the training set. As a learning criterion we propose:

 L(Sd)=c(gSdΘ)+∑i(uSdψ)ik+i(gSdΘ) (19) ∀Sd∈Dmaxψ{minΘ{L(Sd)}} (20)

We will now show that, after convergence, for all , is locally optimal under some regularity constraints, i.e. it fulfills the KKT-conditions and furthermore, that the KKT-multipliers for which the KKT-conditions hold are:

 μi={(uSdψ)iif\ % ki(gSdΘ)=00else (21)
• Dual feasibility: is dual feasible by construction: it is either or greater than because it is the output of a soft-plus NN.

• Complementary slackness: Follows directly from (21)

• Primal Feasibility: Since (20) converged, we know that and since , must be primal feasible. Or in other words: if was not primal feasible, but then the maximization step of (20) could have increased by increasing which is a contradiction to the assumption that (20) has converged.

• Stationarity: Follows directly from the assumption that (20) has converged. Note that substituting for does not have an influence because if then the corresponding (complementary slackness) and when then

Note that the detour of substituting for improves the performance substantially. Without the substitution, because more constraints are complied with initially, the NN drives the outputs before the soft-plus non-linearity to in order to make the corresponding equal to 0. The output units are then ‘dead’ and, because the gradient of the output non-linearity is close to 0, will always stay 0.

### V-C Enforcing Physicality

So far, we have shown how to enforce ‘a priori’-constraints, i.e. constraints whose violation is known before inferring nodal voltages, by construction, as well as ‘a posteriori’-constraints, i.e. constraints whose violation is known after inferring nodal voltages, by introducing a learning objective that, after convergence, will enforce the KKT-conditions. However, we have not yet shown how to keep the function in the physical regime, i.e. prevent from producing a generation assignment for some such that there is no that fulfills the power flow equations (2). An extreme example of a non-physical tuple , for any demand assignment for which is .

First, we note that HELM will always produce complex nodal voltages even for non-physical tuples. However, for non-physical tuples the power flow equations (2) will not hold, i.e. there is a mismatch between the RHS and LHS of (2). We quantify this mismatch by defining:

 ϵ(v)=||Sg−Sd−diag(v)(Yv)∗||∞

The goal now is to enforce that with being some parameter which specifies when a power flow solution is deemed physical. Note that because is a function of , in principle, an additional inequality constraint could be introduced, i.e. and one could try to enforce this constraint as an a posteriori constraint as described earlier. However, in our experience this approach struggles, i.e. the learning objective usually does not converge. Figure 2 gives an intuition as to why this is the case. Figure 2 shows as a function of on a 200 bus system. scales the generation of a physical tuple , i.e. the y-axis shows . Note that when is either small or big ( or ), is close to flat and therefore the gradient of is close to 0. After randomly initializing the the parameters of the function , its guesses about optimal generation assignments will naturally be bad which corresponds to scaling the optimal generation assignment with a small or big . However, the function cannot improve its guesses by gradient descent because the gradient will be close to 0.

In order to overcome this problem, we propose to optimize a proxy of the actual mismatch function . Note that an indicator of whether or not a solution is physical is whether or not the power series coefficients have converged to 0. Let be the mean th power series coefficient of all voltages, i.e. . Figure 3 shows a scatter-plot of and . Empirically, one can see that small is a sufficient condition for small , however not a necessary condition. That is, a small implies small but not vice versa. Thus in order to enforce physicality, can be minimized as a proxy for . However, one might think that optimizing is unnecessarily restrictive, i.e. it excludes solutions where the power series coefficients did not converge to 0 but the corresponding nevertheless fulfill the power flow equations. But, as we will show later, imposing voltage magnitude constraints naturally enforces physicality and additionally minimizing is only required after the actor function was first initialized in order to ‘nudge’ the it into the physical regime.

## Vi Binary Constraints

Binary constraints naturally occur in optimal power flow when incorporating the possibility of completely shutting down generators. Introducing this constraint also known as unit commitment makes generation limit constraints non-convex, i.e. becomes a possible generation assignment, even though points between and are not valid. Typically, optimal power flow solvers employ mixed integer programming techniques such as branch and bound or branch and cut [lawler1966branch] algorithms to tackle this problem. However, these algorithms can incur substantial computational cost, i.e. every branch requires solving an LP relaxed optimal power flow problem and there are exponentially-many branches in a worst-case scenario.

However, using the problem formulation introduced here, because the constraint is a priori

and can be enforced by construction, we can reduce the non-convex constraint into the problem of inferring the mode of a probability distribution

over binary configurations. As we will show later, because inference in is intractable, we optimize a variational bound, i.e. we introduce a variational distribution for which posterior inference is tractable and choose the variational parameters in such a way that best approximates

. Specifically, we built on recent advances in Bayesian inference, specifically Variational Inference

[wainwright2008graphical] and train a variational distribution parameterized by a NN. See [blei2017variational, zhang2017advances] for recent reviews of Variational Inference.

We begin by showing that computing the optimal binary configuration is equivalent to computing the mode of a distribution . Let

be the vector describing which generators are turned

on or off and

be an exponential distribution,

its posterior (Boltzmann distribution) and be the loss as defined in (19), i.e.

 p(b,Sd) =λexp−λL(b⋅Sd) (22) p(b|Sd) =exp−λL(b⋅Sd)∑b′∈Bexp−λL(b′⋅Sd) (23)

It is easy to see that computing the mode of (23), i.e. is equivalent to choosing the binary configuration that results in the smallest loss. However, naïve evaluation of the mode is usually intractable, because of the intractable denominator. Naïvely computing the mode of (23) is equivalent to brute-force search, i.e. enumerating all possible latent configurations and picking the one with the smallest error. However, in the following, we show how ideas from Variational Inference [wainwright2008graphical] can reduce the computational burden of inference.

We introduce a variational distribution whose posterior is tractable. Specifically, we choose

to be a multi-variate Bernoulli distribution and ensure tractability with ideas introduced in

[lange2018factornet]. Note that is parameterized with a NN, therefore ensuring that inference at test-time is fast. As a learning signal for the parameters of the auxiliary posterior distribution , we choose the Evidence Lower Bound defined by:

 LBO(ϕ) =Eqϕ(b|Sd)logp(b,Sd)qϕ(b|Sd) (24) =logp(Sd)−DKL(qϕ(b|Sd)||p(b|Sd)) (25)

Note that optimizing (24) does not require knowledge of the intractable posterior of but nevertheless allows for minimizing a divergence measure between the true () and auxiliary posterior (). Thus, after training, in order to obtain an approximation of the mode of , because and will be maximally similar, posterior inference is performed on

instead. However, the price for this ‘trick’ is increased variance. It can be shown that the stochastic gradient estimator of (

24) w.r.t. is an unbiased but higher variance estimator of the KL-divergence [mnih2014neural]. In order to combat variance, a decades-old variance reduction technique is employed, namely sampling without replacement. Sampling without replacement from is not trivial. However, there is a considerable body of preexisting work that we make us of. The sampling scheme introduced in [shah2018without] with slight modifications is employed. Specifically, instead of using the Pareto sampler as the underlying sampling mechanism, a slightly slower but more accurate elimination sampler introduced in [deville1998unequal] is used.

In order to obtain an approximation of the mode of the true posterior, because allows for drawing samples efficiently, -many samples are drawn from . Then, in order to approximate the mode of , out of the -many binary configurations sampled from , the one which results in the smallest generation cost is chosen. Note that the optimal configuration of generators is dependent on the binary configuration, thus should additionally be fed into the actor function . Figure 4 shows a graphical depiction of the proposed data pipeline.

## Vii The Lopf-algorithm

In this section we summarize the resulting algorithm, we call Learning Optimal Power Flow, or short LOPF. The algorithm iterates over batches of the data set making updates to the three constituent NNs , and . It is described in pseudo-code in Algorithm 1. For notational convenience, we define a function :

 Sg =b⋅gΘ(Sd,b) solve(Sd,b) =⎛⎜ ⎜⎝ϵ(v(Sd,Sg))c(v(Sd,Sg))+∑i(uψ(Sd))ik+i(v(Sd,Sg))N−1∑i(fc,n(Sd−Sg))i⎞⎟ ⎟⎠T

Figure 5 shows a graphical depiction of the input/output relationships of the individual networks. Note that the network not only produces active and reactive generation assignments for non-slack generators but also the voltage at the slack bus. Furthermore, the magnitude by which constraints are violated, denoted by , are fed into the network that produces a proxy of the Lagrange multipliers. Additionally feeding into the -network eases and speeds up learning considerably.

## Viii Experiments

Since this work introduces a learning based approach to the problem of ACOPF, the performance of the algorithm is evaluated similar to how the performance of reinforcement learning agents is evaluated, i.e. an empirical evaluation strategy is employed. Specifically, given a held out test set of load flow problems that the system was not presented with during training, the generation cost and the result of whether or not the system was able to find a feasible solution are recorded.

The requirements for feasibility excluding those that are met by construction are the following:

• Log-mismatch between the RHS and LHS of the power flow equations (2), i.e. , must be smaller than .

• Slack active and reactive generation are within limits

• Non-slack voltage magnitude constraints are met

The experiments were conducted on the 200 bus Illinois IEEE test case [zimmerman2011matpower]. However, since the IEEE test cases only contain a single demand assignment, the demand base case was superimposed by temporal patterns extracted from the RE Europe data set [re_europe]. RE Europe data set contains historical demand for 3 years at an hourly interval. Let be the base demand taken from test case and

be the temporal demand patterns taken from RE Europe data set. The temporal patterns were imposed such that the mean demand of every node is equal to the demand in the test case and such that the ratio between mean and standard deviation as seen in the RE Europe data set is preserved. The data set was separated into training (20.280 data points) and test set (6000) when conducting experiments.

The NNs used in this experiments constitute standard fully connected three-layer networks with intermediate activations. All intermediate layers have hidden units.

## Ix Results

As stated earlier, learning based approaches to control problems are usually not guaranteed to be optimal but can offer advantages in terms of computational time and robustness. The performance of LOPF reinforces these expectations. Figure 7 (left) shows the percentage of load flow solutions produced by the system that violate any requirement for feasibility as a function of learning steps. One learning step encompasses 32 load flow problems. For all of the 32 load flow problems 50 candidate binary configurations are drawn from the auxiliary distribution. One can see that the system quickly learns to produce feasible solutions. Initially the system produces feasible solutions to no load flow problems. However, after just 30 steps close to all solutions proposed by the system are feasible.
Figure 7 (right) shows the -mismatch between the RHS and LHS of the power flow equations (2), i.e. . Note that initially, the proposed approach produces generation assignments for which the HELM solver is unable to produce voltage phasors that fulfill the power flow equations but by minimizing the power series coefficients as described in section V-C, the system is quickly nudged into a regime where the proposed solutions fulfill the power flow equations. However, after approximately 75 learning steps, for a short period of time, the system produces generation assignments that, again, do not fulfill the power flow equations. This can most likely be explained by the fact that the system also tries to minimize cost. Thus, by trying to find cheaper generation assignments, the system left the regime in which solutions can be found by HELM but was then steered back into this regime.
Table I showcases the performance of our proposed algorithm in comparison to the MIPS solver proposed in [zimmerman2011matpower]

. In order to deal with non-convex generation limit constraint, the MIPS solver was run with a unit-decommitment heuristic (

runuopf). When obtaining the results for the MIPS solver, all initializations were unchanged and only demand was varied in the way described earlier. Slightly varying the demand reveals the weakness of traditional solvers: Convergence to a feasible solution cannot be guaranteed. In our experiments, the MIPS solver produced solution which comply with all constraints and fulfill the power flow equations in only about 61% of all problem instances (failure in 2349 out of 6000 instances). Our proposed solution produces feasible solutions for 99.86% (failure in 8 out of 6000 cases) of the problem instances.111When LOPF fails, it slightly violates voltage magnitude constraints.
On top of that, our proposed learning based approach is considerably faster than optimization based approaches: Because solutions can be obtained by feeding a demand assignment through the NNand the forward pass through NNs is usually fast, obtaining the generation assignment proposed by the system is fast. Note that when we report the time per instance for the MIPS solver, we report the mean-time over all load flow problems. However, when the solver fails, it usually fails quickly. If only the time per successful instance was reported, the mean time per instance of the MIPS solver would be close to 30s per instance.
However, Table I and Figure 6 more clearly reveal the main weakness of our proposed learning based approach. Even though solutions can be obtained robustly and fast, the proposed learning system does not find solutions that are optimal in terms of generation cost. On average, the solutions that the approach produces are approximately 29% more expensive than the solutions found by the MIPS solver. Note that the average cost is reported for only those load flow problems for which both approaches yielded feasible solutions.

## X Conclusion and future work

The main contribution of this paper is the introduction of a learning based framework for the problem of ACOPF that respects the full non-linear ACOPF equations. Specifically, we introduce a learning based approach in which a function is tasked to produce feasible and minimal cost generation assignments as a function of the demand. A learning signal for this function is obtained by differentiating through the operators of a load flow solver. Furthermore, we show how convex security constraints and non-convex generation limit constraints can be enforced. The resulting system seems to produce feasible solutions fast. However, these solutions are not necessarily optimal in terms of generation cost.

An obvious future research path is to close the optimality gap. At this moment, because of the complexity of the resulting system, it is hard to understand why the solutions are not optimal. But note that the proposed algorithm cannot be optimal by design, because the slack generator cannot be decommitted and there is a natural interpolation between load flow solutions. However, in the opinion of the authors, performance gains in terms of optimality should be possible.

Another potentially interesting research question is whether or not the trained auxiliary distribution that learns the cost surface as a function of the binary generator configuration allows for conditional sampling. Imagine a scenario where generators have failed. In such a scenario, it is paramount to reconfigure the network in a feasible state fast. If it is possible to sample from conditioned that the failed generators are off, then the proposed learning based approach could potentially find application in emergency and security sensitive situations. Note that this seemingly easy problem is not trivial because of the FactorNet [lange2018factornet] structure of the auxiliary distribution.