## 1 Introduction

Certain control algorithms such as Model Predictive Control (MPC) and visual servoing require heavy real-time computational operations while on-site controllers in industrial control systems are often resource-constrained. The cloud-based control may resolve this issue by allowing on-site controllers to outsource their computations to cloud computers. Many such cloud-based control schemes have been proposed recently in the literature. For instance, Hegazy and Hefeeda (2014) considered a cloud-based MPC architecture with an application to large-scale power plant operations. Wu et al. (2012) considered a cloud-based visual servoing architecture, where an UDP-based communication protocol was developed for latency reduction.

Despite the computational advantages, a naive implementation of cloud-based control can leak operational records of clients’ control systems, which often contain sensitive information. Since private information can be a valuable asset in our modern society, an appropriate privacy protecting mechanism is a critical requirement for cloud-based control services in practice. Both non-cryptographic (e.g., differential privacy) and cryptographic (e.g., homomorphic encryption (HE), the focus of this paper) approaches have been considered in the literature.

Cloud-based control over HE is pioneered by Kogiso and Fujita (2015), Farokhi et al. (2016), Kim et al. (2016), followed by rapid developments of related technologies in more recent works. The focus of the literature to date has been mainly on the cloud-based *implementations* of control policies (Fig. 1 (a)). In such scenarios, the role of the cloud is to evaluate the values of control actions in the ciphertext domain based on the encrypted values of the system output , where the control policy is given in advance. In this paper, we consider a different scenario motivated by cloud-based control *synthesis* problems (Fig. 1 (b)). In this scenario, the role of the cloud is to construct a control policy

while the synthesized policy may be implemented locally by the client. Control synthesis problems are classified into model-based and data-driven approaches. In model-based approaches, a mathematical model of the system to be controlled is explicitly used for the policy construction (e.g., synthesis of explicit MPC laws;

Bemporad et al. (2002)), while data-driven approaches construct a policy from the historical input-output data without using system models (e.g., reinforcement learning). Since there are scenarios in which control synthesis involves heavy computations (e.g., explicit MPC for time-varying systems, reinforcement learning by deep Q-network; Mnih et al. (2015)), it will be beneficial to consider cloud-based implementations of such computational procedures. To the best of our knowledge, cloud-based control synthesis over HE has not been investigated in the literature yet.Although a large portion of the existing work on control over HE is restricted to partially homomorphic encryption (PHE) schemes (notable exceptions include Kim et al. (2019)), this paper adopts a framework of FHE schemes. This is because it is hard to find a useful application of PHE schemes (which support only addition *or*

multiplication in ciphertext domain) in control synthesis problems, where more advanced algebraic operations are typically required. As a step forward to general control synthesis over FHE, this paper focuses on implementing one of the most elementary reinforcement learning (RL) algorithms, namely the SARSA(0) algorithm. We note that RL over FHE is largely unexplored, while supervised learning over FHE has been studied in recent years, e.g.,

Dowlin et al. (2016). To deal with delays due to computational overhead by FHE, we consider a modified SARSA(0) and discuss its convergence properties. We implement the modified SARSA(0) over Cheon-Kim-Kim-Song (CKKS) encryption scheme and apply it to a benchmark RL problem (pole balancing) to demonstrate the feasibility of our study.In the remainder of this paper, we first summarize the overview of HE schemes in Section 2. A modified SARSA(0) algorithm is described in Section 3, and the results of numerical experiments are shown in Section 4. We conclude in Section 5 with future research directions.

## 2 Preliminaries

Homomorphic Encryption (HE) is an encryption method that possess homomorphic properties with respect to addition and/or multiplcation operations. Modern cryptography can be classified into two categories, namely symmetric key encryption and asymmetric key encryption. Symmetric key encryption model assumes that the encryption and decryption keys are interchangeable. For example, Advanced Encryption Standard (AES) is a widely known symmetric encryption algorithm. On the other hand, asymmetric key encryption, often known as public-key encryption, requires the decryption key to be different from the encryption key. Thus, a public-key homomorphic encryption method can enable cloud outsourced computations with data protection. A comprehensive introduction to this technology can be found otherwise in Yi et al. (2014).

### 2.1 Fully Homomorphic Encryption

Homomorphic Encryption is Partially Homomorphic if only one opeartion (either an addition or a multiplication) is preserved. Some popular Partially Homomorphic Encryption schemes include ElGamal Cryptosystem and Paillier Cryptosystem. The former is homomorphic with respect to multiplication and the latter with addition. Recently, both schemes have been applied in control literatures, Kogiso and Fujita (2015), Farokhi et al. (2016), Alexandru et al. (2018), and Schulze Darup et al. (2018).

If both additions and multiplications are preserved but for a limited number of operations, then the encryption scheme is called Somewhat (Leveled) Homomorphic. Gentry (2009) proved that any Somewhat Homomorphic scheme with a bootstrapping procedure can be promoted to Fully Homomorphic Encryption by controlling the noise growth in ciphertexts. Since then, many more efficient schemes emerged. It is also worth mentioning that many of recent generation FHE schemes are among a few candidates for Quantum-resistant cryptosystem, Lange (2015). A more in-depth coverage on FHE can be found in Halevi (2017).

### 2.2 CKKS Encryption scheme

Cheon et al. (2016) developed a CKKS scheme. A bootstrapping procedure for CKKS was proposed by Cheon et al. (2018)

, thus elevating it to a fully homommorphic encryption scheme. CKKS is unique in that it can encrypt a vector of complex numbers and can perform approximate arithmetic as opposed to other schemes, which encrypt integers and perform exact arithmetic. Note that a special encoding structure is necessary as the plaintext space of CKKS scheme is an integer-coefficient polynomial ring modulo cyclotomic polynomial. A detailed review of CKKS scheme is out of scope of this paper and we refer

Cheon et al. (2016).CKKS scheme’s precision loss is only slightly more than a precision loss due to the unencrypted floating point arithmetic. Therefore, this scheme is very convenient for control applications. A small precision loss can be modeled as an existing disturbance and applications involving large data can reap benefits via batched operations. Moreover, Microsoft

SEAL (2019)provides an open-source library that supports CKKS with a simple tutorial for the use.

### 2.3 Cloud-based control over homomorphic encryption

More recently, the secure evaluation of affine control law for explicit MPC using Paillier Cryptosystem (PCS) is shown in Schulze Darup et al. (2018). In Alexandru et al. (2018), an implicit MPC control evaluation using PCS was shown to be possible through the projected Fast Gradient Method and the use of a communication protocol. Also, in Darup et al. (2018), a PCS encrypted implicit MPC was shown via the use of real-time proximal gradient method. But due to the nature of PCS, or any other partially homomorphic schemes, some parameters are assumed to be public. This is a valid assumption in some cases but may not fit some other scenarios.

On the other hand, FHE is not as adopted as PHE in control systems because it is still far more computationally demanding compared to PHE. Nonetheless, the feasibility of using the FHE for a cloud-based control system was first shown in Kim et al. (2016). Subsequently, in Kim et al. (2019), a secure dynamic control was proposed using LWE-based FHE and a critical observation was that the noise growth of FHE is bounded by the stability of the closed-loop system under some conditions, eliminating the need for bootstrapping, which is one of the most computationally involved procedure.

Previous attempts to apply homomorphic encryption to cloud-based control mostly focused on cloud-based *implementations* of control laws. In many cases, implementations of control laws imposes stringent real-timeness requirements ( milliseconds), which significantly restricts the use of homomorphic encryption, which introduces delay. Instead of focusing on the control implementations, we apply homomorphic enryption to cloud-based control *synthesis* (computation of feedback control policies to be implemented) as seen in Figure 1(b). Compared to Figure 1(a), we assume that a feedback control policy is implemented by the client. Since sensor data and control command are not encrypted, fast implementation of the control policy is possible.

The role of the cloud in our model is to update the control policy based on the uploaded information by the client. In a model-based design scenario, the client uploads encrypted plant information based on which the cloud synthesize a policy . In a data-driven scenario (such as reinforcement learning), the client upload encrypted operational record of the system (e.g., action , observation and reward signal) based on which the cloud synthesize a policy . Since the time scales of control synthesis (i.e., the frequency at which control policies are updated) are typically much slower than that of implementation, we expect that the existing FHE schemes can be fruitfully applied.

## 3 Reinforcement learning over fully homomorphic encryption

Consider a Markov decision process (MDP) defined by a tuple

, where is a finite state space, is a finite action space,is the state transition probability and

is the reward. A policy is a sequence of stochastic kernels , where assigns the probability of selecting the next action given the history . For a fixed policy , the*value*of each state is defined by

where is a predefined discount factor. We say that a policy is *optimal* if it maximizes the value of each state simultaneously.
The existence of time-invariant, Markov, and deterministic optimal policy under the present setup is well-known, (e.g., Bertsekas (2011)).
Consequently, an optimal policy of the form can be assumed without loss of performance.
The value function under an optimal policy satisfies the Bellman’s equation

(1) |

### 3.1 Q-learning

The focus of reinforcement learning algorithms in general is to construct an optimal policy based on the observed history of the states, actions, and reward signals.
The Q-learning, Watkins and Dayan (1992), achieves this by attempting the construction of the optimal *Q-function* defined by

(2) |

for each . It follows from the Bellman’s equation (1) that

for each . Once is obtained, an optimal policy can also be obtained by .

Let be a predefined sequence such that , and .
Denote by the number of times that the state-action pair has been visited prior to the time step .^{1}^{1}1If is visited for the first time at , then and .
Upon the observation of , the Q-learning updates the entry of the Q-function by

(3) |

No update is made to the entry if . The following result is standard in the literature: For an arbitrarily chosen initial Q-function , the convergence as holds under the update rule (3) provided each state-action pair is visited infinitely often. In what follows, we assume that the underlying MDP is communicating, i.e., every state can by reached from every other state under certain policies.

### 3.2 Sarsa(0)

The Q-learning method described above is considered an off-policy reinforcement learning algorithm in the sense that the hypothetical action in the update rule (3) need not be actually selected as . The SARSA(0) algorithm, on the other hand, is an on-policy counterpart that performs the Q-function update based on the experienced trajectory :

(4) |

Here, we remark that the absence of the *max* operation in (4) is a significant advantage for the implementation over FHE because the current polynomial approximations of comparison operations are highly inefficient, Cheon et al. (2019).

The convergence of SARSA(0) can be guaranteed under certain conditions. Although a complete discussion must be differed to Singh et al. (2000), roughly speaking it requires that (i) the learning policy is infinitely exploring, i.e., each state-action pair is visited infinitely often; and (ii) the learning policy is greedy in the limit, i.e., the probability that tends to zero. Condition (i) is required for the convergence of Q-learning (off-policy counterpart of SARSA). The additional condition (ii) is needed due to the on-policy nature of SARSA(0). The following is a simple example of learning policies satisfying (i) and (ii): (Decreasing policy) Let be the number of times that the state has been visited prior to time step and define for some . We say that is a decreasing policy if it selects an action

randomly with the uniform distribution over

with probability and the greedy action with probability . The following result is a consequence of Theorem 1 in Singh et al. (2000). For an arbitrarily chosen , the convergence as for each state-action pair occurs with probability one under the SARSA(0) update rule (4) and the decreasing policy. (Outline only) The result relies on the convergence of stochastic approximation (Singh et al., 2000, Lemma 1), whose premises are satisfied if the learning policy is*Greedy in the limit with infinite exploration (GLIE)*: (i) infinitely exploring, and (ii) greedy in the limit in that as with probability one. To verify (i), we note that, under the assumption of communicating MDP, performing each action

*infinitely often*in each state is sufficient to guarantee the infinite exploration of states. Let be the time step that the state is visited the -th time. Since we are adopting the decreasing policy, e.g., with 0 c 1, is dependent on . But by the extended Borel-Cantelli lemma, we have w.p.1 for each . Thus, by Lemma 4 (Singh, 2000), we have a.s., where denotes the number of actions performed at state at time . The condition (ii) holds by construction of the decreasing -greedy policy. The conditions 1, 2, and 3 of the Theorem 1 in Singh et al. (2000) are satisfied by construction.

### 3.3 SARSA(0) with blocking states

Now, consider a cloud-based implementation of SARSA(0). We assume that the Q-function update (4) is performed by the cloud while the decreasing policy is implemented by the client. As shown in Fig. 1(b), at each time step , the client can encrypt and upload a new data set and, upon the completion of the Q-function update (4) on the cloud, the updated entry of the Q-function is downloaded and decrypted. If the computation of Q-update takes less than a unit time interval, then SARSA(0) together with the decreasing policy as described above can be implemented in the considered cloud-based architecture without any modifications.

However, encrypted Q-update can take up to some unit time intervals. In such scenarios, a new data set may arrive before the previous Q-update is complete. For simplicity, we assume there is no communication delay between the cloud and the client. In what follows, we propose a modified SARSA, which we call *SARSA(0) with blocking states*. The proposed Q-update rule is depicted in Fig. 2, where a special case with and is shown.

First, encrypted values of the initial Q-function

are recorded on the cloud’s memory.
At time step , the state-action pair is visited, so the encrypted data set is uploaded to the cloud.
The computation (4) for -update is initiated, which will take three time steps to complete. While the is being updated on the next time steps (i.e., ), the corresponding state is added to the list of *blocking states*. When the state is blocking, no updates are allowed to the entries of .
For instance, at , the state-action pair is visited.
However, since the state is in a blocking state list, the update request is rejected and the data set is discarded.
At , the state-action pair is visited. Since is not blocking at , the update is accepted, and the computation of update is initiated using the new data set . At , the computation for is complete and the result is recorded on the memory. ( meaning the -th revision of the .) Since the state-action pair is visited when the state is removed from the blocking list, a new data set is accepted and the computation for -update is initiated.

Under SARSA(0) with blocking states described above, we denote by the most updated version of the Q-function recorded on the memory as of time step . For instance, Fig. 2 reads

Let be the number of times that the update has been accepted at the state-action pair prior to time step . Whenever the update with data is accepted at time step , the following update is made at time step :

Since the blocking mechanism ensures that the entries are unchanged over the time steps , the above update rule can also be written as

We define the *decreasing policy with delay* similarly to the decreasing policy (Definition 3.2) except that greedy actions are selected by , where is the most updated version of the Q-function available at .

We have the following convergence result. For an arbitrarily chosen , the convergence as for each state-action pair occurs with probability one under SARSA(0) with delayed update and blocking if the decreasing policy with delay is adopted. We will prove that, under delayed update and blocking mechanism, (i) every state is explored infinitely often, and (ii) the update is still executed infinitely often, and (iii) the policy is greedy in the limit, again satisfying the convergence conditions.

(i) Since the decreasing policies with and without delay (Definitions 3.2 and 3.3) share the same probability for random exploration, the inequality remains valid under the decreasing policy with delay. As in the proof of Theorem 3.2, this implies that each state-action pair is visited infinitely often.

(ii) Assume that update occurs only times for some state-action pair . Since , the Q-update can be blocked at most times. This means that updates are accepted infinitely many times, and this contradicts the assumption that updates occur only times.

(iii) The policy is greedy in the limit in that

(5) |

as with probability one. To see this, suppose that was a greedy action selected at time step , i.e.,

Due to the blocking mechanism, are unchanged over the time steps . This means that

Thus, whenever was selected greedily at , we have

Since the delayed -greedy policy is greedy in the limit, the convergence (5) holds with probability one.

## 4 Numerical demonstration

We used a classical pole-balancing problem from RL. The setup can be found from Sutton and Barto (2018). With slight modifications, we demonstrate our encrypted SARSA(0) updates. For encryption, we used Microsoft SEAL to set up the CKKS scheme. Algorithm 1 is the pseudocode of Encrypted SARSA(0) with delayed updates. We assumed that the client transmitted Q-values directly corresponding to the state-action pair it visited. We fixed a step size and we queued the data sets at the client’s memory to take advantage of a batch encryption. We let the data set to be and we simply use for . Instead of attempting to encrypt and upload the data every time step, we assumed the situation where the agent can explore longer and collect the larger data sets. The agent then uploads the batch data and the cloud computes back the batch update. We use a tilde to denote the encoded version (e.g., to be the encoded version of data sets, ) and a symbol in front to denote the encrypted version (e.g., is the encrypted version, ).

Table 1 lists the CKKS Parameters used for the demonstration. Table 2 shows the delay introduced by operations involved in Homomorphic Encryption. The maximum precision error by encrypted updates was only 0.0063%. We used to generate the particular result below and accordingly only 1000 slots were used out of 4096 available slots (encryption parameter-specific). But the time taken by each operation of homomorphic encryption listed in Table 2 is fixed for all , and thus it can allow larger batch operations if necessary. We verified that, given enough trials, learning outcomes remain successful for large and can learn much faster than small by performing more compact and less frequent encrypted operations.

We note that the client is still burdened with non-trivial computing tasks, most notably CKKS encoding and homomorphic encryption, which takes up more than a half of the homomorphic operations involved. On the other hand, decryption and decoding tasks are less strenuous. This is a prevalent issue in encrypted control as seen in the results of Darup et al. (2018) and many others, dwindling the appeal of cloud-based control. This result is therefore encouraging for a further investigation in possibility of outsourcing the encryption itself. A novel concept in this direction was first suggested in Naehrig et al. (2011), where one can combine the relatively light AES encryption to offload the homomorphic encryption process all together to the cloud. But even without such future improvements, in certain control *synthesis* problems, the homomorphic encryption-induced computing requirements and delays may be well compensated with the privacy guarantee acquired. Another significant task we note is Relinearization. But this task is for the cloud during the homomorphic operations.

CKKS Parameters | Chosen |
---|---|

(Poly. Modulus Deg.) | 8192 |

(Cipher. Coeff. Modulus) | (, , , , ) |

Scale Factor | |

Available Slots |

Type | Num | Time (ms) | Percent |
---|---|---|---|

Encode | 5 | 6.695 | 23.59 % |

Encrypt | 5 | 33.519 | 39.38 % |

Multiply | 4 | 2.549 | 2.99 % |

Relinearize | 4 | 14.909 | 17.51 % |

Rescale | 4 | 7.886 | 9.26 % |

Addition | 3 | 0.074 | 0.09 % |

Decrypt | 1 | 1.225 | 1.44 % |

Decode | 1 | 4.881 | 5.73 % |

## 5 Summary and future work

We tackled a problem of privately outsourced control. In this work, we provide a general framework for control *synthesis* over fully homomorphic encryption. We showed a convergence result for the SARSA(0) with delayed updates. We then demonstrated solving a classical reinforcement learning problem with a privacy objective and privacy-induced delayed updates. Numerical results showed that the homomorphic encryption via CKKS scheme could successfully complete the private learning with the precision loss being minimal. We saw that a batch operation can be of our advantage.

Many challenges remain for encrypted control. Obviously, the delays introduced may limit the area of applications. However, as we considered in this paper, using homomorphic encryption for *control synthesis*

may be feasible. The other critical challenge is the difficulty of implementing certain critical operations such as comparison and sorting over FHE. This makes the execution of Bellman operations or optimizations on ciphertext domain challenging. As our future work seeks to extend our framework to more advanced evaluation tasks such as training and execution of artificial neural network or control synthesis, (e.g., model predictive control) over fully homomorphic encryption, an efficient polynomial comparison function will be of significant value. We will also continue to investigate the idea of outsourcing the homomorphic encryption task as discussed to improve our framework.

## References

- Cloud-based mpc with encrypted data. In 2018 IEEE Conference on Decision and Control (CDC), Vol. , pp. 5014–5019. Cited by: §2.1, §2.3.
- The explicit linear quadratic regulator for constrained systems. Automatica 38 (1), pp. 3–20. Cited by: §1.
- Dynamic programming and optimal control 3rd edition. Belmont, MA: Athena Scientific. Cited by: §3.
- Bootstrapping for approximate homomorphic encryption. In Advances in Cryptology – EUROCRYPT 2018, J. B. Nielsen and V. Rijmen (Eds.), Cham, pp. 360–384. External Links: ISBN 978-3-319-78381-9 Cited by: §2.2.
- Homomorphic encryption for arithmetic of approximate numbers. Note: Cryptology ePrint Archive, Report 2016/421https://eprint.iacr.org/2016/421 Cited by: §2.2.
- Efficient homomorphic comparison methods with optimal complexity. Note: Cryptology ePrint Archive, Report 2019/1234https://eprint.iacr.org/2019/1234 Cited by: §3.2.
- Encrypted cloud-based mpc for linear systems with input constraints. IFAC-PapersOnLine 51 (20), pp. 535 – 542. Cited by: §2.3, §4.
- CryptoNets: applying neural networks to encrypted data with high throughput and accuracy. Cited by: §1.
- Secure and private cloud-based control using semi-homomorphic encryption. IFAC-PapersOnLine 49 (22), pp. 163 – 168. Cited by: §1, §2.1.
- A fully homomorphic encryption scheme. Ph.D. Thesis, Stanford University. Note: crypto.stanford.edu/craig Cited by: §2.1.
- Homomorphic encryption. In Tutorials on the Foundations of Cryptography: Dedicated to Oded Goldreich, pp. 219–276. Cited by: §2.1.
- Industrial automation as a cloud service. IEEE Transactions on Parallel and Distributed Systems 26 (10), pp. 2750–2763. Cited by: §1.
- Encrypting controller using fully homomorphic encryption for security of cyber-physical systems. IFAC-PapersOnLine 49 (22), pp. 175 – 180. Cited by: §1, §2.3.
- Comprehensive introduction to fully homomorphic encryption for dynamic feedback controller via lwe-based cryptosystem. CoRR abs/1904.08025. External Links: 1904.08025 Cited by: §1, §2.3.
- Cyber-security enhancement of networked control systems using homomorphic encryption. 2015 54th IEEE Conference on Decision and Control (CDC), pp. 6836–6843. Cited by: §1, §2.1.
- Initial recommendations of long-term secure post-quantum systems. In PQCRYPTO, Cited by: §2.1.
- Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1.
- Can homomorphic encryption be practical?. In Proceedings of the 3rd ACM workshop on Cloud computing security workshop, pp. 113–124. Cited by: §4.
- Towards encrypted mpc for linear constrained systems. IEEE Control Systems Letters 2 (2), pp. 195–200. Cited by: §2.1, §2.3.
- Microsoft SEAL (release 3.4). Note: https://github.com/Microsoft/SEALMicrosoft Research, Redmond, WA. Cited by: §2.2.
- Convergence results for single-step on-policy reinforcement-learning algorithms. Machine learning 38 (3), pp. 287–308. Cited by: §3.2.
- Reinforcement learning: an introduction. The MIT Press. Cited by: §4.
- Q-learning. Machine learning 8 (3-4), pp. 279–292. Cited by: §3.1.
- Cloud-based networked visual servo control. IEEE Transactions on Industrial Electronics 60 (2), pp. 554–566. Cited by: §1.
- Homomorphic encryption and applications. Vol. 3, Springer. Cited by: §2.

Comments

There are no comments yet.