## Abstract

Biopharmaceutical manufacturing faces critical challenges, including complexity, high variability, lengthy lead time, and limited historical data and knowledge of the underlying system stochastic process. To address these challenges, we propose a green simulation assisted model-based reinforcement learning to support process online learning and guide dynamic decision making. Basically, the process model risk is quantified by the posterior distribution. At any given policy, we predict the expected system response with prediction risk accounting for both inherent stochastic uncertainty and model risk. Then, we propose green simulation assisted reinforcement learning and derive the mixture proposal distribution of decision process and likelihood ratio based metamodel for the policy gradient, which can selectively reuse process trajectory outputs collected from previous experiments to increase the simulation data-efficiency, improve the policy gradient estimation accuracy, and speed up the search for the optimal policy. Our numerical study indicates that the proposed approach demonstrates the promising performance.

## 1 Introduction

To address critical needs in biomanufacturing automation, in this paper, we introduce a green simulation assisted Bayesian reinforcement learning to support bioprocess online learning and guide dynamic decision making. The biomanufacturing industry is growing rapidly and becoming one of the key drivers of personalized medicine. However, biopharmaceutical production faces critical challenges, including complexity, high variability, long lead time, and very limited process data. Biotherapeutics are manufactured in living cells whose biological processes are complex and have highly variable outputs (e.g., product critical quality attributes (CQAs)) whose values are determined by many factors (e.g., raw materials, media, critical process parameters (CPPs)). As new biotherapeutics (e.g., cell and gene therapies) become more and more “personalized,” biomanufacturing requires more advanced manufacturing protocols. In addition, the analytical testing time required by biopharmaceuticals of complex molecular structure is lengthy, and the process observations are relatively limited.

Driven by these challenges, we consider the model-based reinforcement learning (MBRL) or Markov Decision Process (MDP) to fully leverage the existing bioprocess domain knowledge, utilize the limited process data, support online learning, and guide dynamic decision making. At each time step

, the system is in state , and the decision maker takes the action by following a policy . At the next time step , the system evolves to new state by following the state transition probabilistic model , and then we collect a reward . Thus, the statistical properties and dynamic evolution of stochastic control process depend on decision policy and state transition model . In the biomanufacturing, the prior knowledge of state transition model is constructed based on the existing biological/physical/chemical mechanisms and dynamics. The unknown model parameters (e.g., cell growth, protein production, and substrate consumption rates in cell culture; nucleation rate and heat transfer coefficients in freeze drying) will be online learned and updated as the arrivals of new process data. The optimal policy depends on the current knowledge of process model parameters.In this paper, we propose a green simulation assisted Bayesian reinforcement learning (GS-RL) to guide dynamic decision making. Given any policy, we predict the expected system response with prediction risk accounting for both process inherent stochastic uncertainty and model estimation uncertainty, call model risk. The model risk is quantified by the posterior distribution and it can efficiently leverage the existing bioprocess domain knowledge through the selection of prior and support the online learning. Thus, the proposed Bayesian reinforcement learning can provide the robust dynamic decision guidance, which can be applicable for cases with various amount of process historical data. In addition, motivated by the studies on green simulation (i.e., Feng and Staum (2017) and Dong et al. (2018)), we propose the stochastic control process likelihood ratio-based metamodel to improve the policy gradient estimation, which can fully leverage the historical trajectories generated with various state transition models and policies. Therefore, the proposed green simulation assisted Bayesian reinforcement learning can: (1) incorporate the existing process domain knowledge; (2) facilitate the interprertable online learning; (3) guide complex bioprocess dynamic decision making; and (4) provide the reliable, flexible, robust, and coherent decision guidance.

For the model-free inforcement learning, Mnih et al. (2015) introduce the experience replay (ER) to reuse the past experience, increase the data efficiency, and decrease the data correlation. It randomly samples and reuses the past trajectories. Built on ER, Schaul et al. (2016) further propose the prioritized experience replay (PER), which prioritizes the historical trajectories based on temporal-difference error.

The main contribution of our study is to propose a green simulation assisted Bayesian reinforcement learning (GS-RL). Even though both GS-RL and PER are motivated by “experience replay” and reuse the historical data, there is the fundamental difference between GS-RL and PER. In our approach, the posterior distribution of state transition model can provide the risk- and science-based knowledge of underlying bioprocess dynamic mechanisms, and facilitate the online learning. Then, the likelihood ratio of stochastic decision process is used to construct the metamodel of policy gradient in the complex decision process space, accounting for the selection and impact of both policy and state transition model. It allows us to reuse the trajectories from previous experiments, and the weight assigned to each trajectory depends on its importance measured by the spatial-temporal distance of decision processes. In addition, a mixture process proposal distribution used in the likelihood ratio can improve the estimation accuracy and stability of policy gradient and speed up the search for the optimal policy. Since the model risk is automatically updated during the learning, our approach can dynamically adjust the importance weights on the previous trajectories, which makes GS-RL flexible, efficient, and automatically deal with non-stationary bioprocess.

The organization of the paper is as follow. In Section 2, we provide the problem description. To facilitate the biomanufacturing process online learning and automation, we focus on the model-based reinforcement learning with the posterior distribution quantifying the model risk. Then, in Section 3, we propose the green simulation assisted policy gradient, which can fully leverage the process trajectories obtained from previous experiments and speed up the search for the optimal policy. After that, a biomanfuacturing example is used to study the performance of proposed approach and compare it with the state-of-art policy gradient approaches in Section 4. We conclude this paper in Section 5.

## 2 Problem Description and Model Based Reinforcement Learning

To facilitate the biomanufacturing automation, we consider the reinforcement learning for finite horizon problem. In Section 2.1, we suppose the underlying model of production process is known and review the model-based reinforcement learning. Since the process model is typically unknown and estimated by very limited process data in the biomanufacturing, in Section 2.2

, the posterior distribution is used to quantify the model risk and the posterior predictive distribution, accounting for stochastic uncertainty and model risk, is used to generate the trajectories characterizing the overall prediction risk. Thus, in this paper, we focus on the model-based reinforcement learning with model risk so that we can efficiently leverage the existing process knowledge, support online learning, and guide process dynamic decision making.

### 2.1 Model-Based Reinforcement Learning for Dynamic Decision Making

We formalize model-based reinforcement learning or Markov decision process (MDP) over finite horizon as , where is a set of states, is the starting state, is the set of actions. The process proceeds in discrete time step . In each -th time step, the agent observes the current state , takes an action , and observes a feedback in form of a reward signal . Moreover, let

denote a policy specified by parameter vector

. The policy is a function of current state,, whose output is action for deterministic policy or its selection probabilities for random policy. For

non-stationary finite horizon MDP, we can write .Let represent the state transition model characterizing the probability of transitioning to a particular state from state . Suppose the underlying process model can be characterized by parameters . Let

denote the probability distribution of the trajectory

of state-action sequence over transition probabilities parameterized by transition model starting from state and following policy . The bioprocess trajectory length can be scenario-dependent. For example, it can depend on the CQAs of raw materials and working cells. We write the distribution of decision process trajectory as

(1) |

Let denote the expected total reward for the trajectory (sample path) starting from , i.e., , where is the discount factor and the reward occurring in the -th time step depends on the state and action . Therefore, given the process model specified by , we are interested in finding the optimal policy maximizing the expected total reward,

(2) |

### 2.2 Model Risk Quantification and Bayesian Reinforcement Learning

However, the underlying process model is typically unknown and estimated by the limited historical real-world data. Here, we focus on Bayesian reinforcement learning (RL) with model risk quantified by the posterior distribution. We consider the growing-batch RL setting (Laroche and Tachet des Combes, 2019). The process consists in successive periods: In each -th period, a batch of data is collected with a fixed policy from distributed complex bioprocess, it is used to update the knowledge of bioprocess state transition model, and then the policy will be updated for the next period. At any -th period, given all real-world historical data collected so far, denoted by , we construct the posterior distribution of state transition model quantifying the model risk, , where the prior quantifies the existing knowledge on bioprocess dynamic mechanisms. Since the posterior of previous time period can be the prior for the next update, the posterior will be updated as new process data are collected. There are various advantages of using the posterior distribution quantifying the model risk, including: (1) it can incorporate the existing domain knowledge on bioprocess dynamic mechanisms; (2) it is valid even when the historical process data are very limited, which often happens in the biomanufacturing; and (3) it facilitates online learning and bioprocess knowledge automatic update.

At any -th period, to provide the reliable guidance on the dynamic decision making, we need to consider both process inherent stochastic uncertainty and model risk. Let denote the total expected reward accounting for both sources of uncertainty: , with the inner conditional expectation, accounting for stochastic uncertainty and the outer expectation accounting for model risk. Therefore, given the partial information of bioprocess characterized by , we are interested in finding the optimal policy,

(3) |

## 3 Green Simulation Assisted Reinforcement Learning With Model Risk

In this section, we present the green simulation assisted Bayesian reinforcement learning, which can efficiently leverage the information from historical process trajectory data and accelerate the search for the optimal policy. In Section 3.1, at each -th period and given real-world data , we derive the policy gradient solving the stochastic optimization problem (3) and develop the likelihood ratio based green simulation to improve the gradient estimation. Motivated by the metamodel study in Dong et al. (2018), a decision process mixture proposal distribution and the likelihood ratio based metamodel for policy gradient are derived, which can reuse the process trajectories generated from previous experiments to improve the gradient estimation stability and speed up the search for the optimal policy. In Section 3.2, we provide the algorithm for proposed online green simulation assisted policy gradient with model risk.

### 3.1 Green Simulation Assisted Policy Gradient

At each -th period and given real-world data , we develop the green simulation based likelihood ratio to efficiently use the existing process data and facilitate the policy gradient search. Conditional on the posterior distribution , the objective of reinforcement learning is to maximize the expected performance Based on eq. (3), we can rewrite the objective function,

(4) | |||||

The likelihood ratio in eq. (4) can adjust the existing trajectories generated by policy and transition model to predict the mean response at the new policy .

Let denote the accumulated number of iterations for the optimal search occurring in the previous periods. For notation simplification, suppose there is a fixed number of iterations in each period (say ). At -th iteration, we only generate one posterior sample to estimate the outer expectation in eq. (4). For the candidate policy , the likelihood ratio based green simulation is used to estimate the mean response . It can reuse the process trajectories obtained from previous simulation experiments generated by using the policies and state transition models with . They are obtained in previous periods with different posterior distributions, i.e., with . Then, since each proposal distribution is based on a single decision process distribution specified by , we create the green simulation individual likelihood ratio (ILR) estimator of ,

where is the -th sample path generated by using and is the combination of replications allocated at each for . Since the process trajectory length is scenario-dependent, we replace the horizon with to indicate its trajectory dependence.

This expected total reward estimator can be used in the policy gradient to search for the optimal policy. Under some regularity conditions, we provide the derivation for the policy gradient estimator.

(5) | |||||

(6) |

(7) | |||||

where

By plugging in eq.(6), the policy gradient becomes,

(8) | |||||

Then, we obtain the individual likelihood ratio based policy gradient estimator,

(9) |

The importance weight or likelihood ratio is larger for the trajectories that are more likely to be generated by the policy and transition probabilities . During the model learning process, the current policy candidate can be quite different from the policy for

that generated the existing trajectories. Although this importance weight is unbiased, its variance could grow exponentially as the horizon

increases, which restricts their applications.Since the likelihood ratio with single proposal distribution can lead to high estimation variance, inspired by the BLR-M metamodel proposed in Dong et al. (2018), we develop the bioprocess Mixture proposal distribution and Likelihood Ratio based policy gradient estimation (MLR), which allows us to selectively reuse the previous experiment trajectories and reduce the gradient estimation variance. Specifically, at the -th iteration of search for optimal policy, we generate a posterior sample of process model parameters, . During the optimal policy search, if there are new process data coming, the posterior will automatically update. The policy candidate and transition probability model uniquely define the trajectory distribution . Based on the historical trajectories generated during the previous periods, we create a mixture proposal distribution , and then use it to construct the likelihood ratio,

(10) |

where , , , and is the number of trajectories generated during the previous -th iteration with for . By replacing the likelihood ratio in eq. (9) with , the green simulation based policy gradient estimator becomes,

(11) |

where with represent the trajectories generated in the previous -th iteration. Notice that the mixture proposal distribution based likelihoood ratio is bounded by . In this way, the mixture likelihood ratio puts higher weight on the existing trajectories that are more likely to be generated by in the -th iteration without assigning extremely large weights on the others.

Since the parameterization plays an important role in the optimal policy gradient approach, we briefly discuss several possible policy functions. The policy function in reinforcement learning can be either stochastic or deterministic; see Silver et al. (2014) and Sutton and Barto (2018). The policy for discrete actions could be defined as the softmax function, where is feature vector of state-action pair . The gradient of the policy function is . For continuous action spaces, we can apply Gaussian policy; say for example for some constant , where is feature representation of . The gradient of the policy function is

. In general, as long as the predictive models have a gradient descent learning algorithm, they can be applied in our approach, such as deep neural network, generalized linear regression, SVM, etc. In the empirical study, we considered a two-layer MLP model as our policy function.

### 3.2 Optimal Policy Search Algorithm

Algorithm 1 provides the procedure for the green simulation assisted policy gradient approach to support online learning and guide dynamic decision making.

At any -th period, given the real-world data collected so far, the model risk is quantified by the posterior distribution , and then we apply the green simulation assisted policy gradient to search for the optimal policy in Steps (1)–(4). Specifically, in each -th iteration, we first generate the posterior sample for state transition probability model in Step (1), , and then generate trajectories by using the current policy and model parameter in Step (2). Then, in Steps (3) and (4), we reuse all historical trajectories and apply the green simulation-based policy gradient to speed up the search for the optimal policy. After that, as new real-world data coming, we update the posterior of transition model in Step (6), and then repeat the above procedure. In the empirical study, we use a fixed learning rate . Notice that the proposed mixture likelihood ratio based policy gradient can be easily extended to broader reinforcement learning settings, such as online, offline, and model-free cases.

## 4 Empirical Study

In this section, we study the performance of MLR using a biomanufacturing example. The upstream simulation model was built based on a first-principle model proposed by Jahic et al. (2002) and the downstream chromatography purification process follows Martagan et al. (2018). The empirical study results show that MLR outperforms the state-of-the-art policy search and baseline model-based stochastic gradient algorithms without BLR-M metamodel.

### 4.1 A Biomanufacturing Example

In this paper, we consider the batch-based biomanfucturing and use the stochastic simulation model built based on our previous study (Wang et al., 2019) to characterize the dynamic evolution of biomanufacturing process. A reinforcement learning model with continuous state and discrete action space is then constructed to search for the optimal decisions on chromatography pooling window, which was studied by Martagan et al. (2018)

. Instead of assuming that each chromatography step removes the uniformly distributed random proportion of protein and impurity

(Martagan et al., 2018), we let the random removal fraction following Beta distribution with more realistic and flexible shape.

This biomanufacturing process consists of: (1) upstream fermentation where cells produce the target protein; and (2) downstream purification to remove the impurities through multiple chromatography steps. The primary output of fermentation is a mixture including the target protein and significant amount of unwanted impurity derived from the host cells or fermentation medium. After fermentation, each batch needs to be purified using chromatography to meet the specified quality requirements, i.e., purity concentration reaching to certain threshold level . Since the chromatography typically contributes the main cost for downstream purification, in this paper, we focus on optimizing the integrated protein purification decisions related to chromatography operations or pooling window selection. To guide the downstream purification dynamic decision making, we formulate the reinforcement learning for biomanufacturing process as follows.

Decision Epoch: Following Martagan et al. (2018)

, we consider three-step chromatography. During chromatography we observe measurements and make decisions at each decision epoch

.State Space: The state at any decision time is denoted by the protein-impurity-step tuple on the finite space , where and . The state space is bounded by a predefined constant threshold due to limitation in cell viability, growth rate and antibody production rate, etc. The state space is bounded by a predefined constant threshold following FDA process quality standards.

Action Space: Let denote the selection of pooling window given the state at the time following a policy . To simplify the problem, we consider 10 candidate pooling windows per chromatography step here.

Reward: At the end of downstream process, we record the reward,

We set the failure cost , the protein shortage cost per milligram (mg), the product price $5 per mg, , the amount of purity percentage requirement mg, and the purity requirement . The operational cost for each chromatography column is $8 for and for .

Initial State: The random protein and impurity inputs for downstream chromatography are generated with the cell culture first-principle model, which is based on the differential equations with random noise. Here, we consider a fed batch bioreactor dynamic model proposed by Jahic et al. (2002),

(12) |

where and denote the constant specific mAb protein production and impurity rates, denotes the biomass concentration from dry weight , is medium volume , denotes inlet substrate concentration , is substrate concentration , is specific maximum rate of substrate consumption , is the specific rate of substrate consumption , is the specific growth rate and is biomass yield coefficient exclusive maintenance and is maintenance coefficient . The initial biomass and substrate is set to be (). We set the total time of production fermentation to be 50 days, and obtain mg of target protein and mg of impurity by applying the PDEs in (12

). After the harvest, we further add the noise, following the normal distribution

, to account for the overall impact from other factors introduced during the cell production process. Then, the protein and impurity inputs for downstream purification become and
Comments

There are no comments yet.