The prediction of interfacial area properties in two-phase flow systems is difficult and challenging. In this paper, a conceptual idea of using single-agent reinforcement learning for the behaviors of two-phase flows and IAC behaviors is proposed. The basic assumption for this application is that the development of two-phase flow is considered to be a stochastic process with Markov property. The details of the design of simple Markov games are described and approaches of gaming solutions are adapted. The experiment shows that both of the steam fraction and IAC prediction processes converge. The model predictions are compared with the experimental results, and the tendency matches although some oscillations exist. The performances and prediction results can be improved by elaborating the game environment setup.READ FULL TEXT VIEW PDF
Maneuvering in dense traffic is a challenging task for autonomous vehicl...
Reinforcement learning (RL) has recently achieved tremendous successes i...
Despite the notable successes in video games such as Atari 2600, current...
Social navigation has been gaining attentions with the growth in machine...
In this paper, we first introduce a new phase-type
Phase segregation, the process by which the components of a binary mixtu...
We consider the use of multipreconditioning to solve linear systems when...
The prediction of the characteristics of the two-phase flow is essential in terms of the safety of two-phase flow systems such as the reactor pressure vessel in the nuclear power plant. Nowadays, many software and codes have been developed based on two-phase flow fundamental theories and models. Take the TRACE code  as an example, which is considered as one of the most elaborated code up to now. The TRACE code was developed based on the Two-fluid model  and interfacial area transport Equations (IATE)  and is capable of predicting the characteristics of the boiling two-phase flows 
. The two-fluid model uses two groups of partial differential equations and a series of constitutive equations to describe the two phases. The IATE is developed based on the idea of Boltzmann transport equations and is capable of dynamically predicting the transition of the two-phase flows. Both of the two models are considered to be the most accurate model up to now. However, the models are complicated to solve since there are many non-linear PDEs included. During the years, more correlations and models have been developed and further elaborate the models, making the models even more complex to compute.
Nowadays, the ability of computation has been significantly increased and the cost of compilations is going down, some problems of thermal-dynamics and two-phase flows were solved in a model-based free, machine learning approaches. The advantage of these approaches is that they are quite easy to be established, as long as the problems meet the fundamental prerequisites. For example, studies were performed on the two-phase flow regime classifications using self-organized map (SOM)
, supported vector machine (SVM)
, and neural network (ANN). They used different machine learning techniques, however, the fundamental is to determine the key parameters that can describe the flow characteristics. These parameters are used to classify the flow regimes. Besides, there are also model-based machine learning approaches that can both be accurate and stable since they are theory based and easy to solve complex problem. There are examples that can be easily searched, thus, there is no further discussion here.
In this paper, the author proposes a new solution concept of using stochastic game theory approach for the prediction of interfacial parameters. This approach is considered to be eligible since the changes and transitions of the two-phase flows meet the Markov property, that is: the next state of the two-phase flow is only related/determined by the current state. This paper describes the simple, basic stochastic game design and a Q-learning test.
Reinforcement learning is about learning from interaction how to behave in order to achieve a goal. 
The reinforcement learning problem/setup constitutes a Markov decision process (MDP). A MDP is a discrete, stochastic control process that provides mathematical frameworks for decision making. The key element of a MDP can be represented as a tuple,
. S, A, P, R are the state, action, probability transition function, rewards of the n-th player, respectively. The player interacts with the environment in terms of the state by taking actions and getting rewards. A policy, which is a stochastic rule by which the player selects actions as a function of states, is formed through the this process. The objective is to maximize the amount of reward it receives over time. Stochastic game is a generalized concept that combines repeated games and Markov decision processes (MDP).  A MDP is a one-player stochastic game.  Detailed information about the theory of reinforcement learning and stochastic game are not discussed in this paper.
This paper aims to propose a one-player stochastic game environment design in which the player’s goal is to predict the integral quantity of the interfacial parameters, i.e., total steam fraction and total interfacial area concentration. However, multi-player game can also be setup using this method. Two approaches can be considered in designing a multi-player game: 1) each player represents one group of bubbles (group 1 and 2 bubbles defined by ) 2) each player represents one phase, either gas/steam or water.
The game design is inspired from the two-phase flow experiment setup provided by Zivi . Consider a steady-state, steam-water two-phase flow in a finite length annulus duct. The inner part of the annulus duct is a heater rod that provides constant heat flux, and the outer part of the annulus duct is adiabatic. Suppose that the water entering the duct is at thermal saturation state. Due to the heat addition, steam can emerge at certain location, and can develop along the duct with the change of the quality of the two-phase mixture , where is the location along the duct. The flows enter and exit the annulus duct with very low velocities so that the effects of kinetic energy dissipation and frictional pressure can be negligible.
Based on this game design, the estimation of steam fraction is calculated using correlation provided by Zivi,
where , , and are the steam quality, steam density and water density, respectively. denotes the fraction of the water in form of droplets entrained in the steam. This parameter can be used to quantify the flow regime, where D = 0 is pure annular flow and D = 1 can be considered as bubbly flow. From Eq. (1), the local average void fraction is estimated with above four parameters. And also the void fraction depends on the axial locations. Thus, the state of steam fraction is simplified in a following structure,
The calculation of interfacial area concentration (IAC) utilizes the correlation developed by Kocamustafaogullari et al. ,
where , , , , , , and are local averaged void fraction, area-averaged superficial water velocity, area-averaged superficial air velocity, pressure drop rate along the duct, water density, surface tension, and hydraulic diameter, respectively. In this case, can be referred from the estimation using Eq. (1) and the values of equal to those in Eq. (2). and are nearly fixed values so these two parameters are not included in the IAC states. Thus, the state expression of IAC is simplified as follows,
The game is a finite, episodic task game since the two-phase flows travels in a finite length. In each step, the action can change one parameter in the state by choosing one of the three options: to a larger value, to a smaller value, or staying the same. The rewards at each step is the difference between calculated value using the state and the true value,
where C is a positive constant. There is a trick used in the setup of rewards. The initial rewards for each state-action pair is set as 0. While if the rewards are without +1 (i.e. ), the rewards are always negative. If a state-action pair has been visited, it would be likely to be updated to a negative value (e.g. in Q-learning algorithm). In this case, if a state is being visited for another time, the agent would always prefer the action that has not been updated yet. This is a false game design and the result can be non-convergent in most cases. The Constant C in the IAC rewards has the same purpose as +1 in the steam fraction rewards that compensates the rewards and avoids the scenario that the visited rewards are negative. It should be noted that the constant in the IAC rewards can affect the performance and the prediction result by affecting the choices of action. It should be set and tuned properly during the game setup.
The agent explores the environment and establishes policies on action selections to change the states. The policies are a series of stochastic rules by which the agent selects the actions. An agent’s goal in the game is to predict the steam fraction or IAC by changing and optimizing the values of the states that are used to estimate the steam fraction and IAC at the next state.
In this section, The validation of the veracity and robustness of the game design is tested with Q-learning algorithm  (i.e., - greedy, with 0.001, 0.001, and = 1.0). The procedure of the experiment is provided .
Fig. 1 and Fig. 2 show the off-policy Q-learning converges in the games for both steam fraction and IAC prediction. In both two games, each element is updated separately at each step with possibly different actions chosen. The steam fraction/IAC at the state is calculated after all the elements of the state are updated. From the two figures, the convergence speed of steam fraction prediction is faster than that of IAC becuase the are less elements in the state of steam fraction game setup.
Fig. 3 shows the steam fraction prediction result. Fig. 4 gives the changes of key parameters in the prediction. It should be noted that the ratios of change of state elements can affect the ultimate prediction results. Using a large ratio may cause a shortage of reasonable states, though it may reduce the time and space complexities. From the figures provided, the predictions show good tendencies with some oscillations. These oscillations are caused by the following two factors in the game setup: 1) the parameters for the steam fraction and IAC calculation are discrete; 2) the game setup is not elaborated enough because the models included in the game setup are not enough. From the convergences of models/policies training and the tendency matches between the prediction and the experimental results, it can be concluded that this approach works. It is expected that with more elaborated designs, the predictions can become more accurate and the application range can be extended. Fig, 5 and Fig. 6 give the IAC prediction results and the change of the key parameters change, respectively. In this experiment, IAC prediction also shows large oscillations at some positions.
In this paper, a conceptual idea of using single-agent reinforcement learning for the behaviors of two-phase flows and IAC prediction is proposed. The idea by developing a stochastic game using the steam fraction and IAC empirical correlations is established. In the game, the parameters in the correlations are treated as the elements of the state, and they are updated (increase, or decrease, or stay the same) according to the chosen actions in each step. The game is tested using Q-learning and the results show good matches with experimental results. This approach can be further developed by elaborating the game environment setup.
The author is currently a PhD student in thermal hydraulic and reactor safety laboratory (TRSL) at Purdue University and under the supervision of Dr. Mamoru Ishii. The author would like to deeply thank his support and guidance in the theory of thermo-fluid dynamics and two-phase flow.