I Introduction
Artificial Intelligence (AI) is a broad field of science whose main objective is to make machines smarter. This means that machines are constructed so that they behave intelligently as humans do. Machines in that way are capable to adapt faster to whatever information they receive. AI acquired its prominence due to its considerable breakthroughs in various fields. Common reallife AI examples are selfdriving cars, smartphones, and computer games. Based on this definition, this section defines the concept of machine learning, a subfield of AI then introduces the concept of deep learning. After retailing these concepts thoroughly, this section highlights the relation between AI, in general, and physics. Finally, a brief summary discusses how the review is organized.
i.1 Machine Learning: Cornerstone of AI
Artificial Intelligence is a broad field of science whose main objective is to make machines smarter. A fundamental subject of AI is machine learning (ML) (Ray, )
. Machine learning implements the ability to learn from experience, i.e. observational data in hand. This is what makes machines intelligent since learning is at the core of intelligence. When a machine is fed with data, it first inspects it and extracts corresponding features (useful information). It then builds a model that is responsible for inferring new predictions based on those extracted features. Hence, the emphasis of machine learning is on constructing computer algorithms automatically without being explicitly programmed. This means that the computer will come up with its own program rather than having humans intervene in programming it directly. Applications of ML techniques often create more accurate results in comparison to those of direct programming. ML meets with statistics, mathematics, physics, and theoretical computer science over a wide range of applications. Some of these reallife applications where ML is implemented include face detection, speech recognition, classification, medical diagnosis, prediction, and regression
Mehta et al. (2019).i.2 Machine Learning vs. Deep Learning
Since their inception, ML techniques have achieved considerable success over direct programming. As discussed, one of the main tasks done by a machine learning model is to extract the features, but this task is very handy. If the number of features extracted is insufficient, then this will lead to predictions that are not accurate enough. The model is said to be highly biased. On other hand, if the number of features is more than enough to output predictions, the model will also be weak. It is thus said to be highly variant. For that, if the model fails to extract the features efficiently, careful engineering is necessary, i.e. an expert will intervene to make adjustments to improve the accuracy. This limits the scope of machine learning techniques.
To address the aforementioned limitations, a new subset of ML emerged known as deep learning (DL). It is concerned with feature learning also known as representation learning Fig.(1) which finds the features on its own from data where manual extraction of features isn’t fully successful (Bengio et al., 2013). Deep learning is implemented using complex architectures, often known as artificial neural networks (ANN’s) mimicking the biological neural network of a human brain. For that, it is built in a logical structure to analyze data in a similar way a human draws conclusions. Upon analyzing the data, a neural network is able to extract features, make predictions, and determine how accurate the drawn conclusion is. In this way, deep learning model resembles the human intelligence.
i.3 Physics and Machine Learning
i.3.1 Physics contributing to machine learning
Perhaps a question arises: What are the reasons behind unifying physics and machine learning? Upon going through the details of ML techniques, one of these reasons will be automatically manifested. It will be obvious then that the core concepts of ML techniques arise from the field of physics. Hence, physicists have been contributing to ML techniques since their early inception. Applying methods and theories developed in physics is still adopted in machine learning where efforts are present to explore new ML paradigms and develop physicsinspired learning algorithms. A group of researchers at Google, Princeton, Colombia, and MIT (Zeng et al., 2019) confirmed this approach and designed a robot that develops an intuition of physics. No doubt that significant success has been made in improving the robots’ efficiency in doing their tasks and learning from real world experiences. However, the researchers’ understanding is that robots still need careful considerations. To address this challenge, they integrated simple physics models with deep learning techniques. Since physics explains how the real world works, this can be advantageous to support the robot with such models in a way to improve its capability to perform complex tasks. For example, to let the robot grasp objects efficiently, a neural network is provided with an image of the objects as an input in order to select the appropriate one from the box. At a certain stage, the network extracts the feature of the object, more specifically its position in the box. This feature along with the throwing velocity supplied by a physical simulator, are fed to another neural network. This network performs adjustments to predict a projectile that accurately targets the selected placing location. In conclusion, this unification between physics and deep learning techniques results in a better performance than techniques implemented alone.
i.3.2 Machine learning contributing to physics
In turn, machine learning techniques can be used as a toolkit in physics. Physicists can benefit from ML when it comes to data analysis. Physics is one of the scientific fields that give rise to big data sets in diverse areas such as condensed matter physics, experimental particle physics, observational cosmology, and quantum computing. For example, the recent experiment ”Event Horizon Telescope” recorded 5 petabytes of data in order to generate the first ever image of a supermassive black hole (Akiyama et al., 2019). That’s why physicists are integrating ML techniques and following any advances in this direction. The benefits of machine learning for physicists don’t stop here. Physicists implement different ML techniques with a view to improve their physical understanding and intuitions. To illustrate this approach, a recent work was done on neural networks to investigate whether they can be used to discover physical concepts even in domains that aren’t clearly evident, such as quantum mechanics. This is done in the work of Renato et al. in (Iten et al., 2018) to be detailed in section(IV) as a first step in this approach. This depicts a promising research direction of how ML techniques can be applied to solve physical problems. The central question here: Can artificial intelligence discover new physics from raw data? This review will introduce to the reader attempts made recently as a first step to answer this question.
i.4 Layout
It must be emphasized that this review discusses how machine learning, and AI in general, interplay with physics. Since common AI tools are based on physical concepts, this is an indicator of how the AI community benefits from that of physics. However, this review highlights the other way around. Physicists are taking the challenge to make breakthroughs upon implementing AI tools in their research. Any approach in this direction appears to be very promising. To pave the way directly to the point, the review is organized as follows: section (II
) reviews fundamental concepts about artificial neural networks. These are a class of DL techniques that are selftrained through different learning processes: supervised, unsupervised, and reinforcement learning. Talking about reinforcement learning smooths the way to introduce the concept of Markov decision processes as explained in section (
III). Sections (IV) and (V) fully retail two approaches to show how the DL techniques are used to help physicists improve their intuition about different physical settings. The former explains the strategy of how neural networks are implemented to describe physical settings, while the latter illustrates an algorithm that works the same way a physicist works upon dealing with a physical problem. Both, the algorithm and a physicist, use the four following strategies to solve any problem: divideandconquer, Occam’s razor, unification, and lifelong learning. Finally, some concluding remarks are made with an opening to future work.Ii Artificial Neural Networks
This section provides background knowledge on artificial neural networks. This knowledge is indispensable to understand following ML techniques through which they are implemented. This section first introduces the building block of ANNs: the artificial neuron, then discusses how information is being processed in ANNs. Here comes an important step to neural networks called training. The main objective of this step is that it leads neural networks to produce results with very high accuracy. This training occurs through an algorithm called gradient descent. All these topics are presented in the following sections.
ii.1 Artificial Neural Networks In a Nutshell
An artificial neuron is a computational model that resembles a biological one (Haykin, 2009)
. In the human body, electrical signals are transmitted among natural neurons through synapses located on the dendrites, i.e. membranes of the neuron. These signals activate the neuron whenever they exceed a specific threshold and therefore, a signal is emitted through the axon to activate the next neuron Fig.(
2). Take for example the case when a human hand approaches a hot solid. If the solid is hot enough, the neurons will be quickly activated transmitting a command to warn off the hand. Otherwise, the human hand shows no reaction.The artificial neuron with its basic components is analogous to the biological one; it is the building block of the artificial neural network (Kriesel, 2007)
. ANN consists of several interconnected consecutive layers where each layer is made up of stacked artificial neurons. The first layer is the input layer which receives the input data. This data is provided as a vector
x = where each neuron of the input layer is supplied with one element . The inputs are multiplied by weights w = indicating the strength of each input, i.e. the higher the weight, the more influence the corresponding input has. The weighted sum of all inputs is then computed and an external bias, denoted by , is added to it. The resulting value () is supplied as a variable to a mathematical function called the activation function
. Its output is fed to a neuron in the next layer as an input. Examples of activation functions are presented in Appendix (A). The resulting computation any neuron receives basically depends on the incoming weights. It is important to note that the incoming weights into a specific neuron generally differ from those coming to any other neuron in the same layer. This results in a different computed input for each one. It is also worth mentioning that the bias is added to the weighted sum to modify the net input of the activation function. According to its sign, this net input is either increased or decreased. To make things clearer, consider the activation function to be a onedimensional function where . This function can be shifted by translation upon the addition of a constant to its parameter , . According to the sign of , this function is shifted to the left or right allowing more flexibility in the choice of the value of the function thus affecting its output as well. The bias plays the role of this constant (Haykin, 2009).The preceding steps are repeated along each layer of the neural network, thus information is being processed through it. Starting from the input layer and passing through intermediate layers known as hidden layers, the process ends with the output layer which holds the final desired results of the network Fig.(3). Perhaps, the simplest architecture of a neural network is that consisting of an input layer, a hidden layer with a sufficient number of hidden neurons, and an output layer (Mehta et al., 2019). This structure demonstrates the property of universality of neural networks which states that any continuous function can be approximated arbitrarily well by the aforementioned structure of the neural network (Hornik et al., 1989; Nielsen, 2015).
More complex architectures are described as deep ones. This refers to neural networks that contain multiple hidden layers. Such structures are used frequently in modern research for their representational power due to the increased number of layers, and therefore the number of parameters, i.e. the weights and biases (Bengio et al., 2013). Deep neural networks are able to learn more complex features from the input data. It is worth mentioning that the exact neural network architecture for a specific problem depends on several factors, two of which are: the type and amount of data that is available and the task to be achieved. Let’s not forget also that the choice of the number of hidden layers and the number of hidden neurons in each layer alter the global performance of the network. In conclusion, a standard to abide by is that the number of parameters in a neural network should not be small enough to prevent underfitting and not large enough to prevent overfitting.
ii.2 Training
As mentioned in the previous section, the output of an artificial neuron depends on the adjustment of the parameters accompanied with the input data given to the neuron. However, in an artificial neural network composed of hundreds of interconnected neurons, all corresponding parameters cannot be set by hand. This is a very complicated task. Instead, regularizing the parameters of an artificial neural network occurs through a process often called training or learning. The parameters of a neural network hold a random initial value and then, after training, reach an optimal one. The optimization method is carried out with respect to a cost function that measures how close the output of a neural network is to the desired output of a specific input (Iten et al., 2018). This cost function must, in turn, be minimized and this minimization is performed through an algorithm often called gradient descent (Kriesel, 2007). This algorithm is discussed thoroughly in the following subsection.
ii.2.1 Gradient descent
The cost function is a multivariable function that we aim to minimize during the learning process. It is as a function of the parameters of the neural network, and these parameters are adjusted iteratively and slowly until a minimum cost is achieved. Learning parameters using gradient descent takes the following steps:

The parameters of the neural net are randomly initialized .

In each iteration and for each parameter , the firstorder gradient of the cost function is computed.

The parameter is then updated by
(1) where is the learning rate that is a hyperparameter defining the step size of updating.

These steps are repeated for every iteration and the parameters are updated until the minimal cost is achieved.
As seen, the term gradient descent corresponds then to decreasing the gradient step by step through adjusting the parameters until convergence. The choice of should be taken as not to take too big of a step leading to a collapse and not too small of a step leading to a slow performance.
ii.2.2 Stochastic Gradient Descent
The cost function is often encountered as a summation of subfunctions, for example
(2) 
where may be the euclidean distance between the desired output and the prediction i.e. , and is the total number of input data points.
The gradient of the cost function with respect to a weight is then the sum of gradients of all
subfunctions with respect to that weight. For a single step towards the minimum, the gradient will be calculated over the whole points. This is timeconsuming especially if the number of data points is large. Stochastic gradient descent (SGD)
(Nielsen, 2015) tackles this problem by taking only a subset of the data at random to compute the gradient:(3) 
where is termed minibatch. Using this gradient, the parameter is then updated. The steps are repeated but with each iteration, the choice of the minibatch must be changed.
ii.2.3 Adam
One arising problem in gradient descent and stochastic gradient descent is the need to specify the learning rate. If the rate was too high, the algorithm may collapse. If it is too low, the performance will be slow. Standing for adaptive moment estimation, the Adam algorithm as introduced in
(Kingma and Ba, 2015) takes a step towards solving this problem. The key idea of Adam algorithm is to compute a separate learning rate for each parameter of the neural network. It is an extension of the stochastic gradient descent method but with excellent results.Briefly, Adam works as follows. For each parameter, the learning rate will be computed at each iteration. The algorithm for setting the learning rates starts by computing a hyperparameter which is an estimate for the first moment that is the gradient (as seen in SGD). The update rule for is:
(4) 
Another hyperparameter which is the estimate of the second moment that is the gradient square is also computed at each iteration. The computation at each iteration is
(5) 
where and are factors empirically found to be 0.9 and 0.999 respectively. Since the first and second moments are initially set to 0, they remain close to 0 after each iteration especially that and are small. To solve this situation, a slight change is made
(6)  
The update rule for the parameters at each iteration is then
(7) 
where is the step size and is added to avoid any divergences. This procedure is repeated until convergence.
ii.3 Learning Paradigms
Hopefully, the previous sections have given a good overview of what an artificial neural network is and how it operates. As repeatedly mentioned, a neural network, like any artificial machine, thinks and behaves like a human. That’s why it is trained to achieve such a goal. Upon training, a neural network is first supplied with a set of data called the training set. Then, it adjusts its parameters, as mentioned in section (II.2), to continually learn from this data. Whenever the parameters reach their optimal values, training stops and in that way the neural network reaches the desired accuracy. The neural network is capable now to generalize and infer new predictions about data of the same type it did not encounter previously. Depending on the given training set, the processes through which a neural network learns differ. These are clarified in the following and can be easily generalized to any artificial machine (Haykin, 2009).
ii.3.1 Supervised learning
Perhaps, supervised learning seems to be the simplest learning paradigm. An ANN is trained with a data set consisting of labeled data, i.e. data points augmented with labels. The neural network’s role is to find a mapping between these pairs. In this way, upon linking each data point to its corresponding label, the data points are classified. When the finishes, the neural network employs this mapping to find labels to unseen data.
ii.3.2 Unsupervised learning
In contrast to the preceding paradigm, this one is given unlabeled data, i.e. the labels of the data are not provided. This necessitates that the neural network finds some relationships among the data in a way to cluster, i.e. group them. The grouping can be done either by categorizing or by ordering. A sufficiently trained neural network uses the inferred clustering rule and applies it on data it did not process previously.
ii.3.3 Reinforcement learning
It is important to note that reinforcement learning differs from the previous paradigms. This paradigm essentially consists of a learning agent interacting with its environment. Hence, the environment in reinforcement learning plays the same role as the data in the previous paradigms. The process of learning in this case is evaluative: the learning agent receives a reward whenever it performs an action in the environment it is put in. Therefore, the goal of the agent is gaining the maximum possible reward. One approach to model the environment is to characterize it as Markov decision processes, i.e. the environment is defined as a set of states. Reinforcement Learning is discussed separately in the next section.
Iii Reinforcement Learning and Markov Decision Processes
Machines, like human beings, are able to move around in an environment and interact either with it or with each other. However, their behaviors are not the same of course. Humans can interact adaptively and even intelligently when they encounter any environment including any stochastic behavior. However, these stochastic behaviors are troublesome for machines. Unlike previous attempts that directly engineer robots to accomplish specific tasks, now robots are made to behave independently without any human intervention.
The learning scheme for the robot is known as sequential decision making. The robot, or the agent as generally named, is left to take its own decisions sequentially in a series of time steps. So, the agent is the decision maker here and the learner as well. Definitely then, the agent performs actions and is rewarded based on the action performed at each step. In that way, the agent wanders the environment. The idea of the reward is to inform the agent of how good it is to take this action or how bad. The main goal is to increase the total rewards as much as possible.
Markov decision process (MDP) is a fundamental formalism that deals with the agent’s interaction with the environment (Sutton et al., 1998). It assumes that the environment is accessible, i.e. the agent knows exactly where it is in it. This formalism models the environment as a set of states and the agent acts for improving its ability to behave optimally. It aims to figure out the best way to behave so that it achieves the required task in an optimal way. The agent’s state is Markov that is it has all the sufficient information it needs to proceed; no need to check its history. The future is thus independent of past events.
The sequence of actions taken by the agent to reach the goal define the policy followed. The MDP framework allows learning an optimal policy that maximizes a longterm reward upon reaching a goal starting from an initial state. To address this challenging goal, we first introduce all the components of MDP, then we head to discuss the two algorithms that are used to compute the optimal behaviors: reinforcement learning and dynamic programming.
iii.1 Components of MDP
Markov decision process (Sutton et al., 1998) is the formalism defined as a tuple () where is a finite set of states, a finite set of actions,
a transition function or probability and
is a reward function.
States: As mentioned before, the environment is modelled as a set of finite states where the size of the state space is . The state is a unique characterization of all the features sufficient to describe the problem that is modelled. For example, the state in a chess game is a complete configuration of board pieces of both black and white.

Actions: The set of actions is defined as the finite set where the size of the action space is . Any action can be applied on any state to control it.

Transition Function: The transition function is defined as
defines a probability distribution over the set of all possible transitions, i.e. the conditional probability of changing from a current state
to a new state when applying an action . For all states and and for all actions , it is required that . Furthermore, for all states and actions , we have . Based on this and the fact that the system is Markovian, we can ensure that 
Reward Function: The state reward function is defined as
specifies a reward, i.e. a scalar feedback signal for being in a specific state after which an action is applied. This can be interpreted as negative (punishment) or positive (reward).
iii.2 Policy and Optimality
Given an MDP i.e knowing the set of states, actions, probabilities and rewards, a policy governs the action taken when present in a specific state. So the policy can be defined as
The policy thus controls the studied environment. There are two types of policies:

Deterministic policy that specifies the action taken in the state : .

Stochastic policy that runs a probability distribution over the actions: . That is, it assigns probabilities to the actions that can be performed when present in state .
Under a certain policy and starting with a state , the policy suggests an action to move to a state . The agent receives a reward by making this transition. In this sense, the sequence under the policy is:
Our main goal is to find the optimal policy which is the policy that obtains the maximum number of rewards. It is important to note that the aim is not to maximize the immediate reward , but rather the summation of all rewards collected during the task. These are expressed as a return function defined as:
(8) 
where is the final step. This makes sense when the task has limited steps; the return function will always converge. This model is known as finitehorizon. However, the interaction between the agent and the environment may be unlimited, and the agent may continue heading from one state to another and gathering rewards without achieving the goal. The return function will tend to infinity as more steps are being taken. The model of infinitehorizon is problematic. For this purpose, we introduce a discounting factor which discounts the rewards, and the discounted return function is then:
(9)  
where . The discount factor determines the importance of future rewards at the present. A reward after 2 steps is worth . It can be viewed as follows:

if , then the agent is myopic and only cares about the immediate reward.

if , then the agent is nearsighted and cares about the nearest coming rewards.

if , then the agent is farsighted and cares about future rewards.
The discount factor guarantees that the return function converges for a large number of steps. The return factor enjoys a recursive property
(10)  
The optimality criteria to maximize the return function depends on the problem at hand.
iii.3 Value Functions and Bellman Equations
The value functions (Sutton et al., 1998) of a state estimates how good it is to be present in this state in general or accompanied with taking a specific action. It depends on the future rewards to be gained starting from this state and following the policy. Value functions thus link optimality criteria to policies and are used to learn optimal policies.
A statevalue function is the expected return when being present in that state under a particular policy:
(11)  
A similar value function, denoted by , can be defined in the same way as the value of the state , taking a specific action and thereafter following policy :
(12)  
The statevalue functions satisfy a recursive relation:
(13)  
This equation is known as the Bellman Equation. It expresses the value function as the sum of all rewards and values of all possible future states weighted by their transition probabilities and a discount factor. The optimal statevalue function is thus:
(14) 
and the optimal actionvalue function is:
(15) 
In the same manner, the optimal statevalue function has a recursive property:
(16)  
(17)  
This is known as the Bellman optimality equation. Finding or will be the corner stone in finding the optimal policy as will be seen in the following sections. To achieve the goal of arriving to the optimal policy, several algorithms have been proposed. these algorithms are divided in two classes: modelbased and modelfree algorithms. Both classes include the states and actions, but the modelbased algorithms are also supplied with the transition probabilities and rewards, whereas the modelfree aren’t. In the following sections, these two cases are detailed by their corresponding algorithms; the first is dynamic programming which is modelbased, the second is reinforcement learning which modelfree.
iii.4 Dynamic Programming
Dynamic programming (DP) (Sutton et al., 1998) is the category of algorithms that go after an optimal policy given that the dynamics of the environment (transition probabilities and rewards) are completely supplied. Dynamic programming is thus a modelbased algorithm for solving MDPs.
iii.4.1 Reaching optimality: evaluation, improvement and iteration
Finding the optimal policy of course follows from obtaining optimal value functions of the states: or which satisfy Bellman’s optimality equations Eq.(16, 17). The general idea is that DP algorithms find the optimal value functions by updating the previous equations and then finding the optimal policy based on the value functions. The path of reaching the optimal policy thus mainly consists of two steps: evaluating then improving. Afterwards we repeat these steps several times till the optimum is achieved.
Policy evaluation: We kick off by randomly considering some policy where the dynamics of the environment are completely known. We aim to find the statevalue functions under this policy. These values satisfy Bellman’s equation Eq.(13). Solving this equation requires solving a system of equations with unknown valuestate functions where is the dimension of the state space, and this is tedious to achieve. One way to go around this problem is to transform it into an iterative problem:

We start with initializing the for all states with arbitrary values, usually with zero.

Using Bellman equation, we evaluate all the value functions for all states.

Having evaluated the functions in the first round, we repeat the evaluation on and on such that the Bellman equation is now updated to:
(18) where represents the iteration. This means that the value of the state in the current iteration depends on the value of the successor states in the previous iterations.

We continue updating the values of states by iterating until the current value doesn’t differ much from the previous value i.e. :
(19)
This is known as iterative evaluation policy. The final value obtained for each state under the given policy is then .
Policy improvement: After computing all the statevalue functions under a certain policy, we need to know whether being in this state and performing the action governed by the policy is better or worse than performing an action governed by some other policy . In other words, once in state , we perform an action and then continue with policy . Is this better or worse?
This is answered using the stateaction value function where :
(20) 
If is in fact better than , then choosing this action then following is better than considering just from the beginning. The new policy is thus an improved policy. This is know as policy improvement theorem. Having two deterministic policies and and
(21) 
is indeed the same as having
(22) 
It is logical to sweep over all states present and the actions assigned to each of them and choose which action increases the value of each state according to .
The policy that aims to choose the action that increases the value of a certain state is known as the greedy policy where
(23)  
where is the action that maximizes the actionstate value function . Therefore, the greedy policy is the policy that improves the value of the state by choosing a better action; this process of obtaining a greedy policy is the process of policy improvement. If , then both and are the optimal policies.
Note that if there are several actions that maximize the value function, then these actions must all be considered and given certain probabilities. This is the case where the policies aren’t deterministic but rather stochastic.
Policy iteration: Our main goal is to obtain the optimal policy, and as we’ve mentioned before it is the process repeating two steps successively: policy evaluation and policy improvement. This is policy iteration. Starting with a policy , we evaluate this policy, then we improve it to get the policy . By evaluating then improving it, we’ll end up with . Repeating the same process again and again, it’ll converge to an optimal policy which is our goal. An example on how to start with a random policy and end up with an optimal one is illustrated in Fig.(3(a), 3(b)).
iii.4.2 Value iteration
Policy iteration is costly since it requires sweeping over the set of states several times during the policy evaluation step. Can this step be reduced to include just one step i.e. can we substitute the iterations to obtain the value with only one step?
This is the process of value iteration. Instead of sweeping the whole set of states several times to obtain the value then looking for the best action performed, we immediately do this in one step using an update for the Bellman equation:
(24) 
Policy evaluation is still present, but it requires to take the action that maximizes the value. Thus value iteration joins policy evaluation and improvement in one making the convergence to the optimal policy faster. It is important to mention that some sweeps use value iteration while others still use policy evaluation, but the end result is always an optimal policy.
iii.4.3 Asynchronous dynamic programming
As discussed, sweeping over the large set of states is very costly even for just one sweep. Asynchronous DP doesn’t sweep over the whole set of states but rather just over a subset in each sweep. A value of one state is updated using whatever values of the other states are available, one state can be updated several times whereas another state can be updated just once or twice. Asynchronous DP allows flexibility in choosing what states will be updated in this step and what states will remain the same under the condition that at the end of the whole process, every state must have been updated and not completely ignored. Some states require frequent updates whereas others require updating every now and then; some states are irrelevant to the process of reaching optimality and could be ignored all along.
iii.4.4 Generalized policy iteration
In the preceding sections we saw how policy iteration lead us to find the optimal policy. It consists of two steps: policy evaluation and policy improvement. One step doesn’t start unless the previous has terminated. Of course other processes are present to make policy iteration more efficient such as value iteration and asynchronous dynamic programming.
Generalized policy iteration (GPI) describes the process of policy evaluation and policy improvement whether the other processes are present or not. The whole idea as previously explained is that the current policy is evaluated then we improve the policy according to a better value function. Improvement and evaluation are thus interacting, and one drives the other. All modelbased and modelfree algorithms depend on GPI. Once the value function and improvement produce no change, then the optimal policy is reached.
iii.5 Reinforcement Learning
The previous section discussed dynamic programming which is a modelbased algorithm assuming that all the transition and reward functions are given to compute the optimal policy. When such a model is not available, reinforcement learning steps in. It necessitates statistical knowledge of the unknown model in a way to generate samples of state transitions and rewards. Sampling occurs due to the agent’s interaction with the environment by doing actions to learn the optimal policy by trialanderror. An important aspect must be highlighted then, namely the need for the agent’s exploration of the environment. The agent must always try to perform different actions seeking better ones and not only exploit its current knowledge about good actions. Several strategies for exploration can be abided by. The most basic one is known as the greedy policy. The agent through its exploration chooses its current best action with a probability and any other action is taken randomly with probability . This is a reinforcement learning technique.
Reinforcement learning can be solved indirectly. This occurs upon the interaction with the environment by learning the transition and rewards functions and building up an approximate model of the MDP. Hence, all the dynamics, i.e. state values and stateaction values, of the system can be deduced using all the methods of DP mentioned in the previously. Another option suggests estimating directly the state values of the actions without even estimating a model of the MDP. This option is known as direct Reinforcement Learning. Indeed, this happens to be a choice taken in modelfree contexts. There exists other choices including temporal difference learning, Qlearning (Sutton et al., 1998) and SARSA (StateActionRewardStateAction) (Graepel et al., 2004).
We detail in the next section the trust region policy optimization (TRPO) algorithm which is an algorithm using reinforcement learning. It improves the policy iteratively with cautious step sizes. An example of using TRPO efficiently will follow the algorithm.
iii.5.1 Trust region policy optimization
Another method to reach the optimal policy is by using the Trust Region Policy Optimization algorithm (Schulman et al., 2015). This algorithm outdoes other policy improvement algorithms due to the fact that it specifies a trust region for the step to be taken for improving the policy i.e. it takes the largest possible trusty step. In general, taking large steps is very risky and taking small steps makes the process very slow. TRPO solves this problem by defining a trust region to take the best steps avoiding a collapse of the improvement process.
As explained in (Schulman et al., 2015)
, the procedure will start off by monotonically improving the policy through minimizing a certain loss function, then introducing approximations that are the core of the practical TRPO algorithm.
Considering an infinitehorizon MDP where is the probability distribution of the initial states . Recall the functions defined for an MDP:

The stateaction value function:
(25) 
the value function:
(26) 
and the expected rewards:
(27)
A new function that quantifies how well the action performs compared to the average actions is defined as

The advantage function:
(28)
Given two stochastic policies and where , we can prove the expected rewards following the policy as:
(29) 
Let’s prove this formula as done in (Kakade and Langford, 2002). Start by the advantage of over :
(30)  
The result will then be as in Eq.(29). Before continuing with the explanation, let’s head briefly to talk about visitation frequencies (Si et al., 2004). The state visitation frequency is the distribution of the probability of passing through a certain state following a specific policy. It is thus defined for a state as:
(31) 
where the first term is the probability of encountering the state at the first time step, the second term is the probability at the second time step and so on. It must be kept in mind that visitation probabilities are heavily changed with the change of policy.
Starting once more with the advantage:
(32)  
Combining this result with Eq.(29), we end up with
(33) 
This equation implies that having a nonnegative sum of expected advantage functions will increase the expected rewards when updating from policy to policy thus making an improved policy. If the summation of the expected advantages is zero, then the optimal policy is now reached and the performance is now constant .
As have mentioned before, the policy could be a deterministic policy such that , and so improvement is guaranteed if at least one advantage function is positive with an existing visitation probability. However, if the policy is a stochastic one and the regime is an approximated regime, then due to the inevitable estimations error, there could be negative advantage functions. Moreover, the dependence of the visitation probability on the policy makes it really tedious to solve the optimization Eq.(33). For that, it is quite easier to use instead of in the optimization equation. This substitution is valid if the update from on is in a way that the changes in the visitation frequencies can be ignored. Then instead of Eq.(33), use
(34) 
If the policy is parameterized and differentiable by a parameter , then for the current policy , there is
(35) 
This equation shows that any improvement from to which increases will definitely increase , but it doesn’t specify how good a step is. Recall that it’s quite risky to take large steps and very slow to take small ones, so we must specify how big of step to take.
In the work of (Kakade and Langford, 2002), this issue was solved by defining the following lower bound
(36) 
This guarantees that increasing the righthand side will surely increase the expected rewards under policy thus improving the policy. To tackle specifically stochastic policies, will be the distance measure between the two policies such as the KLdivergence, more specifically, the maximum KLdivergence between the two policies is taken to lower the bound further. Using the , the bound becomes
(37) 
where and . For the sake of simplicity, define the surrogate function such that
(38) 
where is the current policy and is the new one. So,
(39)  
and,  
then, 
By maximizing the surrogate function , it is guaranteed to have a monotonically increasing improvement of the policies i.e. until optimization is reached. This algorithm is called the MinorizationMaximization algorithm where minorization corresponds to the fact that is the lower bound, and maximization is quite obvious.
But we still haven’t specified how big of a step to take. To do that, the trust region policy optimization algorithm (Schulman et al., 2015) is now presented as a practical approximation to the theoretical MinorizationMaximization algorithm.
Recall that the policies may be parameterized by some arbitrary parameter . To make things a bit simpler, define the following notations as used by the authors in (Schulman et al., 2015):
(40)  
And by denoting as the old parameters to be improved, the optimization equation becomes
(41) 
The main goal can then be summarized by
(42) 
However, taking leads to small steps thus a slow rate of improvement. A constraint must be put on the KLdivergence in order to take larger steps but not to large as to cause collapses. That being said, this constraint is the constraint on KLdivergence and the condition to satisfy now becomes:
(43)  
The way of writing Eq.(43) is called Lagrangian duality where the constraint may be integrated back to the condition using a multiplier. The constraint put implies that each KLdivergence is bounded, but this tedious to work with due to the big number of constraints. A way to avoid this problem is by taking the average of the KLdivergence. The condition thus becomes:
(44)  
where . Recall that
(45) 
Since is constant, the condition in Eq.(44) simplifies to:
(46)  
The following replacement is introduced:
(47) 
and using (Neal, 2001) done for a single state and described as
(48)  
where is another simpler distribution, and is known as sampling weights. In the TRPO context, is . Introducing these replacements, the condition thus becomes:
(49)  
By solving this condition and defining based on the problem at hand, the expected rewards at each iteration of solving will increase guaranteeing a better policy until optimization is reached.
iii.5.2 Divide and conquer reinforcement learning
Finding the optimal policy for highly stochastic environments is a main challenge in reinforcement learning. High stochasticity is a synonym for wide diversity of initial states and goals and thus, it leads to a tedious learning process. Having TRPO presented, the work of (Ghosh et al., 2017) applied the algorithm on, for example, training a robotic arm to pick up a block and placing it in different positions. The idea behind their strategy is the following:

Slice the initial statespace into distinct slices.

Train each slice to find the corresponding optimal policy.

Merge the policies into a single optimal one that describes the entire space.
This strategy of solving is named DivideandConquer (DnC) reinforcement learning and is efficient for tasks with high diversity. Let’s describe this algorithm.
Consider an MDP described as a tuple . This MDP is modified to fit the ”slicing” strategy i.e. the variables called contexts are introduced. The initial set is partitioned as and each partition is associated to a context such that
. Slicing is done using kmeans clustering Appendix (
B) for example. By that, becomes the joint probability distribution for the context and initial state set that is . Based on this slicing, the MDP extends into two:
Contextrestricted MDP : Given the context , we find the policy , so .

Augmented MDP M’: Each state is accompanied by a context ending with the tuple , and the stochastic policy in this MDP is the family of contextrestricted policies i.e. .
Finding the optimal policies in the contextrestricted MDPs is finding the optimal policy in the augmented MDP . Once the policy in the augmented MDP (whether the optimal policy or not) is found, the central policy in the original MDP (contextfree MDP) can be found by defining . The main condition presented in this work is that the local policy in one context may generalize to other contexts; this accelerates the finding of the global policies that works for the original MDP which is contextindependent. Here’s where TRPO kicks in. In order to find the optimal describing the original MDP, it is a must to find the augmented policy that maximizes the rewards more specifically, it maximizes
(50) 
where is the multiplier in order to integrate the condition back into the equation. Following the TRPO regime for specifying the trust region, it is a condition that the policies in two respective contexts and should share as much information as possible, then:
(51) 
In the work of (Ghosh et al., 2017), instead of maximizing the surrogate function in Eq.(42), they consider it as a loss (multiplying by a minus sign) and aim to minimize it. Following this, the surrogate loss is:
(52)  
Each policy is trained to take actions in its own context, but it is trained with the data of other contexts as well to ensure generalization. That being said, the surrogate loss for a single policy is
(53)  
The steps of finding the optimal central policy is thus as follows. Within each context , the local policy is enhanced using the surrogate loss with each iteration. After repeating this optimization procedure for several iterations, the central policy is found from the local policies by minimizing the KLdivergence of Eq.(51) which simplifies to
(54)  
The TRPO algorithm accompanied with the constraints introduced here allowed to solved a highly stochastic MDP with diverse initial states and goals. The experimental work done in DivideandConquer proves that this algorithm outperforms other RL algorithms.
Iv Scinet: A Physics Machine
We arrive now to introduce an example of a machine learning technique that empowers physics using neural networks. Approaches usually use experimental data and present them to neural networks in order to have it come up with the theory explaining the data. However, most techniques impose constraints on say, the space of initial states or the space of mathematical expressions. More specifically, these techniques incorporate our physical intuition to the neural network, thus mainly they’re testing the network’s efficiency and learnability rather than its ability to output theories from scratch. In the work of (Iten et al., 2018), this problem is tackled by constructing a neural network named on which no constraints or any previous knowledge are applied. must as well output the parameters that describe the physical setting wholly and sufficiently. The idea presented in this work is as follows:

Supplying with experimental data,

finds a simple representation of the data,

then a question will be asked for to answer.
must be able to answer the question using only the representation it gave without going back to the input data. These steps are approached using two models:

Encoder: The encoder structure is made of one or more neural networks. It takes the observations (experimental data) and encodes them into representations named latent representations in machine learning context. The mapping is thus .

Decoder: The decoder structure is also a neural network. It takes as inputs the latent (hidden) representations
produced by the encoder as well as the question to be answered. It outputs the answer to the question. The mapping is thus .
Fig.(4) illustrates the encoder and decoder networks.
’s encoder and decoder are trained with a chosen training set of observations and questions, and then they are tested with the chosen test set to predict the accuracy. We must note that since we don’t previously know or impose the number of latent neurons (those specific for the latent representation), the accuracy of prediction may be low due to insufficient latent neurons. In that sense and during the training phase, the number of latent neurons may be reset to fit the representations.
As a simple example, suppose that you feed with observations of the variation of the electric potential as a function of current governed by Ohm’s law. But, has no idea what Ohm’s law is; it is only seeing the introduced observations. The encoder will find a representation for these observations which is the parameter standing for resistance and store it in a latent neuron. Being supplied with the representation and a question as what will the potential be for a given current, the decoder predicts the right answer.
Let’s introduce some of the examples presented in the paper that demonstrates ’s efficiency and accuracy in predicting representations and answers from scratch without any constraints given.
iv.1 Experiment One: Damped Pendulum
Presented as a simple classical example of how works, the double pendulum is described by the following differential equation
(55) 
where is the spring constant which governs the frequency of oscillation and is the damping factor. The solution of the damped system is
(56) 
is implemented as a network with three latent neurons, and it is fed with timeseries observations about the position of the pendulum. The amplitude , mass , and phase are fixed for all training sets only the spring constant and damping factor vary between and respectively.
The encoder outputs the parameters and and stores them in two latent neurons without using the third neuron. Upon providing time as a question, predicts through the decoder network the position of the pendulum at that time with excellent accuracy. Therefore, was able to draw out the physical parameters and store them as well as predict future positions accurately. This implies that the parameters extracted where sufficient to describe the whole system as well as make future predictions.
iv.2 Experiment Two: Qubits
After presenting a classical example,
is tested with quantum examples, specifically with qubits. Before heading to explain the problem at hand, we define a couple of terminologies used:

Qubit: A qubit, the quantum analog of a classical bit, is a twodimensional system that can exist in a superposition of two states. It forms the fundamental unit in quantum computing.

Quantum tomography: It is a method of reconstructing the quantum state from a series of measurements (Paris and Rehacek, 2004). A typical approach is to prepare copies of the quantum state and perform several measurements on the copies. Each of these measurements allows us to have part of the information stored in the state. If the set of measurements is informationally complete thus allowing reconstruction of the quantum state fully, they are said to be tomographically complete. Otherwise, they are tomographically incomplete.

Binary projective measurements: Measurements to yield the state of the qubit: or for a single qubit for example.
Given a set of measurements , is required to represent the state of the quantum system as well as make accurate predictions without having any previous quantum knowledge. We suppose in this example that the two considered states, to be represented, are a 1qubit state and 2qubit state. The number of real parameters for a single qubit is two, and that for double qubits is six. Here’s how we got those numbers:

The dimension for the complex vector space is for qubits, so for a single qubit we have two states: and ; for doublequbits we have four states: , , and .

Counting the number of real parameters, then we’ll have .

Two constraints are present. The first is the normalization condition , and the second is that the global phase factor doesn’t hold any information meaning that having will not affect the inner product. These constraints will lessen the number of parameters by two.
Therefore, we should expect from the encoder to use two latent parameters for 1qubit and six latent parameters for 2qubits. From the set of all binary projective measurements , a random subset ( = 10 for singlequbit, and 30 for 2qubits) is chosen and projected on where is the quantum state to be represented. The probabilities generated ’s are the probabilities of measuring zero. After repeating the measurements several times, the resulting probabilities are fed to the network as observations. Given these observation, determines the minimal number of parameters sufficient to describe the quantum state.
Choosing another random set of binary projective measurements ( = 10 for singlequbit, and 30 for 2qubits), we project these measurements on another measurement to generate the set of probabilities
Comments
There are no comments yet.