## 1 Introduction

Considering the high nonlinearity of Full Waveform Inversion (FWI), a hierarchical approach is common practice. Developing such a strategy is a daunting task considering the size of the data, the realities of how the data were acquired (limited band and aperture), and the physical assumptions we impose on the model. With respect to the choice of the proper objective function, recent advances admitted reasonably cycle-skipping free misfit functions such as the matching filter misfit, optimal transport (OT) [opt2] function or a combination of them, i.e., the optimal transport of the matching filter misfit (OTMF) [MF_OTMF]

. Unlike the L2-norm misfit, which is a local comparison, these advanced misfit functions seek global comparisons between the predicted and measured data, and thus, can avoid the cycle-skipping. However, we often still need to switch to the L2 norm for higher resolution models when the data are less cycle-skipped and safe for local comparisons. In reality, we need to carefully QC of the data matching to determine the optimal time to switch. Besides, the probability of cycle-skipping varies for different offsets. It would be ideal to use different misfit functions for different traces to accommodate their specific cycle-skipping probabilities. In principle, we can formulate this problem as that in each iteration, given the predicted and measured data, we try to make a decision (action) to choose between the L2 norm misfit and a cycle-skipping free misfit such as the OTMF and the consequence of such action seeks a better fitting of the data in a long time horizon ( running over many iterations). Mathematically, it is considered as a Marko decision process and it is well studies in the field of statistics and machine learning. In this paper, based on the concept of reinforcement learning, we train a Deep Q network (DQN)

[DQN] to achieve fast convergence by learning the optimal choice of objective functions over FWI iterations.Reinforcement learning (RL) is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning. It is a potential algorithm towards true artificial intelligence. It differs from supervised learning that it does not require labels given by input/output pairs. Instead, the focus of RL is in finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge). Recently, RL based algorithms demonstrated its potential in solving complex problems which is extremely difficult for conventional machine learning algorithms, e.g., AlphaGo beats the top human Go players while the Alphastar achieves the Grandmaster level at playing the Starcraft games

[Silver2017, Vinyals2019]. In this paper, we share our first attempt to use RL to automatically selecting a misfit function in full-waveform inversion. We start with a brief review of the OMTF misfit function used here and then develop the method for misfit function selection using DQN. At last, we demonstrate our method using a time-shifted signal example.## 2 A robust misfit function by optimal transport of the matching filter (OTMF)

The conventional L2-norm misfit function seeks a local point-wise comparison between the predicted data and the measured data :

(1) |

[MF_OTMF] introduce the optimal transport of the matching filter (OTMF) misfit for FWI. In OTMF, a matching filter would be computed first by deconvolving of the predicted data with the measured data:

(2) |

where denotes the convolution operation. After a proper precondition of the resulting matching filter to fulfill the requirements for a distribution, we minimize the Wasserstein distance between the resulting matching filter and a target distribution given by, e.g., a Dirac delta function:

(3) |

where , denotes the Wasserstein distance [opt2]. The resulting OTMF misfit in Equation 3 can overcome the cycle-skipping effectively as demonstrated by [SUN_OTMF_EAGE].

## 3 Automatic misfit function selection using Deep Q network (DQN)

We first provide a mathematical background summary for the Markov decision process (MDP), the deep Q network (DQN) and the RL techniques. We adopt the standard MDP formalism. An MDP is defined by a tuple

, which consists of a set of states , a set of actions , a reward function , a transition function and a discount factor . For each state , the agent takes an action . Upon taking this action, the agent receives a reward and reaches a new stateas a result of the action, determined from the probability distribution

. In RL, we try to learn a policy , specified for each state which action the agent will take. The goal of the agent is to find such policy mapping states to actions that maximizes the expected discounted total reward over the agent’s lifetime. Such long time expected reward is formulated as the action-value function (Q) :(4) |

where is the expectation over the distribution of the admissible trajectories obtained by executing the policy starting from and . There are many algorithms developed in RL to learn the policy . DQN is a popular method for dealing with a discrete action space. It only learns the Q function and the optimal policy can be derived from the learned Q function directly:

(5) |

Equation 5 is intuitive to understand that the best action for each state should give the largest Q value for that state. In order to learn the Q function, we take a single move from current state to next one and see what reward R we can get. This admits a one-step look ahead:

(6) |

In order to stabilize the learning process, we keep track of another target Q function:

. Thus, the loss function of DQN in training is the time difference (TD) error between the Q function and its target :

(7) |

For training efficiency, we save the transition in a replay buffer and reuse these datasets for training (It is referred as experience replay in RL).

Exploration plays an important role in RL. Exploration provides the agent with the ability to expand his knowledge when interacting with the environment. The -greedy exploration strategy randomly choses the action given a probability :

(8) |

where is related to the optimal policy from equation 5. We start with a large and gradually reduce it during the training. Another important aspect related to RL is the reward, the decision of the form for the reward is problem-specific, and it may affect the training significantly. Algorithm 1 shows a typical DQN flow with experience replay and -greedy exploration policy.

It is straightforward to adapt DQN to our misfit function selection problem, i.e., select between the L2 norm and the OTMF misfit. Considering a one dimensional FWI problem, the state in RL would be the predicted and measured data:

(9) |

where and is a single trace of the data in the time domain at iteration step . The Q function will have such a state as input and it will output two values determining whether we use the L2 norm misfit function or the OTMF misfit function. We will also incorporate the -greedy exploration policy, i.e., we will random choose between the L2 norm and the OTMF misfit with probability . For the reward, we can define it as the negative of the normalized L2 norm of the model difference, or the negative of the normalized L2 norm of the data residuals as another option.

(10) |

We should keep in mind that unlike in FWI, here though we use a L2 norm of the data difference to formulate the reward in the RL training, it will not be an issue. Because the Q function we try to fit in Equation 4 seeks a long time expected reward (over many iterations). This means that the best policy learnt will always give fast convergence with less accumulated L2 norm of the data residuals throughout the inversion process.

## 4 Results

In this example, we try to optimize a single parameter, i.e., the time shift between signals. An assumed forward modeling produces a shifted Ricker wavelet, using the formula

(11) |

where is the time shift and is the dominant frequency. The modeling equation given by equation 11 is a simplified version of a PDE based simulation. The reward we use for the training is the normalized L2-norm data residuals (the second formula in Equation 10). In this example, the data are discretized using samples with a time sampling s. We use direct connected network (DCN) for the Q function. We use one hidden layer for the DCN of size . The Q network will output two scalar values representation the Q for the L2 norm and the OTMF. We set the initial value of to be 0.90 and drop it exponentially to 0.05 at the end. Using a 3 Hz peak frequency wavelet, we randomly generate the initial and true time-shifts between 0.4 s and 1.2 s. In each episode (one full run of the inversion), we iterate for 12 iterations. We run ten thousand episodes for training, and we update the Q network based on equation 7 at every iteration. The batch size is set to be 128, i.e., we randomly fetch 128 tuples of for updating the Q function. Figure 1a shows the Loss of equation 7 over episodes (the curves in Figure 1 has been smoothed with a moving average over 100 episodes). Its convergence demonstrates the success of the RL training. Figure 1b is the accumulated reward over episodes, and its increasing value further indicates the learnt policy improved and can achieve fast convergence with higher reward throughout the training. In order to further understand the trained Q function, we plot the Q value for different time shifts (we set the measured data with time-shift 0.5 s and scan the Q function over the predicted data with time-shift varying from 0.5 to 1.1 s). We plot the Q function over the relative time-shift between the predicted and measured data in Figure 2a. Figure 2b denotes the action that will be taken based on the learnt Q function (0 for L2 norm, 1 for OTMF). We can see that if the relative time shift is smaller than approximate 0.15 s, the Q value for the L2 norm is larger than the OTMF, suggesting apply the L2-norm misfit function. Otherwise, the learnt Q function would suggest to use OTMF to avoid the cycle-skipping. Note the switch point at 0.15 s is consistent with the half cycle of the 3 Hz peak frequency we used in training. However, this number is fully determined from the data itself in the framework of reinforcement learning.

## 5 Conclusions

In the framework of Reinforcement Learning, we trained a Deep Q network (DQN) to select a misfit function for FWI. We use the time-shift inversion example to demonstrate the basic principle of our method. The resulting trained network managed to use the data to determine the appropriate objective function to achieve convergence.