1 Introduction
Nowadays, artificial intelligence (AI) have fundamentally changed a vast of industries. AI leverages computers and machines to mimic the problemsolving and decisionmaking capabilities of the human mind. AI has already achieved similar performance as human experts in many fields, such as image recognition
(ir), Go (go), Starcraft (star) and voice generation (wavenet). Recently, AI has been applied in many scientific fields, e.g., protein structure prediction (fold), climate forecasting (clim), astronomical pattern recognition
(ast), etc. These exciting successes encourage the exploration of AI in various scientific research. This paper introduces reinforcement learning (RL) to inverse problems and develops an intelligent computing method for inverse scattering to achieve precision imaging. In particular, a reinforcement learning framework is designed to learn and decide sensor positions and wave frequencies adaptive to different scatterers in an intelligent way, thus obtaining a significant improvement in reconstruction quality with limited imaging resources.The inverse scattering problem is to reconstruct or recover the physical and/or geometric properties of an object from the measured data. The reconstructed information of interest includes, for instance, the dielectric constant distribution and the shape or structure. The interrogating or probing radiation can be an electromagnetic wave (e.g., microwave, optical wave, and Xray), an acoustic wave, or some other waves. The problem of inverse scattering is important when details about the structure and composition of an object are required. Inverse scattering has wide applications in nondestructive evaluation, medical imaging, remote sensing, seismic exploration, target identification, geophysics, optics, atmospheric sciences, and other such fields (app1; app2; app3; app4; app5; s1).
We focus on the twodimensional timeharmonic acoustic inverse scattering problem as a proof of concept for the reinforcement learning framework. In a compact domain of interest, the inhomogeneous media scattering problem at a fixed frequency is modeled by the Helmholtz equation:
where is an unknown velocity field. We assume that there is a known background velocity field such that except in the domain . We introduce a scatterer compactly supported in :
Then we can work with instead of . The aim of inverse problems is to recover the unknown given some observation data . Waves are sent from a set of sensors and received by a set of receivers. The intrinsic properties of scatterers are contained in the measurements . A corresponding forward problem aims at computing from a given . Both problems are computationally challenging. It is difficult to get a numerical solution for the inverse problem because of the nonlinearity of reconstructing . Traditionally, there are several numerical methods trying to resolve this inverse problem and they can be mainly divided into two types: nonlinearoptimizationbased iterative methods (o1; o2; o3) and imagingbased direct methods (i1; i2; i3)
. Recently, deep learning has been introduced to solve inverse scattering problems with new development
(d1; d2; d3; rec; add).Inverse scattering problems are illposed when the incident wave has only one frequency due to the lack of stability (ill1). Minor variations in measured data may lead to significant errors in the reconstruction (sta1; sta2). There have been extensive efforts in different directions trying to alleviate this issue. For example, regularization methods under singlefrequency data (single1; single2) have been proposed to increase reconstruction efficiency and stability. Another direction to alleviate this stability issue is to apply multifrequency data in the case of timeharmonic scattering problems (m1; m2). It can be shown that the inverse problem is uniquely solvable and is Lipschitz stable when the highest wavenumber exceeds a certain real number. However, the nonlinear equation becomes more oscillatory at a higher frequency and contains much more local minima. Therefore, (r1) developed a recursivelinearizationbased algorithm utilizing multifrequency data to form a continuation procedure that combines the advantages of low and high frequency. In detail, it solves the essentially linear equation at the lowest wavenumber and then uses the solution to linearize the equation for higher frequency gradually. Seminal works in other directions have also been proposed for inverse scattering. For example, singlefrequency algorithms can be naturally extended to multifrequency versions following the idea in (mr1). (mr3) devises a novel Fourier method that directly reconstructs acoustic sources from multifrequency measurements, avoiding the expensive computation brought out by iterative methods.
Current literature focuses on computational algorithms and uses fixed sensor positions and frequencies. Motivated by the numerical challenges and the aforementioned works, a reinforcement learning framework is proposed in this paper to select scattererdependent sensor locations and multiple frequencies to improve the image reconstruction stability and quality for precision imaging. Previously, reinforcement learning has been applied to medical imaging (ct; med; med2), a related numerical problem to inverse scattering. Our algorithm is mainly inspired by the work (ct), where reinforcement learning is applied to learn sensor locations and Xray doses in CT imaging. It is worth emphasizing that the reconstruction problem in CT imaging is a linear problem while the one in inverse scattering is nonlinear and, hence, more challenging. Second, the goal in CT imaging is to optimize sensor locations via balancing sensing safety and reconstruction quality, while the goal in inverse scattering is to balance sensing expense and reconstruction quality, leading to a different learning target in this paper than the one in (ct). Finally, we develop a new reinforcement learning framework that not only optimizes sensor locations but also incident wave frequencies in this paper. We focus on the case of relatively weak scatterers as a proof of concept in this paper while still maintaining the nonlinear nature of the forward problem by keeping a few leading order terms in the Born series. Extensive numerical results will be provided to demonstrate the superiority of the proposed method over existing methods with limited imaging resources.
The rest of the paper is organized as follows. In Section , the inverse scattering problem is introduced. In Section , we explain the proposed reinforcement learning framework. In Section , numerical results are provided to demonstrate the effectiveness of the proposed framework. In Section , we conclude this paper with a short discussion.
2 Preliminary of Inverse Scattering
2.1 Background
In this section, we discuss the forward model for inverse scattering problem. The inhomogeneous media scattering problem at a fixed frequency is modeled by the Helmholtz equation in (1). It is assumed that a scatterer is compactly supported in a domain (see Figure (3) for visualization). Typically, in a numerical solution of the Helmholtz operator, is discretized by a Cartesian grid at the rate of a few points per wavelength. Assume that has grid points and is used to denote the discretization points of . After discretization, the scatterer field
can be treated as a vector in
evaluated at .Assuming a known background velocity , the background Helmholtz operator can be written as . Then can be treated as perturbation with
Consider as a background Green’s function. When the scatterer field is sufficiently small, the expansion of can be constructed via
Note that can be determined by the known background velocity . Therefore, the difference becomes the quantity of interest to recover . In a standard experimental setup, a set of sources, denoted as , and a set of receivers, denoted as , are installed around . Incident waves are sent from sources to probe the intrinsic structure of a scatterer. Receivers receive the waves scattered from the object. Let be a sourcedependent operator imposing an incoming wavefield via sources in . Similarly, let be a receiverdependent operator collecting data with receivers in . Then the observation data can be modeled as
(1) 
In this paper, a summation of finitely many terms in the expansion (1) is used as the forward model in our computation. Note that this forward model is a high order polynomial in , which would lead to a challenging nonlinear model in inverse scattering. Next we concretely provide the form of the expansion (1) under a farfield assumption.
2.2 FarField approximation
Without loss of generality, we assume that the domain is rescaled to a support in a unit circle denoted as . For simplicity, the background velocity we assume . We also assume a sensor is placed on the unit circle where its location is represented by the angle. Let be a unit direction, the source in direction sends out an incoming plane wave . The scattered wave field at a large distance is modeled by (colton1998inverse):
(2) 
where is defined on . The incident wave launched from a source at location is transmitted through the domain and received by a receiver at location . The measurement data corresponding to this transmission process can be modeled by
The exact form of the farfield data can be derived under the general framework of (1). Consider a source located at a position , where is a direction and is a distance going to infinity. The source magnitude is required to scale up by to compensate for the spreading of the wave field and the phase shift (see for example d2). In the limit of , we have
The same setup works in the prescription of receivers. Suppose a receiver is located at a position , where is a direction and is a distance going to infinity. Then we have
Combining the above two limiting processes together, the observed data for a source at a location and a receiver at a location can be computed as follows in the sense of limit:
(3) 
where is the frequency of the wave field from the source at the location . As mentioned previously, we approximate (3) as a polynomial in up to th order. For simplicity, we write (3) succinctly as:
(4) 
In the next section, based on the forward model (3), we introduce the proposed reinforcement learning scheme to solve the inverse problem.
3 Reinforcement Learning Framework for Inverse Scattering
In the typical data collection of inverse scattering, sensors are either installed randomly or uniformly on the unit circle. Moreover, wave frequencies are empirically selected. To achieve a better reconstruction quality and stability, reinforcement learning is applied in this paper to learn a strategy that adaptively decides sensor angles and wave frequencies in a sequential manner: 1) several sets of sensors and frequencies are set up sequentially; 2) the locations of sensors and the frequencies of waves are decided according to the reconstruction result based on previous results. This is a sequential decision process that gradually adjusts data collection to obtain better reconstructions, and furthermore, this method uses individualized strategies for imaging different ’s.
The rest of the section is organized as follows. In section , we introduce the problem setting, which establishes the foundation for the following sections. In section , we describe the MDP formulation of the problem, which enables us to use RL methods to solve it. In section , we review some basic notions and methods in RL, and introduce the RL algorithm we used to optimize the policy in the MDP described in section . In section , we introduce the solver used in scatterer reconstruction, which is required in RL algorithms in section . In section and section , we present the structure of the policy network and the value network. In section and section , we explain the training procedure and test procedure of our RL model, combining all the sections from to .
3.1 Problem Setting
Without loss of generality, we assume that the true scatterer is compactly supported in a unit square centered at the origin, and all the probes are placed on the unit circle in (containing ). In a reinforcement learning framework as we describe later, an action decides the location of sensors and the choice of frequencies at time . In this case, we discretize the unit sphere uniformly and define as an indicator vector to specify the location of one sensor. has only one nonzero entry to indicate the angle of the sensor on a unit sphere . In the experiment, we place one sensor on the unit circle in each step until sensors have been placed, where is a number decided by user in advance. In each step, sensors send out incident waves and receive scattered waves. At time , given a group of sensors placed sequentially at , a new sensor is added on the unit circle. A wave field of frequency is launched from this new sensor and transmits through the domain . Then sensors at will receive the scattered wave. Each sensor is not only a source but also a receiver. We denote the receiver set at time as and the source set at time as . Therefore, and . We also assume the observed data at time can be approximated by (4): . It is important to emphasize that that the frequency at step can be different from others. After steps, the data collection procedure ends and we will reconstruct the scatterer with all the measurements recorded. We define the entire data collection procedure as an episode and this episode consists of sequential steps. The goal of this paper is to improve the reconstruction quality while limiting the number of probes to use. Our solution is to learn an optimal strategy for data collection.
The original problem of determining sensor location and wave frequency is a combinatorial optimization problem and is NPhard. We’ll formulate the problem as a Markov Decision Process (MDP) in section
, which can be further solved by reinforcement learning methods.3.2 Markov Decision Process Formulation
The procedure of deciding sensor locations and frequency values in inverse scattering is a sequential decision problem, where one needs to make a choice of angle and frequency at each step. Thus it can be formulated as a Markov Decision Process (MDP). In this way, we can use RL to solve the problem efficiently. We now elaborate how to formulate our problem as an MDP:

State at time is , where . The first term is the collected observation data at step . The size comes from the fact that at time , we send out wave from a single source to receivers. Measurements up to time are all included in state because the reconstruction at step relies on all the previous collected data. The second term is a vector recording all the angles where sensors have already been placed by time and the corresponding wave frequencies. The th entry of denotes the angle . If the th angle is selected at any step , then the th entry of is , denoting the frequency of the wave sent at th step. If no wave is sent from a specific angle, the entry in corresponds to that angle is . The last term means the number of sensors left to be placed.

Action taken at time is . It is used to define choices on angles and frequencies and . is a onehot vector which denotes the angle of the new sensor to be added at time . stands for the frequency of its incident wave. In particular, can be defined in terms of and as .

Transition model. State and action enable us to compute a deterministic under a noisefree model. According to (4), the new measurement can be computed given and . Meanwhile, . Therefore, and the new state .

Reward at time is
, which is defined as the increment in the Peak Signal to Noise Ratio (PSNR) value of the reconstruction compared to last step. PSNR is commonly used to quantify reconstruction quality for images. We apply its increment here to quantify how much the new reconstruction has been improved due to the new action. Suppose the reconstruction at step
is and the true scatterer is , then . In this paper, is reconstructed by a regularizationbased optimization method to be introduced later. We would like to point out that better results might be possible if more sophisticated reconstruction methods were used.
When we do not know the transition probability, we need a reinforcement learning framework to simultaneously learn the action taken and the transition probability.
3.3 Reinforcement Learning Algorithm
In RL, an agent interacts with environment to obtain a sequence of data, based on which the agent learns a policy to maximize a certain accumulated reward to finish a task. Given an interaction trajectory between agent and environment , the total reward over time is
where is a discount rate. We use since the reward at each step is of equal importance in our application. The policy in the MDP is defined as the conditional probability for and . denotes the probability of taking action at step while the state is . The value function of a state following a certain policy is given as
Similarly, the value function of a stateaction pair is defined as
In an MDP, the RL algorithm aims to find an optimal policy that maximize the expected value of initial state
following a probability distribution
:Currently, modelfree RL algorithms mainly fall into two categories: valuebased methods and policybased methods. Valuebased algorithms learn the state or stateaction value and act by choosing the best action in the state, which requires comprehensive exploration. For example, Qlearning (qlearning) learns the optimal Q function through the Bellman Equation and chooses a greedy action, which maximize the learned Q function. Since the maximization requires searching on the action space, it will become slow and imprecise if the actions are continuous. This means that valuebased algorithms are more suitable for discrete actions. On the other hand, policy gradient methods (policy1; policy2) are more suitable for continuous actions because they directly optimize a parameterized policy under a surrogate objective. Though the variables (angles and frequencies) in our current setting are discrete, we adopt the policybased method for possible future extensions to continuous cases.
More specifically, we use policy gradient methods that directly optimize a policy
(parameterized by a neural network) under a certain objective function. Policy gradient methods seek to maximize the performance of
via stochastic updates, whose expectation approximates the gradient of the performance measure with respect to . In our experiments, we use the Proximal Policy Optimization (PPO) algorithm (ppo). Given an old policy and a new policy , let denote the probability ratio . An advantage function of policy is introduced as(5) 
The objective function to optimize is then defined as:
where
is a hyperparameter and
. This method employs clipping to avoid destructively large policy updates, which retains the stability and reliability of trustregion methods but is much simpler to implement in practice.In the MDP, a reward needs to be computed each step. This means that we need to generate a reconstruction of the scatterer each step because of our choice of the reward.
3.4 Reconstruction
According to (4), for a choice of frequency , receiver set and source set at step , the approximate measurement is given by
where is the true scatterer. We reconstruct by minimizing an loss of data discrepancy with an penalization to encourage sparsity:
where is a hyperparameter. Due to the small number of measurements, we observe that the use of a regularization like significantly improve the results. A new reconstruction is generated at each step in order to compute a reward of the RL model. Given state and action , the reconstruction obtained at time is:
(6) 
which has taken the advantage of all the previously collected measurements.
In our numerical experiment, is set to be and LBFGS rec is used to solve the optimization problem in (6) for the reconstruction problem at each step. Given an initialization , we perform LBFGS for iterations to obtain the reconstruction of the current step. It is important to point out that warm start is necessary to obtain good results. In our tests, we use the reconstruction result of the last step as the initialization of the current step. This greatly reduces the computational cost and ensures convergence.
In the previous sections, we formulate the original problem as an MDP and introduce how to compute the terms in the MDP. In order to apply policy gradient methods, we need to build a policy network that parameterizes the policy to learn.
3.5 Policy Network
In order to handle the increasing dimension of state
over time, we use a Recurrent Neural Network (RNN) to parameterize the policy. The specific structure of RNN enables us to store all the past information in
in a hidden state while adding new information at each step. More specifically, we use multilayer Gated Recurrent Units (GRU), which was introduced in 2014 by
(gru) and also applied in ct. It is similar to a long shortterm memory (LSTM) with fewer parameters. The structure of the whole GRU is shown in Figure
2. The policy network is denoted by , where represents the training parameters. We use to represent the output of the policy network at layer, which is called a hidden state. First, a multilayer perceptron (MLP) is used to extract features from an input
. Then, the features and the output of GRU in the last layer (a hidden state ) are processed by a GRU. In order to learn the sensor angle, the output of GRU will be further processed by another MLP. This MLP aims to learn the policy for angle based on the information from state . With the help of a softmax function, its output is turned into a categorical distribution of angles (a 360dimensional vector). The value at each entry denotes the possibility of placing a sensor at the corresponding angle. Because an angle cannot be chosen more than once in one episode, a mask is introduced to remove all the previously chosen angles. This angle distribution is denoted by , which represents the policy for angle . During training, an angle is sampled based on the distribution , which is further used to generate a onehot 360dimensional vector. This vector uses to denote the chosen angle and 0 otherwise. After that, the output of GRU and the selected angle are merged into a single vector, which is used to learn the frequency policy. This is because the wave frequency should depend not only on state but also on chosen angle . In our experiment, the frequency can only be chosen from given values. So the merged vector is processed by another MLP with an output as a dimensional vector, which denotes the categorical distribution of frequencies. We denote this distribution as emphasizing that the policy of frequency depends on angle . Finally, a random frequency is sampled from the distribution . The structure of the policy network is shown in Figure 3. To conclude, the policy network learns the policy of an action given a state, which consists of an angle policy and a frequency policy . The angles and frequencies generated by policy nets during training are random samples from these two distributions.Besides the policy network, a value network is required for training in policy gradient methods.
3.6 Value Network
In addition to the policy network, we also build a value network parametrized by trainable parameters with a similar structure as in the policy network to approximate the value function of states. The value function approximation is required in the evaluation of the advantage function of policy in (5). The value network design is visualized in Figure 4. We use to represent the output of the value network after the th layer. At the th step, an MLP is used to extract features from the current input . Then, the features of and the output of GRU in previous step (a hidden state ) are processed by a GRU to generate . The difference between the policy network and the value network is how they process to generate useful information. The output
of a GRU in the value network is processed by an MLP to generate a deterministic estimate of the value function
, instead of a distribution in the policy network. The estimated value of a state given by the value network is denoted by .With the policy network and the value network, we can train the RL model through policy gradient methods.
3.7 Training Procedure
Now we will introduce the training procedure of the policy network and the value network . The training set consists of a specific type of scatterers randomly generated. A scatterer will be randomly chosen from the training set and used to generate an interaction trajectory of the RL model . In the initialization of the episode, we set , and . Thus the initial state . At step , given past information and current information , the policy network learns a policy . Meanwhile, the value network approximates the value of state under the current policy . During training, angles and frequencies are randomly sampled based on the policy . In this way, a more comprehensive exploration of the action space is encouraged for faster convergence. A new sensor is then placed at angle to launch an incident wave of frequency . So the receiver set is and the sensor set is at step . Based on the formula in (4), a new measurement is obtained. Meanwhile, compute and let . Then the reconstruction at time is computed based on all the previously collected measurements. Let us denote explicitly as according to (6). In this way, we can compute the reward as the increment in the PSNR of the reconstruction: . When the reward and state are ready, the value network approximates the value in order to estimate the advantage function . We denote the estimate of the advantage as . These estimates will be used in PPO. The episode ends after steps and an interaction trajectory has been generated. In practice, we generate several episodes on parallel and train the policy network and the value network on a mini batch of episodes. Based on these simulations, we can compute the surrogate objective function of PPO and optimize our neural networks. We use autodifferentiation and Adam to optimize the surrogate objective of PPO, and the choice of hyperparameters can be found in Section . A major advantage of using PPO is that we can apply multiple optimization steps using a few trajectories without destructively large policy updates. A more detailed algorithm can be found in Algorithm 1.
After training the RL model on a training set, we need to test the performance of the model on a new test set. The test procedure is a bit different from the training procedure.
3.8 Test Procedure
Given the fully trained policy network , we can test the RL model on a set of scatterers that are similar to those in the training set. During testing, given a scatterer randomly selected from the test set, the policy network generates an interaction trajectory . The initialization is the same as before: , , and . At step , the policy network learns a policy of deciding angles and a policy of choosing frequencies based on the hidden state and the current information . These policies return probability distributions of angles and frequencies. Note that, in the training of RL, angles and frequencies are randomly chosen according to their probability distributions to encourage the exploration of the action space. However, in the testing of the RL model, we choose the angle and the frequency corresponding to the highest probability, hoping to achieve the highest reconstruction resolution. After deciding an action, a new sensor is placed and the state is updated: and . After steps, we apply LBFGS with more iterations (20 iterations) to reconstruct a scatterer via (6) to ensure convergence: . This is the reconstruction of given by the RL model.
4 Numerical Experiment
4.1 Setting
As introduced in the preliminary, the scatterer field is discretized at grid points. or is applied in our numerical results and the numerical conclusion remains similar for other ’s. In the RL training and testing, scatterers are randomly generated with specific shapes. In each experiment, we generate different scatterers, of which are used for training and the rest are used for testing. Since our sensors are located on a unit circle, the positions of sensors are specified with integer angles in , while the possible choices for frequencies are for and for . Our algorithm allows more choices of frequencies to achieve possible better resolution, but we limit the choices here for computational efficiency. For , probes in total are applied for sensing corresponding to in Algorithm 1. For , the number is increased to . The choice of is a userdefined hyperparameter determined by the requirement of the reconstruction resolution. A larger leads to a better reconstruction. We assume the measurements in our experiments approximately follow the model in (4) and we take the secondorder and thirdorder models in our numerical tests. As we shall see, our method works well in both nonlinear models and is significantly better than standard sensing methods. A linear model has also been tested and our method also outperforms existing sensing methods. Since the linear case is less interesting than nonlinear cases in practice, only the results of the second and thirdorder models will be presented here. We use LBFGS to optimize the objective function in reconstruction, and the model requires a new reconstruction at each step. Only iterations of LBFGS are performed and the optimization result is the reconstruction result. Then the result will be the initialization for reconstruction at next step. We set the penalization constant introduced in Section to be .
In the policy network, a multilayer GRU with recurrent layers is used and there are neurons in each layer. The angle MLP has hidden layer of neurons and the one for frequency has hidden layers of neurons. In the value network, the value MLP is composed of hidden layer of neurons. Besides, the MLPs in the policy network and the value network that extract features from also contain hidden layers with neurons. We use Adam (kingma2014adam) to optimize the policy network and value network parameters with a learning rate equal to . and coefficients used for computing running averages of gradient and its square are and . We trained the policy network and the value network with PPO in each experiment for several hundred steps. In each step, we generate episodes using scatterers that are randomly chosen from the training set. Then we perform step of Adam on a minibatch of episodes randomly selected from the scatterers. This action is repeated for times in a single step mentioned above. The PPO algorithm allows us to repeat the optimization on a few repetitive samples without having destructively large policy updates, thus improving the utilization of data and algorithm efficiency.
We test the numerical results on different sampling strategies for angles and frequencies. The first one is our reinforcement learning method that learns both angles and frequencies. This method is denoted as “Learn Both". The second one uses random angles with a fixed frequency and, hence, is denoted as “Random Angle". The third one uses uniformly sampled angles with a fixed frequency and, hence, is denoted as “Uniform Angle". The fourth one uses angles learned by the reinforcement learning method while the frequency is fixed and not learned. This method is denoted as “Learn Angle". The fifth method uses a learned frequency from reinforcement learning with random angles. Therefore, this method is denoted as “Learn Frequency". There will be comparative experiments to see the impact of learning angles and frequencies.
For methods that do not learn how to select frequencies, a fixed is used throughout an entire episode. The frequency that reaches the lowest error in the singlefrequency case is chosen. During testing, we run LBFGS for iterations in the final reconstruction when all the probes have been placed to ensure convergence. To quantify reconstruction accuracy, we use the Mean Squared Error (MSE):
and thepeak signaltonoise ratio (PSNR):
where is the maximum possible pixel value of an image. A method is satisfactory if it produces a small MSE or a large PSNR.
4.2 Numerical Results
In conclusion, the RL model that learns both angles and frequencies improves significantly compared to others methods under limited sensing resources. We now explain and present the numerical results of several datasets in detail.
Experiment . The size of scatterers is and the measurements come from the secondorder model following (4). The scatterers are generated by randomly placing three triangles and three ovals of different sizes in a unit square. The RL model will also work for the thirdorder model based on results in Experiment , but we only choose the secondorder model here to save computational cost. In the secondorder model, the norm of the secondorder measurement (the secondorder term in (4)) is around of the norm of the firstorder one. We test this setting because the firstorder term should dominate in the expansion. At the same time, the secondorder data should not be too small (about one order of magnitude smaller) so that we can verify the power of the RL model in the nonlinear setting. We also test the case where the norm of secondorder data is or as large as that of firstorder data. The model performs similarly, so we will not present them here.
The policy network and the value network are trained for steps and iterations of Adam are performed in each step. After fully trained, the model selects different frequency ranging from to , while other methods that do not learn frequencies are assigned a fixed frequency of .
We computed the MSE and PSNR of the methods over test samples randomly selected from the test set. The reconstructions of our model have the smallest error and the largest PSNR among all the methods on all the
samples. The mean and standard deviation of MSE and PSNR are shown in Table
1. The reconstruction results are visualized in Figure 5. It is clear that learning both angles and frequencies results in a significantly smaller MSE and a larger PSNR than random or uniform angles. Meanwhile, learning both angles and frequencies is also better than learning angles only or learning frequencies only. Our results have demonstrated the effectiveness and necessity of training both angles and frequencies. Meanwhile, the significantly lower variance of MSE demonstrates a better stability of our algorithm than others.
Learn Both  Random Angle  Uniform Angle  Learn Angle  Learn Frequency  

MSE mean (std)  7.6e5 (1.2e10)  0.0008 (1.3e7)  0.0012 (2.8e7)  0.00015 (6e9)  0.0018 (6.2e7) 
PSNR mean (std)  113.4 (0.35)  103.6 (5.2)  101.6 (3.9)  110.7 (3.8)  100 (6.1) 
Experiment .
The scatterers we used are digital numbers from the MNIST dataset
(mnist) for the purpose of testing different kinds of scatterers. The grid size for discretizing the domain is . The measurements are generated from the secondorder model. The RL model is trained for steps and it chooses frequency for all sensors. The choice of frequencies means that the RL model decides that a single frequency is more suitable than multifrequencies for this type of scatterers. However, learning frequencies is better than selecting a fixed frequency empirically and mannually because people do not know which frequency will lead to better reconstructions, especially when the number of possible frequencies increases. Since the RL model selects the same frequency for all incident waves, there is no difference between learning both and learning angles only. The same situation holds for sampling angles uniformly and learning frequencies only. So we only compare the performance of learning both angles and frequencies, random angles, and uniform angles in Table 2 and Figure 6. From the table and figure, it is clear that learning both is significantly better than random angles or uniform angles, which demonstrates the necessity of training angles.Learn Both  Random Angle  Uniform Angle  

MSE mean (std)  0.00012 (8.5e9)  0.0046 (3e7)  0.0039 (3.7e7) 
PSNR mean (std)  111.6 (3.3)  95.6 (0.29)  96.29 (0.43) 
Experiment . The size of scatterers is and the measurements are generated from the thirdorder model in (4). We train and test the RL model separately on two types of scatterers. The first type of scatterers is the same as those in Experiment , where scatterers are generated by randomly placing three triangles and three ovals of different sizes in the unit square. The second type of scatterers is generated by randomly placing three circles with different random intensities. The RL model is trained for iterations with PPO. The final model sets frequency to be and in the first case and chooses in the second. Because the second case is simplified to a singlefrequency one by the RL model as in Experiment , we will not show the results of learning angles and learning frequencies. The errors are recorded in Table 3 and Table 4. The reconstruction results are visualized in Figure 7. From these results, we can see that the error of learning both angles and frequencies is much smaller than learning angles only, which illustrates the power of training frequencies. The other results are similar to the former two experiments. As shown in Figure 7, the RL model also has the ability to handle different intensities of various objects.
Learn Both  Random Angle  Uniform Angle  Learn Angle  Learn Frequency  

MSE mean (std)  0.0006 (1.2e8)  0.004 (7e7)  0.004 (8e7)  0.0017 (5e7)  0.011 (7e6) 
PSNR mean (std)  104.3 (0.62)  96.3 (1.4)  96.2 (1.6)  100.7 (4.4)  92.2 (1.2) 
Learn Both  Random Angle  Uniform Angle  

MSE mean (std)  0.0007 (1.4e8)  0.0045 (9e7)  0.0053 (4e7) 
PSNR mean (std)  103.6 (0.9)  95.9 (2)  95 (0.5) 
The numerical results show that learning both angles and frequencies achieves the highest resolution with a significant improvement over other methods. Meanwhile, the variance of error indicates that the stability of the RL algorithm is also much better than other methods. Learning angles only is the secondbest method and it can generate a vague reconstruction. The difference between the best two methods demonstrates the effectiveness of training frequencies. Sampling angles randomly performs close to sampling uniformly and they are much worse than the previous two methods. This implies that learning angles is necessary for data collection. The performance of learning frequencies only is the worst, which means that the learned frequencies must work together with the learned angles. The numerical results demonstrate the necessity of training sensor angles and incident wave frequencies. Thus, the proposed RL model outperforms all the other methods.
5 Discussion
In this paper, reinforcement learning is applied to learn a policy that selects scattererdependent sensing angles and frequencies in inverse scattering. The process of sensor installation, information collection, and scatterer reconstruction is reformulated as a Markov decision process and hence, reinforcement learning can help to optimize this process. A recurrent neural network is adopted as the policy network to choose sensor locations and wave frequencies adaptively. The proposed reinforcement learning method learns to make scattererdependent decisions from previous imaging results, each of which requires the solution of an expensive optimization problem. To better facilitate the convergence and reduce the computational cost of reinforcement learning, a warmstart strategy is used in these optimization problems. Extensive numerical experiments have been conducted using several types of scatterers in the second and the third order of the nonlinear inverse scattering model. These results demonstrate that the proposed method significantly outperforms existing algorithms in terms of reconstruction quality. This paper serves as a first step towards intelligent computing for precision imaging in inverse scattering. The case of weak scattering is adopted as a proof of concept. In the future, more advanced learning techniques will be proposed to deal with more challenging cases.
Acknowledgments
Y. K. was partially supported by NSF grant DMS2111563. H. Y. was partially supported by the NSF CAREER Award DMS1945029 and the NVIDIA GPU grant. We thank the authors in the seminal work (ct) for sharing their code.