DeepAI
Log In Sign Up

Reinforced Inverse Scattering

06/08/2022
by   Hanyang Jiang, et al.
University of Maryland
The University of Chicago
0

Inverse wave scattering aims at determining the properties of an object using data on how the object scatters incoming waves. In order to collect information, sensors are put in different locations to send and receive waves from each other. The choice of sensor positions and incident wave frequencies determines the reconstruction quality of scatterer properties. This paper introduces reinforcement learning to develop precision imaging that decides sensor positions and wave frequencies adaptive to different scatterers in an intelligent way, thus obtaining a significant improvement in reconstruction quality with limited imaging resources. Extensive numerical results will be provided to demonstrate the superiority of the proposed method over existing methods.

READ FULL TEXT VIEW PDF

page 12

page 13

page 14

10/15/2021

Diffraction Tomography, Fourier Reconstruction, and Full Waveform Inversion

In this paper, we study the mathematical imaging problem of diffraction ...
10/29/2019

Reduced Order Model Approach to Inverse Scattering

We study an inverse scattering problem for a generic hyperbolic system o...
07/20/2022

Sampling type method combined with deep learning for inverse scattering with one incident wave

We consider the inverse problem of determining the geometry of penetrabl...
08/03/2021

Reduced order model approach for imaging with waves

We introduce a novel, computationally inexpensive approach for imaging w...
11/27/2019

Solving Inverse Wave Scattering with Deep Learning

This paper proposes a neural network approach for solving two classical ...
06/04/2018

SuMo-SS: Submodular Optimization Sensor Scattering for Deploying Sensor Networks by Drones

To meet the immediate needs of environmental monitoring or hazardous eve...
10/15/2020

Asymptotics for metamaterial cavities and their effect on scattering

It is well-known that optical cavities can exhibit localized phenomena r...

1 Introduction

Nowadays, artificial intelligence (AI) have fundamentally changed a vast of industries. AI leverages computers and machines to mimic the problem-solving and decision-making capabilities of the human mind. AI has already achieved similar performance as human experts in many fields, such as image recognition

(ir), Go (go), Starcraft (star) and voice generation (wavenet). Recently, AI has been applied in many scientific fields, e.g., protein structure prediction (fold), climate forecasting (clim)

, astronomical pattern recognition

(ast), etc. These exciting successes encourage the exploration of AI in various scientific research. This paper introduces reinforcement learning (RL) to inverse problems and develops an intelligent computing method for inverse scattering to achieve precision imaging. In particular, a reinforcement learning framework is designed to learn and decide sensor positions and wave frequencies adaptive to different scatterers in an intelligent way, thus obtaining a significant improvement in reconstruction quality with limited imaging resources.

The inverse scattering problem is to reconstruct or recover the physical and/or geometric properties of an object from the measured data. The reconstructed information of interest includes, for instance, the dielectric constant distribution and the shape or structure. The interrogating or probing radiation can be an electromagnetic wave (e.g., microwave, optical wave, and X-ray), an acoustic wave, or some other waves. The problem of inverse scattering is important when details about the structure and composition of an object are required. Inverse scattering has wide applications in nondestructive evaluation, medical imaging, remote sensing, seismic exploration, target identification, geophysics, optics, atmospheric sciences, and other such fields (app1; app2; app3; app4; app5; s1).

We focus on the two-dimensional time-harmonic acoustic inverse scattering problem as a proof of concept for the reinforcement learning framework. In a compact domain of interest, the inhomogeneous media scattering problem at a fixed frequency is modeled by the Helmholtz equation:

where is an unknown velocity field. We assume that there is a known background velocity field such that except in the domain . We introduce a scatterer compactly supported in :

Then we can work with instead of . The aim of inverse problems is to recover the unknown given some observation data . Waves are sent from a set of sensors and received by a set of receivers. The intrinsic properties of scatterers are contained in the measurements . A corresponding forward problem aims at computing from a given . Both problems are computationally challenging. It is difficult to get a numerical solution for the inverse problem because of the nonlinearity of reconstructing . Traditionally, there are several numerical methods trying to resolve this inverse problem and they can be mainly divided into two types: nonlinear-optimization-based iterative methods (o1; o2; o3) and imaging-based direct methods (i1; i2; i3)

. Recently, deep learning has been introduced to solve inverse scattering problems with new development

(d1; d2; d3; rec; add).

Inverse scattering problems are ill-posed when the incident wave has only one frequency due to the lack of stability (ill1). Minor variations in measured data may lead to significant errors in the reconstruction (sta1; sta2). There have been extensive efforts in different directions trying to alleviate this issue. For example, regularization methods under single-frequency data (single1; single2) have been proposed to increase reconstruction efficiency and stability. Another direction to alleviate this stability issue is to apply multi-frequency data in the case of time-harmonic scattering problems (m1; m2). It can be shown that the inverse problem is uniquely solvable and is Lipschitz stable when the highest wavenumber exceeds a certain real number. However, the nonlinear equation becomes more oscillatory at a higher frequency and contains much more local minima. Therefore, (r1) developed a recursive-linearization-based algorithm utilizing multi-frequency data to form a continuation procedure that combines the advantages of low and high frequency. In detail, it solves the essentially linear equation at the lowest wavenumber and then uses the solution to linearize the equation for higher frequency gradually. Seminal works in other directions have also been proposed for inverse scattering. For example, single-frequency algorithms can be naturally extended to multi-frequency versions following the idea in (mr1). (mr3) devises a novel Fourier method that directly reconstructs acoustic sources from multi-frequency measurements, avoiding the expensive computation brought out by iterative methods.

Current literature focuses on computational algorithms and uses fixed sensor positions and frequencies. Motivated by the numerical challenges and the aforementioned works, a reinforcement learning framework is proposed in this paper to select scatterer-dependent sensor locations and multiple frequencies to improve the image reconstruction stability and quality for precision imaging. Previously, reinforcement learning has been applied to medical imaging (ct; med; med2), a related numerical problem to inverse scattering. Our algorithm is mainly inspired by the work (ct), where reinforcement learning is applied to learn sensor locations and X-ray doses in CT imaging. It is worth emphasizing that the reconstruction problem in CT imaging is a linear problem while the one in inverse scattering is nonlinear and, hence, more challenging. Second, the goal in CT imaging is to optimize sensor locations via balancing sensing safety and reconstruction quality, while the goal in inverse scattering is to balance sensing expense and reconstruction quality, leading to a different learning target in this paper than the one in (ct). Finally, we develop a new reinforcement learning framework that not only optimizes sensor locations but also incident wave frequencies in this paper. We focus on the case of relatively weak scatterers as a proof of concept in this paper while still maintaining the nonlinear nature of the forward problem by keeping a few leading order terms in the Born series. Extensive numerical results will be provided to demonstrate the superiority of the proposed method over existing methods with limited imaging resources.

The rest of the paper is organized as follows. In Section , the inverse scattering problem is introduced. In Section , we explain the proposed reinforcement learning framework. In Section , numerical results are provided to demonstrate the effectiveness of the proposed framework. In Section , we conclude this paper with a short discussion.

2 Preliminary of Inverse Scattering

2.1 Background

In this section, we discuss the forward model for inverse scattering problem. The inhomogeneous media scattering problem at a fixed frequency is modeled by the Helmholtz equation in (1). It is assumed that a scatterer is compactly supported in a domain (see Figure (3) for visualization). Typically, in a numerical solution of the Helmholtz operator, is discretized by a Cartesian grid at the rate of a few points per wavelength. Assume that has grid points and is used to denote the discretization points of . After discretization, the scatterer field

can be treated as a vector in

evaluated at .

Assuming a known background velocity , the background Helmholtz operator can be written as . Then can be treated as perturbation with

Consider as a background Green’s function. When the scatterer field is sufficiently small, the expansion of can be constructed via

Note that can be determined by the known background velocity . Therefore, the difference becomes the quantity of interest to recover . In a standard experimental setup, a set of sources, denoted as , and a set of receivers, denoted as , are installed around . Incident waves are sent from sources to probe the intrinsic structure of a scatterer. Receivers receive the waves scattered from the object. Let be a source-dependent operator imposing an incoming wavefield via sources in . Similarly, let be a receiver-dependent operator collecting data with receivers in . Then the observation data can be modeled as

(1)

In this paper, a summation of finitely many terms in the expansion (1) is used as the forward model in our computation. Note that this forward model is a high order polynomial in , which would lead to a challenging nonlinear model in inverse scattering. Next we concretely provide the form of the expansion (1) under a far-field assumption.

2.2 Far-Field approximation

Without loss of generality, we assume that the domain is rescaled to a support in a unit circle denoted as . For simplicity, the background velocity we assume . We also assume a sensor is placed on the unit circle where its location is represented by the angle. Let be a unit direction, the source in direction sends out an incoming plane wave . The scattered wave field at a large distance is modeled by (colton1998inverse):

(2)

where is defined on . The incident wave launched from a source at location is transmitted through the domain and received by a receiver at location . The measurement data corresponding to this transmission process can be modeled by

Figure 1: The data generating process for the far-field pattern problem.

The exact form of the far-field data can be derived under the general framework of (1). Consider a source located at a position , where is a direction and is a distance going to infinity. The source magnitude is required to scale up by to compensate for the spreading of the wave field and the phase shift (see for example d2). In the limit of , we have

The same setup works in the prescription of receivers. Suppose a receiver is located at a position , where is a direction and is a distance going to infinity. Then we have

Combining the above two limiting processes together, the observed data for a source at a location and a receiver at a location can be computed as follows in the sense of limit:

(3)

where is the frequency of the wave field from the source at the location . As mentioned previously, we approximate (3) as a polynomial in up to -th order. For simplicity, we write (3) succinctly as:

(4)

In the next section, based on the forward model (3), we introduce the proposed reinforcement learning scheme to solve the inverse problem.

3 Reinforcement Learning Framework for Inverse Scattering

In the typical data collection of inverse scattering, sensors are either installed randomly or uniformly on the unit circle. Moreover, wave frequencies are empirically selected. To achieve a better reconstruction quality and stability, reinforcement learning is applied in this paper to learn a strategy that adaptively decides sensor angles and wave frequencies in a sequential manner: 1) several sets of sensors and frequencies are set up sequentially; 2) the locations of sensors and the frequencies of waves are decided according to the reconstruction result based on previous results. This is a sequential decision process that gradually adjusts data collection to obtain better reconstructions, and furthermore, this method uses individualized strategies for imaging different ’s.

The rest of the section is organized as follows. In section , we introduce the problem setting, which establishes the foundation for the following sections. In section , we describe the MDP formulation of the problem, which enables us to use RL methods to solve it. In section , we review some basic notions and methods in RL, and introduce the RL algorithm we used to optimize the policy in the MDP described in section . In section , we introduce the solver used in scatterer reconstruction, which is required in RL algorithms in section . In section and section , we present the structure of the policy network and the value network. In section and section , we explain the training procedure and test procedure of our RL model, combining all the sections from to .

3.1 Problem Setting

Without loss of generality, we assume that the true scatterer is compactly supported in a unit square centered at the origin, and all the probes are placed on the unit circle in (containing ). In a reinforcement learning framework as we describe later, an action decides the location of sensors and the choice of frequencies at time . In this case, we discretize the unit sphere uniformly and define as an indicator vector to specify the location of one sensor. has only one non-zero entry to indicate the angle of the sensor on a unit sphere . In the experiment, we place one sensor on the unit circle in each step until sensors have been placed, where is a number decided by user in advance. In each step, sensors send out incident waves and receive scattered waves. At time , given a group of sensors placed sequentially at , a new sensor is added on the unit circle. A wave field of frequency is launched from this new sensor and transmits through the domain . Then sensors at will receive the scattered wave. Each sensor is not only a source but also a receiver. We denote the receiver set at time as and the source set at time as . Therefore, and . We also assume the observed data at time can be approximated by (4): . It is important to emphasize that that the frequency at step can be different from others. After steps, the data collection procedure ends and we will reconstruct the scatterer with all the measurements recorded. We define the entire data collection procedure as an episode and this episode consists of sequential steps. The goal of this paper is to improve the reconstruction quality while limiting the number of probes to use. Our solution is to learn an optimal strategy for data collection.

The original problem of determining sensor location and wave frequency is a combinatorial optimization problem and is NP-hard. We’ll formulate the problem as a Markov Decision Process (MDP) in section

, which can be further solved by reinforcement learning methods.

3.2 Markov Decision Process Formulation

The procedure of deciding sensor locations and frequency values in inverse scattering is a sequential decision problem, where one needs to make a choice of angle and frequency at each step. Thus it can be formulated as a Markov Decision Process (MDP). In this way, we can use RL to solve the problem efficiently. We now elaborate how to formulate our problem as an MDP:

  1. State at time is , where . The first term is the collected observation data at step . The size comes from the fact that at time , we send out wave from a single source to receivers. Measurements up to time are all included in state because the reconstruction at step relies on all the previous collected data. The second term is a vector recording all the angles where sensors have already been placed by time and the corresponding wave frequencies. The -th entry of denotes the angle . If the -th angle is selected at any step , then the -th entry of is , denoting the frequency of the wave sent at -th step. If no wave is sent from a specific angle, the entry in corresponds to that angle is . The last term means the number of sensors left to be placed.

  2. Action taken at time is . It is used to define choices on angles and frequencies and . is a one-hot vector which denotes the angle of the new sensor to be added at time . stands for the frequency of its incident wave. In particular, can be defined in terms of and as .

  3. Transition model. State and action enable us to compute a deterministic under a noise-free model. According to (4), the new measurement can be computed given and . Meanwhile, . Therefore, and the new state .

  4. Reward at time is

    , which is defined as the increment in the Peak Signal to Noise Ratio (PSNR) value of the reconstruction compared to last step. PSNR is commonly used to quantify reconstruction quality for images. We apply its increment here to quantify how much the new reconstruction has been improved due to the new action. Suppose the reconstruction at step

    is and the true scatterer is , then . In this paper, is reconstructed by a regularization-based optimization method to be introduced later. We would like to point out that better results might be possible if more sophisticated reconstruction methods were used.

When we do not know the transition probability, we need a reinforcement learning framework to simultaneously learn the action taken and the transition probability.

3.3 Reinforcement Learning Algorithm

In RL, an agent interacts with environment to obtain a sequence of data, based on which the agent learns a policy to maximize a certain accumulated reward to finish a task. Given an interaction trajectory between agent and environment , the total reward over time is

where is a discount rate. We use since the reward at each step is of equal importance in our application. The policy in the MDP is defined as the conditional probability for and . denotes the probability of taking action at step while the state is . The value function of a state following a certain policy is given as

Similarly, the value function of a state-action pair is defined as

In an MDP, the RL algorithm aims to find an optimal policy that maximize the expected value of initial state

following a probability distribution

:

Currently, model-free RL algorithms mainly fall into two categories: value-based methods and policy-based methods. Value-based algorithms learn the state or state-action value and act by choosing the best action in the state, which requires comprehensive exploration. For example, Q-learning (qlearning) learns the optimal Q function through the Bellman Equation and chooses a greedy action, which maximize the learned Q function. Since the maximization requires searching on the action space, it will become slow and imprecise if the actions are continuous. This means that value-based algorithms are more suitable for discrete actions. On the other hand, policy gradient methods (policy1; policy2) are more suitable for continuous actions because they directly optimize a parameterized policy under a surrogate objective. Though the variables (angles and frequencies) in our current setting are discrete, we adopt the policy-based method for possible future extensions to continuous cases.

More specifically, we use policy gradient methods that directly optimize a policy

(parameterized by a neural network) under a certain objective function. Policy gradient methods seek to maximize the performance of

via stochastic updates, whose expectation approximates the gradient of the performance measure with respect to . In our experiments, we use the Proximal Policy Optimization (PPO) algorithm (ppo). Given an old policy and a new policy , let denote the probability ratio . An advantage function of policy is introduced as

(5)

The objective function to optimize is then defined as:

where

is a hyperparameter and

. This method employs clipping to avoid destructively large policy updates, which retains the stability and reliability of trust-region methods but is much simpler to implement in practice.

In the MDP, a reward needs to be computed each step. This means that we need to generate a reconstruction of the scatterer each step because of our choice of the reward.

3.4 Reconstruction

According to (4), for a choice of frequency , receiver set and source set at step , the approximate measurement is given by

where is the true scatterer. We reconstruct by minimizing an loss of data discrepancy with an penalization to encourage sparsity:

where is a hyperparameter. Due to the small number of measurements, we observe that the use of a regularization like significantly improve the results. A new reconstruction is generated at each step in order to compute a reward of the RL model. Given state and action , the reconstruction obtained at time is:

(6)

which has taken the advantage of all the previously collected measurements.

In our numerical experiment, is set to be and L-BFGS rec is used to solve the optimization problem in (6) for the reconstruction problem at each step. Given an initialization , we perform L-BFGS for iterations to obtain the reconstruction of the current step. It is important to point out that warm start is necessary to obtain good results. In our tests, we use the reconstruction result of the last step as the initialization of the current step. This greatly reduces the computational cost and ensures convergence.

In the previous sections, we formulate the original problem as an MDP and introduce how to compute the terms in the MDP. In order to apply policy gradient methods, we need to build a policy network that parameterizes the policy to learn.

3.5 Policy Network

In order to handle the increasing dimension of state

over time, we use a Recurrent Neural Network (RNN) to parameterize the policy. The specific structure of RNN enables us to store all the past information in

in a hidden state while adding new information at each step. More specifically, we use multi-layer Gated Recurrent Units (GRU), which was introduced in 2014 by

(gru) and also applied in ct

. It is similar to a long short-term memory (LSTM) with fewer parameters. The structure of the whole GRU is shown in Figure

2. The policy network is denoted by , where represents the training parameters. We use to represent the output of the policy network at layer

, which is called a hidden state. First, a multi-layer perceptron (MLP) is used to extract features from an input

. Then, the features and the output of GRU in the last layer (a hidden state ) are processed by a GRU. In order to learn the sensor angle, the output of GRU will be further processed by another MLP. This MLP aims to learn the policy for angle based on the information from state . With the help of a softmax function, its output is turned into a categorical distribution of angles (a 360-dimensional vector). The value at each entry denotes the possibility of placing a sensor at the corresponding angle. Because an angle cannot be chosen more than once in one episode, a mask is introduced to remove all the previously chosen angles. This angle distribution is denoted by , which represents the policy for angle . During training, an angle is sampled based on the distribution , which is further used to generate a one-hot 360-dimensional vector. This vector uses to denote the chosen angle and 0 otherwise. After that, the output of GRU and the selected angle are merged into a single vector, which is used to learn the frequency policy. This is because the wave frequency should depend not only on state but also on chosen angle . In our experiment, the frequency can only be chosen from given values. So the merged vector is processed by another MLP with an output as a -dimensional vector, which denotes the categorical distribution of frequencies. We denote this distribution as emphasizing that the policy of frequency depends on angle . Finally, a random frequency is sampled from the distribution . The structure of the policy network is shown in Figure 3. To conclude, the policy network learns the policy of an action given a state, which consists of an angle policy and a frequency policy . The angles and frequencies generated by policy nets during training are random samples from these two distributions.

Besides the policy network, a value network is required for training in policy gradient methods.

3.6 Value Network

In addition to the policy network, we also build a value network parametrized by trainable parameters with a similar structure as in the policy network to approximate the value function of states. The value function approximation is required in the evaluation of the advantage function of policy in (5). The value network design is visualized in Figure 4. We use to represent the output of the value network after the -th layer. At the -th step, an MLP is used to extract features from the current input . Then, the features of and the output of GRU in previous step (a hidden state ) are processed by a GRU to generate . The difference between the policy network and the value network is how they process to generate useful information. The output

of a GRU in the value network is processed by an MLP to generate a deterministic estimate of the value function

, instead of a distribution in the policy network. The estimated value of a state given by the value network is denoted by .

With the policy network and the value network, we can train the RL model through policy gradient methods.

3.7 Training Procedure

Now we will introduce the training procedure of the policy network and the value network . The training set consists of a specific type of scatterers randomly generated. A scatterer will be randomly chosen from the training set and used to generate an interaction trajectory of the RL model . In the initialization of the episode, we set , and . Thus the initial state . At step , given past information and current information , the policy network learns a policy . Meanwhile, the value network approximates the value of state under the current policy . During training, angles and frequencies are randomly sampled based on the policy . In this way, a more comprehensive exploration of the action space is encouraged for faster convergence. A new sensor is then placed at angle to launch an incident wave of frequency . So the receiver set is and the sensor set is at step . Based on the formula in (4), a new measurement is obtained. Meanwhile, compute and let . Then the reconstruction at time is computed based on all the previously collected measurements. Let us denote explicitly as according to (6). In this way, we can compute the reward as the increment in the PSNR of the reconstruction: . When the reward and state are ready, the value network approximates the value in order to estimate the advantage function . We denote the estimate of the advantage as . These estimates will be used in PPO. The episode ends after steps and an interaction trajectory has been generated. In practice, we generate several episodes on parallel and train the policy network and the value network on a mini batch of episodes. Based on these simulations, we can compute the surrogate objective function of PPO and optimize our neural networks. We use auto-differentiation and Adam to optimize the surrogate objective of PPO, and the choice of hyperparameters can be found in Section . A major advantage of using PPO is that we can apply multiple optimization steps using a few trajectories without destructively large policy updates. A more detailed algorithm can be found in Algorithm 1.

After training the RL model on a training set, we need to test the performance of the model on a new test set. The test procedure is a bit different from the training procedure.

3.8 Test Procedure

Given the fully trained policy network , we can test the RL model on a set of scatterers that are similar to those in the training set. During testing, given a scatterer randomly selected from the test set, the policy network generates an interaction trajectory . The initialization is the same as before: , , and . At step , the policy network learns a policy of deciding angles and a policy of choosing frequencies based on the hidden state and the current information . These policies return probability distributions of angles and frequencies. Note that, in the training of RL, angles and frequencies are randomly chosen according to their probability distributions to encourage the exploration of the action space. However, in the testing of the RL model, we choose the angle and the frequency corresponding to the highest probability, hoping to achieve the highest reconstruction resolution. After deciding an action, a new sensor is placed and the state is updated: and . After steps, we apply L-BFGS with more iterations (20 iterations) to reconstruct a scatterer via (6) to ensure convergence: . This is the reconstruction of given by the RL model.

1:  Require: A training sample size , randomly generated training scatterer samples , a grid size , the total number of sensors , a policy network , a value network , a clipping constant
2:  Forepoch
3:    Initialization: , , , , , , , randomly choose a scatterer from
4:    For :
5:      Given and , use policy networks to generate policies and , and then update
6:      Randomly sample an angle from the policy and a frequency from the policy
7:      Let and
8:      Update the receiver set and the source set
9:      Compute the state using , , and
10:      Given and , use the value network to approximate the value , then update
11:      Reconstruct a scatterer
12:      Compute the reward
13:      Approximately compute the advantage function
14:    End for
15:    Given and , let and be the probability distributions of the angle and the frequency learned by the policy network
16:    Compute
17:    Evaluate
18:    Given and , let be the value approximated by the value network
19:    Evaluate
20:    Use auto-differentiation and Adam to optimize and in the objective functions and , respectively
21:  End for
22:  Return: The trained policies and
Algorithm 1 The Training Procedure of Our Reinforcement Learning Algorithm
Figure 2: Structure of Recurrent Neural Network. Each GRU represents one layer, is the input of layer . Each layer outputs a hidden state which is also the input of next layer.
Figure 3: Structure of policy network . represents the hidden state of policy net, which is the output of GRU at layer . . We use as the input of another perceptron and generate a 360-dim categorical distribution of angle through softmax function with a mask removing angles that have been chosen. This distribution is the angle policy . Then we randomly generate an angle based on distribution and combine its one-hot concentrate with as the input of another MLP, which gives rise to another categorical distribution of frequency. This distribution represents the frequency policy given angle . Finally, we use this to randomly generate a frequency .
Figure 4: Structure of value network . represents the hidden state of value net, which is the output of GRU at layer . . We use as the input of another perceptron and generate , which is the estimate of the value of current state under the policy parameterized by policy network.

4 Numerical Experiment

4.1 Setting

As introduced in the preliminary, the scatterer field is discretized at grid points. or is applied in our numerical results and the numerical conclusion remains similar for other ’s. In the RL training and testing, scatterers are randomly generated with specific shapes. In each experiment, we generate different scatterers, of which are used for training and the rest are used for testing. Since our sensors are located on a unit circle, the positions of sensors are specified with integer angles in , while the possible choices for frequencies are for and for . Our algorithm allows more choices of frequencies to achieve possible better resolution, but we limit the choices here for computational efficiency. For , probes in total are applied for sensing corresponding to in Algorithm 1. For , the number is increased to . The choice of is a user-defined hyper-parameter determined by the requirement of the reconstruction resolution. A larger leads to a better reconstruction. We assume the measurements in our experiments approximately follow the model in (4) and we take the second-order and third-order models in our numerical tests. As we shall see, our method works well in both nonlinear models and is significantly better than standard sensing methods. A linear model has also been tested and our method also outperforms existing sensing methods. Since the linear case is less interesting than nonlinear cases in practice, only the results of the second and third-order models will be presented here. We use L-BFGS to optimize the objective function in reconstruction, and the model requires a new reconstruction at each step. Only iterations of L-BFGS are performed and the optimization result is the reconstruction result. Then the result will be the initialization for reconstruction at next step. We set the penalization constant introduced in Section to be .

In the policy network, a multi-layer GRU with recurrent layers is used and there are neurons in each layer. The angle MLP has hidden layer of neurons and the one for frequency has hidden layers of neurons. In the value network, the value MLP is composed of hidden layer of neurons. Besides, the MLPs in the policy network and the value network that extract features from also contain hidden layers with neurons. We use Adam (kingma2014adam) to optimize the policy network and value network parameters with a learning rate equal to . and coefficients used for computing running averages of gradient and its square are and . We trained the policy network and the value network with PPO in each experiment for several hundred steps. In each step, we generate episodes using scatterers that are randomly chosen from the training set. Then we perform step of Adam on a mini-batch of episodes randomly selected from the scatterers. This action is repeated for times in a single step mentioned above. The PPO algorithm allows us to repeat the optimization on a few repetitive samples without having destructively large policy updates, thus improving the utilization of data and algorithm efficiency.

We test the numerical results on different sampling strategies for angles and frequencies. The first one is our reinforcement learning method that learns both angles and frequencies. This method is denoted as “Learn Both". The second one uses random angles with a fixed frequency and, hence, is denoted as “Random Angle". The third one uses uniformly sampled angles with a fixed frequency and, hence, is denoted as “Uniform Angle". The fourth one uses angles learned by the reinforcement learning method while the frequency is fixed and not learned. This method is denoted as “Learn Angle". The fifth method uses a learned frequency from reinforcement learning with random angles. Therefore, this method is denoted as “Learn Frequency". There will be comparative experiments to see the impact of learning angles and frequencies.

For methods that do not learn how to select frequencies, a fixed is used throughout an entire episode. The frequency that reaches the lowest error in the single-frequency case is chosen. During testing, we run L-BFGS for iterations in the final reconstruction when all the probes have been placed to ensure convergence. To quantify reconstruction accuracy, we use the Mean Squared Error (MSE):

and thepeak signal-to-noise ratio (PSNR):

where is the maximum possible pixel value of an image. A method is satisfactory if it produces a small MSE or a large PSNR.

4.2 Numerical Results

In conclusion, the RL model that learns both angles and frequencies improves significantly compared to others methods under limited sensing resources. We now explain and present the numerical results of several datasets in detail.

Experiment . The size of scatterers is and the measurements come from the second-order model following (4). The scatterers are generated by randomly placing three triangles and three ovals of different sizes in a unit square. The RL model will also work for the third-order model based on results in Experiment , but we only choose the second-order model here to save computational cost. In the second-order model, the norm of the second-order measurement (the second-order term in (4)) is around of the norm of the first-order one. We test this setting because the first-order term should dominate in the expansion. At the same time, the second-order data should not be too small (about one order of magnitude smaller) so that we can verify the power of the RL model in the nonlinear setting. We also test the case where the norm of second-order data is or as large as that of first-order data. The model performs similarly, so we will not present them here.

The policy network and the value network are trained for steps and iterations of Adam are performed in each step. After fully trained, the model selects different frequency ranging from to , while other methods that do not learn frequencies are assigned a fixed frequency of .

We computed the MSE and PSNR of the methods over test samples randomly selected from the test set. The reconstructions of our model have the smallest error and the largest PSNR among all the methods on all the

samples. The mean and standard deviation of MSE and PSNR are shown in Table

1. The reconstruction results are visualized in Figure 5

. It is clear that learning both angles and frequencies results in a significantly smaller MSE and a larger PSNR than random or uniform angles. Meanwhile, learning both angles and frequencies is also better than learning angles only or learning frequencies only. Our results have demonstrated the effectiveness and necessity of training both angles and frequencies. Meanwhile, the significantly lower variance of MSE demonstrates a better stability of our algorithm than others.

Learn Both Random Angle Uniform Angle Learn Angle Learn Frequency
MSE mean (std) 7.6e-5 (1.2e-10) 0.0008 (1.3e-7) 0.0012 (2.8e-7) 0.00015 (6e-9) 0.0018 (6.2e-7)
PSNR mean (std) 113.4 (0.35) 103.6 (5.2) 101.6 (3.9) 110.7 (3.8) 100 (6.1)
Table 1: The MSE and PSNR and their statistics of five different methods in Experiment .
(a) True image
(b) Reconstruction of learning both angles and frequencies (MSE=8e-5)
(c) Reconstruction of random angle (MSE=0.0012)
(d) Reconstruction of uniform angle (MSE=0.0017)
(e) Reconstruction of learning angles only (MSE=3e-4)
(f) Reconstruction of learning frequencies only (MSE=2e-3)
Figure 5: Experiment . We compare the reconstruction results of the different methods on a specific type of scatterer. The true scatterer is shown in subplot (a). We tag the methods under each plot, with the MSE of reconstruction showing the difference in resolution.

Experiment .

The scatterers we used are digital numbers from the MNIST dataset

(mnist) for the purpose of testing different kinds of scatterers. The grid size for discretizing the domain is . The measurements are generated from the second-order model. The RL model is trained for steps and it chooses frequency for all sensors. The choice of frequencies means that the RL model decides that a single frequency is more suitable than multi-frequencies for this type of scatterers. However, learning frequencies is better than selecting a fixed frequency empirically and mannually because people do not know which frequency will lead to better reconstructions, especially when the number of possible frequencies increases. Since the RL model selects the same frequency for all incident waves, there is no difference between learning both and learning angles only. The same situation holds for sampling angles uniformly and learning frequencies only. So we only compare the performance of learning both angles and frequencies, random angles, and uniform angles in Table 2 and Figure 6. From the table and figure, it is clear that learning both is significantly better than random angles or uniform angles, which demonstrates the necessity of training angles.

Learn Both Random Angle Uniform Angle
MSE mean (std) 0.00012 (8.5e-9) 0.0046 (3e-7) 0.0039 (3.7e-7)
PSNR mean (std) 111.6 (3.3) 95.6 (0.29) 96.29 (0.43)
Table 2: The MSE and PSNR and their statistics of three different methods in Experiment .

Experiment . The size of scatterers is and the measurements are generated from the third-order model in (4). We train and test the RL model separately on two types of scatterers. The first type of scatterers is the same as those in Experiment , where scatterers are generated by randomly placing three triangles and three ovals of different sizes in the unit square. The second type of scatterers is generated by randomly placing three circles with different random intensities. The RL model is trained for iterations with PPO. The final model sets frequency to be and in the first case and chooses in the second. Because the second case is simplified to a single-frequency one by the RL model as in Experiment , we will not show the results of learning angles and learning frequencies. The errors are recorded in Table 3 and Table 4. The reconstruction results are visualized in Figure 7. From these results, we can see that the error of learning both angles and frequencies is much smaller than learning angles only, which illustrates the power of training frequencies. The other results are similar to the former two experiments. As shown in Figure 7, the RL model also has the ability to handle different intensities of various objects.

Learn Both Random Angle Uniform Angle Learn Angle Learn Frequency
MSE mean (std) 0.0006 (1.2e-8) 0.004 (7e-7) 0.004 (8e-7) 0.0017 (5e-7) 0.011 (7e-6)
PSNR mean (std) 104.3 (0.62) 96.3 (1.4) 96.2 (1.6) 100.7 (4.4) 92.2 (1.2)
Table 3: The MSE and PSNR and their statistics of five different methods on the first type of scatterer in Experiment . The first type of scatterer consists of randomly placed triangles and randomly placed ovals with different sizes.
Learn Both Random Angle Uniform Angle
MSE mean (std) 0.0007 (1.4e-8) 0.0045 (9e-7) 0.0053 (4e-7)
PSNR mean (std) 103.6 (0.9) 95.9 (2) 95 (0.5)
Table 4: The MSE and PSNR and their statistics of three different methods on the second type of scatterer in Experiment . The second type of scatterer consists of randomly placed circles with random intensities.
(a) True image
(b) Reconstruction of learning both angles and frequencies (MSE=2e-4)
(c) Reconstruction of random angle (MSE=5e-3)
(d) Reconstruction of uniform angle (MSE=3e-3)
Figure 6: Experiment . We compare the reconstruction results of the different methods on a specific type of scatterer. The true scatterer is shown in subplot (a). We tag the methods under each plot, with the MSE of reconstruction showing the difference in resolution.
(a) True image
(b) Reconstruction of learning both angles and frequencies (MSE=5e-4)
(c) Reconstruction of random angle (MSE=4e-3)
(d) Reconstruction of uniform angle (MSE=5e-3)
(e) Reconstruction of learning angles only (MSE=2e-3)
(f) Reconstruction of learning frequencies only (MSE=0.01)
(g) True image
(h) Reconstruction of learning both angles and frequencies (MSE=6e-4)
(i) Reconstruction of random angle (MSE=4e-3)
(j) Reconstruction of uniform angle (MSE=4e-3)
Figure 7: Experiment . We compare the reconstruction results of different methods on two different types of scatterers. The true scatterer of the first type is shown in subplot (a). This type of scatterer consists of randomly placed triangles and randomly placed ovals with different sizes. The reconstruction results of different methods are shown in subplot (b) to (f). The true scatterer of the second type is shown in subplot (g). This type of scatterer consists of randomly placed circles with random intensities. The reconstruction results of different methods are shown in subplot (h) to (j). We tag the methods under each plot, with the MSE of reconstruction showing the difference in resolution.

The numerical results show that learning both angles and frequencies achieves the highest resolution with a significant improvement over other methods. Meanwhile, the variance of error indicates that the stability of the RL algorithm is also much better than other methods. Learning angles only is the second-best method and it can generate a vague reconstruction. The difference between the best two methods demonstrates the effectiveness of training frequencies. Sampling angles randomly performs close to sampling uniformly and they are much worse than the previous two methods. This implies that learning angles is necessary for data collection. The performance of learning frequencies only is the worst, which means that the learned frequencies must work together with the learned angles. The numerical results demonstrate the necessity of training sensor angles and incident wave frequencies. Thus, the proposed RL model outperforms all the other methods.

5 Discussion

In this paper, reinforcement learning is applied to learn a policy that selects scatterer-dependent sensing angles and frequencies in inverse scattering. The process of sensor installation, information collection, and scatterer reconstruction is reformulated as a Markov decision process and hence, reinforcement learning can help to optimize this process. A recurrent neural network is adopted as the policy network to choose sensor locations and wave frequencies adaptively. The proposed reinforcement learning method learns to make scatterer-dependent decisions from previous imaging results, each of which requires the solution of an expensive optimization problem. To better facilitate the convergence and reduce the computational cost of reinforcement learning, a warm-start strategy is used in these optimization problems. Extensive numerical experiments have been conducted using several types of scatterers in the second and the third order of the nonlinear inverse scattering model. These results demonstrate that the proposed method significantly outperforms existing algorithms in terms of reconstruction quality. This paper serves as a first step towards intelligent computing for precision imaging in inverse scattering. The case of weak scattering is adopted as a proof of concept. In the future, more advanced learning techniques will be proposed to deal with more challenging cases.

Acknowledgments

Y. K. was partially supported by NSF grant DMS-2111563. H. Y. was partially supported by the NSF CAREER Award DMS-1945029 and the NVIDIA GPU grant. We thank the authors in the seminal work (ct) for sharing their code.

References