1 Introduction
Optical coherence tomography (OCT) technology provides clinicians with realtime and highresolution images of ocular structures which are of great use in diagnosing and monitoring retinal diseases, evaluating progression, and assessing response to therapy [fujimoto2016development]. Three generations for OCT since it was invented in 1991 [huang1991optical] and over the past few decades, different commercially available OCT instruments were developed. Each device is characterized by several parameters such as lateral and axial resolutions, penetration depth and imaging speed. Fast image acquisition of OCT volumes, i.e. Ascan rate, is very important to reduce retinal motion. Yet, it is limited by the camera readout rate in OCT scanner device [sanchez2019review]. Another type of motion artifacts is the axial eye motion with a frequency varies from 3 to 12 Hz and its exact mechanism has not been clearly known. In addition to the involuntary fixational eye movements that are usually categorized as high frequency tremors, rapid microsaccades, or slow drifts, depending on their frequency and magnitude. These types of motion artifacts during OCT volume scanning result in deformed 3D data of the retina [spaide2015image]. The correction of distorted data is very essential to improve OCT image quality and therefore better diagnoses.
Retinal motion vary on amplitude, direction, and frequency, making its combination difficult to predict [sanchez2019review]
. Moreover, retinal motion may differ significantly between individuals, hindering the development of a generalized theoretical or learning models for retinal motion prediction. This issue can be tackled by using advanced deep learning (DL) models with a large OCT dataset that covers different types of retinal motion. In this regard, groundtruth for interframe misalignment is needed, which is very expensive and difficult to manually annotate. Further, unsupervised learning models with OCT retinal cubes are greatly biased towards the speckle noise misleading the final outcomes of the alignment. In this work, we propose an unsupervised interframe movement correction approach that works really well even in the presence of speckle noise. The approach is based on deep reinforcement learning, in which an artificial agent is trained using deep qnetwork to find a strategy of sequential actions that best improves the alignment between 2dslices in OCT data.
Related Work.
The literature eye movement correction methods can be divided into featurebased and intensitybased approaches. Featurebased registration methods use landmarks of the image, such as vasculature, vessel intersections, and retinal layers to correct OCT data misalignment, while intensity based approaches rely on the similarity between images such as correlation and mutual information [sanchez2019review, baghaie2015state, baghaie2017involuntary].
Further, the literature also presents the use of deep reinforcement learning for medical image registration [liao2017artificial, krebs2017robust], where the agent uses ground truth transformation parameters for training.
We summarize our contributions as follow:

For the first time, we propose a dueling deep qnetwork for OCT interframe image alignment (DDQNOCT) that does not require landmarks or transformation parameters ground truth.

We use a combined intensity based image similarity metrics to guide the rewarding system for training the artificial agent in an unsupervised fashion.

The proposed approach does not require the removal of speckle noise, which is a common preprocessing step for all 2D and 3D OCT registration methods.

Our approach has significant improvement over elastix intensity based medical image registration.
2 Methodology
2.1 Reinforcement Learning Framework
We formulate the interframe movement correction in OCT volumes as a 2D rigid registration problem that matches two adjacent Bscans in the fast scanning plane ( and ). This is accomplished by finding the optimal spatial transformation that aligns with . The 2Drigid transformation has 3 parameters: two translations () and one rotation (). For this purpose, we use deep reinforcement learning (DRL) to solve this optimization problem in an unsupervised manner. Figure 1 shows the framework of the proposed approach. Specifically, an artificial agent learns by interacting with an environment () to maximize the cumulative reward signals () throughout the agent’s lifetime. At every iteration , given a state that represents the difference image of the two adjacent Bscans, the agent selects an action that is associated with a scalar reward signals . is the set of states that the agent can see and is the set of discrete actions that the agent can take. consists of 6 candidate transformations that lead to the change of in one parameter of . During training, the agent learns the optimal policy (i.e. a strategy of sequential actions) that maps a current state to an optimal action that best improves the alignment by maximizing the sum of reward signals seen over the agent’s lifetime. The optimal actionselection policy is identified by learning an action value function (i.e. function) that measures the quality of taking an action given a state , as defined by Watkins et al. [Watkins1992]. The function can be solved using Bellman iterative approach [bellman2013dynamic] as in Equation 1.
(1) 
where is the future rewards discount factor. and are the next state and action.
2.2 Dueling Deep QNetwork for Optimal Policy Estimation
In this paper, we follow Mnih et al. [mnih2015human]
who proposed a deep Qnetwork (DQN) to approximate the actionvalue function using deep neural network (DNN) as in Equation
2.(2) 
where is a bonus reward value the agent receives when it finds the best alignment. This is reached when the distance between the reference Bscan and the aligned Bscan is within the distance threshold . The immediate reward for a stateaction pair is calculated using Equation 3.
(3) 
where and refer to the transformation values before and after action is selected that is parameterized by . is the dissimilarity metric.
Finding a good measure of dissimilarity is very crucial for the success of the agent learning process. The literature papers in this area rely on the ground truth transformation parameters [liao2017artificial, krebs2017robust], where . In our work, we propose the use of intensity based image similarity metrics to train the agent in an unsupervised manner that was not proposed in the literature before. The proposed dissimilarity metric is based on two statistical measures, namely correlation coefficient () and structural similarity index measure (SSIM) as in Equation 4. measures the degree of change in one causes the change in the other, while SSIM considers the perceptual likeness and structural information.
(4) 
where and
are the average and variance of reference image x, and
and are the average and variance of the transformed image y. Also, refers to the covariance between x and y. and are stabilization variables.We also adopt the actionstate value function split notion, proposed by Wang et al. [wang2015dueling], called dueling DQN. In which, is decomposed into actionindependent () and actiondependent value (
) functions. This has shown to provide robust state value estimates. The architecture of our DNN network is shown in Figure
1, which consists of 4 convolutional layers, each is followed by batch normalization and ReLU activation, except the first convolutional layer which does not have batch normalization. The convolutional layers have incremental number of the filters of 32326464 with kernel sizes of 5543 in order, and stride of 2 for all layers. This is followed by 2D maxpooling layer with size of 2 and a fullyconnected layer with 512 nodes. Then the output of fullyconnected layer is passed to two branches for actiondependent and actionindependent value functions with 6 and 1 nodes in order. Finally, a fullyconnected layer connects the sum of the two branches to the output layer with 6 nodes, each corresponds to one of the actions in
. The input to the network is computed by subtracting the reference Bscan and the transformed one, i.e. for each of the previoussteps. This is to obtain more stable search trajectories and prevent the agent from oscillation issue. Also, the loss function is calculated using mean squared error as shown in Equation
5.(5) 
3 Data and Implementation Details
Dataset. The dataset contains 10,370 OCT macular scans from both eyes of 1678 individuals, acquired on a Cirrus SDOCT Scanner (Zeiss; Dublin, CA, USA) over multiple visits. The dataset has 427 healthy scans from 109 individuals and the other scans with different ocular conditions including glaucoma, optic neuropathy, plateau iris and others.
The scans have 2002001024 (ascansbscansdepth) voxels per cube covering an area of 662 . This is an observational study that was conducted in accordance with the tenets of the Declaration of Helsinki and the Healthy Insurance Portability and Accountability Act.
Training and Validation. The OCT volumes are divided into a training (7290 scans), validation (1608) and testing (1472) subsets, while it is ensured that eyes belonging to the same patient are not split across subsets. A 20 Bscans from each volume are randomly selected, each has a size of and normalized to have pixel values from 0 to 1. A random window around retinal layers with size is selected with spacing of 4 and 2 in the and directions in order. Then, a random rigid transformation is applied on each cropped window separately. The range of simulated transformation parameters is chosen to be from 5 to 5 for , , and . This is to cover all possible eye movements. The cropped Bscan and its corresponding transformed Bscan represent the environment that the agent interacts with throughout its lifetime (i.e. one episode). During training, we use a replay experience memory of size to store transitions of . Then a batch of size 256 is randomly selected. Each input sample has the size , where 4 represents the previous action steps taken by the agent. The network is trained using Adam optimizer with a learning rate of for epochs, each has steps with a maximum of steps per episode.
4 Experimental Results and Discussion
The proposed DDQNOCT model is implemented using Python and TensorFlow on a single V100 GPU. The exploration rate for the artificial agent starts with 1 and linearly decreases to reach 0.1 in epoch #20, followed by another linear decrease till epoch #100 as shown in Figure
2(a). Also, Figure 2(b) and (c) plot the training and validation loss curves, and image distances in order, which show a very good convergence for the model without overfitting.(a)  (b)  (c) 
To visualize how our method works, we test DDQNOCT model trained on noisy scans, on both noisy and denoised Bscans. We denoise the Bscans using the Generative Adversarial Network (GAN) model proposed in [halupka2018retinal]. Instead of cropping a random window, we resize the whole Bscan to match the network input shape (See Figure 3). The figure shows the agent results using randomly selected noisy Bscan (left column) and denoised Bscan (right column). The figure has 4 rows, each row displays the reference Bscan, aligned Bscan and agent screen at steps 3, 5, 9, and 11. From the figure, the agent reaches the best alignment after 11 steps. Also, the ground truth transformation parameters and agent current transformation are displayed in yellow.
For quantitative evaluation, we compute normalized mutual information (NMI), crosscorrelation coefficient (
), agent score (i.e. cumulative reward) and execution time for each sample in our test set that contains 29,440 Bscans. We then compute the average, standard deviation, lower, middle, and upper quartiles for each statistical measure across the test set as reported in Table
1. The proposed model achieves an average of 0.985 and 0.914 for NMI and , respectively. Furthermore, to quantify the impact of speckle noise on the agent’s training, we retrain the model using denoised Bscans. Statistical results show an increase of 4% in correlation measure (), while a decrease of 2% for NMI measure (Table 1).(a) Noisy Bscan  (b) Denoised Bscan 
Statistical measure  NMI  Episode score  Time (sec)  

Noisy Bscans 
Average Std  0.985 0.064  0.914 0.128  3.343 3.036  0.445 0.498 
Lower quartile (25%)  0.990  0.826  0.218  0.040  
Median (50%)  1.00  1.00  5.455  0.061  
Upper quartile (75%)  1.00  1.00  5.562  1.027  
Denoised Bscans 
Average Std  0.969 0.094  0.957 0.076  3.621 2.623  0.607 0.787 
Lower quartile (25%)  0.986  0.945  0.280  0.058  
Median (50%)  0.989  0.984  5.254  0.105  
Upper quartile (75%)  1.00  1.00  5.347  1.195 
As a comparative study, we evaluate the performance of elastix intensitybased medical image registration approach, described in [klein2009elastix]. We use same test set and also evaluate using noisy and denoised Bscans separately. Results are reported in Table 2, where our DDQNOCT model has a significant improvement comparing to elastix registration approach with more than 50% and 10% for NMI and , respectively. The table also shows that elastix approach works much better on denoised scans than noisy scans with an improvement of 10% and 4% for NMI and , respectively. We also record the execution time for all our experiments as shown in Tables 1 and 2 where our model takes much less time than elastix approach with an average of 0.5 second per images.
Statistical measure  NMI  Time (sec)  

Noisy Bscans 
Average Std  0.344 0.140  0.814 0.083  38.322 14.978 
Lower quartile (25%)  0.281  0.762  35.545  
Median (50%)  0.306  0.826  36.581  
Upper quartile (75%)  0.331  0.875  37.867  
Denoised Bscans 
Average Std  0.448 0.103  0.847 0.072  6.840 0.392 
Lower quartile (25%)  0.386  0.802  6.580  
Median (50%)  0.421  0.854  6.851  
Upper quartile (75%)  0.465  0.900  7.063 
We also compare the proposed dueling DQN with other DQN architectures that have been recently presented in the literature. For example, Van Hasselt et al. [van2016deep] proposed a double DQN, by decoupling the selected action from the target network that reduced the observed overestimation and better performance. Also, in [alansary2019evaluating], combination of double dueling approaches have shown to outperform the original DQN. In this experiment, We train four DQN variants namely, DQN, Double DQN, Dueling DQN and Double Dueling DQN. For evaluation, we apply 2d rigid transformation on random crops from our test set that contains 29,440 Bscans (within 10 pixels and 10 degrees for a crop size of 8484). Performance measures are reported in Table 3 which shows a very slight performance differences between the variants. Double DQN has the best performance with NMI and of 0.98 and 0.97 in order. Also, original DQN has achieved the least performance which aligns with the results presented in [wang2015dueling, van2016deep, alansary2019evaluating].
Furthermore, we compare our unsupervised rewarding approach with the supervised one as in [liao2017artificial] (i.e. using ground truth transformation parameters). Surprisingly, the unsupervised rewarding approach outperforms the supervised based training with roughly 4% improvement.
NMI  Episode score  Time (sec)  
Unsupervised Training  
DQN  0.969 0.083  0.958 0.077  3.796 2.687  0.3350.469 
Dueling DQN  0.9730.085  0.959 0.080  9.4754.341  0.249 0.265 
Double  0.9780.070  0.9650.066  9.7384.487  0.198 0.216 
Double Dueling DQN  0.9740.081  0.9600.075  9.5944.010  0.2480.264 
Supervised Training  
Dueling DQN  0.9340.147  0.902 0.121  5.97913.399  0.3880.218 
Conclusion
Registration is a critical step in automated analysis of medical images for monitoring patients, and access to an accurate unsupervised registration method is of immense value, given the costly practice of curating annotated images as needed in supervised methods. In this paper, we lay out a novel framework for unsupervised 2D rigid registration of medical images, in particular OCT volumes of retina, which takes advantage of intensitybased techniques, resulting to stateoftheart performance. In doing so, an artificial agent is presented that is able to efficiently align two Bscans by finding the 2D transformation parameters. The agent is trained using dueling deep Qnetwork in an unsupervised manner, where a combination of intensity based image similarity measures are used to guide the rewarding system. The proposed DDQNOCT model markedly outperforms the elastix intensity based medical image registration approach. Also, the proposed framework has shown the strong potential to be applied to other applications.
Comments
There are no comments yet.