1 Introduction
Optical coherence tomography (OCT) technology provides clinicians with real-time and high-resolution images of ocular structures which are of great use in diagnosing and monitoring retinal diseases, evaluating progression, and assessing response to therapy [fujimoto2016development]. Three generations for OCT since it was invented in 1991 [huang1991optical] and over the past few decades, different commercially available OCT instruments were developed. Each device is characterized by several parameters such as lateral and axial resolutions, penetration depth and imaging speed. Fast image acquisition of OCT volumes, i.e. A-scan rate, is very important to reduce retinal motion. Yet, it is limited by the camera read-out rate in OCT scanner device [sanchez2019review]. Another type of motion artifacts is the axial eye motion with a frequency varies from 3 to 12 Hz and its exact mechanism has not been clearly known. In addition to the involuntary fixational eye movements that are usually categorized as high frequency tremors, rapid microsaccades, or slow drifts, depending on their frequency and magnitude. These types of motion artifacts during OCT volume scanning result in deformed 3D data of the retina [spaide2015image]. The correction of distorted data is very essential to improve OCT image quality and therefore better diagnoses.
Retinal motion vary on amplitude, direction, and frequency, making its combination difficult to predict [sanchez2019review]
. Moreover, retinal motion may differ significantly between individuals, hindering the development of a generalized theoretical or learning models for retinal motion prediction. This issue can be tackled by using advanced deep learning (DL) models with a large OCT dataset that covers different types of retinal motion. In this regard, ground-truth for inter-frame misalignment is needed, which is very expensive and difficult to manually annotate. Further, unsupervised learning models with OCT retinal cubes are greatly biased towards the speckle noise misleading the final outcomes of the alignment. In this work, we propose an unsupervised inter-frame movement correction approach that works really well even in the presence of speckle noise. The approach is based on deep reinforcement learning, in which an artificial agent is trained using deep q-network to find a strategy of sequential actions that best improves the alignment between 2d-slices in OCT data.
Related Work.
The literature eye movement correction methods can be divided into feature-based and intensity-based approaches. Feature-based registration methods use landmarks of the image, such as vasculature, vessel intersections, and retinal layers to correct OCT data misalignment, while intensity based approaches rely on the similarity between images such as correlation and mutual information [sanchez2019review, baghaie2015state, baghaie2017involuntary].
Further, the literature also presents the use of deep reinforcement learning for medical image registration [liao2017artificial, krebs2017robust], where the agent uses ground truth transformation parameters for training.
We summarize our contributions as follow:
-
For the first time, we propose a dueling deep q-network for OCT inter-frame image alignment (DDQN-OCT) that does not require landmarks or transformation parameters ground truth.
-
We use a combined intensity based image similarity metrics to guide the rewarding system for training the artificial agent in an unsupervised fashion.
-
The proposed approach does not require the removal of speckle noise, which is a common preprocessing step for all 2D and 3D OCT registration methods.
-
Our approach has significant improvement over elastix intensity based medical image registration.
2 Methodology
2.1 Reinforcement Learning Framework
We formulate the inter-frame movement correction in OCT volumes as a 2D rigid registration problem that matches two adjacent B-scans in the fast scanning plane ( and ). This is accomplished by finding the optimal spatial transformation that aligns with . The 2D-rigid transformation has 3 parameters: two translations () and one rotation (). For this purpose, we use deep reinforcement learning (DRL) to solve this optimization problem in an unsupervised manner. Figure 1 shows the framework of the proposed approach. Specifically, an artificial agent learns by interacting with an environment () to maximize the cumulative reward signals () throughout the agent’s lifetime. At every iteration , given a state that represents the difference image of the two adjacent B-scans, the agent selects an action that is associated with a scalar reward signals . is the set of states that the agent can see and is the set of discrete actions that the agent can take. consists of 6 candidate transformations that lead to the change of in one parameter of . During training, the agent learns the optimal policy (i.e. a strategy of sequential actions) that maps a current state to an optimal action that best improves the alignment by maximizing the sum of reward signals seen over the agent’s lifetime. The optimal action-selection policy is identified by learning an action value function (i.e. -function) that measures the quality of taking an action given a state , as defined by Watkins et al. [Watkins1992]. The -function can be solved using Bellman iterative approach [bellman2013dynamic] as in Equation 1.
(1) |
where is the future rewards discount factor. and are the next state and action.

2.2 Dueling Deep Q-Network for Optimal Policy Estimation
In this paper, we follow Mnih et al. [mnih2015human]
who proposed a deep Q-network (DQN) to approximate the action-value function using deep neural network (DNN) as in Equation
2.(2) |
where is a bonus reward value the agent receives when it finds the best alignment. This is reached when the distance between the reference B-scan and the aligned B-scan is within the distance threshold . The immediate reward for a state-action pair is calculated using Equation 3.
(3) |
where and refer to the transformation values before and after action is selected that is parameterized by . is the dissimilarity metric.
Finding a good measure of dissimilarity is very crucial for the success of the agent learning process. The literature papers in this area rely on the ground truth transformation parameters [liao2017artificial, krebs2017robust], where . In our work, we propose the use of intensity based image similarity metrics to train the agent in an unsupervised manner that was not proposed in the literature before. The proposed dissimilarity metric is based on two statistical measures, namely correlation coefficient () and structural similarity index measure (SSIM) as in Equation 4. measures the degree of change in one causes the change in the other, while SSIM considers the perceptual likeness and structural information.
(4) |
where and
are the average and variance of reference image x, and
and are the average and variance of the transformed image y. Also, refers to the covariance between x and y. and are stabilization variables.We also adopt the action-state value function split notion, proposed by Wang et al. [wang2015dueling], called dueling DQN. In which, is decomposed into action-independent () and action-dependent value (
) functions. This has shown to provide robust state value estimates. The architecture of our DNN network is shown in Figure
1, which consists of 4 convolutional layers, each is followed by batch normalization and ReLU activation, except the first convolutional layer which does not have batch normalization. The convolutional layers have incremental number of the filters of 32-32-64-64 with kernel sizes of 5-5-4-3 in order, and stride of 2 for all layers. This is followed by 2D max-pooling layer with size of 2 and a fully-connected layer with 512 nodes. Then the output of fully-connected layer is passed to two branches for action-dependent and action-independent value functions with 6 and 1 nodes in order. Finally, a fully-connected layer connects the sum of the two branches to the output layer with 6 nodes, each corresponds to one of the actions in
. The input to the network is computed by subtracting the reference B-scan and the transformed one, i.e. for each of the previoussteps. This is to obtain more stable search trajectories and prevent the agent from oscillation issue. Also, the loss function is calculated using mean squared error as shown in Equation
5.(5) |
3 Data and Implementation Details
Dataset. The dataset contains 10,370 OCT macular scans from both eyes of 1678 individuals, acquired on a Cirrus SD-OCT Scanner (Zeiss; Dublin, CA, USA) over multiple visits. The dataset has 427 healthy scans from 109 individuals and the other scans with different ocular conditions including glaucoma, optic neuropathy, plateau iris and others.
The scans have 2002001024 (a-scansb-scansdepth) voxels per cube covering an area of 662 . This is an observational study that was conducted in accordance with the tenets of the Declaration of Helsinki and the Healthy Insurance Portability and Accountability Act.
Training and Validation. The OCT volumes are divided into a training (7290 scans), validation (1608) and testing (1472) subsets, while it is ensured that eyes belonging to the same patient are not split across subsets. A 20 B-scans from each volume are randomly selected, each has a size of and normalized to have pixel values from 0 to 1. A random window around retinal layers with size is selected with spacing of 4 and 2 in the and directions in order. Then, a random rigid transformation is applied on each cropped window separately. The range of simulated transformation parameters is chosen to be from -5 to 5 for , , and . This is to cover all possible eye movements. The cropped B-scan and its corresponding transformed B-scan represent the environment that the agent interacts with throughout its lifetime (i.e. one episode). During training, we use a replay experience memory of size to store transitions of . Then a batch of size 256 is randomly selected. Each input sample has the size , where 4 represents the previous action steps taken by the agent. The network is trained using Adam optimizer with a learning rate of for epochs, each has steps with a maximum of steps per episode.
4 Experimental Results and Discussion
The proposed DDQN-OCT model is implemented using Python and TensorFlow on a single V100 GPU. The exploration rate for the artificial agent starts with 1 and linearly decreases to reach 0.1 in epoch #20, followed by another linear decrease till epoch #100 as shown in Figure
2-(a). Also, Figure 2-(b) and (c) plot the training and validation loss curves, and image distances in order, which show a very good convergence for the model without overfitting.![]() |
![]() |
![]() |
(a) | (b) | (c) |
To visualize how our method works, we test DDQN-OCT model trained on noisy scans, on both noisy and denoised B-scans. We denoise the B-scans using the Generative Adversarial Network (GAN) model proposed in [halupka2018retinal]. Instead of cropping a random window, we resize the whole B-scan to match the network input shape (See Figure 3). The figure shows the agent results using randomly selected noisy B-scan (left column) and denoised B-scan (right column). The figure has 4 rows, each row displays the reference B-scan, aligned B-scan and agent screen at steps 3, 5, 9, and 11. From the figure, the agent reaches the best alignment after 11 steps. Also, the ground truth transformation parameters and agent current transformation are displayed in yellow.
For quantitative evaluation, we compute normalized mutual information (NMI), cross-correlation coefficient (
), agent score (i.e. cumulative reward) and execution time for each sample in our test set that contains 29,440 B-scans. We then compute the average, standard deviation, lower, middle, and upper quartiles for each statistical measure across the test set as reported in Table
1. The proposed model achieves an average of 0.985 and 0.914 for NMI and , respectively. Furthermore, to quantify the impact of speckle noise on the agent’s training, we retrain the model using denoised B-scans. Statistical results show an increase of 4% in correlation measure (), while a decrease of 2% for NMI measure (Table 1).![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
(a) Noisy B-scan | (b) Denoised B-scan |
Statistical measure | NMI | Episode score | Time (sec) | ||
---|---|---|---|---|---|
Noisy B-scans |
Average Std | 0.985 0.064 | 0.914 0.128 | 3.343 3.036 | 0.445 0.498 |
Lower quartile (25%) | 0.990 | 0.826 | 0.218 | 0.040 | |
Median (50%) | 1.00 | 1.00 | 5.455 | 0.061 | |
Upper quartile (75%) | 1.00 | 1.00 | 5.562 | 1.027 | |
Denoised B-scans |
Average Std | 0.969 0.094 | 0.957 0.076 | 3.621 2.623 | 0.607 0.787 |
Lower quartile (25%) | 0.986 | 0.945 | 0.280 | 0.058 | |
Median (50%) | 0.989 | 0.984 | 5.254 | 0.105 | |
Upper quartile (75%) | 1.00 | 1.00 | 5.347 | 1.195 |
As a comparative study, we evaluate the performance of elastix intensity-based medical image registration approach, described in [klein2009elastix]. We use same test set and also evaluate using noisy and denoised B-scans separately. Results are reported in Table 2, where our DDQN-OCT model has a significant improvement comparing to elastix registration approach with more than 50% and 10% for NMI and , respectively. The table also shows that elastix approach works much better on denoised scans than noisy scans with an improvement of 10% and 4% for NMI and , respectively. We also record the execution time for all our experiments as shown in Tables 1 and 2 where our model takes much less time than elastix approach with an average of 0.5 second per images.
Statistical measure | NMI | Time (sec) | ||
---|---|---|---|---|
Noisy B-scans |
Average Std | 0.344 0.140 | 0.814 0.083 | 38.322 14.978 |
Lower quartile (25%) | 0.281 | 0.762 | 35.545 | |
Median (50%) | 0.306 | 0.826 | 36.581 | |
Upper quartile (75%) | 0.331 | 0.875 | 37.867 | |
Denoised B-scans |
Average Std | 0.448 0.103 | 0.847 0.072 | 6.840 0.392 |
Lower quartile (25%) | 0.386 | 0.802 | 6.580 | |
Median (50%) | 0.421 | 0.854 | 6.851 | |
Upper quartile (75%) | 0.465 | 0.900 | 7.063 |
We also compare the proposed dueling DQN with other DQN architectures that have been recently presented in the literature. For example, Van Hasselt et al. [van2016deep] proposed a double DQN, by decoupling the selected action from the target network that reduced the observed overestimation and better performance. Also, in [alansary2019evaluating], combination of double dueling approaches have shown to outperform the original DQN. In this experiment, We train four DQN variants namely, DQN, Double DQN, Dueling DQN and Double Dueling DQN. For evaluation, we apply 2d rigid transformation on random crops from our test set that contains 29,440 B-scans (within 10 pixels and 10 degrees for a crop size of 8484). Performance measures are reported in Table 3 which shows a very slight performance differences between the variants. Double DQN has the best performance with NMI and of 0.98 and 0.97 in order. Also, original DQN has achieved the least performance which aligns with the results presented in [wang2015dueling, van2016deep, alansary2019evaluating].
Furthermore, we compare our unsupervised rewarding approach with the supervised one as in [liao2017artificial] (i.e. using ground truth transformation parameters). Surprisingly, the unsupervised rewarding approach outperforms the supervised based training with roughly 4% improvement.
NMI | Episode score | Time (sec) | ||
Unsupervised Training | ||||
DQN | 0.969 0.083 | 0.958 0.077 | 3.796 2.687 | 0.3350.469 |
Dueling DQN | 0.9730.085 | 0.959 0.080 | 9.4754.341 | 0.249 0.265 |
Double | 0.9780.070 | 0.9650.066 | 9.7384.487 | 0.198 0.216 |
Double Dueling DQN | 0.9740.081 | 0.9600.075 | 9.5944.010 | 0.2480.264 |
Supervised Training | ||||
Dueling DQN | 0.9340.147 | 0.902 0.121 | 5.97913.399 | 0.3880.218 |
Conclusion
Registration is a critical step in automated analysis of medical images for monitoring patients, and access to an accurate unsupervised registration method is of immense value, given the costly practice of curating annotated images as needed in supervised methods. In this paper, we lay out a novel framework for unsupervised 2D rigid registration of medical images, in particular OCT volumes of retina, which takes advantage of intensity-based techniques, resulting to state-of-the-art performance. In doing so, an artificial agent is presented that is able to efficiently align two B-scans by finding the 2D transformation parameters. The agent is trained using dueling deep Q-network in an unsupervised manner, where a combination of intensity based image similarity measures are used to guide the rewarding system. The proposed DDQN-OCT model markedly outperforms the elastix intensity based medical image registration approach. Also, the proposed framework has shown the strong potential to be applied to other applications.
Comments
There are no comments yet.