Sarcopenia corresponds to muscle atrophy which may be due to ageing, inactivity, or disease. The decrease of skeletal muscle is a good indicator of the overall health state of a patient . In oncology, it has been shown that sarcopenia is linked to outcome in patients treated by chemotherapy [14, 5], immunotherapy , or surgery . There are multiple definitions of sarcopenia [7, 23] and consequently multiple ways of assessing it. On CT imaging, the method used is based on muscle mass quantification. Muscle mass is most commonly assessed at a level passing through the middle of the third lumbar vertebra area (L3), which has been found to be representative of the body composition . After manual selection of the correct CT slice at the L3 level, segmentation of muscles is performed to calculate the skeletal muscle area . In practice, the evaluation is tedious, time-consuming, and rarely done by radiologists, highlighting the need for an automatic diagnosis tool that could be integrated into clinical practice. Such automated measurement of muscle mass could be of great help for introducing sarcopenia assessment in daily clinical practice.
Muscle segmentation and quantification on a single slice have been thoroughly addressed in multiple works using simple 2D U-Net like architectures [4, 6]. Few works, however, focus on L3 slice detection. The main challenges for solving this task rely on the inherent diversity in patient’s anatomy, the strong resemblance between vertebrae, and the variability of CT fields of view as well as their acquisition and reconstruction protocols.
The most straightforward approach to address L3 localization is by investigating methods for multiple vertebrae labeling in 3D images using detection  or even segmentation algorithms . Such methods require a substantial volume of annotations and are computationally inefficient when dealing with the entire 3D CT scan. In fact, even if our input is 3D, a one-dimensional output as the -coordinate of the slice is sufficient to solve the L3 localization problem.
and focus on training simple convolutional neural networks (CNN). These techniques use maximal intensity projection (MIP), where the objective is to project voxels with maximal intensity values into a 2D plane. Frontal view MIP projections contain enough information towards the body and vertebra’s bone structure differentiation. On the sagittal view, restricted MIP projections are used to focus solely on the spinal area. In the authors tackle this problem through regression, training the CNN with parts of the MIP that contain the L3 vertebra only. More recently, in  a UNet-like architecture (L3UNet-2D) is proposed to draw a 2D confidence map over the position of the L3 slice.
In this paper, we propose a reinforcement learning algorithm for accurate detection of the L3 slice in CT scans, automatizing the process of sarcopenia assessment. The main contribution of our paper is a novel formulation for the problem of L3 localization, exploiting different deep reinforcement learning (DRL) schemes that boost the state of the art for this challenging task, even on scarce data settings. Moreover, in this paper we demonstrate that the use of 2D approaches for vertebrae detection provides state of the art results compared to 3D landmark detection methods, simplifying the problem, reducing the search space and the amount of annotations needed. To the best of our knowledge, this is the first time that a reinforcement learning algorithm is explored on vertebrae slice localization reporting performances similar to medical experts and opening new directions for this challenging task.
Reinforcement Learning is a fundamental tool of machine learning which allows dealing efficiently with the exploration/exploitation trade-off. Given state-reward pairs, a reinforcement learning agent can pick actions to reach unexplored states or increase its accumulated future reward. Those principles are appealing for medical applications because they imitate a practitioner’s behavior and self-learn from experience based on ground-truth. One of the main issues of this class of algorithm is its sample complexity: a large amount of interaction with its environment is needed before obtaining an agent close to an optimal state . However, those techniques were recently combined with deep learning approaches, which efficiently addressed this issue 12] in a variety of tasks and applications.
In medical imaging, model-free reinforcement learning algorithms are highly used for landmark detection  as well as localization tasks . In , a Deep Q-Network (DQN) that automates the view planning process on brain and cardiac MRI was proposed. This framework takes as an input a single plane and updates its angle and position during the training process until convergence. Moreover, in  the authors again present a DQN framework for the localization of different anatomical landmarks introducing multiple agents that act and learn simultaneously. DRL has also been studied for object or lesion localization. More recently, in  the authors propose a DQN framework for the localization of different organs from CT scans achieving comparable to supervised CNNs performance. This framework uses a D volume as input with different actions to generate bounding boxes for these organs. Our work is the first to explore and validate a RL scheme on MIP representations for a single slice detection using the discrete and 2D nature of the problem.
3 Reinforcement Learning Strategy
In this paper, we formulate the slice localization problem as a Markov Decision Process (MDP), which contains a set of states , actions , and rewards .
States : For our formulation, the environment that we explore and exploit is a D image representing the frontal MIP projection of the D CT scans. This projection allows us to reduce our problem’s dimensionality from a volume of size ( being the varying heights of the volumes) to an image of size . The reinforcement learning agent is self-taught by interacting with this environment, executing a set of actions, and receiving a reward linked to the action taken. An input example is shown in Figure 1. We define a state as an image of size in . We consider the middle of the image to be the slice’s current position on a -axis. To highlight this, we assign a line of maximum intensity pixel value to the middle of each image provided as input to our DQN.
Actions : We define a set of discrete actions . corresponds to a positive translation (going up by one slice) and corresponds to a negative translation (going down by one slice). These two actions allow us to explore the entirety of our environment . Experiments with a stop action did not improve the agent’s accuracy since adding more actions increases the complexity of the task to learn.
Rewards : In reinforcement learning designing a good reward function is crucial in learning the goal to achieve. To measure the quality of taking an action we use the distance over between the current slice and the annotated slice . The reward for non-terminating states is computed with:
where we denote as and the current and next state and the ground truth annotation. Moreover, our positions and are the -coordinates of the current and next state respectively. is the Euclidean distance between both coordinates over the -axis. The reward is non-sparse and binary and helps the agent differentiate between good and bad actions. A good action being when the agent gets closer to the correct slice. For a terminating state, we assign a reward of .
Starting States: An episode starts by randomly sampling a slice over the -axis and ends when the agent has achieved its goal of finding the right slice. The agent then executes a set of actions and collects rewards until the episode terminates. When reaching the upper or lower borders of an image, the current state is assigned to the next state (i.e., the agent does not move), and a reward of is appointed to this action.
Final States: During training, a terminal state is defined as a state in which the agent has reached the right slice. A reward of is assigned in this case, and an episode is terminated. During testing, the termination of an episode happens when oscillations occur. We adopted the same approach as , and chose actions with the lowest -value, which have been found to be closest to the right slice since the DQN outputs higher -values to actions when the current slice is far from the ground truth.
3.1 Deep Q-Learning
To find the optimal policy of the MDP, a state-action value function
can be learned. In Q-Learning, the expected value of the accumulated discounted future rewards can be estimated recursively using the Bellman optimality equation:
In practice, since the state is not easily exploitable, we can take advantage of neural networks as universal function approximators to approximate . We utilize an experience replay technique that consists in storing the agent’s experience at each time step in a replay memory . To break the correlation between consecutive samples, we will uniformly batch a set of experiences from . The Deep Q-Network (DQN) will iteratively optimize its parameters
by minimizing the following loss function:
with and being the parameters of the policy and the target network respectively. To stabilize rapid policy changes due to the distribution of the data and the variations in Q-values, the DQN uses , a fixed version of that is updated periodically. For our experiments, we update every 50 iterations.
3.2 Network Architecture
Our Deep Q-Network takes as input the state
and passes it through a convolutional network. The network contains four convolution layers separated by parametric ReLU in order to break the linearity of the network, and four linear layers with LeakyReLU. Contrary to, we chose not to add the history of previously visited states in our case. We opted for this approach since there is a single path that leads to the right slice. This approach allows us to simplify our problem even more. Ideally, our agent should learn, just by looking at the current state, whether to go up or down when the current slice is respectively below or above the L3 slice. An overview of our framework is presented in Figure 1.
We also explore dueling DQNs from . Dueling DQNs rely on the concept of an advantage which calculates the benefit that each action can provide. The advantage is defined as with being our state value function. This algorithm will use the advantage of the Q-values to distinguish between actions from the state’s baseline values. Dueling DQNs were shown to provide more robust agents that are wiser in choosing the next best action. For our DQN, we use the same architecture as the one in Figure 1 but change the second to last fully connected layer to compute state values on one side, and action values on the other.
Since our agent is unaware of the possible states and rewards in , the exploration step is implemented first. After a few iterations, our agent can start exploiting what it has learned on . In order to balance between exploration and exploitation, we use an -greedy strategy. This strategy consists of defining an exploration rate , which is initialed to with a decay of , allowing the agent to become greedy and exploit the environment. A batch size of and an experience replay of
are used. The entire framework was developed in Pytorch library using an NVIDIA GTX 1080Ti GPU. We trained our model for episodes, requiring approximately - hours. A visualization of the training process is provided as supplementary material.
4 Experiments and Results
A diverse dataset of CT scans has been retrospectively collected for this study. CT scans were acquired on 4 different CT models from 3 manufacturers (Revolution HD from GE healthcare, Milwaukee, WI; Brillance 16 from Philips Healthcare, Best, Netherlands; and Somatom AS+ & Somatom Edge from Siemens Healthineer, Erlangen, Germany). Exams were either abdominal, thoracoabdominal, or thoraco-abdominopelvic CT scans acquired with or without contrast media injection. Images were reconstructed using abdominal kernel with either filtered back-projection or iterative reconstruction. Slice thickness ranged from to mm, and the number of slices varied from to . The heterogeneity of our dataset highlights the challenges of the problem from a clinical perspective.
Experienced radiologists manually annotated the dataset, indicating the position of the middle of the L3 slice. Before computing the MIP, all of the CT scans are normalized to over the -axis. This normalisation step harmonises our network’s input, especially since the agent performs actions along the -axis. After the MIP, we apply a threshold of HU (Hounsfield Unit) to HU allowing us to eliminate artifacts and foreign metal bodies while keeping the skeleton structure. The MIP are finally normalized to [0,1]. From the entire dataset, we randomly selected patients for testing and the rest for training and validation. For the testing cohort, annotations of L3 from a second experienced radiologist have been provided to measure the interobserver performance.
4.2 Results and Discussion
Our method is compared with other techniques from the literature. The error is calculated as the distance in millimeters () between the predicted L3 slice and the one annotated by the experts. In particular, we performed experiments with the L3UNet-2D  approach and the winning SC-Net  method of the Verse2020111https://verse2020.grand-challenge.org/ challenge. Even if SC-Net is trained on more accurate annotations with 3D landmarks as well as vertebrae segmentations, and addresses a different problem, we applied it to our testing cohort. The comparison of the different methods is summarised in Table 1. SC-Net reports CT scans with an error higher than mm. Moreover, L3UNet-2D  reports a mean error of mm mm when the method is trained on the entire training set, giving only scans with an error higher than mm for the L3 detection. Our proposed method gives the lowest errors with a mean error of mmmm, proving its superiority. Finally, we evaluated our technique’s performance with a Duel DQN strategy, reporting higher errors than the proposed one. This observation could be linked to the small action space that is designed for this study. Duel DQNs were proven to be powerful in cases with higher action spaces and in which the computation of the advantage function makes a difference.
|Method||# of samples||Mean||Std||Median||Max||Error|
For the proposed reinforcement learning framework, trained on the full training set, 9 CTs had a detection error of more than mm. These scans were analysed by a medical expert who indicated that of them have a lumbosacral transitional vertebrae (LSTV) anomaly . Transitional vertebrae cases are common and observed in 15-35% of the population  highlighting once again the challenges of this task. For both cases, the localization of the L3 vertebra for sarcopenia assessment is ambiguous for radiologists and consequently for the reinforcement learning agent. In fact, the only error higher than in the interobserver comparison corresponds to an LSTV case where each radiologist chose a different vertebrae as a basis for sarcopenia assessment. Even if the interobserver performance is better than the one reported by the algorithms, our method reports the lowest errors, proving its potential.
Qualitative results are displayed in Figure 2
as well as in the supplementary materials. The yellow line represents the medical expert’s annotation and the blue one the prediction of the different employed models. One can notice that all of the different methods converge to the correct L3 region with our method reporting great performance. It is important to note that for sarcopenia assessment, an automatic system does not need to be at the exact middle of the slice; a few millimeters around will not skew the end result since muscle mass in the L3 zone does not change significantly. Concerning prediction times for the RL agent, they depend on the initial slice that is randomly sampled on the MIP. Computed inference time for a single step is approximately 0.03 seconds.
To highlight the robustness of our network on a low number of annotated samples, we performed different experiments using 100, 50, and 10 CTs corresponding respectively to 10%, 5% and 1% of our dataset. We tested those 3 agents on the same 100 patients test set and report results in Table 1. Our experiments prove the robustness of reinforcement learning algorithms compared to traditional CNN based ones  in the case of small annotated datasets. One can observe that the traditional methods fail to be trained properly with a small number of annotations, reporting errors higher than mm for all three experiments. Learning a valid policy from a low number of annotations is one of the strengths of reinforcement learning. In our case, trained with less data increase the reported detection error, however trained on only 10 CTs with the same number of iterations and memory size, our agent was able to learn a correct policy and achieve a mean error of mm mm. Traditional deep learning techniques rely on pairs of images and annotations in order to build a robust generalization. Thus, each pair is exploited only once by the learning algorithm. Reinforcement learning, however, relies on experiences, each experience being a tuple of state, action, reward, next state and next action. Therefore, a single CT scan can provide multiple experiences to the self-learning agent, making our method ideal for slice localization problems using datasets with limited amount of annotations.
In this paper, we propose a novel direction to address the problem of CT slice localization. Our experiments empirically prove that reinforcement learning schemes work very well on small datasets and boost performance compared to classical convolutional architectures. One limitation of our work lies in the fact that our agent is always moving independently of the location, slowing down the process. In the future, we aim to explore different ways to adapt the action taken depending on the current location, with one possibility being to incentivize actions with higher increments. Future work also includes the use of reinforcement learning in multiple vertebrae detection with competitive or collaborative agents.
-  (2018) Automatic view planning with multi-scale deep reinforcement learning agents. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 277–285. Cited by: §2, §3.2, §3.
-  (2011) The prevalence of transitional vertebrae in the lumbar spine. The Spine Journal 11 (9), pp. 858–862. External Links: Cited by: §4.2.
Spotting l3 slice in ct scans using deep convolutional network and transfer learning. Computers in biology and medicine 87, pp. 95–103. Cited by: §1.
-  (2020) Abdominal musculature segmentation and surface prediction from ct using deep learning for sarcopenia assessment. Diagnostic and Interventional Imaging 101 (12), pp. 789–794. Cited by: §1.
-  (2017) Forcing the vicious circle: sarcopenia increases toxicity, decreases response to chemotherapy and worsens with chemotherapy. Annals of Oncology 28 (9), pp. 2107–2118. Note: A focus on esophageal squamous cell carinoma External Links: Cited by: §1.
Automated segmentation of abdominal skeletal muscle in pediatric ct scans using deep learning.
Radiology: Artificial Intelligence, pp. e200130. Cited by: §1.
-  (2019) Sarcopenia: revised european consensus on definition and diagnosis. Age and ageing 48 (1), pp. 16–31. Cited by: §1.
-  (2017) Quantifying Sarcopenia Reference Values Using Lumbar and Thoracic Muscle Areas in a Healthy Population. J Nutr Health Aging 21 (10), pp. 180–185. Cited by: §1.
-  (2014) Sarcopenia is a predictor of outcomes in very elderly patients undergoing emergency surgery. Surgery 156 (3), pp. 521–527. External Links: Cited by: §1.
-  (2018-08) Towards intelligent robust detection of anatomical structures in incomplete volumetric data. Med Image Anal 48, pp. 203–213. Cited by: §2.
-  (2020-04-01) Quantification of skeletal muscle mass: sarcopenia as a marker of overall health in children and adults. Pediatric Radiology 50 (4), pp. 455–464. External Links: Cited by: §1.
Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint arXiv:2006.07733. Cited by: §2.
-  (2018) Automatic l3 slice detection in 3d ct images using fully-convolutional networks. External Links: Cited by: §1, §4.2, §4.2, Table 1.
-  (2018) Skeletal muscle loss is an imaging biomarker of outcome after definitive chemoradiotherapy for locally advanced cervical cancer. Clinical Cancer Research 24 (20), pp. 5028–5036. Cited by: §1.
-  (2018-05-01) A review of lumbosacral transitional vertebrae and associated vertebral numeration. European Spine Journal 27 (5), pp. 995–1004. External Links: Cited by: §4.2.
-  (2017) Deep reinforcement learning for active breast lesion detection from dce-mri. In Medical Image Computing and Computer Assisted Intervention - MICCAI 2017, M. Descoteaux, L. Maier-Hein, A. Franz, P. Jannin, D. L. Collins, and S. Duchesne (Eds.), Cham, pp. 665–673. External Links: Cited by: §2.
-  (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §2.
-  (2015-02-01) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. External Links: Cited by: §3.1.
-  (2020) Deep reinforcement learning for organ localization in ct. In Medical Imaging with Deep Learning, pp. 544–554. Cited by: §2.
-  (2019) Association of sarcopenia with and efficacy of anti-pd-1/pd-l1 therapy in non-small-cell lung cancer. Journal of Clinical Medicine 8 (4). External Links: Cited by: §1.
-  (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §3.3.
-  (2020) Coarse to fine vertebrae localization and segmentation with spatialconfiguration-net and u-net. In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, Vol. 5, pp. 124–133. External Links: Cited by: §1, §4.2, Table 1.
-  (2014) Clinical definition of sarcopenia. Clinical cases in mineral and bone metabolism 11 (3), pp. 177. Cited by: §1.
-  (2018) Reinforcement learning: an introduction. MIT press. Cited by: §2.
-  (2015) Fast automatic vertebrae detection and localization in pathological ct scans - a deep learning approach. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi (Eds.), pp. 678–686. Cited by: §1.
-  (2020) No-regret exploration in goal-oriented reinforcement learning. In International Conference on Machine Learning, pp. 9428–9437. Cited by: §2.
-  (2019) Multiple landmark detection using multi-agent reinforcement learning. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 262–270. Cited by: §2.
-  (2015) Dueling network architectures for deep reinforcement learning. CoRR abs/1511.06581. External Links: Cited by: §3.2.
-  (2019-11) Single-slice ct measurements allow for accurate assessment of sarcopenia and body composition. European radiology 30, pp. . Cited by: §1.