|Dataset||N. Subjects||Acquired Data||Probe Orientation||US Parameters||Applied Force [N]||Robot Speed [mm/s]|
|Dataset 1||19||B-Mode Linear US||Transverse||Gain = 92%
Freq. = 14 Hz
Depth = 4cm
|Dataset 2||13||B-Mode Linear US
|Transverse||Gain = 92%
Freq. = 14 Hz
Depth = 4cm
|[2, 10, 15]||[12, 20, 40]|
|Dataset 3||19||B-Mode Convex US||Paramedian-Sagittal||Gain = 92%
Freq. = 14 Hz
Depth = 7cm
Lumbar spinal injections are commonly performed in different clinical procedures as facet joint or epidural injections [alexander2019lumbosacral, skaribas2019lumbar]. Such procedures typically require the correct localization of the target vertebra to effectively release pharmaceuticals. In clinical practice, vertebral level detection is achieved either through palpation or X-ray guidance. Although X-ray guidance can improve the overall precision of the procedure, the use of ionizing radiation is considered a hazard for the patient and especially for the clinicians and assistants. On the other hand, the accuracy of the palpation technique is lower, especially for less experienced clinicians. Furthermore, the incorrect chosen level of injection can lead to avoidable complications, such as headaches, nerve damage, and paralysis [Boon2004].
Ultrasound (US) has proven to be an alternative to X-ray, providing precise guidance and preventing patients and clinicians from unnecessary radiation [Galiano317, Evansa2015, wu2016effectiveness]. Despite being real-time and non-invasive, ultrasound guidance is particularly challenging in spine procedures due to artifacts and noise caused by the curvature of the spinal bones and the layer of soft tissue covering the spine. To address these issues, various authors have proposed to use image processing techniques to support the clinician in the detection of vertebral levels.
a method is proposed to automatically classify images acquired during manual ultrasound-guided epidural injections. In this work, a Convolutional Neural Network (CNN) is used to classify the acquired images as either “vertebra” or “intervertebral gap” and State Machine is implemented to refine the results. In[Kerby2008] and [Yu2015] panorama image stitching is used to obtain a 2-Dimentional (2D) representation of vertebral laminas along the spine in the paramedian-sagittal plane. In [Kerby2008] a set of filters are applied to the panorama image to enhance bony structures. Local minimums in the resulting pattern are extracted and labelled as vertebrae. In [Yu2015] the identification of vertebrae is performed on the panorama image using a template matching approach.
The aforementioned methods provide support tools for the interpretation of ultrasound data during manual injection procedures. However, they still rely on the operator’s skills to manually find correspondence between ultrasound images and patient anatomy. Few studies have been conducted to evaluate the potential of robots integration in the clinical environment for injection procedures. In [Esteban2018], a robotic-ultrasound system for precise needle placement is described in an initial clinical study. In this study, a robotic system with a calibrated ultrasound probe is used to scan the patient back. The acquired US volume is then used by the operator to select the needle insertion path. The manipulator, equipped with a calibrated needle holder, moves to the desired insertion point to offer visual guidance during the insertion. Although showing promising results, these systems still rely on the operator in the interpretation of ultrasound images. Furthermore, they do not provide any tactile feedback, which, for the standard procedure, is given by palpation.
The contribution of this work is a robotic-ultrasound approach combining force and ultrasound data for automatic lumbar vertebral level classification in the spine. The target spinal region is the lumbar region (i.e. vertebrae levels from L5 to L1), where spinal injections commonly take place. Force feedback reproduces the tactile information the operator can get through palpation while ultrasound images provide continuous visual feedback during the procedure. Compared to the previously presented methods for vertebrae level classification, the proposed approach combines the benefits of both robotics and standard procedures. Furthermore, it does not only rely on visual feedback, but it exploits multiple sensors information. It is demonstrated that fusing ultrasound and force data ensures higher performances of the method in the presence of data corruption and single-sensor misclassifications. The potential of the proposed approach is explored for an example application, i.e. automatic target plane detection for facet injection procedures.
Ii-a Materials and Experimental Setup
The system consists of a main workstation (Intel i7, GeForce GTX 1050 Mobile), a robotic arm certified for human interaction (KUKA LBR iiwa 7 R800) combined with a Six-Axis Force/Torque Sensor System FTD-GAMMA (SCHUNK GmbH & Co. KG) and a Zonare z.one ultra sp Convertible Ultrasound System with an L8-3 linear probe, with purely linear and steered trapezoidal imaging (Fig. 7). The ultrasound system is connected to the main workstation through an Epiphan DVI2USB 3.0 frame-grabber (Epiphan Systems Inc. Palo Alto, California, USA), with an 800x600 resolution and a sampling frequency of 30 fps. Deep Learning models were trained on an NVIDIA Titan V 12 GB HBM2, using Pythorch 1.1.0 as Deep Learning framework for both training and inference. ImFusion Suite Version 2.9.4 (ImFusion GmbH, Munich, Germany) is used for basic image processing and visualization.
Three different datasets were used for training of Deep Learning models and testing. The datasets were acquired for different subjects with different ultrasound, robot force and speed settings. The acquisition was performed in the lumbar region, from L5 to L1. The Body Mass Index (BMI) of the scanned subject is in the range 20-30 for all the 3 databases. The dataset size and acquisition parameters are reported for the three datasets in Table I.
Ii-B Scanning Procedure
Before starting the procedure, the robotic arm is manually placed at the level of the sacrum with a transverse probe orientation. After probe placement, the robot starts moving in an upward direction towards the subject’s head, while force and ultrasound data are simultaneously collected (Fig. 1(a)). The subjects are asked to hold their breath for the whole duration of the scan (around 10 sec.), which is comparable to the breath-hold time of standard imaging procedures, as abdominal MRI or PET/CT [van2012motion, pepin2014management]. Once the scan is completed, the collected data are processed, to provide the location of the vertebral level at which the injection must be performed. The robot is redirected to the target vertebra, where it can perform additional maneuvers depending on the clinical application. In the reported showcases (integration of the counting system for facet plane detection), the robot performs a further 90 rotation about the z-axis and acquires an ultrasound scan of the target vertebra with the probe in paramedian-sagittal orientation (Fig. 1(b)).
Ii-C Force Data Extraction
In Fig. 4(a), a model of the vertebra-robot interaction is provided. In absence of vertebrae, the robot moves on a surface (the patient back) which can be considered flat. The reaction force is directed along the z-axis and its modulus balances the force applied by the robot, which is constant and set prior to the acquisition (Point A). In correspondence to a vertebra, the local direction of the subject back changes yielding to the generation of a non-null y-axis component of the reaction force (point B). Once the vertebral peak has been reached (point C), the inclination of the plane changes again (point D) leading to the generation of non-null y-component of the reaction force, with an opposite sign with respect to point B. When the original surface direction is recovered, the y-component of the reaction force vanishes and the initial force value is recovered. The variations in the force y-component due to reaction forces are recorded by the force sensor and result in a very characteristic pattern in the force trace (Fig. 4(b)). This pattern can be used to count the vertebral levels while the patient back is scanned. In Fig. 4(b), a plot of the y-component of the force signal is provided, in relation to the points A, B and C.
The recorded force in the y-direction () is pre-processed to remove the low-frequency drift, appearing due to the robot initial and final acceleration/deceleration. Drift removal is done by subtracting from the original signal its filtered version obtained applying a second-order Butterworth filter with cutoff frequency at 0.05 Hz. The “un-drifted” signal is then low pass-filtered with a second-order Butterworth filter with cutoff frequency at 0.3 Hz, normalized between 0 and 1 and re-sampled in equally spaced space-grid.
As mentioned above, the force applied by the robot along the z-direction () is constant and manually set before the acquisition takes place. The robot complies to the Force Control Scheme as described in [Zettinig2017]. The value of the force z-component has a notable impact on the quality of the force signal recorded along the y-axis () and on the visibility of vertebral patterns. In particular, higher values of lead to more visible and defined vertebral spikes. However, high values of also result in less comfort for the subjects, especially for those with a thin muscle/fat layer. In this study, the quality of the force signal recorded along the spine direction is evaluated for three different values of on a group of 13 subjects with BMI ranging from 20 to 30 (Dataset 2). The selected force values are comparable to those which are used in clinical experimentation [Esteban2018]. Each subject was asked to report the comfort level of the procedure on a scale ranging from 1 to 4, designed in the following way: 1 - very uncomfortable, 2 - uncomfortable, 3 - slightly uncomfortable, 4 - comfortable. For none of the subject, the procedure resulted to be “very uncomfortable” or “uncomfortable”. However, subjects with lower BMI tended to rate the procedure performed with as slightly uncomfortable. For this reason, the force applied by the robot along the z-axis is set to for subjects with lower BMI () and to for subjects with higher BMI (). In Fig. 3 and Fig. 4, the force signal are reported for 3 different values of (i.e. ) for two subjects with different BMI. For both subjects, the amplitude of the spikes in the force trace increases with increasing force. However, for the subject with lower BMI, the spikes are still clearly recognizable in the signals obtained with lower pressures along the z-direction.
Ii-D Ultrasound Data Processing
The informative component of the force signal (along y-axis , Fig. 4(b)
) is a 1D signal providing spatial information about the spine anatomy along the spine direction. However, ultrasound data are 3D data, where each position along the spine corresponds to a 2D (B-mode) ultrasound frame. To be able to effectively compare the information from the two sensors, ultrasound data are reduced to a 1D vector, defined along the spine direction. The dimension reduction is achieved by analyzing each ultrasound frame in the acquired sweeps and defining the probability for each of them to contain a vertebra. The concatenation of the resulting values along the spine direction is a 1D signal where high probability peaks ideally coincide with vertebrae and therefore corresponds to peaks in the force signal.
The vertebra probability value is extracted from each frame using a Convolutional Neural Network trained for the task of classification. In order to ensure the best classification results, three state of the art classification networks were tested and compared: ResNet18 [he2015deep], DenseNet121 [huang2016densely]
, VGG11 with batch normalization[Simonyan15]
. The training and validation performances were evaluated for all the architectures in the following cases: a) Using ImageNet[imagenet_cvpr09]
weights as initialization (pre-trained network) and fine-tuning all layers. b) Using ImageNet weights as initialization and fine-tuning the last layer only. c) Training the network with randomly initialized weights. Each model was trained using Adam optimizer, Cross-entropy loss function, learning rate of 0.0005 and a learning rate decay of 0.1 every 5 epochs for 30 epochs. The data for CNN training and testing were sampled from the Dataset 1. The training dataset consisted of 15 subjects (12 for training and 3 for validation), for a total of 1986 images for each class to ensure class balance. The test set consisted of 4 subjects, for a total of 696 images for each class. A 5-fold cross-validation study was performed over the training and validation datasets to exclude false-positive results. The obtained 1D signal is smoothed using a low-pass filtered with a second-order Butterworth with cutoff frequency at 0.3 Hz and re-sampled in equally spaced space-grid.
Ii-E Force - Ultrasound Data Fusion
The extracted and pre-processed force and ultrasound 1D signals represent variations of the inner/outer spine anatomy along the spine direction. In optimal conditions, both signals present well visible peaks in correspondence with vertebral levels (Fig. 6(a)). However, in some cases one (or both) signals may be corrupted by noise, making it challenging to identify the real position of the vertebral levels. Noise in the signal extracted from the ultrasound data typically arises from the scarce visibility of the spinous process in the ultrasound sweep (Fig. 6(b)). This can be related to several factors as device-specific noise, non-optimal couplings between the probe and the patient skin or subject-specific anatomy and tissue distribution. Noise in the force signal may arise from sudden movements of the subject during the acquisition, or from subject-specific anatomical features (e.g. vertebral peaks may be less evident in particularly muscular subjects) (Fig. 6(c)).
To make the method more robust against single-sensor misclassifications, a force-ultrasound fusion method was implemented. In particular, a 1D Spatial Convolutional Network was trained to classify vertebral levels from the input signals. The vertebral level counting problem is modelled as a classification problem, where the network is trained to classify each vertebral level in the lumbar region (Fig. 8).
A multi-stage temporal convolutional network is devised based on [farha2019ms], where the overall architecture consists of three stages and each stage is trained to classify the input data. Each stage refines the results from previous stages, yielding smoother and more accurate classification results. Each stage consists of an initial 1x1 convolution layer which re-sizes the input into a 32 x N sequence, where N is the original signal length (number of samples along the spine direction). The initial layer is followed by 9 1xD dilated convolution layers with kernel size 3 and increasing dilation size (Fig. 9). Dilated convolution is defined as:
where is the input signal, is the filter kernel and
is the dilation factor. It can be seen from the formula that, compared to standard convolution, the result at each point of the convoluted signal is obtained considering a larger spatial field in the input signal, therefore allowing the network to exploit a broader spatial context for the input’s classification. A softmax layer is added after the last convolution layer, to retrieve class probabilities (Fig.9). The cross-entropy and an additional smoothing factor are used as the loss function for network training, as described in [farha2019ms]. The convolutional network for force and ultrasound fusion was trained using Adam optimizer, learning rate of 0.0005 and batch size 1 for 110 epochs. The data for network training and testing were sampled from Dataset 2. The training dataset consisted of multiple sweeps acquired over 9 subjects (7 for training and 2 for validation) sampled from Dataset 2, for a total of 27 sequences for training and 7 for validation. The test set consists of 4 unseen patients, acquired with the optimal robot parameters (force equal to 10N or 15N depending on subject BMI and robot speed equals to 20 mm/s).
Iii-A1 Ultrasound Data Processing
In Table, II the test accuracy is reported for each CNN architecture (ResNet18, DenseNet121, VGG11) for the 3 training cases (using a pre-trained network with ImageNet weights as initialization and fine-tune all layers; training only the last layer of the network; training entire network with randomly initialized weights). The best accuracy on the test set is obtained by fine-tuning all the layers of ResNet18 from the pre-trained model, providing an average accuracy of . The ResNet18 model with the best performance was tested on a testing database of 4
subjects, yielding an overall accuracy of 0.938. The confusion matrix computed on the test data is displayed in TableIII. The values are normalized by the total number of frames, the number of images n = 1392, the correspondent number of frames is shown in the parenthesis.
|Training with randomly initialized weights|
|Pre-trained weights & all layers fine-tuned|
|Pre-trained weights & last layer only fine-tuned|
|n = 1392||Vertebra||Intervertebral Gap|
Iii-A2 Force-Ultrasound Data Fusion
The performances of the force-ultrasound data fusion method were evaluated in terms of its capability to correctly label each vertebral level. The test group consists of 5 (unseen) subjects, for a total of 20 vertebral levels. To proof the robustness of the method and the effectiveness of combining different sensor data for vertebral level counting, the force-ultrasound fusion method was compared against pure ultrasound-based and pure force-based approaches. Pure force and pure image-based methods were obtained using the same network architecture described in Sec. II-E, using the force and image signals alone as input data. In order to simulate a realistic environment in the test phase, the offline data were streamed to the main workstation with proper streaming frequencies (30 fps for both ultrasound and force sensor).
In Table IV the results of the vertebral classification are reported for the three methods, as well as the distance from the ground truth vertebral level. A vertebral level classification is considered successful if an overlap higher than 0.5 exists between labels and predictions. It can be seen that the fusion method outperforms both pure force and pure image-method in the task of classification.
In Fig. 10 the results for the three methods are shown for the optimal case where both the force and ultrasound signals are not corrupted by noise. It can be seen that in this case, using force or ultrasound data alone is sufficient for obtaining a precise classification and counting of the vertebral levels.
|Pure Force||Pure Ultrasound||Fusion|
|Correctly Classified Levels||18/20||16/20||20/20|
|Distance from Ground Truth Label [mm]||5.97 5.91||3.22 3.07||3.47 2.78|
In Fig. 11 the results for the three methods are shown in the presence of noisy ultrasound data. It can be seen that the pure force-based method is able to correctly classify the vertebral levels. However, the pure ultrasound-based method fails to classify the last 3 vertebral levels. The fusion between the two methods is able to compensate for the ultrasound signal misclassification and to correctly classify all the lumbar vertebral levels.
In Fig. 12 the results for the three methods are shown in the presence of noisy force data. It can be seen that the pure force-based method fails in the classification of the last two vertebral levels while the pure ultrasound-based method is able to correctly classify them. Even in this case, the fusion method is able to correctly classify the input data, even in the presence of force signal corruption.
Iii-A3 The potential application
The performances of the presented method were tested for an example application, i.e. automatic target plane selection for facet injection procedures. The facet injection procedure is performed to deliver anaesthetics at the level of facet joints, i.e. the anatomical structures connecting consecutive vertebrae (Fig. 1(b)). Using the proposed vertebral level classification method, the correct vertebral level can be selected, and a sweep can be taken at the correct level with the probe in a paramedian sagittal orientation, to identify the target injection plane.
The method for facet plane identification is similar to the one presented in [Pesteie2015]. Each frame in the sweep is classified as either “facet” or “non-facet” plane and the two frames with the highest probability in the sweep are labelled as right and left facet planes. The plane classification task is performed using ResNet18, given its high performances in the ultrasound classification task (Sec. III-A1).
The model was pre-trained on ImageNet and fine-tuned on a training set sampled from Dataset 3. The spatial errors between identified facet joint planes and labelled planes were calculated on 4 test subjects sampled from Dataset 3, which consisted of 20 vertebrae sweeps (5 vertebrae for each subject), each containing two facet joints, resulting in 40 facet joints in total. For 37 facet joints out of 40, the mean distance error between the detected and manually labelled facet planes is mm. According to [Greher2004] an error below 5 mm leads to an effective anaesthetic result for the facet joint injections. For the rest 3 facet joints out of 40, the error is mm since the CNN output resulted to be less precise, due to the poor image quality.
Currently, clinical routine spine injections procedures completely rely on the expertise of the surgeon, both to ensure the accuracy of the procedure and to limit the exposure time to the ionizing radiation. In this study, a robotic-ultrasound method for vertebral level detection and counting was developed for spine injection procedures. To the best of our knowledge, it is the first robotic system integrating visual and force feedback for vertebra level classification.
The method was tested on a group of healthy volunteers, chosen to maximize the inter-subject variability in terms of gender and BMI. However, a more thorough analysis should be conducted with a larger database, to better understand the correlation between method performances and subject anatomical characteristics. Future exploration should also focus on the online validation of the method in a real clinical environment and further automation of each step of the injection procedure. Furthermore, possible application in other clinical scenario as scoliosis assessment should be considered, since accessing the position of each vertebral level is beneficial for curvature reconstruction [Victorova2019].
The proposed method effectively fuses ultrasound and force data acquired during a robotic actuated scanning of the patient back for vertebral level classification. It was proven that the proposed fusion method yielded higher performances compared to pure image and pure force-based methods. From the results, it can be noticed that, when able to classify a vertebral level, the pure image method can precisely detect its correct location. However, in cases where the input ultrasound image is particularly noisy, it totally fails to detect vertebrae. Even when using the pure force-based method, corruption in the input force data may lead to misclassification of the vertebral levels. By combining image and force data, the fusion method is able to correctly classify vertebral level even in presence of force or ultrasound data corruption, with a precision comparable to the one obtained with the pure image method. In particular, the fusion method correctly classifies 100% of the vertebral levels in the test set with a precision of 3.47 mm, while pure image and pure force-based method could only classify 16 and 18 out of 20 vertebrae with a precision of 3.22 mm and 5.97 mm, respectively.
The potential of the proposed method was explored in the integration with a common clinical procedure, opening the path for future exploration toward fully automatic injection procedures.