Accurate assessment of neointimal coverage after stent implantation in intravascular optical coherence tomography (IVOCT) images is important to ensure the safety and efficacy of the Percutaneous Coronary Intervention procedure . Unfortunately, manual assessment requires the detection and analysis of thousands of struts within the stent, which is a challenging, tedious and time-consuming task. As shown in Fig. 1, the stent struts are small, and the visual characteristics of the region covering the thick intima (innermost layer of the artery) may make the struts inconspicuous.
Motivated by these challenges, a number of automated detection methods have been proposed. Existing methods typically use handcrafted features to encode the candidate strut points and then apply supervised classification to identify the struts. Commonly used handcrafted features and supervised approaches include shadow feature [11, 12]5] and wavelet based detection . Besides, some studies used lumen segmentation [7, 14] and stent shape models  to constrain the search space for the potential struts candidates. However, all these methods rely on effective pre-processing steps, such as denoising, illumination corrections and detecting lumen boundaries for producing accurate results, which thereby restricting its generalizability.
An alternative is to derive features using convolutional neural networks (CNNs) which have achieved great success in medical imaging analysis . The results from the use of CNN architectures including U-Net  and FCN  demonstrate their performances in accurate detection and segmentation for large sized objects. However, because of the downsampling of the image to enlarge the receptive field and to encode the global information, the application of these CNNs to strut detection has not been validated.
We propose an automated method for stent detection in IVOCT images that overcomes the limitations above mentioned. We leverage CNNs for their ability to combine low-level appearance information with high-level semantic information in a hierarchical manner. A local network to densely detect potentially similar-struts in the image patches, and a global network that uses image appearance information to iteratively refines/removes the false predictions that are less likely to be struts. We have named our method the deep local-global refinement network (LGRN). We contribute the following to the state-of-the-art:
To the best of our knowledge, this is the first deep learning method for strut points detection. Our method removes the reliance on pre-processing steps such as denoising and illumination corrections.
Our coupling of a local network that has high recall with a global network that provides image level refinement, enables false positive reductions while maintaining high sensitivity and efficiency to strut point detection.
Our global network uses an appearance constrained attention module for false positive reduction, which preserves the detected struts that fit the overall appearance of the image.
Our dataset consists of 57 patients with stent implanted for more than 1 year. Each patient has an average of about 127 IVOCT images with stent. All the IVOCT images were acquired using a C7-XR OCT system (St. Jude Medical, St. Paul, MN, USA) with a 2.7f Dragonfly imaging catheter. Each IVOCT image has a resolution of pixels. A cardiologist performed manual annotation of the struts and lumen on all the IVOCT images. And we follow the same protocol as in  to design a morphological filter applied to the annotated struts to enlarge the size, which was then used as the ground truth for evaluation.
2.2 Deep Local-Global Refinement Network
Fig. 2 shows the overview of the proposed local-global refinement network. Initially, Local-Network is applied to the input IVOCT image to detect all the potential struts via small input patches. After that, the detected struts together with the original IVOCT image are used as the input to Global-Network for refinement, where an appearance constrained attention module is applied to guide the overall spatial distribution of all the struts and to remove all falsely detected struts.
A patch-based deep CNN is used as the Local-Network for detecting all the strut points. It consists of 9 zero-padded convolutional layers with kernel size of 3 and stride of 1. Residual block is used to connect each adjacent layer. At the end of the Local-Network we use a linearconvolution layer and a Gaussian convolution kernel, where the
convolution is used to compensate for batch normalization, and the Gaussian convolution kernel is used to smooth the output, e.g., suppressing the path artifacts of the Local-Network. The patch size is set tofor the Local-Network. The network is trained with Loss.
The purpose of the Global-Network is to extract high-level semantic information that can be used to guide the refinement of all the detected struts. The Global-Network uses a modified U-net  architecture. There are 4 downsampling layers, each with a max-pooling operator, and 4 upsampling layers. At each down/up-sampling layer, we repeat the following parallel architecture module: one sub-branch with convolutional kernel, and another sub-branch with
dilated convolution and 2 dilations. The outputs of these two sub-branches are added at the end of this module. This combination of regular convolution and dilated convolution has a larger receptive field thus it can learn visual characteristics that assist in inferring struts with less visual features. Therefore, global context constraints can ensure the accuracy of the prediction results, while local context learning can improve the sensitivity of the model to detect the object. However, the uneven distribution of the background and foreground makes it difficult for the global network to converge during training. To overcome this, we added an appearance constrained attention module to guide its convergence, where we used another 5 layers CNN network to different whether the predicted detection results have the similar appearance to the ground truth. To facilitate the learning process, we used two loss functions (and ) as:
where is the ground truth annotation. P indicates the predicted results. It’s used to balance the uneven distribution of foreground and background. We also add an attention loss to constrain the overall appearance of the predicted struts and this can be defined as:
where indicates the attention module that discriminates whether is similar appearance to the ground truth.
2.3 Implementation Details
We pre-processed the dataset with maximum normalization and cropped all images to
. Both Local-Net and Global-Network were trained for 80 epochs with an Adam optimizer at an initial learning rate of 0.001 and batch size of 1. It took an average of 15 hours to train on an 11GB Nvidia 1080Ti GPU.
3 Results and Discussions
3.1 Experimental Setup
We randomly divided the dataset into a training set (30 patients, 3873 images) and a test set (27 patients, 3352 images) for evaluation. We performed the following experiments on the dataset: (a) comparison of the performance of our method with the state-of-the-art methods; and (b) analysis of the performance of each component in our method. The state-of-the-art methods include: (i) Wang, Ancong et al. 
- Bayesian network based detection method; (ii) Faster-RCNN - Region Proposal Network (RPN) that is trained end-to-end to generate high-quality region proposals for detection. (iii) Lu, Hong et. al  - Bagged decision trees classifier for classifying candidate struts using structure features, and (iv) Nam, Hyeong Soo et. al.  - Neural network classifier to classify the features from gradient images. For , due to the unavailability of the algorithm source code, we refer to the published result as a reference acknowledging that the dataset is different. We use recall and precision for the evaluation (we followed  to define the true positive detection if they are within 5 pixels of the ground-truth).
3.2 State-of-the-art methods comparison
Table 1 shows the detection results of our method compared to the state-of-the-art methods, where it increases recall by 1.2% and precision by 4.7%, relative to the second best results from Faster-RCNN, as shown in Fig 3(a).
|Hyeong et al. ||0.894||0.818|
|Hong et al. ||0.804||0.826|
|Ancong et al. ||0.910||0.840|
3.3 Component Analysis
Table 1 and Fig. 3(b) show the detection results of our method at individual stages. Fig. 5(a) shows the two example detection results with various thickness coverage. The Local-Network shows the higher recall while Global-Network achieved better precision results (as shown in Fig. 5(c) and Fig. 5(d)). As exemplified in Fig. 5(e), the proposed method integrated both Local Network and Global-Network and achieved a better consistent performance in recall and precision.
Table 1 and Fig. 3(a) illustrate our method achieved the overall best performance when compared to the existing methods for strut detection. The traditional methods (Hyeong et al., Hong et al. and Ancong et al.) using hand-crafted features with conventional classifiers achieved competitive performance when compared with Faster-RCNN method. Fig. 4(d) and Fig. 4(e) show two example results where both Hyeong et al and Hong et al methods fail to detect strut points where there is low-contrast to the background. In contrast, Faster-RCNN has the ability to combine deep semantic information and shallow appearance information in a hierarchical manner that enables it to encode image-wide location information and semantic characteristics. However, Faster-RCNN lacks constrain of the overall appearance of all the struts. Consequently, Faster-RCNN generates poor detection results for small struts (as shown in Fig. 4(c)).
Table 1, Fig. 3(b) and Fig. 5 compared the main components of our method individually to quantify their contributions to the final detection results. These results demonstrate that Local-Network has higher recall and we attribute this to the usage of patch-based network to detect all the potential strut candidates. In contrast, Global-Network achieved higher precision for its ability by adding global context, e.g., appearance information, as part of the learning process, which ensures all the detected struts are consistent with the shape of the stent. Table 1, Fig. 3(b) and Fig. 5 also show the advantages from our combination which integrates complementary detection results produced at individual components.
We propose a deep learning-based method for stent struts detection for IVOCT images. We achieved state-of-the-art struts detection performance via a local-global refinement network, where we detected potential struts which were then refined according to global appearance constraints to reduce false positives. Our experimental results demonstrate that our method achieved higher accuracy when compared to the existing state-of-the-art methods on a large clinical dataset.
Acknowledgement This work was supported in part by Australia Research Council (ARC) grants (LP140100686 and IC170100022), the University of Sydney – Shanghai Jiao Tong University Joint Research Alliance (USYD-SJTU JRA) grants and STCSM grant (17411953300).
-  Ciompi, F., et al.: Computer-aided detection of intracoronary stent in intravascular ultrasound sequences. MED PHYS 43(10), 5616–5625 (2016)
-  Jonathan, L., et al.: Fully convolutional networks for semantic segmentation. In: CVPR. pp. 3431–3440 (2015)
-  Kostas, M., et al.: Automatic quantitative analysis of in-stent restenosis using fd-oct in vivo intra-arterial imaging. MED PHYS 40(6Part1) (2013)
-  Litjens, G., et al.: A survey on deep learning in medical image analysis. MED IMAGE ANAL 42, 60–88 (2017)
-  Lu, H., et al.: Automatic stent detection in intravascular oct images using bagged decision trees. BIOMED OPT EXPRESS 3(11), 2809–2824 (2012)
-  Merget, D., et al.: Robust facial landmark detection via a fully-convolutional local-global context network. In: CVPR. pp. 781–790 (2018)
-  Nam, H.S., et al.: Automated detection of vessel lumen and stent struts in intravascular optical coherence tomography to evaluate stent apposition and neointimal coverage. MED PHYS 43(4), 1662–1675 (2016)
-  Otsuka, F., et al.: Neoatherosclerosis: overview of histopathologic findings and implications for intravascular imaging assessment. EUR HEART J 36(32), 2147–2159 (2015)
-  Ren, S., et al.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: NIPS. pp. 91–99 (2015)
-  Ronneberger, O., et al.: U-net: Convolutional networks for biomedical image segmentation. In: MICCAI. pp. 234–241. Springer (2015)
-  Ughi, G.J., et al.: Automatic segmentation of in-vivo intra-coronary optical coherence tomography images to assess stent strut apposition and coverage. Int J Card Imaging 28(2), 229–241 (2012)
-  Wang, A., et al.: Automatic stent strut detection in intravascular optical coherence tomographic pullback runs. Int J Card Imaging 29(1), 29–38 (2013)
-  Wang, A., et al.: 3-d stent detection in intravascular oct using a bayesian network and graph search. IEEE T MED IMAGING 34(7), 1549–1561 (2015)
Yong, Y.L., et al.: Linear-regression convolutional neural network for fully automated coronary lumen segmentation in intravascular optical coherence tomography. J BIOMED OPT22(12), 126005 (2017)