Reflectance confocal microscopy (RCM) is a non-invasive optical imaging technology that enables users to examine mosaics of small field of view (FOV) images of thick layers ("optical sections") of skin at lateral resolution. Imaging can go as deep as , which is sufficient for diagnosing important skin conditions, typically covering the epidermis and papillary dermis. Recent studies have demonstrated that RCM imaging is highly sensitive () and specific () for detecting skin cancers Rajadhyaksha et al. (2016) by expert visual inspection. Moreover, the combination of RCM and dermoscopy has been shown to reduce the rate of biopsy of benign lesions per detected malignancy by , leading to better patient care Borsari et al. (2016); Pellacani et al. (2016).
To identify the correct depths for gathering mosaics, clinicians take a sequence of RCM images at separation in depth, from the epidermis to the dermis. This set of images is referred to as an RCM stack and each RCM image in the stack is called an RCM slice
. After acquiring the stack, the clinician classifies each image as belonging to an epidermis, dermal-epidermal junction (DEJ), or dermis layeer. Figure1 shows typical RCM images from each layer. The clinician then uses the stack as a reference to choose depths at which to collect larger FOV high-resolution images mosaics for subsequent diagnostic analysis.
Our work is concerned with the first part of the process. Given a stack of RCM images, we try to distinguish boundaries between the epidermis, dermal-epidermal junction, and dermis. RCM stack data has a strong sequential structure, which recurrent neural networks can naturally exploit. Human skin maintains a strict ordering of different strata; there is a smooth transition between the contiguous layers (e.g. epidermisdermisepidermis transitions do not happen) and unidirectional (dermisDEJ or DEJepidermis transitions are not possible). These constraints help experienced clinicians quickly classify the stacks. This observation was also made in Bozkurt et al. (2017), where authors used a recurrent convolutional network (RCN) to emulate the clinicians’ behavior. They reported high accuracy and a consistent classification by using a model that looks at all of the images in a stack. They also hinted at the possibility of performing equally well considering only a small part of the stack at a time. They tested this possibility by performing the same task while only looking at three slices at a time, and reported only minimal loss in terms of performance.
In this work, we apply two attention-based models to extend (Bozkurt et al., 2017). Our first model is similar to their full-sequence RCN model, in the sense that both approaches use information from all images in a stack. Even though their model has very powerful representation ability (and was shown to perform very well for this dataset), it is not interpretable: we do not know how much each image is contributing to a decision. To improve the interpretability of their model, we we utilize a soft (global) attention mechanism Bahdanau et al. (2014), which also uses information from all images in a stack, but applies a soft mask over representation of images in the network. Visualizing the mask gives us clues about which images the network pays attention to which making a decision.
We also address redundancy in Bozkurt et al. (2017). With their partial-sequence RCN model, the authors showed that it is possible to perform comparably to full-sequence RCN model, just by looking at 3 images per decision. However, choice of number of images to use is not well justified. Finding the optimal location and number of components of an input to look at is a well known problem (Mnih et al., 2014)
in computer vision, and solutions like hard-attention require sample approximation algorithms like REINFORCEWilliams (1992)
, due to the discrete nature of the problem. To find a compromise between that and soft attention, we restrict the support of the attention vector and make it sequence-independent, which results in a Toeplitz structured attention map. In the following section, we will explain the attention mechanisms used in this work in more detail.
2 Global Attention
Global attention Bahdanau et al. (2014)
has been proposed as a way to align source and target segments in neural machine translation in a differentiable manner. Since then, it has been used in many computer vision and NLP related tasks. In this mechanism, an attention vector with the same length as the sequence is calculated with a neural work from encoding of the whole sequence, then a context vector is calculated as weighted sum of encodings according to the attention vector.
3 Toeplitz Attention
The name Toeplitz Attention comes from the idea that the attention map created by this method has Toeplitz structure. This method is a compromise between global attention and soft attention. Support of the attention weights is more compact than global attention, but it is still differentiable.
This mechanism can be seen as a special case of local attention with monotonic alignment introduced in Luong et al. (2015), where the context vector was calculated as a weighted average over sets of within the window ( is chosen, in both Luong et al. (2015) and our work, empirically). Aligned position can be calculated with an MLP (predictive alignment) or set as (monotonic alignment). Monotonic alignment is suitable when source and target sequences are aligned, such as our case. Now has a shorter support of , compared to the full input sequence length in the global attention case. It is calculated in a similar fashion to global attention in Luong et al. (2015). In our case is time (here, depth) independent, i.e. where is a learnable kernel with all non-negative entries that sum to one (convex combiner). The attention map , which is a concatenation of for each slice, therefore will have a Toeplitz structure. This structure lends itself to an efficient implementation using convolution.
(left) Global attention model. (right) Toeplitz attention model. Note that this figure is intended to explain only the attention layers, the encoder and decoder structures are not presented in detail due to space constraints.
4 Model Definition
. We replicate this structure in our work as well. We use bidirectional gated recurrent units (GRU)Cho et al. (2014a) appended to Inception v3 networks Szegedy et al. (2016) to create a recurrent encoder network. This network will produce encoding for every image in a stack. The full sequence RCN model used in Bozkurt et al. (2017) can be formed by attaching a fully connected layer to end of this encoder. From this lens, it can be seen as a special case of attention augmented networks. Indeed, we can recover the full-sequence RCN as a special case of Toeplitz attention where
, so the attention map becomes an identity matrix.
We use different decoder networks for each attention mechanism. For global attention, we use a GRU followed by a fully connected layer. For Toeplitz attention, we use a just a fully connected layer. In both cases, we augment the attended input to the decoder by the decoder’s output at a previous timestep (again, time corresponds to slice depth here), to efficiently exploit the structure nature of the data.
We tested our methods on the dataset in Bozkurt et al. (2017). To compare our methods with current state-of-art, we present accuracy, sensitivity, specificity, and number of errors that imply impossible transitions. Perhaps the most interesting parameter of our model is , the support of the local attention vector. We experimented with several values to provide insight about the model. For , we can compare our network with Toeplitz attention to full-sequence RCN to assess the effect of input-feeding.The case can be compared with partial-sequence RCN, however the main difference between models is that partial-sequence RCN’s attention acts on the input (output of Inception v3 in their case) and operates in a complex manner that cannot be expressed with a weighted sum (due to recurrent units).
Interestingly, looking at 3 images per decision indeed gives the highest accuracy in our experiments (Table 1). Even though the global attention model does not perform as accurately as the Toeplitz attention models, it is the only model that performs perfectly consistently in terms of producing no anatomically impossible transitions, Table 2).
Looking at attention maps, we see the power of our model’s capturing the sequential relationships within the stacks. The global attention model has a block-diagonal structure,and block edges align with transition boundaries. In the case of Toeplitz attention (), even though the support of the attention vector is 15 slices, it learns to fit a sharper Gaussian-like curve around image of interest without any sparsity regularization. Note that the brightest band is not exactly on the main diagonal, so the network leverages the freedom to change the alignment. This freedom is restricted as decreases and the attention map for is an identity matrix, as expected.
|Toeplitz Attention (D=1)||88.18||93.76||83.88||84.34||95.98||90.48||95.75|
|Toeplitz Attention (D=0)||88.04||93.88||83.55||84.12||95.71||90.82||95.45|
|Full seq. RCN Bozkurt et al. (2017)||87.97||93.95||83.22||84.16||95.82||90.54||95.51|
|Toeplitz Attention (D=7)||87.69||93.84||83.21||83.27||95.29||90.07||95.96|
|Par. seq. RCN Bozkurt et al. (2017)||87.52||94.14||82.54||83.33||94.78||90.83||95.44|
|Inception-V3 Bozkurt et al. (2017)||84.87||88.83||84.66||78.18||95.84||85.73||96.23|
|Hames et al. (2016)||84.48||88.87||80.93||81.85||93.81||87.81||94.78|
|Kaur et al. (2016)||64.33||73.99||51.14||68.27||86.22||74.85||84.89|
|Toeplitz Attention (D=7)||0||2||0||0||2|
|Toeplitz Attention (D=1)||0||4||0||1||5|
|Full seq. RCN Bozkurt et al. (2017)||0||4||0||3||7|
|Toeplitz Attention (D=0)||0||2||2||4||8|
|Par. seq. RCN Bozkurt et al. (2017)||3||10||5||5||23|
|Inception-V3Bozkurt et al. (2017)||3||25||8||32||68|
|Hames et al. (2016)||14||59||11||56||140|
|Kaur et al. (2016)||32||255||16||99||402|
In this work, we incorporated attention mechanism to improve the interpretability of the recurrent convolutional networks of (Bozkurt et al., 2017). We experimented with two different mechanisms: first we tried global attention, where the network attends the whole stack with different weights. Second, we tried a Toeplitz attention mechanism, where we forced the attention map into a Toeplitz structure by making attention weights have smaller support and be depth-independent. Comparing Toeplitz attention with different parameters, we observed that indeed looking at 3 images per decision () gives the highest image-wise classification. Our model with global attention, in turn, behaves most consistently by reporting no anatomically impossible transitions. Comparing the case with the full-sequence RCN model, we also conclude that input-feeding in the decoder helps improve the image-wise classification accuracy, but does not help consistency.
This project was supported by NIH grant R01CA199673 from NCI. This project was also supported in part by MSKCC’s Cancer Center core support NIH grant P30CA008748 from NCI.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Borsari et al. (2016) Stefania Borsari, Riccardo Pampena, Aimilios Lallas, Athanassios Kyrgidis, Elvira Moscarella, Elisa Benati, Margherita Raucci, Giovanni Pellacani, Iris Zalaudek, Giuseppe Argenziano, et al. Clinical indications for use of reflectance confocal microscopy for skin cancer diagnosis. JAMA dermatology, 152(10):1093–1098, 2016.
Bozkurt et al. (2017)
Alican Bozkurt, Trevor Gale, Kivanc Kose, Christi Alessi-Fox, Dana H Brooks,
Milind Rajadhyaksha, and Jennifer Dy.
Delineation of skin strata in reflectance confocal microscopy images
with recurrent convolutional networks.
Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pages 777–785. IEEE, 2017.
- Cho et al. (2014a) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014a.
- Cho et al. (2014b) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014b.
- Hames et al. (2016) Samuel C Hames, Marco Ardigò, H Peter Soyer, Andrew P Bradley, and Tarl W Prow. Automated segmentation of skin strata in reflectance confocal microscopy depth stacks. PloS one, 11(4):1–12, 2016.
Kaur et al. (2016)
P. Kaur, K. J. Dana, G. O. Cula, and C. Mack.
Hybrid deep learning for reflectance confocal microscopy skin images.In 2016 23rd International Conference on Pattern Recognition, Dec 2016.
- Luong et al. (2015) Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
- Mnih et al. (2014) Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In Advances in neural information processing systems, pages 2204–2212, 2014.
- Pellacani et al. (2016) G. Pellacani, A. Witkowski, A. M. Cesinaro, A. Losi, G. L. Colombo, A. Campagna, Caterina Longo, Simonetta Piana, N. De Carvalho, F. Giusti, et al. Cost–benefit of reflectance confocal microscopy in the diagnostic performance of melanoma. Journal of the European Academy of Dermatology and Venereology, 30(3):413–419, 2016.
- Rajadhyaksha et al. (2016) Milind Rajadhyaksha, Ashfaq Marghoob, Anthony Rossi, Allan C Halpern, and Kishwer S Nehal. Reflectance confocal microscopy of skin in vivo: From bench to bedside. Lasers in Surgery and Medicine, 2016.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
- Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.
Ronald J Williams.
Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8(3-4):229–256, 1992.