Log In Sign Up

Assessing Knee OA Severity with CNN attention-based end-to-end architectures

This work proposes a novel end-to-end convolutional neural network (CNN) architecture to automatically quantify the severity of knee osteoarthritis (OA) using X-Ray images, which incorporates trainable attention modules acting as unsupervised fine-grained detectors of the region of interest (ROI). The proposed attention modules can be applied at different levels and scales across any CNN pipeline helping the network to learn relevant attention patterns over the most informative parts of the image at different resolutions. We test the proposed attention mechanism on existing state-of-the-art CNN architectures as our base models, achieving promising results on the benchmark knee OA datasets from the osteoarthritis initiative (OAI) and multicenter osteoarthritis study (MOST). All code from our experiments will be publicly available on the github repository:


page 5

page 15

page 16

page 17

page 18

page 19


DCANet: Learning Connected Attentions for Convolutional Neural Networks

While self-attention mechanism has shown promising results for many visi...

Attend and Guide (AG-Net): A Keypoints-driven Attention-based Deep Network for Image Recognition

This paper presents a novel keypoints-based attention mechanism for visu...

AMNet: Memorability Estimation with Attention

In this paper we present the design and evaluation of an end-to-end trai...

Deep Attentive Features for Prostate Segmentation in 3D Transrectal Ultrasound

Automatic prostate segmentation in transrectal ultrasound (TRUS) images ...

Multiresolution Textual Inversion

We extend Textual Inversion to learn pseudo-words that represent a conce...

Improving Fine-Grained Visual Recognition in Low Data Regimes via Self-Boosting Attention Mechanism

The challenge of fine-grained visual recognition often lies in discoveri...

1 Introduction

Knee osteoarthritis (OA) is the most common articular disease and a leading cause of chronic disability [heidari2011knee], and mainly affects the elderly, obese, and those with a sedentary lifestyle. Degenerative processes of the articular cartilage as a result of excessive load on the joint, and the aging process, contributes to the natural breakdown of joint cartilage with joint space narrowing (JSN) and osteophytes [KneeOA5]. Knee OA causes excruciating pain and often leads to joint arthroplasty in its severe stages. An early diagnosis is crucial for clinical treatment to be effective in curtailing progression and mitigating future disability [KneeOA1] [KneeOA2]. Despite the introduction of several imaging modalities such as MRI, OCT, and ultrasound for augmented OA diagnosis, X-Ray is still the method of choice in diagnosing knee OA, although clinical evidence also contributes.

Previous work has approached the challenge of automatically assessing knee OA severity as an image classification problem [KneeOA2] [KneeOA4] [KneeOA6] using the Kellgren and Lawrence (KL) grading [KL]. KL grading quantifies the degree of degeneration on a five-point scale (0 to 4): KL-0 (no OA changes), KL-1 (doubtful), KL-2 (early OA changes), KL-3 (moderate), and KL-4 (end-stage). Assessment is based on JSN, presence of osteophytes, sclerosis, and bone deformity. Most methods in the literature use a two step process to automatically quantify knee OA severity: 1) localization of knee joints; and 2) quantification of severity. Separate models for knee joint localization, either using hand-crafted features [KneeOA4], [KneeOA6] or CNNs [antony2] are not always highly accurate, affecting the subsequent quantification accuracy and adding extra complexity to the training process.

To overcome this problem, this work proposes a novel end-to-end architecture incorporating trainable attention modules that act as unsupervised fine-grained ROI detectors, which automatically localize knee joints without a separate localization step. The proposed attention modules can be applied at different levels and scales across an arbitrary CNN pipeline. This helps the network to learn attention patterns over the most informative parts of the image at different resolutions, achieving improvements in the quantification performance.

2 Related work

Much of the literature has proposed image classification-based solutions to assess knee OA severity using radiography-based semi-quantitative scoring systems, like KL gradings, which are based on the study of anatomical features such as variations in joint space width or osteophytes formation [KneeOA3] [KneeOA2] [KneeOA4]. Shamir et al. [KneeOA4]

proposed WND-CHARM: a multi purpose medical image classifier to automatically assess knee OA severity in radiographs using a set of features based on polynomial decompositions, contrast, pixel statistics, textures, and features from image transforms. Recently, Yoo et al.


proposed a self-assessment scoring system associating risk factors and radiographic knee OA features using multivariable logistic regression models, additionally using an Artificial Neural Network (ANN) to improve the overall scoring performance. Shamir et. al.

[KneeOA2] proposed template matching to automatically detect knee joints from X-ray images. This method is slow to compute for large datasets and gives poor detection performance. Antony et al. [antony2] introduced an SVM-based approach for automatically detecting the knee joints. Later, Antony et al. [antony2017automatic] proposed an FCN-based approach to improve the localization of the knee joints. Although more accurate, the aspect ratio chosen for the extracted knee joints affects the overall quantification.

Recently, the emergence of deep learning has enabled the development of new intelligent diagnostics based on computer vision. CNNs outperform many state-of-the-art methods based on hand-crafted features in tasks such as image classification

[krizhevsky2012imagenet], retrieval [babenko2014neural] and object detection [lawrence1997face] [wei2011computer]. Antony et al. [antony2] showed that the off-the-shelf CNNs such as the VGG 16-layer network [VGG16], the VGG-M-128 network [chatfield2014return], and the BVLC reference CaffeNet [jia2014caffe] [karayev2013recognizing]

pre-trained on ImageNet LSVRC dataset


can be fine-tuned for classifying knee OA images through transfer learning. They argued that it is appropriate to assess knee OA severity using continuous metrics like mean-squared error together with binary or multi-class classification losses, showing that predicting the continuous grades through regression reduces the error and improves overall quantification. They proposed a novel pipeline

[antony2017automatic] to automatically quantify knee OA severity using a FCN for localization and a CNN jointly trained for classification and regression. The work consolidates the state-of-the-art baseline for the application of CNNs in the field, opening a range of research lines for further improvements. Tiulpin et al. [tiulpin2018automatic] presented a new computer-aided diagnosis method based on using deep Siamese CNNs, which are originally designed to learn a similarity metric between pairs of images. However, rather than comparing image pairs, the authors extend this idea to similarity in knee x-ray images (with 2 symmetric knee joints). Splitting the images at the central position and feeding both knee joints into a separate CNN branch allows the network to learn identical weights for both branches. They outperform the previous approaches by achieving an average multi-class testing accuracy score of 66.71 % on the entire OAI dataset, despite also needing a localization step to focus the network branches on the knee joint areas.

This work mainly focuses on designing an end-to-end architecture with attention mechanisms. There are similar methods reported in the literature. Xiao et al. [xiao2015application]

propose a pipeline to apply visual attention to deep neural networks by integrating and combining attention models to train domain-specific nets. In another approach, Liu et al.


introduce a reinforcement learning framework based on fully convolutional attention networks (FCAN) to optimally select local discriminative regions adaptive to different fine-grained domains. The proposed weakly-supervised reinforcement method combined with a fully-convolutional architecture achieves fast convergence without requiring expensive annotation. Recently, Jetley et al.


introduce an end-to-end-trainable attention module for CNN architectures built for image classification. The module takes as input the 2D feature vector maps, which forms the intermediate representations of the input image at different stages in the CNN pipeline, and outputs a matrix of scores for each map. They redesign standard architectures to classify the input image using only a weighted combination of local features, forcing the network to learn relevant attention patterns.

3 Method

This section describes the proposed methods, detailing the design, implementation, and training of the attention modules. Several strategies are investigated to integrate the attention mechanism into standard CNN architectures, proposing experimental approaches to classify the knee images.

3.1 Trainable Attention Module for CNNs

The selected attention module is inspired by the work of Kevin Mader [kevinmader], in which a trainable attention mechanism is designed for a pretrained VGG-16 network to predict bone age from hand X-Ray images. Figure 1 illustrates this idea.


Figure 1: Attention module scheme.

Given an input volume from a convolutional layer with feature maps, several convolutional layers are stacked to extract spatial features. The output is then passed to a locally connected layer [chen2015locally] (convolution with unshared weights) with sigmoidal activation to give an attention mask . The original feature maps are element-wise multiplied by the attention mask, generating a new convolutional volume accentuating informative areas. A spatial dimensionality reduction is performed by applying global average pooling (GAP) on the masked volume, generating a -dimensional feature vector , which is then normalized by the average value of the attention mask. Additionally, a softmax layer can be applied to yield a

-dimensional vector with the output class probabilities


3.2 Module Integration to CNN Pipeline

Standard CNN architectures typically stack several convolutional layers with occasional pooling operations that reduces the spatial dimension and increase the receptive field. Therefore, the degree of abstraction of the attention modules is closely related to their location in the CNN, focusing on more global details as depth increases. We define the concept of an attention branch as the location of an attention module in a specific convolutional block, applying a softmax operation at the top to produce independent class probabilities based on the KL scores. Each attention branch will be seen as a new model by itself that could be trained end-to-end. Figure LABEL:fig:vgg-example shows a sample architecture integrating the attention modules to the VGG-16 pipeline. Fixing an input size of pixels, as fixed in Section 3.4, we build the branches taking as input volumes the feature maps belonging to the pooling layers after the convolutional blocks , , and . Following the methodology in Section 3.3

, a combinational module is applied to fuse the local features from all the branches into a global feature vector and then generate the KL grades probability distribution by applying a

softmax layer at the top.


Figure 2: Sample architecture integrating the attention modules in the VGG-16 pipeline.

3.3 Combining Multiple Attention Branches

Several strategies are investigated to merge features from multiple branches with the aim to combine attention patterns at different resolutions. Our first strategy is performing early fusion of features from different branches. Each attention module generates a -dimensional feature vector with the average values of the masked feature maps conforming the input convolutional volume. A channel-wise concatenation is applied to fuse all the branches, generating a new vector with , being the total of attention branches and the dimension of . In addition, a fully connected layer is added at the top to perform early fusion of the concatenated features, while a softmax operation is applied to generate the -dimensional output class probabilities. As shown above, the complexity of the attention modules correlates with their location in the CNN pipeline, which biases the convergence behavior. This can be critical for a combined model that attempts to train modules with different convergence rates at the same time: deeper branches quickly overfit while waiting for the convergence of the slower ones. In contrast, by reducing the overall learning time, the shallower branches with more complex modules may decrease their performance due to under training.

Our next strategy is to simplify the multiple branch learning process. We propose the use of multi-loss training, which aims to improve the learning efficiency and prediction performance by learning multiple objectives from a shared representation. Each attention branch makes separate predictions via a softmax to to generate their class probabilities and we linearly combine the individual categorical cross-entropies into a global loss: , with

. This allows to control the rate of convergence by weighting the contribution of each branch, assigning low weights for those branches with faster convergence to reduce their influence at the initial stages of training and attenuate updates in shallower attention modules. There are previous approaches that propose the use of multi-loss training to address different machine learning tasks such as dense prediction


, scene understanding


, natural language processing

[collobert2008unified] or speech recognition [huang2013cross]. However, the model performance is extremely sensitive to the weight selection , that needs an expensive and time-consuming hyper-parametrization process.

Several multi-branch combinations were tested by applying multidimensional cross-validation to find the optimum branch locations and multi-loss weights. We used a 2D grid search, validating the and loss weights between a range of to with a step size of , and using the validation loss as monitor. The best performance was achieved with , weights , , slightly reducing the contribution of the deeper attention modules and decreasing their overfitting tendency while the shallower branches are still learning.


Figure 3: Comparison between merging solutions in the VGG-16 pipeline, visualizing the generated masks in the attention branches and and observing a large improvement in the shallower modules using multi-loss.

3.4 Public Knee OA Datasets

The data used for this work are bilateral PA fixed flexion knee X-ray images. The datasets are from the Osteoarthritis Initiative (OAI) and Multicenter Osteoarthritis Study (MOST) in UCSF, being standard public datasets widely used in knee OA studies. The baseline cohort of the OAI dataset contains MRI and X-ray images of 4,476 participants. From this entire cohort, we selected 4,446 X-ray images based on the availability of KL grades for both knees as per the assessments by Boston University X-ray reading center (BU). The MOST dataset includes lateral knee radiograph assessments of 3,026 participants. From this, 2,920 radiographs are selected based on the availability of KL grades for both knees as per baseline assessments. As a pre-processing step, all the X-ray images are manually split in the middle, generating two vertical sections from the left and right sides, resizing them to a fixed mean size of pixels by keeping the average aspect ratio. Histogram equalization is performed for intensity level normalization, and eventually data augmentation is applied by performing horizontal right-left flips to generate more training data. The training, validation, and test sets were split based on the KL grades distribution. A 70-30 train-test split was used and of the training data was kept for validation.

3.5 Training

All models were trained from scratch using categorical cross-entropy with the ground truth KL grades. Regarding multi-branch training, all target data were duplicated for each attention branch. We used Adam [kingma2014adam] with a batch size of 64, , , an initial learning rate of scaled by every epochs without improvement in validation loss, and early stopping after epochs without improvement.

4 Results

Although the attention mechanism can be integrated to any CNN pipeline, not all the architectures are well-suited for assessing knee OA severity. We explored several architectures in the literature including state-of-the-art models from previous works of Antony et al. [antony2017automatic] and more complex architectures such as ResNet-50 [Resnet50], and we analyzed their performance. We found that the same level of abstraction in and for Antony et. al models can be achieved in shallower branches , for deeper architectures, implying the best branch location depends on model complexity. After testing different branch combinations, the best performing locations presented in Table LABEL:tab:evaluation, and detailed in the Tables LABEL:tab:antony-clf, LABEL:tab:antony-ext, LABEL:tab:resnet50, specifying the output resolution of their convolutional blocks and the location of the attention branches. From the evaluation of multi-loss models, since every single attention branch produces an independent prediction, the top performing one is used at test time. A more sophisticated ensemble approach was considered but not included in this paper. This approach involves averaging the pre-activation outputs (i.e. values before the softmax) of each of the model branches and then passing the result through a softmax to enforce a valid probability distribution over the classes. This idea is often effective in test time data augmentation and ensemble methods and may improve performance over the single best model referred here. As Table LABEL:tab:evaluation shows, the VGG-16 attention branch with multi-loss learning achieved the best overall classification accuracy ().

tab:evaluation Models Antony Clsf. Antony Extended ResNet-50 VGG-16 Early fusion Multi-Loss : : : 60% : 64.3% : : : :

Table 1: Evaluation overview for different CNN pipelines.

We also compared the attention mechanism with related knee OA classification-based solutions in the literature. First, we retrained the Antony et. al models with the same training data from the previous experiments, applying the FCN introduced by the authors to address the knee joints extraction [antony2017automatic]. The results (Table LABEL:tab:comparative) show that the attention-based models with end-to-end architectures clearly outperform the state-of-the-art frameworks. We further compared our results to human level accuracy using the radiologic reliability readings from the OAI [klara2016reliability]. Although the data used to compute the reliability grading does not match our test set, we followed the methodology of previous works in the literature [tiulpin2018automatic], with the aim to dispose of a panoramic view of the current gold standard for diagnosing OA involving human performance. Cohen’s kappa coefficient [cohen1960coefficient] was used to evaluate the agreement between non-clinician readers and experienced radiologists by classifying items with mutually exclusive categories. Considering the following grading: slight agreement, - fair, - moderate, - substantial, and - almost perfect agreement, their inter-reader reliability for the KL scores was moderate to substantial, with values between and . In the case of automatic assessments, by considering a CNN model as a non-clinician X-Ray reader, we can apply the coefficient to evaluate the inter-reader reliability between its predictions and the corresponding ground truth annotations, provided by experienced radiologists. As Table LABEL:tab:comparative shows, the branch of the VGG-16 trained by multi-loss together with the branch (), improves the reliability of related works with a substantial agreement, reaching the margins of human accuracy.

tab:comparativewidth= Test Acc. Test Loss Kappa # parameters Antony Clsf M (FCN: K) Antony Joint Clsf & Reg M (FCN: K) VGG-16: Multi-loss () 64.3% 0.63 M

Table 2: Comparative with related frameworks

5 Conclusions

This work proposed a novel end-to-end architecture that incorporates trainable attention modules acting as unsupervised fine-grained ROI detectors. The proposed attention modules can be applied at different levels and scales across the CNN pipeline, helping the network to learn relevant attention patterns over the most informative parts of the image at different resolutions. The results obtained for the public knee OA datasets OAI and MOST were satisfactory despite having a considerable scope for further improvement.

The proposed attention mechanism can be easily integrated to any convolutional neural network architecture, being adaptable to any input convolutional volume. However, after exploring different off-the-shelf base models for classification with different complexities, we observed that the best performance is achieved in those models with a balanced ratio between the complexity of the overall architecture and the depth of the convolutional volumes, avoiding overfitting while getting abstraction in the local features used to train the attention modules. On the other hand, we propose the use of multi-loss training to manage the training of multiple attention branches with different velocities of convergence at the same time, boosting the overall performance by fusing attention features with different levels of abstraction. The best performance was achieved by slightly reducing the contribution of the deepest attention branches, improving then the precision of the shallower attention masks and reaching the effectiveness of related approaches with a test accuracy of and Kappa agreement of . Although our method does not surpass the state-of-the-art and could be interpreted as challenging to implement, the overall aim was to reduce the training complexity using an end-to-end architecture. As mentioned in Section 4, without an end-to-end design, the models require a localization step to focus the classifier to the knee joint regions of interest. For instance, previous work of Antony et. al. [antony2017automatic] needed a manual annotation process for training a FCN to automatically segment the input knee joints. Our approach, in contrast, requires no such annotation of knee joint locations in the training data. Finally, we observed that localizing the knee joints in an unsupervised way can reduce performance by adding noise in the attention masks and thus into the overall process. A more robust attention module can improve the results and have a bigger impact in the future. As future work, it may be interesting to design better base networks for the attention mechanism and then to test new fine-grained methods in the state-of-art, with the aim to improve the performance of the attention modules towards reducing their dependence on the complexity of the base model.

This research was supported by contract SGR1421 by the Catalan AGAUR office. The work has been developed in the framework of project TEC2016-75976-R, funded by the Spanish Ministerio de Economia y Competitividad and the European Regional Development Fund (ERDF). The authors also thank NVIDIA for generous hardware donations.

This publication has emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under grant numbers SFI/12/RC/2289 and 15/SIRG/3283.

The OAI is a public-private partnership comprised of five contracts (N01-AR-2-2258; N01-AR-2-2259; N01-AR-2- 2260; N01-AR-2-2261; N01-AR-2-2262) funded by the National Institutes of Health, a branch of the Department of Health and Human Services, and conducted by the OAI Study Investigators. Private funding partners include Merck Research Laboratories; Novartis Pharmaceuticals Corporation, GlaxoSmithKline; and Pfizer, Inc. Private sector funding for the OAI is managed by the Foundation for the National Institutes of Health. MOST is comprised of four cooperative grants (Felson – AG18820; Torner – AG18832; Lewis – AG18947; and Nevitt – AG19069) funded by the National Institutes of Health, a branch of the Department of Health and Human Services, and conducted by MOST study investigators. This manuscript was prepared using MOST data and does not necessarily reflect the opinions or views of MOST investigators.


Appendix A CNN Architectures, Learning Curves and Visualizations

tab:antony-clf Layer Kernels Kernel Size Strides Output shape conv1 pool1 - conv2 pool2 - conv3 pool3 - conv4 pool4 -

Table 3: Antony et al. base architecture for classification

tab:antony-ext Layer Kernels Kernel Size Strides Output shape conv1 pool1 - conv2-1 conv2-2 pool2 - conv3-1 conv3-2 pool3 - conv4-1 conv4-2 pool4 - 2

Table 4: Antony et al. extended base architecture for classification and regression

tab:resnet50 Layer Kernels Kernel Size Strides Output shape conv1 maxpool - conv2_ () conv3_ () conv4_ () conv5_ ()

Table 5: ResNet-50 base architecture for classification
Figure 4: Learning curves and visualization for early fusion experiment in VGG-16.
Figure 5: Learning curves and visualization for multi-loss experiment in VGG-16.
Figure 6: Learning curves and visualization for Antony et al. pipeline for classification.
Figure 7: Learning curves and visualization for Antony et al. pipeline for jointly classification and regression.
Figure 8: Learning curves and visualization for ResNet-50 pipeline.