Lung cancer is the deadliest type of cancer worldwide, but the morbidity and mortality rates can be significantly reduced if the diagnosis is performed early enough. Namely, screening programs with chest low-dose computed-tomography (CT) images of risk-groups have shown to reduce mortality more than 20% in relation to chest radiography [Screening2015]. During the screening process, trained radiologists search for pulmonary nodules, primary indicators of lung cancer, inside the lung parenchyma.
Lung cancer screening is non-trivial because lung nodules can present a wide range of opacities (commonly referred as textures), shapes, dimensions and locations, and thus the experience of the specialist tends to play an important role on the success of the nodule hunting and corresponding characterization [MacMahon2017]. Furthermore, CT scans are inherently complex to analyze due to their 3-dimensionality and large range of intensities to explore, making the process tiresome and consequently more prone to errors.
Radiologists fail at nodule detection either due to fixation or recognition errors [Krupinski2010CurrentPerception]. Fixation errors, mostly related to stress and fatigue, occur when the expert does not focus a region-of-interest for an enough period of time to identify potential nodule candidates. On the other hand, recognition errors result from failing to correctly identify a found abnormality as a nodule and depends mostly on experience of the radiologist [Brunye2019Eye-trackingCardiologists].
Assessing the gaze of radiologists during the screening process provides important information on why failures occur, and thus may be used for improving the overall success of the procedure. Namely, eye-tracking equipments allow to record the spatial position of the radiologist’s gaze during the analysis of the scan analysis, providing insight on how screening is performed. For instance, it is known that radiologist usually follow on of two distinct nodule search strategies: scanning and drilling. In scanning, a radiologist searches for nodules on an entire slice before moving to the next, thus having to recur to techniques as maximum intensity projection (MIP) to assess depth information. Alternatively, in drilling the radiologist focus on a single quadrant of the volume at the time, scrolling through all the slices of the scan to account for 3D information [Drew20003, Diaz2015Eye-trackingData].
Lung cancer computer-aided detection and diagnosis (CADe and CADx) systems can help to further increase the success of screening programs by identifying potential abnormalities to the radiologists and mitigating fixation-related failures. Also, the demand for these CADe systems has been raising due to the increase on the number of patients and the consequent equipment and trained personnel costs. CADe systems operate by automatically identifying potential nodules in the CT scan, which are then assessed by the radiologist. Because of this, a high detection sensitivity and low false-positive rates are essential characteristics of these systems. Given the complexity of the task, deep learning-based approaches are becoming the backbone of lung CADe systems since they allow to significantly reduce the field knowledge required to design efficient solutions.
Lung cancer CADe systems are usually composed of two stages:
1. a high sensitivity/low specificity 3D or 2D object detection framework, such as Faster-R CNN [Ren2017], that guarantees the detection of the majority of the nodules, at the cost of also detecting other structures such as blood vessels or scars, and
2. a false-positive reduction neural network to remove the non-nodules proposed by the nodule detector.
a false-positive reduction neural network to remove the non-nodules proposed by the nodule detector.A properly trained system allows to achieve detection sensitivities greater than with FP/scan or greater than with 1 FP/scan [Ding2017].
Despite the high detection performance of lung nodule CADe systems, their success as stand-alone tools in clinical practice is limited. Indeed, human supervision can ensure the relevance of the findings, allowing to re-plan or even avoid unnecessary follow-ups. Also, CADe systems tend to fail on cases that significantly deviate from the training data, namely unseen types of abnormalities. Because of this, CADe systems are used by radiologists either as an indicator of regions-of-interest or as a second independent observer.
When used collaboratively, CADe systems can bias the decision process of the radiologist. Namely, checking a case for the first time with the CADe markings on it can lead the expert to focus their attention on the highlighted regions in detriment of the remaining scan. Furthermore, less experienced experts may over-trust the proposals of the CADe and increase the number of false-positive detections. On the other hand, a posteriori review of CADe suggestions may introduce a large time overhead. In this scenario, adjusting CADe results according to the attention and experience of specialists is of interest since it allows to mitigate CADes’ drawbacks without compromising the success of the screening routine. Namely, the integration of eye-tracking information with CADe has been proposed, showing promising results. Specifically, recent studies have shown that the gaze of the radiologists during the nodule search
task can be used for establishing a set of nodule candidates, which can then be classified by a deep learning system as nodule/non-nodule with state-of-the-art performance[Khosravan2019ALearning].
This study assesses the performance of 4 young radiologists on the lung nodule hunting task and how a CADe system can contribute to improve their success. For that purpose, gaze information recorded via an eye-tracker on a clinical setting is used for understanding how the search is conducted and how the radiologists’ experience affects the process. Likewise, inter-observer evaluations are also conducted. Finally, it is shown that using a deep learning-based CADe system as a posterior second observer, both independently and together with gaze information, allows to improve the global nodule detection sensitivity without increasing the number of false-positives. The experimental setup, including the acquisition setting, gaze processing and the lung nodule detection algorithm is described in Section II. Section III details the results of the reading sessions and the impact of the CADe system on the detection sensitivity. Finally, Section IV discusses and Section V summarizes the findings of this study.
Ii Materials and methods
Ii-a Annotation procedure
The annotation team is composed of 4 radiology interns from Hospital de São João, Porto, Portugal - Rad1, Rad2, Rad3, Rad4 - with experience between 1 and 4 years. Each medical expert was asked to annotate the scans similarly to the first step of the LIDC-IDRI annotation protocol [Armato2011]. Namely, the radiologists were instructed to mark every non-nodule and nodule with diameter with a point on the abnormality’s center of mass and segment voxel-wise all lung nodules with diameter . For each of the abnormalities, the radiologists were also asked to perform a subjective categorical characterization of the nodules’ calcification pattern and internal structure (soft tissue, fluid, fat, air) and ordinal characterization of how well defined the margin is, the extent of the spiculation, their sphericity and lobulation, expected malignancy, subtlety and their texture (solid, sub-solid or non-solid) [McNittGray20071464]. For this purpose, a custom version of the ITK-SNAP software [py06nimg] was used. This custom version allows to retrieve, at a fixed sample rate, the physical pixel size of the view windows (axial, coronal and sagittal), the scan-wise coordinates of the slices currently under analysis as well as the respective pan and zoom settings. The annotation procedure was blind, i.e. the radiologists did not have access to the ground-truth and could not discuss their markings with the other annotators.
Ii-B Gaze capturing and processing
The gaze of the radiologists was recorded using a Tobii Eye Tracker 4C (frequency =90Hz) attached at the base of a Fujitsu E22T-7 monitor (19201080 pixels). The sensor records the absolute position in physical screen pixels of the gaze.
The radiologists were asked to seat at a distance of 60 cm from the monitor in a room with reduced lightning and limited access to avoid distractions, and calibrate the sensor prior to each annotation session. Since the axial view was the one mainly used for nodule hunting, segmentation and characterization, all gaze points outside the window containing this view and those corresponding to the annotation procedure were removed. On this window, all gaze points outside the lung volume were also removed. The lung volume mask was estimated by performing a fixed Hounsfield Unit threshold followed by a morphological closing operation to fill the gaps created by nodules and, blood vessels and other structures.
On this window, all gaze points outside the lung volume were also removed. The lung volume mask was estimated by performing a fixed Hounsfield Unit threshold followed by a morphological closing operation to fill the gaps created by nodules and, blood vessels and other structures.
Let be the dimensions, in voxels, of an analyzed scan. The coordinates of the gaze for each time point are converted to integer scan-wise voxel coordinates having in account the current slice of the observation, , as well as the respective zoom and pan. The corresponding attention map for each slice , , is defined as in Eq. 1:
where is the identity matrix
is the identity matrix, is a zero-valued matrix, , and
zoom is the spread of the isotropic multivariate normal distribution, . Assuming that the radiologists’ gaze is approximately the foveal vision (approximately 5 [millodot2014dictionary]), the values of are computed so that of is contained inside a circle of diameter cm, i.e. (cm), . This way, the voxel-wise diameter of corresponds to the expected physical dimension of the gaze, , when considering the head to monitor distance and the foveal vision angle. Finally, the attention map is defined as the concatenation of all attention slices:
where is the concatenation operator. is a matrix where each element indicates an estimate of the duration, in seconds, of the observation of the respective voxel in the scan.
The performance of the annotators was evaluated on the LIDC-IDRI dataset [Armato2011]. These scans have been assessed by 4 radiologists, first blindly and afterwards with the markings of their peers. The subset of LIDC-IDRI considered for this study follows the LUNA16 Challenge [Setio2016]. Namely, an annotation was considered to be a nodule if at least 3 medical experts agreed on the diagnosis. The remaining lesions were considered as non-nodules. All nodules were subjectively characterized by each specialist from 1 to 6 in terms of calcification and 1 to 5 in terms of internal structure, lobulation, expected malignancy, margin, sphericity, spiculation, subtlety and texture (non-solid, sub-solid and solid). In total, this study considers 20 scans with 42 nodules with radius of known center-of-mass and equivalent radius. Also, the scans have an average number of slices of , slice thickness of (mm) and axial resolution of (mm/voxel).
Besides the 888 scans from the LUNA Challenge, 294 thin-slice scans from proprietary dataset were also used for developing the automatic detection method. All images were adquired by several Siemens models at Centro Hospitalar de São João. All scans have voxel-wise annotations (single blind) and most of the volumes where assessed by 2 of the radiologists that participated in this study. The total number of annotated nodules used was 985. The scans have an average number of slices, slice thickness of mm and axial resolution of mm/voxel.
Ii-D Deep CNN for automatic lung nodule detection
The studied nodule detection system is composed of an initial candidate detector followed by a false-positive reduction step, as shown in Fig. 1. The detection algorithm is based on the YOLOv3 architecture [Redmon2018YOLOv3:Improvement, 10.1007/978-3-030-00946-5_31] and outputs bounding boxes of potential nodules on the scan. The model assumes that each patch of the input image can have at least one object of interest. Instead of predicting the bounding boxes from scratch, the nodule detection is performed by adjusting the dimensions and positions of several template boxes assigned to the same patch. Furthermore, given the wide range of the nodules’ diameters, the prediction is performed by assessing feature maps of the network at 2 different scales in a pyramidal fashion, i.e. at each scale the object location prediction results from the processing of the current set of feature maps as well as the ones from the previous scales. This increases the robustness of the model to variations in the size of the nodules. At each scale, the features maps are convolved to a tensor), where is the number of patches (function of the model’s architecture), is the number of template bounding boxes and 5 is the number of parameters to optimize (the horizontal and vertical displacement of the bounding box, its width and height and the confidence of containing a nodule). The detection network is trained only on slices containing nodules by minimizing a detection loss for each of the scales:
where is the mean square error, , and are the loss components associate with the centroid, width/height and nodule presence of the bounding box, respectively, and are predefined weights.
The input to the model is a image composed of 3 neighbour axial slices [10.1007/978-3-030-00946-5_31]
of the CT scan to reduce the complexity of the model and take advantage of transfer learning approaches. Specifically, lung nodule hunting is non-trivial due to the large amount of information to process and the existence of blood vessels, which circular cross-sections in the axial slices may act as nodule confounders. A possible solution is to use 3D networks that, by assessing the data in a volumetric fashion, ease the distinction of spherical nodules from the cylindrical blood vessels. However, 3D approaches are computationally heavy, hindering their application in clinical settings without recurring to cloud-based solutions. Also, it is known that fine-tuning pre-trained networks on natural images for medical image problems eases the training process. By using 3 neighbour axial slices, the system can take advantage of pre-trained networks for feature extraction and still encode depth information, as depicted in Fig.1. For thin-slice CT scans, increasing the distance between neighbour slices allows to simulate higher slice thickness, helping to both increase the 3D context and standardize the training data.
The optimizer is Adam [DBLP:journals/corr/KingmaB14]
and the data is augmented by random crops, rotations, translations and small color alterations (so that the 3D information is not lost). Also, hard samples mining is performed by, at the end of each epoch, increasing the probability of assessing images with higher prediction error on the previous iteration.
After training, scan-wise inference is performed by sliding through all the slices of the volume. The network predicts, for each slice, candidates characterized by their bounding boxes and the respective probability of containing lung nodules. To reduce the number of false-positive detections, all candidates outside the lung volume, computed via a Hounsfield units-based threshold followed by a morphological closing, are discarded. The remaining predictions for which inter-centroid distance is less than half the size of their bounding box are merged by averaging their centroid and maintaining the highest network object detection probability.
The false-positive reduction network, summarized in Fig. 1, is trained based on the results of the nodule detection algorithm. Namely, the training dataset is composed of all the nodules from the ground-truth and the highest score false-positives of each scan. The input to the network is a cube of , resized to
(voxels), centered on the candidate’s centroid. The model considers the binary non-nodule/nodule classification as a multiple-instance learning (MIL) problem, which leads to a weight optimization via the minimization of the loss function:
where is the binary label of image (non-nodule or nodule), is the number of images, m
is the global max pooling operation andis the last layer of the false-positive reduction network. Adam is used as optimizer and the dataset is artificially augmented via random crops, translations, flips and rotations.
The initial candidate detection was trained on the proprietary dataset (Section II-C). This was done to avoid a potential overfit to the annotation style of LIDC-IDRI dataset, which could result on an over-estimation of the system’s performance. On the other hand, preliminary results of the false-positive reduction network trained with the proprietary dataset suggested that the model was not generalizing well to changes on the reconstruction kernels and new slice thickness observed in the test set. A poor generalization ability on different equipments was also present. Indeed, this is a known problem of deep learning systems, but fine-tuning on independent samples acquired with equipments where the model will be tested helps to mitigate the issue [DeFauw2018ClinicallyDisease]. With this in mind, this second network was trained on samples from the LIDC-IDRI dataset.
Ii-E Performance evaluation
This study focuses on the nodule search technique of the radiologists during lung cancer screening, as well as the nodule detection performance without and with the automatic detection system. The details of the evaluation procedure are detailed on the next paragraphs.
Ii-E1 Search technique
The search technique is qualitatively and quantitatively evaluated by assessing the position of the gaze on the left and right lungs. For that, the left and right lungs were estimated by dividing the scan on sagital direction in the location corresponding to the mean of the minimum and maximum transverse points of the segmentation mask. Also, for each scan, the search time of each point, was normalized to , where is the the total scan reading time. Then, for all scans, the normalized time points were sampled to 100 points. The probability of the gaze being located on the right lung in time point is computed as , where and are the number of points on the right lung and the total number of points between time points and , respectively, and is the number of assessed scans.
Ii-E2 Nodule detection performance
Similarly to the LUNA16, an annotation is considered as a true-positive () if the distance between the ground-truth’s and the marking’s centroids is less than the nodule’s diameter. Also, annotations in non-nodules and multiple hits on the same nodule were neither considered as false-positive or . Finally, all ground-truth nodules without an annotation were counted as false-negatives (), and all marks without an associated nodule as false-positive. The combination of the annotators is performed via the union of the respective single annotation’s sets. Also, the nodule detection performance of the annotators and the automatic system is evaluated in terms of sensitivity () and average number of false-positive findings per scan.
The time spent analyzing if an abnormality is indeed a nodule, is assessed by summing the values of circumscribed by a cylinder of diameter (5.2cm) and height of the nodule’s equivalent diameter centered on its center-of-mass. The normalized attention time is defined as .
Ii-E3 Statistical analysis
Statistical differences related to detection performance are assessed using an adaptation of the McNemar’s test [McNemar1947NotePercentages]. In this study, this test allows to compare performance of pairs of annotators A and B (including the automatic system) based on their accuracy on an independent test set. Namely, the chi-squared (
) distribution with 1 degree of freedom is defined by Eq.5:
where is the number of nodules not detected by B but detected by A and is the number of samples nodules not detected by A but detected by B.
Statistical differences related to the elapsed time are assessed via the ANOVA test [john1996applied]. Herein, this test is used for assessing if the average elapsed time of analyzing an abnormality or a scan is different between the annotators. The ANOVA test is based on an -distribution with (, ) degrees of freedom as in Eq. 6:
where is the number of annotators, is the total number of observations, is the variation of the annotator means from the overall mean and
is the variation of the observations of each annotator from the respective annotator mean. For both tests, the null hypothesis that the annotators are statistically different is reject if-value.
Iii-a Search technique
The average reading time for the left and right lungs per observer is shown in Fig. 3. In this study, the average scan reading time was (s) and the right lung tends to be 20% more observed than its counterpart.
The four radiologists use a similar drilling search strategy, as illustrated in Fig. 4. Specifically, at least 30% of the initial reading time to assess the right lung (), then tend to refocus their attention to the left and finally return to the right.
Specialists tend to focus their attention on anatomical feature such as fissures and blood vessels during the nodule hunting, as suggested by Fig. 2. Also, Table I shows that approximately 20% of the reading time was used for assessing findings that were nodules. Rad 2 was significantly faster than Rad 1 and 3 when marking false-positive findings, but no other statistical differences were found.
Iii-B Nodule detection performance
The overall nodule detection performance of the radiologists and automatic system are depicted in Fig. 5 and 6. Specifically, Fig. 4(a) shows the sensitivity of the single and combined annotators for all the studied nodules, whereas Fig. 4(b) depicts the effect of combining all the annotations of the radiologists with those from the automatic system that had less than 10% normalized attention time from the corresponding radiologist. Fig. 4(c) shows the number of found nodules per range of normalized attention time and Fig. 6 depicts all nodules in the study and the respective detection performance of the specialists and the automatic detection system. The average human sensitivity is with false-positives/scan. For the same number of false-positives, the automatic system achieves a sensitivity of . Sensitivity-wise, i.e. ignoring false-positive annotations, all annotators, including the automatic system, are statistical different with exception of Rad 4 in comparison with Rad 1 and 3. Likewise, combining any two annotators statistically increases the detection sensitivity in comparison to the reader alone.
As shown in Fig. 5, the a posteriori combination of the automatic system allows to increase the sensitivity in average by 0.11 without increasing the number of false-positives. On the other hand, combining two radiologists allows an increase of 0.13 with 76% more false-positives. Fig. 5 also indicates a tendency of increased sensitivity with the time spent assessing an abnormality. Generically, adding the automatic system increases the sensitivity across all relative gaze ranges. For local assessments shorter than 10% of the reading time, the automatic system still allows to significantly improve the detection performance.
In this study, the radiologists used a drilling strategy to search for abnormalities. Unlike scanning, drilling allows to focus the attention on one region of the lung, reducing the complexity of the search space. Furthermore, drilling allows for higher 3D context, easing the differentiation of nodules and blood vessels. By its turn, the automatic system uses a hybrid drilling-scanning strategy by inferring on stacks of axial slices, which also provide 3D context, and predicting candidates patch-wise over the entire scan. During the visual assessment of the scans, there was a clear tendency in providing more focus to right lung by i) investing more assessing time and ii) starting the reading session with this structure. This may be partially related with the left-right top-bottom writing system used on the majority of the occidental countries, since the right lung appears on the left side of the scan and thus is the first on the reading order. However, the most likely explanation is related to medical educational and experience factors. Indeed, it is known that the right lung has a higher probability of containing malignant lesions in comparison with the left lung [Perandini2016DistributionNature]. Because of this, radiologists may have a tendency to provide more attention to this side of the lung. Further studies should be made to assess the relative influence of these two factors on the search strategy.
The found average human lung nodule detection sensitivity of 0.67 is similar to previous studies [ArmatoIII200928, AlMohammad2019RadiologistCT]. Interestingly, there was no evidence that the average reading time was associated with the detection performance. This is indicative that each specialist has its own reading speed and tend to take a proportional amount of time assessing if found abnormalities. Indeed, as shown in Fig. 6, all radiologists were capable of locating nodules with a large variety of textures and sizes, including highly subtle abnormalities as E2 and D3. On the other hand, detection failures of non-subtle nodules tended to occur for smaller sizes, as in D6 and E6. The assessment of the individual assessment times indicate that these failures are most likely due to fixation errors. In fact, as shown in Fig. 4(c), there is a trend of increasing the number of detection nodules when higher attention times are used. Interestingly, Table I suggests that failure, either under- or over-diagnosis is usually associated with lower observation time. These findings further indicate the need for automatic second opinion systems, since these can force radiologists to assess unseen abnormalities and possible mitigating attention-related detection failures.
The automatic system achieved a nodule detection performance similar to the radiologists, as depicted in Fig. 4(a). In fact, when combined with human annotations, the system enables a performance increase similar or better than that of two radiologists. These results suggest that the second opinion provided by the automatic system is as valid as a human’s, allowing to significantly increase the detection sensitivity without changing the number of false-positive. Likewise, combining gaze information with the automatic system allows to mitigate failures related to lower observation times. As shown in Fig. 4(b), using the CADe system only on regions with less than 10% normalized attention time (i.e. regions where there is a higher false-positive and false-negative probabilities, according to Table I) still allows to increase the detection performance. In a scenario where the combination is not done a posteriori, as in this study, but instead the radiologist is invited to review the CADe findings after the screening routine, these results suggest that using the gaze-CADe pair could allow to improve the overall detection sensitivity, while reducing the time overhead introduced by the analysis of all CADe findings.
Lung nodule hunting is a complex task, but using the opinion of a second radiologist allows to significantly improve the success of the process. This second opinion can be replaced by a properly trained automatic detection system. Also, assessing the gaze during the screening routine allows to retrieve important information related to search strategies and identify potential regions of detection failure. When combining this gaze information with the inferences of an automatic system, it is possible to significantly increase the global detection sensitivity without forcing the radiologist to review the entire volume. This leads to a less tiresome and faster verification process by reducing the number of candidates to review, while also reducing CADe-related bias since the process is done after the initial assessment. Because of this, the introduction in the clinical practice of systems similar to the one herein presented may contribute to increase the success of lung cancer screening programs by reducing personnel costs and, most importantly, improve the quality of life of the patient.