Interactive user interface based on Convolutional Auto-encoders for annotating CT-scans

04/26/2019 ∙ by Martin Längkvist, et al. ∙ 12

High resolution computed tomography (HRCT) is the most important imaging modality for interstitial lung diseases, where the radiologists are interested in identifying certain patterns, and their volumetric and regional distribution. The use of machine learning can assist the radiologists with both these tasks by performing semantic segmentation. In this paper, we propose an interactive annotation-tool for semantic segmentation that assists the radiologist in labeling CT scans. The annotation tool is evaluated by six radiologists and radiology residents classifying healthy lung and reticular pattern i HRCT images. The usability of the system is evaluated with a System Usability Score (SUS) and interaction information from the readers that used the tool for annotating the CT volumes. It was discovered that the experienced usability and how the users interactied with the system differed between the users. A higher SUS-score was given by users that prioritized learning speed over model accuracy and spent less time with manual labeling and instead utilized the suggestions provided by the GUI. An analysis of the annotation variations between the readers show substantial agreement (Cohen's kappa=0.69) for classification of healthy and affected lung parenchyma in pulmonary fibrosis. The inter-reader variation is a challenge for the definition of ground truth.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 6

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The application of machine learning in radiology has gained a lot of interest in recent years in various fields of medical image analysis [1]. In computed tomography imaging (CT), a rotating x-ray tube and detector are used to acquire thin slice images of different parts of the body. In a particular application, high resolution CT of the lungs (HRCT), has during the last decades emerged as the most important image modality for interstitial lung diseases, including pulmonary fibrosis [2, 3].

The radiologists analyze the HRCT images through visual assessment, by identifying the presence and distribution of pathological patterns in the lung parenchyma. The radiologists are therefore both interested in identifying certain patterns (classification), and assessing the distribution of the patterns (segmentation).

Semantic segmentation is the task of classifying all pixels in an image and can solve both the classification and the segmentation task. Various image-decoders networks have recently shown to be successful as image-decoders for solving the semantic segmentation task, such as variational auto-encoder [4] and generative adversarial network (GAN) [5].

However, these models have a high number of parameters and require a large amount of labeled data. Acquiring labeled HRCT data is costly, time-consuming, and requires the expertise from radiologists. In this study, we instead present an interactive graphical user-interface (GUI) including a convolutional auto-encoder that aims to assist the radiologist in the process of annotating CT images in order to speed-up the labeling process.

The inter-reader variation in the interpretation of HRCT is a well known issue in the visual analysis  [6, 7]. This is also the rational for the development of quantitative tools for HRCT analysis  [8, 9, 10]. Quantitative analysis of pulmonary fibrosis is based on the delineation of affected and healthy lung parenchyma, which requires a definable border between healthy and affected parenchyma to be present. However, to the best of our knowledge, there is no previous study that quantifies the inter-reader variability in the delineation of healthy and affected lung parenchyma in HRCT images.

The first purpose of the present study is, therefore, to evaluate the usefulness of the interactive GUI for the labeling process and the second purpose is to quantify the reader variations in delineation of healthy lung and reticular pattern in HRCT using the developed GUI.

2 Related Work

There has been a large amount of effort into developing tools for annotating images [11] and videos [12, 13]

for computer vision applications and research. Many different strategies have been explored in order the reduce the human manual effort and required annotation time. The work by 

[14] proposes doing sparse labeling of a video and the tool suggests labels for the inbetween frames based on the current detector and physical constraints on target motions.

Semi-automatic tools that provide hotkey-shortcuts, smart drawing tools, drag-and-drop, and integrated computer vision segmentation algorithms allow humans to annotate more efficiently [15]

. The use of semi-supervised learning (see 

[16] for an overview) makes use of the large amount of available unlabelled data and has been used for object recognition [17, 18]

and in an active learning setting 

[19, 20].

The use of transfer learning 

[21, 22] allows pre-trained models to share knowledge and provide suggestions during the human supervised annotation of new data. The work by [23] uses transfer learning with a pre-trained RetinaNet model for an object detection annotation tool.

The authors in [24] and [25] have developed methods that effectively rank the query images and prioritize images that gives the best tradeoff between human effort and the information gain in order to reduce the needed human intervention.

An alternative to using AI tool-assisted programs is to use crowd-sourcing annotation [26, 27] and has been used for videos [28, 29], object detection  [30], and energy data [31] with the use of web-based frameworks or online games.

In this work, we focus on an interactive, semi-automatic annotation tool for semantic segmentation of CT-scans. Semi-automatic annotation has previously been explored an other datasets such as cooking videos [32] and soccer videos [33].

Another annotation tool for object detection in videos that has been proposed is the interactive Video Annotation Tool, iVAT [34] that supports manual, semi-automatic, and automatic annotations. The tool can train a model from a list of computer vision algorithm on already labeled data in an incremental learning framework. The work uses a quantitative evaluation and system usability study. Similarly, the work by [35] uses active learning to query the user for corrections and shows that the amount of human effort, measured in the number of user interactions, is reduced.

While these works uses active learning and semi-automatic setups, they are not strictly interactive since the user needs to wait for a complete training pass before getting feedback. An interactive object annotation method has been proposed by [36] but is more focused on incrementally train the object detector while the user provides annotations in order to reduce human annotation time rather than learning performance.

There are many interactive segmentation methods for images that have been proposed [37, 38, 39], and specifically for medical image segmentation  [40, 41]

Our work differs from these previous works in a number of ways. We focus on a fully semantic segmentation of the data and not on annotating bounding boxes for object detection or segmentation in medical images. There is also a larger focus on evaluating the experienced system usability and user interaction and relate the inter- and intrareader variation to the perceived usability of the system. The focus on interactivity and continuously provided feedback allows us to examine the difference between how users used the system and their preferred trade-off between training speed and accuracy in the provided feedback.

3 Materials and methods

3.1 Interactive GUI for annotation

An interactive graphical user-interface (GUI) was developed for this study, see Figure 1. The GUI works as an annotation tool where the user can annotate CT volumes effectively and is designed with a radiologist’s feedback to resemble the look and functionality of traditional software used by radiologists. In addition, the GUI contains a fast classifier that is concurrently training in the background to make predictions during the labeling process, which gradually improves as more pixels become labeled. The GUI also stores information about how the user is using the interface for evaluation purposes. The user can draw with different brush sizes in the 2D scans or fill polygons in 2D and 3D. The learning process is iterative and the model is reset for each new user so that each user gets a custom-trained model based on only their own annotations.

Figure 1: GUI used for annotating and training the model. The CT data and the labels are shown in the annotation area. The user labels data by selecting class and brush size and marks the data. The predictions from the model is shown in the prediction area. If the user is satisfied with the output from the model, the predictions can be added to the annotation area and later make corrections where necessary. The user can change slices by clicking the current slice area

or using the mouse wheel. The user can define the models hyperparameters, such as number of classes, filter dimensions, pooling size, number of layers, and number of filters in each layer in the

model architecture area. Showing annotations and/or raw data can be toggled. The predictions under a threshod controlled by a draggable slider can be hidden.

3.2 Convolutional Auto-encoder

Figure 2: Overview of a one layer of Convolutional Autoencoder (CAE) that consists of an encoder and a decoder. The input is slice of a CT scan and the output is an image of the classified pixels.

A Convolutional Auto-encoder (CAE) [42] is used to perform the classification of the CT images and consists of a encoder and a decoder, see Figure 2. The encoder calculates the k-th feature map in the convolutional layer as:

(1)

where is the gray-scale input image, is the k-th filter between the input and convolutional layer, is the bias for the k-th filter, is the convolution operation, and

is the Rectified Linear Unit (ReLu

[43]activation function. Downsampling of the convolutional layer is then performed by taking the maximum value in each non-overlapping subregion.

The decoder transforms the representation from the encoder back to the original size by first unpooling the representation and performs a deconvolutional operation as:

(2)

where is the k-th filter from the unpooling layer the output , is the bias for the K-th output layer.

The architecture for the CAE used in this work was a one-layered CAE with 10 filters, filter size 15, and pool dimension 2. The design choices are motivated by the aim of achieving a fast classifier for interactive use.

3.3 Image Data

Twelve scans, demonstrating pulmonary fibrosis with reticular pattern and honeycombing, were selected from a retrospectively created databank consisting of high resolution CT of the lungs (HRCT). The images were acquired with Siemens Somatom Definition AS (n=6), Siemens Somatom Definition Flash (n=4) and Siemens Biograph (n=2). Images were reconstructed as contiguous 1 mm slices using a high spatial resolution kernel (B70f or I70f1). Aquisition parameters were 120 kVp (n=9), 100 kVp (n=2), 140 kVp (n=1), and standard chest reference mAs settings (ref-mAs 150-160, ref-kV 120).

3.4 Image Review

Two radiologists (seven and five years of experience in thoracic radiology) and four radiology residents participated in the study. The study was performed with two separate experiments. In the first reading, where all readers participated, the user interface in addition to the inter- and intra-reader variation, was evaluated. The purpose of the second reading, where only the two thoracic radiologists participated, was to obtain additional data for the inter-reader variation.

The task in the experiments was to completely annotate four specific slices, interspaced with 1 cm, through the middle part of the lungs in the three categories normal lung parenchyma; reticular pattern including honeycombing; and non-pulmonary tissue. The same image slices were annotated by all readers. Structures and air outside the body were not annotated.

Immediately before the first reading, the readers conducted a tutorial designed to guide the reader through the central and necessary functions of the interactive GUI. One reader (reader 6) only conducted a written, unsupervised, tutorial and one reader (reader 3) was already familiar with the GUI. The other readers conducted a supervised written tutorial, that was completed in approximately 30 minutes.

In the first reading, each reader annotated four slices in four CT stacks. The first stack was repeated as number four for the analysis of intra-reader variation. Some readers completed the annotation during a single session, while others completed the annotation over multiple sessions. After completing the annotation task, each reader received a standardized questionnaire, System Usability Score (SUS).

In the second reading, the two thoracic radiologists performed an additional annotation task on nine more CT-stacks. The task in the second reading was identical to the first reading. Since the purpose of the second reading was to obtain inter-reader variation data on a larger number of subjects, the usability of the system was not assessed.

3.5 Inter- and intrareader variability analysis

The inter- and intrareader variations in the annotation between the six readers in the first reading were evaluated on a pixelwise basis, and were quantified with Cohen’s kappa and Jaccard index (Intersection over Union), that were computed for each pair of readers, using all slices in the first three CT stacks. Cohen’s kappa was computed for the three-class classification healthy praenchyma, reticular pattern or non-pulmonary tissue, and for the two-class classification healthy parenchyma or reticular pattern.

The intra-reader analysis was performed for each reader using the annotations performed on the identical first and fourth CT stacks. Pixelwise Cohen’s kappa and Jaccard index were computed similar to the inter-reader analysis for each reader with 95% confidence interval.

For the two readers that performed the second annotation task, the inter-reader variations were correspondingly quantified using all twelve CT stacks.

The Regional Research Ethics Board approved the study protocol and waived the informed consent requirement for images in the data bank.

4 Experimental results

4.1 GUI evaluation

The usability of the GUI was evaluated using the System Usability Scale (SUS) [44]. The survey consists of 10 statements that uses a Likert scale where the answers are graded from 1 (strongly disagree) to 5 (strongly agree). The combination of the answers measures the user’s effectiveness, efficiency, and satisfaction and a final score between 0 and 100 is calculated and converted to letter grades for easier interpretation [45].

Table 1 shows the survey results from the six radiologists used in this study. A SUS-score above 68 (grade D) is considered above average. Half of the users received a SUS-score below the usability threshold and the other half received a higher value, meaning that there is a disagreement between the radiologist regarding the usability of the system.

Reader SUS-score Grade
#1 58 F
#2 75 C
#3 73 C
#4 53 F
#5 68 D
#6 40 F
Average
Table 1: Calculated SUS-scores from six readers and their respective grades (A-F).

The results can further be analysed by examining the answers from the individual questions in the SUS survey. Figure 3 shows a box plot of the answers to each question in the SUS-survey.

Figure 3: A box plot of the answers from six radiologists to the 10 questions in the SUS survey. The boxes are the and

percentiles, dashed lines are the min and max values, circle indicate median, and the plus symbol indicate outlier.

Most users answered that they would use this system frequently (1), except one user that thought that the system was not easy to use (3). The other users thought the system was somewhere between easy and difficult to use (2), (4), and (8). Most users agree that the functions in the system were well integrated and the users felt very confident using the system (5), (9).

4.2 User interaction evaluation

The system logged the total labeling time, percentage of data labeled by the user, amount of manual work in terms of initial labeling and corrections, see Figure 4. All users labeled the CT stacks in the same order (CT stack 1, 2, 3, and then 1 again). The model started from random initialization and was only reset when a new user started using the GUI. This resulted in each user getting a personalized trained model that was only trained on that users annotated data.

Figure 4: (a). Training time for each user for each stack. (b) Amount of manual labeling in terms of number of brush strokes for each user colored by the class for the first stack. (c) Percentage of the data that was labeled by the user for the first stack.

Figure 4 shows how long the model was training for each user for each stack. The training time is the highest for the first stack and decreasing for each stack, except for user 3 due to interruptions during labeling. The average training time for all users, except user 3, was for the first and last stack minutes and minutes, respectively.

Figure 4 shows the amount of brush strokes for each user on the first stack colored by class. User 4 used the most brush strokes compared to the other users. The average proportional amount of brush strokes over all users and all stacks are , , , for normal parenchyma, reticular pattern, and non-pulmonary tissue, respectively. Meaning that most manual work is spent on labeling normal parenchyma and reticular pattern, which are the most difficult classes to differentiate.

Figure 4 shows how large percentage of the first stack was labeled by the user. All users labeled on average of the data for the first stack and the GUI labeled the rest. User 3 labeled the least amount with and user 6 the most with .

4.3 Inter- and intrareader variation

Quantitative measures for the inter- and intrareader variations are shown in Table 2 and Table 3. As anticipated, the agreement on the delineation of non-pulmonary tissue is excellent with a Jaccard index of close to 1. The variation in labels between the readers was consequently related to the delineation between healthy and affected lung parenchyma.

Cohen’s kappa analysis showed excellent agreement (kappa=0.86) between readers for the three-class analysis. When the non-pulmonary tissue, which is diagnostically unproblematic, was removed from the analysis, the agreement was substantial (kappa = 0.69).

Normal
parenchyma
Reticular
pattern
Non-pulmonary
tissue
Inter-reader variation
(6 readers, 3 CTs)
mean (95%CI)[Range]
0.81 (0.79-0.82)
[0.75-0.84]
0.58 (0.55-0.60)
[0.49-0.65]
0.94 (0.92-0.95)
[0.89-0.97]
Inter-reader variation
(2 readers, 12 CTs)
0.76
0.58
0.97
Intra-reader variation
(6 readers, 1 CT)
mean (95%CI)[Range]
0.82 (0.76-0.87)
[0.77-0.88]
0.62 (0.47-0.68)
[0.47-0.68]
0.94 (0.87-1.00)
[0.81-0.97]
Table 2: Inter- and intra-reader variation, Jaccard index
Three-class
Cohen’s kappa
Two-class
Cohen’s kappa
Inter-reader variation
(6 readers, 3 CTs)
mean (95%CI)[Range]
0.86 (0.84-0.87)
[0.81-0.88]
0.69 (0.67-0.71)
[0.61-0.75]
Inter-reader variation
(2 readers, 12 CTs)
0.85
0.65
Intra-reader variation
(6 readers, 1 CT)
mean (95%CI)[Range]
0.87 (0.83-0.90)
[0.80-0.90]
0.72 (0.65-0.79)
[0.62-0.80]
Table 3: Intra- and inter-reader variations, Cohen’s kappa

In the second reading, where two readers provided labels on twelve CTs, the reader variations were similar as in the first reading with excellent agreement in the three-class analysis and substantial agreement in the two-class analysis, see Table 2 and Table 3. Also, the analysis of intra-reader variations showed similar Jaccard index and Cohen’s kappa as in the inter-reader analysis. However, the numbers are not entirely comparable, since the intra-reader variation was analyzed on just one CT stack, while the inter-reader variations were computed using annotation data from three or twelve different CT stacks.

The inter-reader variations in the delineation of healthy and reticular lung parenchyma can to a large part be explained by the personal cut-off for each reader on the continuous spectrum between clearly healthy lung parenchyma and typical reticular pattern.

Figure 5 shows an example on one of the annotated slices of the delineation by six different readers.

Figure 5: Detail from one annotated CT slice with corresponding labelling by six differnt readers (Red - reticular pattern; Green - normal parechyma; Blue - non-pulmonary tissue).

4.4 Model classification accuracy

The capacity of the model determines the trade-off between accuracy and training time. The model should aim to reduce the training time by using a small model in favor of classification accuracy during the labeling process. Once enough data has been labeled, a larger model can be trained to achieve higher performance. In this section we compare the classification accuracy between different size models. The training data are the slices 5, 15, and 25 for all of the three stacks. The test data is slice 35 for all three stacks. The ground-truth is taken as the mode of the annotation from all 6 readers. Table 4

shows how the total accuracy is increasing with larger model sizes. The F1-score for each class is also shown and is increasing for all classes with increased number of layers. The most difficult class to classify is reticular pattern, followed by normal praenchyma. The easiest class to classify is non-pulmonary tissue. The F1-score is measurement of the balance between precision and recall and is calculated as:

(3)
layers Accuracy [] F1-score Training time [h]
1 88.4 0.79, 0.69, 0.91 1
3 91.4 0.85, 0.75, 0.93 2
5 93.3 0.87, 0.83, 0.96 7
Table 4: Classification accuracy, F1-score per class (normal praenchyma, reticular pattern and non-pulmonary tissue), and training time for different number of layers in the classifier.

5 Discussion

Acquisition of labeled image data is a key issue for the development of machine learning methods in medical imaging. The aquisition process is costly and time consuming, especially in medical imaging, where labels need to be provided by medical expert readers [46]. In the present study, we propose an interactive GUI using a convolutional autoencoder for pixelwise labelling of CT volumes, and we use the GUI to compute the inter- and intra-reader variations in the delineation of pathology.

The definition of ground truth is crucial for the results of the neural network training, and in semantic segmentation the labels need to be provided pixelwise. In the present study, the delineation between healthy and affected (reticular) lung parenchyma in HRCT in patients with pulmonary fibrosis was studied 

[2]. The analysis of inter-reader variation showed that the different readers had different visual cut-off between affected and healthy lung parenchyma, see Figure 5. The inter-reader variation in the interpretaion of medical images including HRCT in idiopathic pulmonary fibrosis has been analyzed before [6, 7]. While previous studies focus on a more general view, the present study demonstrates the inter-reader variation in the detailed delineation of healthy and affected parechyma in the images.

The inter-reader variations were quantified in two experimental designs - between several readers on a small number of CT stacks and between two readers on a larger number of examinations. As detailed in Table 2 and Table 3, the quantified inter-reader variation was consistent in the two experiments. A visual analysis of the labels provided by the two radiologists in the second reading, shows that the spatial distribution of the provided labels are similar between the readers, see Figure 6. The difference in the provided labels, quantified in the inter-reader analysis, is essentially the position of the boundary, but not the location of the of the affected areas.

Figure 6: Example of reader variations in one CT stack in the second reading. Green, red, blue - labelled as healthy lung, affected lung and non-pulmonary tissue. Yellow - Labelled normal by reader A and affected by reader B. Cyan - Labelled affected by reader A and normal by reader B.

The different cut-offs between the readers for delineating affected parenchyma in pulmonary fibrosis suggests that there is no discrete border between healthy and affected parenchyma, but rather a continuous range between healthy and clearly pathologic lung pattern. This finding has important points concerning volumetric analysis of pathology in HRCT: First, an objective reader independent method is necessary for any study involving volumetric analysis of HRCT. Second, a thorough validation of any reader independent method is necessary to ensure consistency. Third, a perfect reader independent method is not reasonably achievable.

A strength in the study is the inclusion of several radiologists and radiology residents that provided detailed annotations in the included slices. Detailed annotation of scattered findings such as reticular parenchyma, as demonstrated in Figure 6, is time-consuming, and to preserve the detailed annotation we intentionally limited the number of slices per examination.

In the interactive labelling phase, learning speed was preferred over accuracy which motivated the use of a single layered network. The development of more complex models, after acquisition of labeled data, verified that when training time is not crucial, a better performing system can be achieved. The study thus demonstrated how the machine learning algorithms used in the labelling, and those used for final model development can be separated depending on their purpose.

The way that the users interacted with the system, and the perceived usability of the system, differed from the different users. Users that spent less time with manual labeling and utilized the predictions from the classifier reported a higher usability SUS-score. For future work, the system needs to be simplified to make it easier to use and more optimized for performance and faster training.

There are some limitations in the present study. The readers that were included from a single radiology department, only received a short training before performing the test. With more acquaintance, the interaction with the GUI may change. Although the delineation task and annotation classes [2] were clearly defined for the readers, the design of the GUI may, just like any other tool, influence the labels provided by readers, which can affect the measured inter-reader variability.

6 Conclusion

The present study demonstrated how a fast convolutional auto-encoder can be used interactively when pixel-wise labels of CT-scans are acquired, while a more complex model can be used when training time is not an issue. The inter-reader variability is an obstacle for the definition of ground truth, but also motivates the development of AI tools that may improve quantitative image analysis in HRCT.

Acknowledgements

This work has been sponsored by Nyckelfonden (grant OLL-597511) and by Vinnova under the project Interactive Deep Learning for 3D image analysis (2016-04915).

References

  • [1] Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen A.W.M. van der Laak, Bram van Ginneken, and Clara I. Sánchez. A survey on deep learning in medical image analysis. Medical Image Analysis, 42(Supplement C):60 – 88, 2017.
  • [2] David M. Hansell, Alexander A. Bankier, Heber MacMahon, Theresa C. McLoud, Nestor L. Müller, and Jacques Remy. Fleischner Society: Glossary of Terms for Thoracic Imaging. Radiology, 246(3):697–722, mar 2008.
  • [3] Ganesh Raghu, Harold R Collard, Jim J Egan, Fernando J Martinez, Juergen Behr, Kevin K Brown, Thomas V Colby, Jean-François Cordier, Kevin R Flaherty, and Joseph A Lasky. An official ATS/ERS/JRS/ALAT statement: idiopathic pulmonary fibrosis: evidence-based guidelines for diagnosis and management. American journal of respiratory and critical care medicine, 183(6):788–824, 2011.
  • [4] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. stat, 1050:10, 2014.
  • [5] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
  • [6] Simon L F Walsh, Lucio Calandriello, Nicola Sverzellati, Athol U Wells, David M Hansell, and UIP Observer Consort. Interobserver agreement for the ATS/ERS/JRS/ALAT criteria for a UIP pattern on CT. Thorax, 71(1):45–51, jan 2016.
  • [7] Takeyuki Watadani, Fumikazu Sakai, Takeshi Johkoh, Satoshi Noma, Masanori Akira, Kiminori Fujimoto, Alexander A. Bankier, Kyung Soo Lee, Nestor L. Müller, Jae-Woo Song, Jai-Soung Park, David A. Lynch, David M. Hansell, Martine Remy-Jardin, Tomás Franquet, and Yukihiko Sugiyama. Interobserver Variability in the CT Assessment of Honeycombing in the Lungs. Radiology, 266(3):936–944, mar 2013.
  • [8] Brian J Bartholmai, Sushravya Raghunath, Ronald A Karwoski, Teng Moua, Srinivasan Rajagopalan, Fabien Maldonado, Paul A Decker, and Richard A Robb. Quantitative computed tomography imaging of interstitial lung diseases. Journal of thoracic imaging, 28(5):298–307, sep 2013.
  • [9] Joseph Jacob, Brian J. Bartholmai, Srinivasan Rajagopalan, Maria Kokosi, Arjun Nair, Ronald Karwoski, Sushravya M. Raghunath, Simon L F Walsh, Athol U. Wells, and David M. Hansell. Automated Quantitative Computed Tomography Versus Visual Computed Tomography Scoring in Idiopathic Pulmonary Fibrosis: Validation Against Pulmonary Function. Journal of thoracic imaging, 31(5):304–11, sep 2016.
  • [10] Stephen M. Humphries, Kunihiro Yagihashi, Jason Huckleberry, Byung-Hak Rho, Joyce D. Schroeder, Matthew Strand, Marvin I. Schwarz, Kevin R. Flaherty, Ella A. Kazerooni, Edwin J. R. van Beek, and David A. Lynch. Idiopathic Pulmonary Fibrosis: Data-driven Textural Analysis of Extent of Fibrosis at Baseline and 15-Month Follow-up. Radiology, 285(1):270–278, 2017.
  • [11] Jenny Yuen, Bryan Russell, Ce Liu, and Antonio Torralba. Labelme video: Building a video database with human annotations. In Computer Vision, 2009 IEEE 12th International Conference on, pages 1451–1458. IEEE, 2009.
  • [12] David Mihalcik and David Doermann. The design and implementation of viper. University of Maryland, pages 234–241, 2003.
  • [13] Amol Ambardekar, Mircea Nicolescu, and Sergiu Dascalu. Ground truth verification tool (gtvt) for video surveillance systems. In 2009 Second International Conferences on Advances in Computer-Human Interactions, pages 354–359. IEEE, 2009.
  • [14] Karim All, David Hasler, and Frangois Fleuret. Flowboost-appearance learning from sparsely annotated video. In

    Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on

    , pages 1433–1440. IEEE, 2011.
  • [15] Isaak Kavasidis, Simone Palazzo, Roberto Di Salvo, Daniela Giordano, and Concetto Spampinato. A semi-automatic tool for detection and tracking ground truth generation in videos. In Proceedings of the 1st International Workshop on Visual Interfaces for Ground Truth Collection in Computer Vision Applications, page 6. ACM, 2012.
  • [16] Xiaojin Zhu. Semi-supervised learning literature survey. Computer Science, University of Wisconsin-Madison, 2(3):4, 2006.
  • [17] Rob Fergus, Yair Weiss, and Antonio Torralba. Semi-supervised learning in gigantic image collections. In Advances in neural information processing systems, pages 522–530, 2009.
  • [18] Yong Jae Lee and Kristen Grauman. Learning the easy things first: Self-paced visual category discovery. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1721–1728. IEEE, 2011.
  • [19] Zhi-Hua Zhou, Ke-Jia Chen, and Hong-Bin Dai.

    Enhancing relevance feedback in image retrieval using unlabeled data.

    ACM Transactions on Information Systems (TOIS), 24(2):219–244, 2006.
  • [20] Ajay J Joshi, Fatih Porikli, and Nikolaos Papanikolopoulos. Breaking the interactive bottleneck in multi-class classification with active selection and binary feedback. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2995–3002. IEEE, 2010.
  • [21] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
  • [22] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 951–958. IEEE, 2009.
  • [23] Viraj Mavani. Anno-mage: A semi automatic image annotation tool. https://github.com/virajmavani/semi-auto-image-annotation-tool/, 2018.
  • [24] Sudheendra Vijayanarasimhan and Kristen Grauman. Cost-sensitive active visual category learning. International Journal of Computer Vision, 91(1):24–44, 2011.
  • [25] Sudheendra Vijayanarasimhan and Kristen Grauman. Large-scale live active learning: Training object detectors with crawled data and crowds. International Journal of Computer Vision, 108(1-2):97–114, 2014.
  • [26] Alexander Sorokin and David Forsyth. Utility data annotation with amazon mechanical turk. In Computer Vision and Pattern Recognition Workshops, 2008. CVPRW’08. IEEE Computer Society Conference on, pages 1–8. IEEE, 2008.
  • [27] Peter Welinder and Pietro Perona. Online crowdsourcing: rating annotators and obtaining cost-effective labels. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, pages 25–32. IEEE, 2010.
  • [28] Carl Vondrick, Donald Patterson, and Deva Ramanan. Efficiently scaling up crowdsourced video annotation. International Journal of Computer Vision, 101(1):184–204, 2013.
  • [29] Isaak Kavasidis, Simone Palazzo, Roberto Di Salvo, Daniela Giordano, and Concetto Spampinato. An innovative web-based collaborative platform for video annotation. Multimedia Tools and Applications, 70(1):413–432, 2014.
  • [30] Isaak Kavasidis, Concetto Spampinato, and Daniela Giordano. Generation of ground truth for object detection while playing an online game: Productive gaming or recreational working? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 694–699, 2013.
  • [31] H. Cao, T. K. Wijaya, K. Aberer, and N. Nunes. A collaborative framework for annotating energy datasets. In 2015 IEEE International Conference on Big Data (Big Data), pages 2716–2725, Oct 2015.
  • [32] Simone Bianco, Gianluigi Ciocca, Paolo Napoletano, Raimondo Schettini, Roberto Margherita, Gianluca Marini, Giorgio Gianforme, and Giuseppe Pantaleo. A semi-automatic annotation tool for cooking video. In Image Processing: Machine Vision Applications VI, volume 8661, page 866112. International Society for Optics and Photonics, 2013.
  • [33] T. D’Orazio, M. Leo, N. Mosca, P. Spagnolo, and P. L. Mazzeo. A semi-automatic system for ground truth generation of soccer video sequences. In 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance, pages 559–564, Sep. 2009.
  • [34] Simone Bianco, Gianluigi Ciocca, Paolo Napoletano, and Raimondo Schettini. An interactive tool for manual, semi-automatic and automatic video annotation. Computer Vision and Image Understanding, 131:88 – 99, 2015.
  • [35] Carl Vondrick and Deva Ramanan. Video annotation and tracking with active learning. In Advances in Neural Information Processing Systems, pages 28–36, 2011.
  • [36] Angela Yao, Juergen Gall, Christian Leistner, and Luc Van Gool. Interactive object detection. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3242–3249. IEEE, 2012.
  • [37] Yuri Y Boykov and M-P Jolly. Interactive graph cuts for optimal boundary & region segmentation of objects in nd images. In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, volume 1, pages 105–112. IEEE, 2001.
  • [38] Antonio Criminisi, Toby Sharp, and Andrew Blake. Geos: Geodesic image segmentation. In European Conference on Computer Vision, pages 99–112. Springer, 2008.
  • [39] Martin Rajchl, Matthew CH Lee, Ozan Oktay, Konstantinos Kamnitsas, Jonathan Passerat-Palmbach, Wenjia Bai, Mellisa Damodaram, Mary A Rutherford, Joseph V Hajnal, and Bernhard Kainz.

    Deepcut: Object segmentation from bounding box annotations using convolutional neural networks.

    IEEE transactions on medical imaging, 36(2):674–683, 2017.
  • [40] Feng Zhao and Xianghua Xie. An overview of interactive medical image segmentation. Annals of the BMVA, 2013(7):1–22, 2013.
  • [41] Guotai Wang, Wenqi Li, Maria A Zuluaga, Rosalind Pratt, Premal A Patel, Michael Aertsen, Tom Doel, Anna L David, Jan Deprest, and Sébastien Ourselin. Interactive medical image segmentation using deep learning with image-specific fine-tuning. IEEE Transactions on Medical Imaging, 2018.
  • [42] J. Masci, U. Meier, D. Cireşan, and J. Schmidhuber.

    Stacked convolutional auto-encoders for hierarchical feature extraction.

    Artificial Neural Networks and Machine Learning–ICANN 2011, pages 52–59, 2011.
  • [43] V. Nair and G.E Hinton.

    Rectified linear units improve restricted boltzmann machines.

    In Proc. 27th Int. Conf. on machine learning (ICML-10), pages 807–814, 2010.
  • [44] John Brooke. SUS - A quick and dirty usability scale Usability and context. In Patrick W Jordan, Bruce Thomas, Bernad A Weerdmeester, and Ian L McClelland, editors, Usability Evaluation in Industry, pages 189–194. Taylor & Francis, 1996.
  • [45] Aaron Bangor, Philip Kortum, and James Miller. Determining what individual sus scores mean: Adding an adjective rating scale. Journal of usability studies, 4(3):114–123, 2009.
  • [46] Garry Choy, Omid Khalilzadeh, Mark Michalski, Synho Do, Anthony E Samir, Oleg S Pianykh, J Raymond Geis, Pari V Pandharipande, James A Brink, and Keith J Dreyer. Current applications and future impact of machine learning in radiology. Radiology, 288(2):318–328, 2018.