The limited availability and size of labeled datasets for training machine learning algorithms is a common problem in medical image analysis (Greenspan et al., 2016; Litjens et al., 2017; Cheplygina et al., 2018). In several other fields, crowdsourcing - defined as the outsourcing of tasks to a crowd of individuals (Howe, 2006)- has been found effective for labeling large quantities of data. For example, in computer vision crowdsourcing has been used to annotate large datasets of images and videos with various tags (Kovashka et al., 2016).
Due to the success of crowdsourcing, several researchers have recently applied these techniques to the annotation of medical images. Although such images present specific challenges, including absence of expertise of the crowd, several early papers such as (Mitry et al., 2013; Mavandadi et al., 2012; Maier-Hein et al., 2014a) have demonstrated promising results. Despite the growing interest, there has not been an overview of the work in this field. In this paper we summarize existing literature on crowdsourcing in medical imaging.
This paper originated during the Lorentz workshop “Crowdsourcing in medical image analysis” in June 2018111https://www.lorentzcenter.nl/lc/web/2018/967/info.php3?wsid=967&venue=Snellius. As participants of the workshop, we searched Google Scholar with the query “crowdsourcing AND (medical or biomedical)” and screened the results for papers focusing on the topic. Google Scholar was selected due to previous papers highlighting the poor indexing of the topic in databases and the high prevalence of crowdsourcing papers in conferences (Wazny, 2017). Additional papers were identified for inclusion by examining the references and citations of selected papers. We only included papers where the crowd was involved in the analysis of medical or biomedical images, for example by annotating them. Our search strategy resulted in 55 papers. Key terms emerging from these studies are illustrated in Fig. 1. Five key dimensions were identified for discussion; the application involved, the type of interaction between the crowd and the images, the scale of the task (such as the number of images), the type of evaluation performed on the crowd annotations, and the results of the evaluation.
There are a number of surveys which are related to this work. However, they are quite different in scope:
Ranard et al. (2014) survey crowdsourcing in health and medical research. They identify four tasks: problem solving, data processing, monitoring and surveying and cover 21 papers published until March 2013. In contrast, we only focus on papers where image analysis (i.e. data processing) is involved.
Wazny (2017) present a meta-review of crowdsourcing from 2006 to 2016. Similar to Ranard et al. (2014), they take a more broad view of crowdsourcing. They review existing review papers until August 2015, focusing on how each review categorizes the papers, for example by platform, size of crowd, and so forth.
Alialy et al. (2018) is most similar to our survey, but only focuses on crowdsourcing in human pathology. They do a systematic search with several steps, excluding conference papers or abstracts, and summarize seven papers. The coverage of literature is therefore much more limited than in this work.
While this paper is a preprint, we welcome feedback from other researchers, which we will aim to incorporate in the journal version. Interested researchers can submit comments via https://goo.gl/forms/Qzr2yAJQjOnRCAF23.
There are a variety of crowdsourcing applications to medical imaging data addressed in the papers surveyed in this work. We group these applications by (1) the type of task performed by the crowd, (2) the biomedical content of the image and (3) the dimensionality of the images.
Ii-a Type of task
An important task in medical image analysis is classification, and 42% of the surveyed papers focus on this task. Classification can refer to assigning a label to an entire image, such as diagnosing whether a chest CT image contains any abnormalities. Classification can also refer to assigning a label to a part of the image, for example, the type of abnormality located in a particular region of interest. Other types of labels include non-diagnostic labels such as image modality (de Herrera et al., 2014), visual attributes (Cheplygina and Pluim, 2018), and assessing the quality of the image (Keshavan et al., 2018). These three types of labels are based more on visual characteristics, and thus might be easier to provide than diagnostic labels without any medical training.
A further 38% of the papers focus on localization or segmentation. Typically the goal is to delineate the boundary of an entire healthy structure, or of an abnormality such as a lesion. The difference with how we define the classification task above is that instead of providing information about the image, the annotator has to modify the image, by providing positions or outlines. These tasks rely more on visual characteristics than classification tasks, and may be more easily explained to a non-expert crowd.
In 13% of the papers both classification and segmentation are addressed. Often this means that the annotator first has to indicate if the structure of interest is visible, and if yes, to locate it in the image.
Finally, 7% of papers request less standard tasks from their crowd. For example, Maier-Hein et al. (2015) focus on determining correspondence between pairs of images. Although this is a type of detection task, where the annotator has to locate points of interest in an image, it is also different since a point of reference is already provided. Another example is Ørting et al. (2017), where the annotator has to decide which image is more similar to a reference image. This is a type of classification problem, but again relying more on visual features than on prior knowledge.
Ii-B Type of image
Medical images are acquired at vastly different scales and locations depending on the physiological measurement of interest. The imaging acquisition modality and strategy depends heavily on the scale of the anatomy of interest, and different technologies’ expected contrast with surrounding tissues. Here we categorize the images by where in the body the image originates from, which narrows down the modality. We use the following categorization, also used in two recent surveys of medical imaging (Litjens et al., 2017; Cheplygina et al., 2018): brain, eye, heart, breast, lung, abdomen, histology/microscopy, multiple, other.
We compare the distribution of applications surveyed in this work with the two other surveys in Table I. An interesting observation is that Litjens et al. (2017) and Cheplygina et al. (2018) have a similar distribution of applications despite surveying different topics: Litjens et al.
covers deep learning, where a larger dataset is preferred, whileCheplygina et al.
covers weakly supervised learning, where datasets are smaller in size. Given that crowdsourcing is often proposed as an alternative to weakly supervised learning, it is surprising that the current survey has a different distribution of papers.
|Application||This survey||Cheplygina et al. (2018)||Litjens et al. (2017)|
Many of the papers in this survey are aimed at 2D images. The most common application is histopathology/microscopy with 29% of all the papers, followed by retinal images with 15% of the papers. Both applications are over-represented compared to (Litjens et al., 2017) and Cheplygina et al. (2018). This overrepresentation in crowdsourcing studies may be because many retinal and microscopic images are acquired in 2D, which might be easier to use in a crowdsourcing study than 3D images.
Breast and heart images, which were already not well represented in the other two surveys, are almost absent in crowdsourcing studies. Both applications can be aimed at 2D or 3D images. However, perhaps due to lack of datasets or perceived difficulty of assessing these images, these applications are almost never considered for crowdsourcing.
Several other papers address applications where images are often 3D, such as the brain (9%) and the lungs (9%). Compared to (Litjens et al., 2017; Cheplygina et al., 2018), brain and lung images are underrepresented in crowdsourcing. This could be due to complexity of images or limitations in interfaces. One approach for dealing with 3D images is to select 2D parts of the original 3D images. For example, (Ørting et al., 2017; O’Neil et al., 2017) select axial slices. (Cheplygina et al., 2016) shows patches of 2D projections in various directions in the image. Others circumvent the 3D problem by presenting a video to the users where the entire image is displayed as a sequence of 2D frames (Boorboor et al., 2018). Only a few of the papers addressing 3D images, present images in 3D (Huang and Hamarneh, 2017; Sonabend et al., 2017).
The last type of data that is addressed is video, common for endoscopy and colonoscopy (both in the abdomen category). Several different approaches are used for presenting video data: 2D frames (Maier-Hein et al., 2014b, 2015, 2016; Heim, 2018; Roethlingshoefer et al., 2017), 3D renderings (Nguyen et al., 2012; McKenna et al., 2012), short video clips (Park et al., 2017), or longer videos that can be paused and annotated (Park et al., 2018).
Other applications of crowdsourcing include segmenting hip joints in 2D MRI (Chávez-Aragón et al., 2013), rating visual characteristics of dermatological images (Cheplygina and Pluim, 2018) and assessing surgical performance (Malpani et al., 2015; Holst et al., 2015). Two papers (Foncubierta Rodríguez and Müller, 2012; de Herrera et al., 2014)
look at multiple applications, where the task is classifying image modality, rather than segmentation or diagnosis. A few papers address segmentation in multiple modalities:(Gurari et al., 2016) focus on both natural and biomedical images, (Lejeune et al., 2017) address segmentation across four medical applications.
|Section II||Section III||Section IV||Section V|
|Albarqouni et al. (2016a)||classify||histo||rate||M||custom||unknown||before/during||majority/weighted||indirect||multiple experts|
|Albarqouni et al. (2016b)||other||histo||rate||L||custom||volunteers||none||weighted||direct||multiple experts|
|Roethlingshoefer et al. (2017)||segment||abdomen||draw||L||none||none||none||none||na||na|
|Boorboor et al. (2018)||segment||lung||draw||S||paid||low||during||none||direct||multiple experts|
|Brady et al. (2014)||classify||eye||rate||S||paid||unknown||none||none||direct||?|
|Brady et al. (2017)||classify||eye||rate||M||paid||low||none||majority/weighted||direct||?|
|Bruggemann et al. (2018)||segment||histo||click||L||paid||low||none||none||direct||multiple experts|
|Cabrera-Bean et al. (2017)||segment||histo||click||XS||volunteer||volunteers||none||other||direct||?|
|Chávez-Aragón et al. (2013)||segment||other||draw||M||custom||volunteers||after||none||direct||other|
|Cheplygina et al. (2016)||segment||lung||draw||S||paid||unknown||after||average||direct||one expert|
|Cheplygina and Pluim (2018)||classify||other||rate||S||students||volunteers||none||average||indirect||?|
|Della Mea et al. (2014)||s+c||histo||click||S||paid||unknown||during||average||direct||one expert|
|dos Reis et al. (2015)||classify||histo||rate||L||volunteer||volunteers||during||average||direct||one expert|
|Eickhoff (2014)||classify||histo||rate||M||paid||low||none||majority||direct||one expert|
|Foncubierta Rodríguez and Müller (2012)||classify||multiple||rate||L||paid||unknown||during||none||direct||one expert|
|Ganz et al. (2017)||segment||brain||draw||S||paid||low||none||average||direct||one expert|
|Gur et al. (2017)||classify||heart||rate||M||custom||unknown||none||none||indirect||multiple experts|
|Gurari et al. (2016)||s+c||multiple||rate+draw||M||paid||low||before/after||majority||direct||multiple experts|
|Heim (2018)||s+c||abdomen||rate+draw||M||paid||low||before||majority/weighted||direct||multiple experts|
|Heller et al. (2017)||segment||abdomen||draw||XS||custom||unknown||none||none||direct||other|
|de Herrera et al. (2014)||classify||multiple||rate||L||paid||volunteers||none||none||indirect||other|
|Holst et al. (2015)||classify||other||rate||S||paid||low||before||other||direct||multiple experts|
|Huang and Hamarneh (2017)||classify||lung||click||?||custom||unknown||none||other||direct||?|
|Irshad et al. (2015)||segment||histo||click+draw||M||paid||unknown||before/during||none||direct||multiple experts|
|Irshad et al. (2017)||s+c||histo||rate+draw||M||paid||unknown||before/during||majority/weighted||direct||one expert|
|Keshavan et al. (2018)||classify||brain||rate||M||custom||volunteers||during||weighted||direct||multiple experts|
|Lawson et al. (2017)||classify||histo||click||S||volunteer||hourly||none||none||direct||multiple experts|
|Lee and Tufail (2014)||segment||eye||draw||S||paid||low||none||none||direct||na|
|Lee et al. (2016)||segment||eye||draw||S||paid||low||before||none||direct||na|
|Leifman et al. (2015)||s+c||eye||rate+draw+click||?||custom||volunteers||during||other||direct||multiple experts|
|Lejeune et al. (2017)||segment||multiple||click||S||experts||unknown||none||average||indirect||?|
|Luengo-Oroz et al. (2012)||segment||histo||click||S||volunteer||volunteers||during||other||direct||multiple experts|
|Maier-Hein et al. (2014a)||segment||abdomen||draw||S||paid||unknown||none||none||indirect||one expert|
|Maier-Hein et al. (2014b)||other||abdomen||click+compare||S||paid||unknown||none||other||direct||multiple experts|
|Maier-Hein et al. (2015)||other||abdomen||click+compare||S||paid||low||none||other||direct||?|
|Maier-Hein et al. (2016)||segment||abdomen||click+compare||M||paid||unknown||none||majority||indirect||?|
|Malpani et al. (2015)||classify||other||rate+compare||M||?||hourly||during||majority/weighted||direct||multiple experts|
|Mavandadi et al. (2012)||classify||histo||rate||M||none||unknown||before/during||none||direct||one expert|
|McKenna et al. (2012)||classify||abdomen||rate||M||paid||low||before||other||direct||?|
|Mitry et al. (2013)||classify||eye||rate||S||paid||low||none||none||direct||multiple experts|
|Mitry et al. (2015)||classify||eye||rate||M||paid||low||before/after||none||direct||multiple experts|
|Mitry et al. (2016)||s+c||eye||rate+draw||S||paid||low||before/after||majority||direct||multiple experts|
|Nguyen et al. (2012)||classify||abdomen||rate||M||paid||low||before||majority||direct||?|
|O’Neil et al. (2017)||segment||lung||draw||S||custom||volunteers||after||majority||direct||one expert|
|Ørting et al. (2017)||other||lung||compare||M||paid||low||before||other||indirect||multiple experts|
|Park et al. (2017)||classify||abdomen||rate||M||paid||unknown||none||majority||direct||one expert|
|Park et al. (2018)||segment||abdomen||click||M||paid||unknown||before/during||none||direct||?|
|Rajchl et al. (2016)||segment||brain||click||L||custom||volunteers||none||none||indirect||one expert|
|Rajchl et al. (2017)||segment||abdomen||?||M||none||none||none||none||na||na|
|Sameki et al. (2016)||segment||histo||draw||M||paid||low||none||other||direct||?|
|Sharma et al. (2017)||segment||histo||draw||S||paid||low||none||other||direct||na|
|Smittenaar et al. (2018)||classify||histo||rate||L||custom||volunteers||none||weighted||direct||multiple experts|
|Sonabend et al. (2017)||classify||brain||rate||?||experts||unknown||none||none||direct||na|
|Sullivan et al. (2018)||classify||histo||rate+draw||L||custom||volunteers||none||other||direct||?|
|Timmermans et al. (2016)||s+c||brain||rate+draw||M||custom||volunteers||none||none||na||na|
An important aspect of crowdsourcing medical image annotations is task design. The interplay between the type of image data, the type of annotations that are needed and the available tools for annotation, needs to be considered to successfully crowdsource annotations. A major component of task design is choosing how workers interact with the task. The type of interaction influences time per task and the required level of expertise and training, which ultimately translates into cost and quality. We identified four categories of interaction tasks across the studies surveyed:
Rate an entire image
Draw shapes to identify regions of interest
Click on specific locations
Compare two or more images
Furthermore, we also observed that studies generally had crowds either (1) create entirely new annotations on unlabeled data, or (2) make responses based on pre-existing annotations, e.g., output from automated segmentation methods.
Rating entire images was the most common interaction and was the main task of 52% of the studies surveyed here. Ratings took many forms, identifying the presence/absence of specific visual features (Sonabend et al., 2017), counting number of cells (Smittenaar et al., 2018), assessing intensity of cell staining (dos Reis et al., 2015), or discriminating healthy samples from diseased (Mavandadi et al., 2012). Most commonly, crowd worker were asked to create new annotations (89% of rating tasks). Less commonly, crowd workers were asked to validate pre-existing annotations (14%). One study involved both validating pre-existing annotations and creating new ones (Heim, 2018), so the percentages do not sum to 100%. Existing annotations were the output of automated methods (Roethlingshoefer et al., 2017; Ganz et al., 2017; Gur et al., 2017) half of the time, and the crowdsourced annotations were used to identify instances with errors to be corrected.
Drawing a shape was the second most common task, comprising 37% of studies. Here crowd workers were asked to draw bounding boxes or segment outlines of structures of interest. Sometimes, this was only after identifying if a structure was present in the image or not (Heim, 2018). Similar to rating images, drawing shapes was used as an interaction for both creating new annotations (90% of drawing tasks) and validating existing annotations (15%). In the case of evaluating existing annotations, drawing was used as a means to indicate the location of errors in segmentations produced by automated methods (Roethlingshoefer et al., 2017; Ganz et al., 2017).
Clicking on specific locations was the third most used interaction, occurring in 26% of studies. Clicking was only used to create new annotations such as identifying the precise location of specific cells, abnormalities, or artifacts within an image. The use of multiple clicks to outline a structure was considered a “drawing a shape” interaction. Selecting points was also used in pairs of video frames to determine the stereotactic correspondence of two video streams for follow-up 3D reconstruction (Maier-Hein et al., 2014b, 2015, 2016).
Comparing two or more images was the least used interaction, occurring in only 5 (9%) of studies. In all cases, comparisons were used to create new annotations, such as marking corresponding points in two consecutive video frames (Maier-Hein et al., 2015, 2016) or to choose which of two images was more similar to a target image (Ørting et al., 2017).
Overall, crowds were more often used to create new annotations, than to make judgments on existing annotations. Ratings and drawing of shapes can be used to obtain more detailed annotations than information already present in datasets. Clicking interactions are sometimes used to identify specific image features, but more commonly used to create bounding boxes or draw object boundaries. Evaluating existing annotations is always done with rating or drawing interactions.
Iv Platform, Scale and Wages
In this section we summarize the main meta parameters and settings of crowdsourcing experiments. First, we classify the reviewed papers based on the type of platform used to perform the crowdsourcing experiments. Second, we report on the scale of the experiments where we consider 1. the number of images annotated and 2. the number of annotators per image. Finally, we summarize the wages paid to crowd workers.
Iv-a Crowdsourcing platforms
A potentially important factor that varies across the surveyed papers is the choice of platform for conducting crowdsourcing experiments. We classify the platforms into six categories: paid commercial marketplaces such as Amazon Mechanical Turk222https://www.mturk.com and FigureEight333https://www.figure-eight.com (formerly known as CrowdFlower), volunteers such as Zooniverse444https://www.zooniverse.org and Volunteer Science555https://volunteerscience.com, custom recruitment/platforms, lab participants, experts and simulation or no experiment at all. The most common choice is a commercial platform (53%). The second most common choice is a custom platform (22%) followed by a volunteer platform (10%). The remaining 15% were almost equally divided into the other categories with around 7% of all papers reporting prototypes or simulation studies.
We summarize the scale of the crowdsourcing experiments in terms of number of images annotated and number of annotations per image.
Iv-B1 Number of images
We classify the number of images into four categories: very small (less than 10 annotated images), small (10 to 100 annotated images), medium (100 to 1000 annotated images) and large (more than 1000 annotated images). The large majority of reviewed papers, 71%, report small and medium scale experiments, while a smaller part report large experiments (21%) or very small experiments (5%). However, in around 3% of the reviewed papers, the scale of the experiments is not reported.
Iv-B2 Number of annotations per image
We divide the number of annotations per image into two categories: a single annotator per image (5%) or multiple annotators per image (61%). Surprisingly, for 34% of surveyed papers the number of annotations per image is not reported nor can it be inferred.
Overall, the experiments using a single annotator per image involve either simulations or locally recruited, volunteer-based annotators that are not remunerated. The number of annotators per image for experiments using multiple annotators per image ranges from 2 to 5000. However, the majority (70%) of these experiments use between 5 to 25 annotators per image.
Iv-C Annotators Wage
We classify the wage given to annotators into six different categories: a few dollars per hour, less than or equal to $0.10 per annotation, more than $0.10 per annotation, volunteers (no monetary incentive), not specified (if we have no information about compensation) and none (if no actual experiment or recruitment took place).
More than a third (35%) of papers did not specify anything about wage. In 32% of papers the wage was less than or equal to $0.10, in 25% of papers crowds where volunteers with no monetary incentive, in 5% of papers the wage was more than $0.10, and in 3% of papers the wage was an hourly payment of a few dollars per hour.
Overall, very few and mainly the papers that mention an hourly payment considered crowd worker wages in relation to minimum wage rules and regulations.
In this section we describe how the crowdsourced annotations are evaluated. This is done via two strategies:
ensuring sufficient quality of annotations by preprocessing
estimating the utility of the crowd annotations for the task at hand
Although the two strategies are closely related and should be considered jointly when designing crowdsourcing experiments, it is informative to consider them separately here.
The first strategy is closely related to the field of quality control in crowdsourcing. Numerous approaches exist to tackle this, starting from simple majority voting and worker filtering to sophisticated statistical and machine learning methods that consider workers’ specific skills, task difficulty and clarity of task descriptions. The second strategy is more domain-specific, as different tasks may have different levels of tolerance for errors.
V-a Preprocessing of annotations
Preprocessing of annotations broadly covers what is done to the crowdsourced annotations prior to using them for their intended purpose. It includes filtering individual annotations and/or aggregating annotations. Of the surveyed papers, 84% perform some form of preprocessing.
V-A1 Filtering individuals
One way to filter annotations, is to remove annotations made by “poorly performing” annotators. Most crowdsourcing platforms offer a rating score for workers that provides an estimate of their performance, based on their percentage of previously approved tasks. This score is used in 15% of surveyed papers to filter workers prior to assigning tasks. A related approach, used in 13% of surveyed papers, is to exclude workers that fail a test task prior to the actual tasks. A refinement of this, used in 24% of surveyed papers, is to integrate separate test tasks in the tasks and exclude workers that fail the tests. One example is adding a smiley face to colonoscopy videos to ensure attention (Park et al., 2018).
Another common filtering approach for individual workers, used in 22% of surveyed papers, is comparing annotations to gold standard annotations. In this case, tasks with known gold standard annotations, are injected into the regular working process. A worker’s correspondence with the gold standard can then be used to estimate individual worker performance. In contrast to platform scores and unrelated test tasks, this approach assesses worker performance on the specific task, allowing more fine-grained worker selection.
V-A2 Aggregating results
One of the main benefits of crowdsourcing is the fast and cost-effective collection of a large number of annotations. This allows aggregating annotations to reduce noise in the individual annotations.
Majority voting is widely used due to its computational and conceptual simplicity and was found in 22% of the papers. In the context of medical image analysis, majority voting is applied to annotations, ratings, and also to aggregate slices of images. One example is presented in Heim (2018) where the authors used crowdsourcing for organ segmentation in computed tomography scan. Multiple organ outlines are collected via an online tool and pixel-wise majority voting is applied to improve the accuracy of the segmentation.
In the case of numerical ratings, mean and median statistics are also used in 13% of the papers to determine a final annotation. For example, Cheplygina et al. (2016) use median to aggregate the areas of the annotations created by individual workers.
A more sophisticated version of the majority vote uses additional information about the general quality of workers. This information can be derived if workers perform multiple tasks or if gold standard data is available. Weighted voting is used in 16% of surveyed papers, for example in (Keshavan et al., 2018)
, where the XGBoost algorithm was used to estimate annotator weights and in(Brady et al., 2017)
where task difficulty is taken into account and annotator weights are estimated as the probability an annotator is correct conditioned on the difficulty of the task,
V-B Evaluating annotations
Evaluating how well crowd annotations solve the intended purpose is most commonly (78% of surveyed papers) done by directly comparing crowdsourced annotations to a gold standard. In about 16% of surveyed papers crowd annotations are used for training a machine learning method, and the performance of the machine learning method used to indirectly evaluate annotations. The remaining 5% have no evaluation of how well annotations solve the intended purpose.
The gold standard originates from different sources. In about 20% of surveyed papers the gold standard is based on a single expert, in about 36% the gold standard is based on multiple experts and in the remaining papers the number of experts is not reported or no expert gold standard is used. Using a gold standard based on a single expert can be problematic since experts often disagree on all but the most trivial tasks. However, only 3 of 20 papers that use multiple experts consider how well experts agree.
Expert-based gold standards are generally not obtained from experts performing exactly the same task as the crowd. In several cases the only difference in expert and crowd tasks is due to differences in user interface, e.g. a clinical workstation for experts and a web interface for crowds. As long as the fundamental task is the same (e.g. count cells) and the user interface has not been dramatically changed we consider the expert and crowd tasks to be the same. Using this definition, about 40% of the papers use the same task and about 40% use a different task. In the remaining 20% it is either not reported or no expert gold standard is used.
There are several reasons for asking crowds to perform a different task than what experts have done for the gold standard. Some papers use a simplified version of the expert task in order to make the task easier or more suitable as a small self-contained task. For example, ranking relative performance in pairs of surgical videos instead of grading performance in each (Malpani et al., 2015)
; assessing visual similarity of images instead of classifying disease patterns(Ørting et al., 2017); refining segmentation proposals instead of performing a full segmentation (Maier-Hein et al., 2016); annotating polyps in a single frame instead of in a full video (Park et al., 2017); counting stained cells instead of classifying disease status (Irshad et al., 2017). Other papers focus on changing the user interface, such as in (Lejeune et al., 2017) where an eye tracker is used for segmentation instead of a mouse, or in (Albarqouni et al., 2016b; Mavandadi et al., 2012) where the user interface is changed to support gamification strategies.
In a few papers, evaluation is focused on variation in annotations. For example, in (Lee and Tufail, 2014; Lee et al., 2016) where annotations are evaluated in terms of inter-rater reliability; and in (Heller et al., 2017; Huang and Hamarneh, 2017; Leifman et al., 2015; Sonabend et al., 2017) where individual annotations are compared to aggregated annotations. Measuring variability of annotations it not directly useful for evaluating the correctness of annotations. However, annotation variability is essential when evaluating how much the crowdsourced annotations can be trusted. Additionally, variation provides an indirect measure of correctness. Large variation can indicate that annotations are often wrong, while small variation indicates that annotations are often correct or the task has been designed such that annotators are consistently wrong.
Vi Results and recommendations
Here, we provide an overview of the primary results and recommendations emerging from the papers examined in this review. Complementary to the topics discussed in Section V we consider (1) How effective is the application of crowdsourcing to medical image analysis? (2) Recommendations to ensure data quality.
Vi-a How effective is the application of crowdsourcing to medical image analysis?
The vast majority of studies examined in this review found crowdsourcing to be a valid approach for data production. Crowdsourcing of medical image analysis was noted to be an accurate approach (Lawson et al., 2017), that can produce large quantities of annotations needed to solve high-throughput problems requiring human input (Irshad et al., 2015; dos Reis et al., 2015; Lee and Tufail, 2014; Maier-Hein et al., 2014b). Crowdsourcing can be used to create new annotations or make existing data more robust, both cheaper and faster than annotation by medical experts (Rajchl et al., 2016; Holst et al., 2015; Gurari et al., 2016; Eickhoff, 2014; Park et al., 2017).
Although the relative efficacy of crowdsourcing applied to medical image analysis will be dependent on the complexity of the task, the papers examined here show crowdsourcing to be an effective methodology across a wide variety of applications, including objective assessment of surgical skill (Malpani et al., 2015), emphysema assessment (Ørting et al., 2017), polyp marking in virtual colonoscopy (Park et al., 2018), identification of chromosomes (Sharma et al., 2017) and biomarker discovery in immunohistochemistry data (Smittenaar et al., 2018). Notably, only one project stated that crowdsourcing could not always be applied effectively to the studied task (“it is very difficult and maybe even impossible to entirely outsource the task of labelling mitotic figures in histology images to crowds” (Albarqouni et al., 2016a)).
Rather than comparing the absolute performance of the crowd to experts or to algorithms, it might be worth considering their relative benefits. For example, crowds were particularly useful for rare classes (Sullivan et al., 2018), which are often difficult cases for algorithms. Another situation where crowds can be useful is identifying data that is missing from the gold standard provided by experts, see for example (Luengo-Oroz et al., 2012). Benefits of combining crowds with algorithms were also demonstrated by (Albarqouni et al., 2016a; Sharma et al., 2017).
Vi-B Recommendations to ensure data quality
The papers examined in this review included suggestions to improve the quality of data produced through crowdsourcing. These suggestions focused on refining the task design, crowdsourcing platform and post-processing of annotations. We summarize these recommendations here.
Vi-B1 Task design
As discussed, crowdsourcing has been applied effectively to many medical imaging applications. However, careful study design remains necessary to ensure generation of data of sufficient quality.
The selection and design of an appropriate crowdsourcing task is central to project success. Effort should be made to make the task simple and unambiguous (Rajchl et al., 2016; Gurari et al., 2016), and to present study data appropriately (McKenna et al., 2012). For unavoidably challenging tasks, crowdsourcing may still provide useful data, for instance, through enabling a rapid first-pass evaluation of large scale data sets (Della Mea et al., 2014; Park et al., 2017). Particularly challenging tasks may be made tractable through gamification (Albarqouni et al., 2016b) or careful reframing of the task, e.g. crowdsourcing of emphysema assessment was made possible through reframing the task as a question of image similarity (Ørting et al., 2017). Alternatively, it may be possible to achieve the desired data quality simply through asking a larger cohort of crowd workers to perform each task per data point. An interesting example of task design is given in (Gurari et al., 2016) where quality and speed of crowdsourced segmentations in natural images are increased by flipping images, suggesting that familiarity with an image can be detrimental.
Vi-B2 Crowdsourcing platform
The choice of crowdsourcing platform can influence study cost and completion time, as well as the size and demographics of the crowd. Furthermore, different platforms offer distinct features which may influence the quality of data produced. For example, Heller et al. noted that user interface features, such as zoom and intuitive controls, can increase data quality. Contingent on the complexity of the task and interface design, training materials should be provided, as this can improve results (McKenna et al., 2012). However, this is not always necessary - in some cases minimal (Brady et al., 2014) or no training (Ganz et al., 2017) was required.
Post-processing of annotations is recommended to improve annotation quality by removing annotations from poorly performing workers. Alternatively, if multiple workers annotate the same data it is possible to improve annotation quality by aggregating annotations.
The surveyed papers propose a variety of criteria for filtering individual annotations. For example, time spend on task (O’Neil et al., 2017), expected shape of segmentation (Cheplygina et al., 2016; Chávez-Aragón et al., 2013), correlation with other workers’ results (Sharma et al., 2017; Chávez-Aragón et al., 2013) and correlation to experts annotations or ground truth (Sameki et al., 2016; Keshavan et al., 2018; Irshad et al., 2017, 2015; Foncubierta Rodríguez and Müller, 2012). However, due to the lack of comparisons between different filtering approaches, the only clear recommendation from these works is to use some form of filtering.
Contrary to this recommendation, Nguyen et al. found that filtering unreliable workers did not have a significant influence when annotations from multiple workers are aggregated. However, aggregating without taking individual performance into account might not be the best approach. Malpani et al. compared different aggregation methods, and found that weighted voting, with weights based on self-reported confidence scores, improved results compared to simple majority voting. Similarly, Irshad et al. found that aggregating segmentations from 3-5 workers, using weights based on consensus and worker trust scores, improved performance over using single worker annotations. Further, Cheplygina and Pluim (2018) found that disagreement between workers was predictive of melanoma diagnosis in skin lesions, suggesting that simple aggregation, such as majority voting or mean statistics, might not be the best approach.
As discussed in Section II, crowdsourcing is applied to a variety of medical images, however, it is most commonly applied to histology or microscopy images. The trend for crowdsourcing of this image type may be due to the ease of which these (typically 2D) images can be incorporated into a crowdsourcing or citizen science project. Alternatively, the microscopy images examined in these papers may have not been derived from a patient, and would therefore not require the consent of an individual to use for crowdsourcing purposes.
The most common crowd task is rating entire images. This is somewhat surprising, given that we would expect such tasks to rely more on prior knowledge than other crowdsourced tasks, such as drawing outlines of objects. Again, this trend might be facilitated due to the ease with which rating images can be incorporated in existing platforms.
Most crowdsourcing studies are set up on commercial platforms, followed by custom platforms. Each image is annotated by multiple crowd workers, who typically receive less than $0.10 per annotation. On the one hand, this low reimbursement might be a product of researchers trying to optimize the total number of annotations given a particular budget. On the other hand, it could be a lack of awareness of what appropriate compensation should be (Hara et al., 2017).
A surprising finding is that, often, important details about the crowd and their compensation are missing. Besides missing details in terms of crowd compensation, we find missing details regarding the number of requested annotations per unit. While for some of the surveyed papers, we could infer an approximation of the number of annotations gathered per unit by checking the scale of the experiment and the total amount of annotations gathered, for at least a third of the surveyed papers (34%) this was not possible due to a lack of detail when describing the crowdsourcing experimental methodology.
Crowdsourced annotations are generally processed prior to evaluating how well the annotations solved the intended purpose. Simply excluding workers based on platform scores or a single test task is not as popular as continuously monitoring worker performance. 29% of the surveyed papers aggregate annotations from multiple crowd workers. This is most commonly done by simple majority voting, but some papers use estimates of task difficulty and/or worker performance to obtain a weighted aggregation.
The most common approach to evaluating the quality of preprocessed annotations is by comparing to an expert defined gold standard. A smaller set of papers use the annotations to train an ML method and evaluate the performance of the trained method.
The studies we reviewed almost unanimously conclude that crowdsourcing is a a viable solution for medical image annotation, which may seem unexpected given the complexity of medical imaging as a field in general. There might be several possible reasons for the lack of negative results. One is researchers selecting tasks which they already expect to be suitable for crowdsourcing. Another reason is publication bias, with papers demonstrating negative results having less chance of being published, which is also an issue in computer vision (Borji, 2018).
There are a number of limitations in the way that the current studies are being conducted. There is generally a lack of clarity in the reporting of experimental design and evaluation protocols. Additionally, ethical questions regarding worker compensation, image content and patient privacy are rarely discussed, but seem crucial to address. In several papers the study design appears to be ad-hoc. Characteristics such as the platform, number of annotators, how the task is explained and so forth, are not always motivated, or even described. This creates difficulties in understanding what leads to a successful crowdsourcing study and increases the barrier for researchers who have not used crowdsourcing before. The studies which do examine such factors are often conducted on a single application, making it difficult to generalize lessons learned to other applications. Detailed documentation of experiments is a crucial factor for ensuring reproducible science and essential for replication studies.
Another problem is the evaluation of results. The quality of crowdsourced annotations is generally estimated by comparing directly to expert annotations. However, variation in both expert defined gold standard and crowd annotations are not systematically accounted for, making it difficult to assess if crowd annotations are actually good enough. When using annotations to train ML methods, noisy crowd annotations might not be a problem if handled by the ML algorithm. However, variation in annotations should still be investigated in this case. A related problem is using expert annotations to filter crowd annotations, which would not be possible for real unlabeled data, thus leading to overly optimistic results.
Overall, surveyed papers reported successful results. However, from our personal experience and discussions with other researchers, it is non-trivial to setup a crowdsourcing project for medical images. Due to the lack of negative results, the current literature does not inform researchers inexperienced with crowdsourcing about the main considerations of such a project. Furthermore, very few articles report on pilot experiments which aim to calibrate and identify the optimal crowdsourcing parameter settings such as the number of annotators per image.
There are important ethical issues which are largely not mentioned in the papers we surveyed. First of all, details about compensation are often missing, whereas this can have an important effect on the crowd (Hara et al., 2017). Furthermore, what is reasonable compensation in one country, may be too low for another country due to different cost of living. How to set the compensation fairly is an open issue that researchers should consider in their work.
Another ethical concern is whether it is possible and/or appropriate to share images with the crowd. Some images (for example of surgery) may be traumatic to view or unsuitable for children, which is more unique to the medical domain than other areas where crowdsourcing is applied e.g. astronomy or ecology projects. Another issue is sharing images from the perspective of patient consent, which is an issue that must be considered case by case.
Several papers discuss directions they want to take in further research. One of the popular directions is increasing the role of machine learning. Several papers not using machine learning plan to do so in future (Brady et al., 2017; Cheplygina and Pluim, 2018; Sullivan et al., 2018). Papers that already use machine learning discuss improvements to their algorithms or crowd-algorithm combinations (Sharma et al., 2017; Sameki et al., 2016).
Related to the above, tailoring the tasks to individual workers is another possibility. The rating score given to workers by platforms only reflects an overall completion rate, and might be artificially high because employers tend to rate the majority of the tasks positive and apply a filtering afterwards. Considering worker scores on different task types could help to make a better selection of workers beforehand.
Another strategy discussed as future work is the use of gamification. Several papers have explored this idea (Luengo-Oroz et al., 2012; Mavandadi et al., 2012; Albarqouni et al., 2016b; Sullivan et al., 2018) citing increased motivation of annotators. While the earlier papers (Luengo-Oroz et al., 2012; Mavandadi et al., 2012; Albarqouni et al., 2016b) have task-specific games, Sullivan et al. (2018) takes a more task-independent approach of a mini-game within an existing, larger game. This could be an opportunity for many other researchers, without the need to design a game from scratch. Finally, annotating images at a festival as in (Timmermans et al., 2016) could be an interesting direction.
Beyond the opportunities that the papers discuss as future research, we see a number of other future directions for the community as a whole. Perhaps the most important future direction is openly sharing our experiences with crowdsourcing, including failures. Due to publication bias, current papers may not reflect the performance and difficulties encountered in a typical crowdsourcing project.
More generally, there is an opportunity to create a set of guidelines for crowdsourcing medical imaging studies. Rather than relying on ad-hoc choices, researchers could then make informed decisions about the platform, reward of the annotators and other variables. For example, the European Citizen Science Initiative has a selection of guides for performing citizen science studies666https://ecsa.citizen-science.net/blog/collection-citizen-science-guidelines-and-publications. A further opportunity is to interact more with other fields where crowdsourcing has been in use longer, and to see which of their best practices are also applicable to medical imaging.
Interacting with workers could both improve projects and help establish guidelines. Workers have created communities (e.g. Reddit, Facebook) and discussion boards (https://www.mturkforum.com, https://www.turkernation.com) for some platforms. Chandler et al. found that 28% 5% of the workers on Mechanical Turk read discussion boards and blogs related to Mechanical Turk. The topics of conversations, in order of frequency, are: pay, gratification, completion time, difficulty, how to successfully complete, purpose and the requesters’ reputation of the HIT. These forums are a valuable source for researchers for gathering information, measuring opinions and getting feedback on improving their project. This is particularly important because high throughput workers are more likely to discuss HITs (Chandler et al., 2014). This subgroup (less than 10 % of the workers do more than 75% of the work (Hara et al., 2017)) is likely to have experience with similar tasks (Chandler et al., 2014), and interaction with these workers may result in various improvements such as improvements of the user interface as in (Bruggemann et al., 2018).
Next to image analysis, crowdsourcing could also be a way to collect, rather than curate, data to improve medical knowledge. This could vary from donating your own medical images (such as http://www.medicaldatadonors.org) to contributing experiences about rare diseases. Since such initiatives do not focus on image analysis we did not include them in this survey, however (Ranard et al., 2014; Wazny, 2017) may be good starting points for readers interested in these topics.
The authors would like to thank eScience-Lorentz grant 2018 and Ms Gerda Filippo (Lorentz center) for their support in organizing the workshop where this paper was conceived.
- Albarqouni et al. [2016a] S. Albarqouni, C. Baur, F. Achilles, V. Belagiannis, S. Demirci, and N. Navab. AggNet: Deep learning from crowds for mitosis detection in breast cancer histology images. IEEE Transactions on Medical Imaging, 35(5):1313–1321, May 2016a.
- Albarqouni et al. [2016b] S. Albarqouni, S. Matl, M. Baust, N. Navab, and S. Demirci. Playsourcing: a novel concept for knowledge creation in biomedical research. In Deep Learning and Data Labeling for Medical Applications, pages 269–277. Springer, 2016b.
- Alialy et al.  R. Alialy, S. Tavakkol, E. Tavakkol, A. Ghorbani-Aghbologhi, A. Ghaffarieh, S. H. Kim, and C. Shahabi. A review on the applications of crowdsourcing in human pathology. Journal of pathology informatics, 9, 2018.
- Boorboor et al.  S. Boorboor, S. Nadeem, J. H. Park, K. Baker, and A. Kaufman. Crowdsourcing lung nodules detection and annotation. In Medical Imaging 2018: Imaging Informatics for Healthcare, Research, and Applications, volume 10579, page 105791D. International Society for Optics and Photonics, 2018.
- Borji  A. Borji. Negative results in computer vision: A perspective. Image and Vision Computing, 69:1–8, 2018.
- Brady et al.  C. J. Brady, A. C. Villanti, J. L. Pearson, T. R. Kirchner, O. Gup, and C. Shah. Rapid grading of fundus photos for diabetic retinopathy using crowdsourcing. Investigative Ophthalmology & Visual Science, 55(13):4826–4826, 2014.
- Brady et al.  C. J. Brady, L. I. Mudie, X. Wang, E. Guallar, and D. S. Friedman. Improving consensus scoring of crowdsourced data using the rasch model: development and refinement of a diagnostic instrument. Journal of medical Internet research, 19(6), 2017.
- Bruggemann et al.  J. Bruggemann, G. C. Lander, and A. I. Su. Exploring applications of crowdsourcing to cryo-EM. Journal of structural biology, 203(1):37–45, 2018.
- Cabrera-Bean et al.  M. Cabrera-Bean, A. Pages-Zamora, C. Diaz-Vilor, M. Postigo-Camps, D. Cuadrado-Sánchez, and M. A. Luengo-Oroz. Counting malaria parasites with a two-stage EM based algorithm using crowsourced data. In Engineering in Medicine and Biology Society (EMBC), pages 2283–2287. IEEE, 2017.
- Chandler et al.  J. Chandler, P. Mueller, and G. Paolacci. Nonnaïveté among amazon mechanical turk workers: Consequences and solutions for behavioral researchers. Behavior research methods, 46(1):112–130, 2014.
- Chávez-Aragón et al.  A. Chávez-Aragón, W.-S. Lee, and A. Vyas. A crowdsourcing web platform-hip joint segmentation by non-expert contributors. In Medical Measurements and Applications Proceedings (MeMeA), 2013 IEEE International Symposium on, pages 350–354. IEEE, 2013.
- Cheplygina and Pluim  V. Cheplygina and J. P. W. Pluim. Crowd disagreement about medical images is informative. In Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis (MICCAI LABELS), pages 105–111. Springer, 2018.
- Cheplygina et al.  V. Cheplygina, A. Perez-Rovira, W. Kuo, H. Tiddens, and M. de Bruijne. Early experiences with crowdsourcing airway annotations in chest CT. In Large-scale Annotation of Biomedical data and Expert Label Synthesis (MICCAI LABELS), pages 209–218, 2016.
- Cheplygina et al.  V. Cheplygina, M. de Bruijne, and J. P. Pluim. Not-so-supervised: a survey of semi-supervised, multi-instance, and transfer learning in medical image analysis. arXiv preprint arXiv:1804.06353, 2018.
- de Herrera et al.  A. G. S. de Herrera, A. Foncubierta-Rodríguez, D. Markonis, R. Schaer, and H. Müller. Crowdsourcing for medical image classification. Swiss Medical Informatics, 30, 2014.
- Della Mea et al.  V. Della Mea, E. Maddalena, S. Mizzaro, P. Machin, and C. A. Beltrami. Preliminary results from a crowdsourcing experiment in immunohistochemistry. In Diagnostic pathology, volume 9, page S6. BioMed Central, 2014.
- dos Reis et al.  F. J. C. dos Reis, S. Lynn, H. R. Ali, D. Eccles, A. Hanby, E. Provenzano, C. Caldas, W. J. Howat, L.-A. McDuffus, B. Liu, et al. Crowdsourcing the general public for large scale molecular pathology studies in cancer. EBioMedicine, 2(7):681–689, 2015.
- Eickhoff  C. Eickhoff. Crowd-powered experts: Helping surgeons interpret breast cancer images. In Gamification for Information Retrieval (GamifIR), pages 53–56. ACM, 2014.
- Foncubierta Rodríguez and Müller  A. Foncubierta Rodríguez and H. Müller. Ground truth generation in medical imaging: a crowdsourcing-based iterative approach. In ACM Multimedia workshop on Crowdsourcing for Multimedia, pages 9–14. ACM, 2012.
- Ganz et al.  M. Ganz, D. Kondermann, J. Andrulis, G. M. Knudsen, and L. Maier-Hein. Crowdsourcing for error detection in cortical surface delineations. International journal of computer assisted radiology and surgery, 12(1):161–166, 2017.
- Greenspan et al.  H. Greenspan, B. Van Ginneken, and R. M. Summers. Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique. IEEE Transactions on Medical Imaging, 35(5):1153–1159, 2016.
- Gur et al.  Y. Gur, M. Moradi, H. Bulu, Y. Guo, C. Compas, and T. Syeda-Mahmood. Towards an efficient way of building annotated medical image collections for big data studies. In Intravascular Imaging and Computer Assisted Stenting, and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis, pages 87–95. Springer, 2017.
- Gurari et al.  D. Gurari, D. Theriault, M. Sameki, B. Isenberg, T. A. Pham, A. Purwada, P. Solski, M. Walker, C. Zhang, J. Y. Wong, and M. Betke. How to collect segmentations for biomedical images? A benchmark evaluating the performance of experts, crowdsourced non-experts, and algorithms. In Winter Conference on Applications of Computer Vision, (WACV), pages 1169–1176, 2015.
- Gurari et al.  D. Gurari, M. Sameki, and M. Betke. Investigating the influence of data familiarity to improve the design of a crowdsourcing image annotation system. In Human Computation (HCOMP), 2016.
- Hara et al.  K. Hara, A. Adams, K. Milland, S. Savage, C. Callison-Burch, and J. Bigham. A data-driven analysis of workers’ earnings on Amazon Mechanical Turk. arXiv preprint arXiv:1712.05796, 2017.
- Heim  E. Heim. Large-scale medical image annotation with quality-controlled crowdsourcing. PhD thesis, German Cancer Research Center (DKFZ), 2018.
- Heller et al.  N. Heller, P. Stanitsas, V. Morellas, and N. Papanikolopoulos. A web-based platform for distributed annotation of computerized tomography scans. In Intravascular Imaging and Computer Assisted Stenting, and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis (MICCAI LABELS), pages 136–145. Springer, 2017.
- Holst et al.  D. Holst, T. M. Kowalewski, L. W. White, T. C. Brand, J. D. Harper, M. D. Sorensen, M. Truong, K. Simpson, A. Tanaka, R. Smith, et al. Crowd-sourced assessment of technical skills: differentiating animate surgical skill through the wisdom of crowds. Journal of endourology, 29(10):1183–1188, 2015.
- Howe  J. Howe. The rise of crowdsourcing. Wired magazine, 14(6):1–4, 2006.
- Huang and Hamarneh  M. Huang and G. Hamarneh. SwifTree: Interactive extraction of 3D trees supporting gaming and crowdsourcing. In Intravascular Imaging and Computer Assisted Stenting, and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis (MICCAI LABELS), pages 116–125. Springer, 2017.
- Irshad et al.  H. Irshad, L. Montaser-Kouhsari, G. Waltz, O. Bucur, J. Nowak, F. Dong, N. W. Knoblauch, and A. H. Beck. Crowdsourcing image annotation for nucleus detection and segmentation in computational pathology: evaluating experts, automated methods, and the crowd. In Pacific Symposium on Biocomputing, pages 294–305. World Scientific, 2015.
- Irshad et al.  H. Irshad, E.-Y. Oh, D. Schmolze, L. M. Quintana, L. Collins, R. M. Tamimi, and A. H. Beck. Crowdsourcing scoring of immunohistochemistry images: Evaluating performance of the crowd and an automated computational method. Scientific Reports, 7:43286, 2017.
- Keshavan et al.  A. Keshavan, J. Yeatman, and A. Rokem. Combining citizen science and deep learning to amplify expertise in neuroimaging. bioRxiv, page 363382, 2018.
- Kovashka et al.  A. Kovashka, O. Russakovsky, L. Fei-Fei, and K. Grauman. Crowdsourcing in computer vision. Foundations and Trends in Computer Graphics and Vision, 10(3):177–243, 2016.
- Lawson et al.  J. Lawson, R. J. Robinson-Vyas, J. P. McQuillan, A. Paterson, S. Christie, M. Kidza-Griffiths, L.-A. McDuffus, K. A. Moutasim, E. C. Shaw, A. E. Kiltie, et al. Crowdsourcing for translational research: analysis of biomarker expression using cancer microarrays. British journal of cancer, 116(2):237, 2017.
- Lee and Tufail  A. Y. Lee and A. Tufail. Mechanical turk based system for macular OCT segmentation. Investigative Ophthalmology & Visual Science, 55(13):4787–4787, 2014.
- Lee et al.  A. Y. Lee, C. S. Lee, P. A. Keane, and A. Tufail. Use of mechanical turk as a mapreduce framework for macular OCT segmentation. Journal of ophthalmology, 2016, 2016.
- Leifman et al.  G. Leifman, T. Swedish, K. Roesch, and R. Raskar. Leveraging the crowd for annotation of retinal images. In International Conference of the Engineering in Medicine and Biology Society (EMBC), pages 7736–7739. IEEE, 2015.
- Lejeune et al.  L. Lejeune, M. Christoudias, and R. Sznitman. Expected exponential loss for gaze-based video and volume ground truth annotation. In Intravascular Imaging and Computer Assisted Stenting, and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis (MICCAI LABELS), pages 106–115. Springer, 2017.
- Litjens et al.  G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. van der Laak, B. Van Ginneken, and C. I. Sánchez. A survey on deep learning in medical image analysis. Medical image analysis, 42:60–88, 2017.
- Luengo-Oroz et al.  M. A. Luengo-Oroz, A. Arranz, and J. Frean. Crowdsourcing malaria parasite quantification: an online game for analyzing images of infected thick blood smears. Journal of medical Internet research, 14(6), 2012.
- Maier-Hein et al. [2014a] L. Maier-Hein, S. Mersmann, D. Kondermann, S. Bodenstedt, A. Sanchez, C. Stock, H. G. Kenngott, M. Eisenmann, and S. Speidel. Can masses of non-experts train highly accurate image classifiers? In Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 438–445. Springer, 2014a.
- Maier-Hein et al. [2014b] L. Maier-Hein, S. Mersmann, D. Kondermann, et al. Crowdsourcing for reference correspondence generation in endoscopic images. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 349–356. Springer, 2014b.
- Maier-Hein et al.  L. Maier-Hein, D. Kondermann, T. Roß, S. Mersmann, E. Heim, S. Bodenstedt, H. G. Kenngott, A. Sanchez, M. Wagner, A. Preukschas, et al. Crowdtruth validation: a new paradigm for validating algorithms that rely on image correspondences. International Journal of Computer Assisted Radiology and Surgery, 10(8):1201–1212, 2015.
- Maier-Hein et al.  L. Maier-Hein, T. Ross, J. Gröhl, B. Glocker, S. Bodenstedt, C. Stock, E. Heim, M. Götz, S. Wirkert, H. Kenngott, et al. Crowd-algorithm collaboration for large-scale endoscopic image annotation with confidence. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 616–623. Springer, 2016.
- Malpani et al.  A. Malpani, S. S. Vedula, C. C. G. Chen, and G. D. Hager. A study of crowdsourced segment-level surgical skill assessment using pairwise rankings. International journal of computer assisted radiology and surgery, 10(9):1435–47, Sept. 2015.
- Mavandadi et al.  S. Mavandadi, S. Dimitrov, S. Feng, F. Yu, U. Sikora, O. Yaglidere, S. Padmanabhan, K. Nielsen, and A. Ozcan. Distributed medical image analysis and diagnosis through crowd-sourced games: A malaria case study. PLoS ONE, 7(5), 2012.
- McKenna et al.  M. T. McKenna, S. Wang, T. B. Nguyen, J. E. Burns, N. Petrick, and R. M. Summers. Strategies for improved interpretation of computer-aided detections for ct colonography utilizing distributed human intelligence. Medical image analysis, 16(6):1280–1292, 2012.
- Mitry et al.  D. Mitry, T. Peto, S. Hayat, J. E. Morgan, K.-T. Khaw, and P. J. Foster. Crowdsourcing as a novel technique for retinal fundus photography classification: Analysis of images in the epic norfolk cohort on behalf of the ukbiobank eye and vision consortium. PLoS ONE, 8(8):e71154, 2013.
- Mitry et al.  D. Mitry, T. Peto, S. Hayat, P. Blows, J. Morgan, K.-T. Khaw, and P. J. Foster. Crowdsourcing as a screening tool to detect clinical features of glaucomatous optic neuropathy from digital photography. PLoS ONE, 10(2):1–8, 2015.
- Mitry et al.  D. Mitry, K. Zutis, B. Dhillon, T. Peto, S. Hayat, K.-T. Khaw, J. E. Morgan, W. Moncur, E. Trucco, and P. J. Foster. The accuracy and reliability of crowdsource annotations of digital retinal images. Translational vision science & technology, 5(5):6–6, 2016.
- Nguyen et al.  T. B. Nguyen, S. Wang, V. Anugu, N. Rose, M. McKenna, N. Petrick, J. E. Burns, and R. M. Summers. Distributed human intelligence for colonic polyp classification in computer-aided detection for CT colonography. Radiology, 262(3):824–833, 2012.
- O’Neil et al.  A. Q. O’Neil, J. T. Murchison, E. J. van Beek, and K. A. Goatman. Crowdsourcing labels for pathological patterns in ct lung scans: Can non-experts contribute expert-quality ground truth? In Intravascular Imaging and Computer Assisted Stenting, and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis (MICCAI LABELS), pages 96–105. Springer, 2017.
- Ørting et al.  S. N. Ørting, V. Cheplygina, J. Petersen, L. H. Thomsen, M. M. W. Wille, and M. de Bruijne. Crowdsourced emphysema assessment. In Intravascular Imaging and Computer Assisted Stenting, and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis (MICCAI LABELS), pages 126–135. Springer, 2017.
- Park et al.  J. H. Park, S. Mirhosseini, S. Nadeem, J. Marino, A. Kaufman, K. Baker, and M. Barish. Crowdsourcing for identification of polyp-free segments in virtual colonoscopy videos. In Medical Imaging 2017: Imaging Informatics for Healthcare, Research, and Applications, volume 10138, page 101380V. International Society for Optics and Photonics, 2017.
- Park et al.  J. H. Park, S. Nadeem, J. Marino, K. Baker, M. Barish, and A. Kaufman. Crowd-assisted polyp annotation of virtual colonoscopy videos. In Medical Imaging 2018: Imaging Informatics for Healthcare, Research, and Applications, volume 10579, page 105790M. International Society for Optics and Photonics, 2018.
- Rajchl et al.  M. Rajchl, M. C. Lee, F. Schrans, A. Davidson, J. Passerat-Palmbach, G. Tarroni, A. Alansary, O. Oktay, B. Kainz, and D. Rueckert. Learning under distributed weak supervision. arXiv preprint arXiv:1606.01100, 2016.
- Rajchl et al.  M. Rajchl, L. M. Koch, C. Ledig, J. Passerat-Palmbach, K. Misawa, K. Mori, and D. Rueckert. Employing weak annotations for medical image analysis problems. arXiv preprint arXiv:1708.06297, 2017.
- Ranard et al.  B. L. Ranard, Y. P. Ha, Z. F. Meisel, D. A. Asch, S. S. Hill, L. B. Becker, A. K. Seymour, and R. M. Merchant. Crowdsourcing: harnessing the masses to advance health and medicine, a systematic review. Journal of General Internal Medicine, 29(1):187–203, Jan. 2014.
- Roethlingshoefer et al.  V. Roethlingshoefer, S. Bittel, H. Kenngott, M. Wagner, S. Bodenstedt, T. Ross, S. Speidel, and M.-H. L. How to create the largest in-vivo endoscopic dataset. In Intravascular Imaging and Computer Assisted Stenting, and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis (MICCAI LABELS), 2017.
Sameki et al. 
M. Sameki, D. Gurari, and M. Betke.
ICORD: Intelligent Collection of Redundant Data ? A
Dynamic System for Crowdsourcing Cell Segmentations Accurately
Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1380–1389, 2016.
- Sharma et al.  M. Sharma, O. Saha, A. Sriraman, R. Hebbalaguppe, L. Vig, and S. Karande. Crowdsourcing for chromosome segmentation and deep classification. In Computer Vision and Pattern Recognition Workshops (CVPRW), pages 786–793. IEEE, 2017.
- Smittenaar et al.  P. Smittenaar, A. K. Walker, S. McGill, C. Kartsonaki, R. J. Robinson-Vyas, J. P. McQuillan, S. Christie, L. Harris, J. Lawson, E. Henderson, et al. Harnessing citizen science through mobile phone technology to screen for immunohistochemical biomarkers in bladder cancer. British journal of cancer, 119(2):220, 2018.
- Sonabend et al.  A. M. Sonabend, B. E. Zacharia, M. B. Cloney, A. Sonabend, C. Showers, V. Ebiana, M. Nazarian, K. R. Swanson, A. Baldock, H. Brem, et al. Defining glioblastoma resectability through the wisdom of the crowd: a proof-of-principle study. Neurosurgery, 80(4):590–601, 2017.
- Sullivan et al.  D. P. Sullivan, C. F. Winsnes, L. Åkesson, M. Hjelmare, M. Wiking, R. Schutten, L. Campbell, H. Leifsson, S. Rhodes, A. Nordgren, et al. Deep learning is combined with massive-scale citizen science to improve large-scale image classification. Nature biotechnology, 36(9):820, 2018.
Timmermans et al. 
B. Timmermans, Z. Szlávik, and R.-J. Sips.
Crowdsourcing ground truth data for analysing brainstem tumors in
Belgium Netherlands Artificial Intelligence Conference (BNAIC), 2016.
- Wazny  K. Wazny. Crowdsourcing ten years in: A review. Journal of global health, 7(2), 2017.