Automatic indexing is an important prerequisite to enable semantic search in large video archives. In particular, visual concept classification and person recognition play an essential role to provide fine-grained access to large image and video databases like the historical video collection of the German Broadcasting Archive (DRA). The DRA maintains the cultural heritage of television broadcasts of the former German Democratic Republic (GDR) and thus grants access to researchers who are interested in German-German history. Semantic search through pre-defined lexicons of visual concepts and personalities associated with the former GDR can support the investigation of specific research questions and be a starting point for further analyses and scientific studies. Typically, a lexicon of visual concepts and persons has to be defined in advance. However, in order to allow for such a fine-grained search by automatically indexing the video collection, a huge manual effort for initializing this process is necessary. A large amount of manually labeled keyframes and shots is necessary to train or fine-tune deep learning based video indexing approaches. While the number of images is known to matter for these approaches, the quality of manually annotated training data affects video indexing performance. In this context, visual concepts in images and video frames are not always perceived objectively. As shown in previous work[7, 2], human annotators can have different understandings and judgments for certain concepts and the inter-coder agreement noticeably vary for different visual concepts. Therefore, the precise definition of such concepts and consequently carefully labelled data are crucial for system success.
In this paper, we investigate the inter-coder agreement for annotations of historical TV data of the former GDR for expert and non-experts. The quality of manually labeled keyframes for both groups is assessed for visual concept classification and person recognition. Moreover, correlations between inter-coder agreement and system performance by means of average precision results on the two tasks of visual concept classification and person recognition are computed and analyzed. Our hypothesis is that, besides the amount of available training data, the inter-coder agreement is correlated to the performance of video indexing methods based on supervised learning. Moreover, we suppose that inter-coder agreement might form an upper bound for such methods. In this regard, some first experiments are presented to predict average precision using support vector regression based on inter-coder agreement and training data size. Furthermore, the influence of image annotation quality induced by experts vs. non-experts on the person recognition performance is investigated based on annotations acquired during the study.
The remainder of the paper is organized as follows. Section 2 discusses related work regarding inter-annotator studies. Section 3 deals with the comprehensive user study including a description of the used dataset, the study participants, the experimental design and the results of a comparison between expert vs. non-expert inter-coder agreements. Furthermore, a performance prediction for the task of concept classification is presented based on inter-coder agreement and training data size as input. In Section 5, the impact of expert vs. non-expert image annotations on person identification performance is investigated. Section 6 concludes the paper and outlines areas for future work.
2 Related Work
In this section, we briefly survey related work for inter-annotator studies conducted for natural language as well as image annotation tasks. Snow et al. 
have evaluated non-expert annotations by means of Mechanical Turk workers for natural language tasks. Among other experiments, they found that annotations of four non-experts are needed to rival the annotation quality of one expert annotator in selected tasks. Furthermore, they have trained machine learning classifiers on expert as well as non-expert annotations and reported better system performance for non-experts due to high annotation diversity reducing the annotator bias. However, these observations were exclusively made for natural language tasks. Nowak and Rueger presented a study on inter-coder agreement for image annotation from both crowdsourcing and experts. Human annotators had to label 99 images of the ImageCLEF test data (http://www.imageclef.org/) with respect to 22 concept categories. Some of the categories were mutually exclusive (season, time of day, indoor/outdoor/none). The images were assessed by experts as well as by Mechanical Turk workers. They measure higher agreement for experts, but argue that majority voting filters out noise in non-expert annotations closing the gap to expert annotations of higher quality. A more recent study  deals with the question whether machines perform better than humans in visual recognition tasks. For assessing human performance the inter-coder reliability by Krippendorff’s on 20 common categories of the PASCAL VOC benchmark is measured. For the best submission at PASCAL VOC’s leaderboard an above average human-level performance for visual concept annotation is reported being on a par or better than 19 of 23 participants.
3 Annotation Study: Expert vs. Non-Expert Agreement on GDR-specific Concepts and Persons
In this section, we aim at comparing the reliability of expert vs. non-expert annotations for for historical TV data in terms of inter-coder-agreement. For this purpose, we collected annotations of expert as well as student participants (Section 3.2) on a selected set of concept and person images (Section 3.1) following the experimental design described in Section 3.3. Agreement results are discussed in Section 3.4.
3.1 Dataset of Historical Concepts and Personalities
In a previous project that realized content-based video retrieval in historical GDR television data, a GDR specific lexicon of 100 classes (91 concepts and 9 persons) has been defined based on the utility and usefulness for anticipated search queries . The concept classification was based on a multi-label CNN approach using a GoogleNet architecture . Person recognition was performed using FaceVACs 111http://www.cognitec.com, a commercial approach combining face alignment and recognition. Visual Concept and person recognition models have been applied to a test set of about 2,500 hours of video data and evaluated on the Top-200 retrieval results in terms of mean average precision. Therefore, the Top-200 retrieval results per class had been manually labeled by an expert . For the current study, 20 concepts and 9 persons have been chosen from the GDR-specific lexicon. The concepts are selected based on the average precision performance so that concepts with high, middle and low retrieval quality are evenly represented. The images for the study have been randomly sampled from the Top-200 retrieval results of the selected concepts and persons. Altogether, 744 positive images for concepts and 199 positive images of persons are incorporated. This includes additional images for rarely occurring classes and an extra person (”Mikhail Gorbachev”). To ensure a minimum of 10 images per class, the additional images are collected from the test set using metadata information and video OCR (optical character recognition. To further enlarge the dataset, 257 images with negative labels are randomly chosen from the Top-200 retrieval results (178 images from concept retrieval results and 79 images from person retrieval results). These images of unknown content tend to be similar to one of the concepts or persons, making the manual labeling process more difficult. Finally, the image annotation study comprises 1200 images, 20 concepts and 10 persons as presented in Table 1.
3.2 Study Participants
Five experts and five non-experts participated in our inter-coder agreement study. The group of experts consists of DRA employees, who are in a general sense very familiar with the historical material we use in our study. In particular, experts are fully trained archivists and information specialists with a mean age of years and work experiences ranging from 3 to 28 years. The non-expert group is composed of Master’s students in computer science with a mean age of years, who have not worked with the material previously. For the participation in the study, students were rewarded 30 Euro.
3.3 Experimental Design
Participants were asked to classify historical images into 20 concept as well as 10 person categories. Thus, we allowed to assign multiple labels, for the case that more than one concept was detected in an image. Since we included random images depicting none of the available categories, participants had the choice to skip and leave images unlabeled.
Participants were provided with annotation instructions which they were allowed to keep during the study. For the concept categories we provided formal descriptions as well as some hints how certain concepts may appear in the images. For each of the personalities we provided three images, which were chosen from both Google and further archive material. The images were selected such that the person’s variability in appearance and age was covered as much as possible. Furthermore, we made sure that example images were mutually exclusive with the ones in the study. Before the start of the annotation task, the participants were given the opportunity to look up concepts and person on the Internet in case of uncertainties. However, they were prohibited to search the Internet during the study.
To evaluate the degree of agreement between different sets of annotators, we used Krippendorff’s alpha (K’s ) coefficient . We chose K’s instead of other reliability measures, since it is not affected by the number of coders . Hence, reliability can be compared for annotator groups of asymmetric size. Moreover, K’s is able to handle missing data. This also suites our study setup well since we involved images which do not belong to any of our categories and allowed participants to skip images unlabeled.
The resulting inter-coder agreement in terms of Krippendorff’s for the 1200 image samples of our study is displayed in Figure 2. Thus, we report K’s per category, respectively for the group of experts and non-experts. In order to measure how well non-experts and experts agree on the categories, we also determined combined agreement among both groups. While experts and non-experts agree almost equally strong on the concept categories ( vs. across all concepts), expert annotations seem to be more homogeneous for the GDR personalities ( vs. across all personalities). According to Krippendorff , an can be considered an agreement value for reliable annotations. Therefore, annotations on concepts can be counted reliable for both groups of experts and students. Considering single concepts, there are many categories non-experts do surprisingly well on (12 concepts above reference value of ). In comparison to the experts, they failed stronger on concepts like mining, microelectronics or ship launching. For some of these categories the provided images are darker or low-quality close-ups, which are hard to classify due to either similar categories or lacking knowledge on the historical context. Nevertheless, non-experts were able to annotate 12 concept categories validly, whereas expert annotations of 11 concepts are considered reliable.
For persons, greater differences between experts and non-experts can be observed suggesting that experts are better annotators for the person identification task. The non-expert agreements are lower for all selected GDR personalities. While non-experts accurately identify commonly known personalities like Erich Honecker, Mikhail Gorbachev or Walter Ulbricht, the largest insecurities can be observed for persons like Fritz Cremer or Hermann Henselmann. This could be due to images of poor quality and also to a larger variability in appearance for the latter person, making it harder to recognize them. Though overall only the person annotations of the expert group are considered a reliable source according to Krippendorff’s rule, non-expert annotations seem to be ’good enough’ for the same personalities that experts annotated validly.
For the agreement of aggregated expert and non-expert annotations also some interesting observations can be made. Overall, an averaged joint agreement of on concepts and on persons can be determined. This implies that experts agree stronger with isolated experts than with the combined group, especially in the case of persons. Non-experts, on the contrary, agree stronger with experts in the case of persons. Surprisingly, the overall concept agreement in the inter-group scenario is also lower than the agreement for the weaker non-expert annotators. This could suggest that the groups of archivists and students may have different understandings for some of the concepts causing diverging biases. Some example concepts for which inter-group agreement is slightly lower than both isolated expert and non-expert agreement are waiting line, military parade and theater performance. Superior joint agreement of experts and non-experts in comparison to single-groups agreement can be uniquely observed for the concept shipyard due to a set of stronger agreeing experts and students.
Earlier in this section we implied that a high agreement is an indicator for an accurate set of annotations. For the categories of concepts, experts and non-experts were found to be equally accurate annotators. In the next section, we analyze to which extent agreement contributes to video indexing performance and investigate the opportunity of performance prediction. We determined that experts are more accurate labelers when it comes to historically relevant persons of the GDR. However, judgments of different groups may have different biases as also reasoned for the concept categories. Therefore, in Section 5, we exemplary determine for the set of different personality annotations collected in our study whether high agreement on training images implies higher system performance for person recognition.
4 Predicting Concept Classification Performance
In this section, we discuss correlations between original average precision results and inter-coder agreements from our study as well as training data size (Section 4.1). Afterwards, we exploit determined correlations in order to predict AP (average precision) of visual concept classification depending on the amount of training data and the agreement on concepts (Section 4.2).
4.1 Correlations with Average Precision Results
In Figure 3, correlations between the original average precision results presented in Table 1, the inter-coder agreement and the number of training images (see also Table 1) are displayed. Thus, the highest correlations are measured for AP results on concepts. While the correlation with expert agreement (0.64) is slightly higher than with non-expert agreement (0.62), also a significant correlation between AP values and the number of training images can be measured (0.78). We can notice only moderately high correlations for AP on persons as well as for total AP results. For this reason, we exploit high correlations observed for concepts in order to perform a performance prediction based on both inter-coder agreement and training data size for the task of concept classification.
The concepts in Figure 4 are sorted according to AP and show a comparison to inter-coder agreement and normalized training data size. For example, it can be noticed that for nine out of the best 10 concepts, the inter-coder agreement is above 0.8, whereas it is below 0.8 for the other ten concepts. Furthermore, it can be observed that the best six concepts have the highest number of training images.
4.2 Performance Prediction for Concept Classification
We employ support vector regression (SVR) 
based on a linear kernel to estimate AP on our study concepts given the corresponding training data sizes and agreement values as input. SVR is performed separately for expert and student agreement. According to leave-one-out cross validation, utilizing expert agreement yields a mean absolute error of(). Using non-expert agreement as input instead, yields a slightly higher error of (). In order to determine the individual impact of number of training images and annotation quality on predicting the AP on concepts, we also train SVR with training data size and expert agreement as input separately. Thus, the mean absolute error for using solely training data size as input is (). AP prediction based on the agreement only, yields an error of (). In comparison, the error of a random baseline using prediction by chance is around .
Figure 5 shows errors on the individual concepts for the different performance estimation approaches. It can be seen that the combined input variables significantly reduce the error for some of the concepts and overall lead to lower errors for most of the concepts. Thus, training image size-based and especially agreement-based prediction lead to high errors for many concepts. Overall, estimation results suggest that the AP performance on the historical concepts can be jointly approximated by training image quantity and image annotation quality. Using expert agreement and performing SVR on all concepts as training samples (i.e., no leave-one-out setting), AP can be exemplary approximated by the following function:
where corresponds to normalized agreement K’s and to the normalized training data size . However, have to lie within for valid estimates, i.e., results for are set to .
5 Impact of Expert vs. Non-Expert Annotations on Person Identification Performance
In this section, we evaluate person identification performance based on different sets of expert and non-expert annotations acquired in the inter-coder annotator study (Section 3). In particular, we create annotation-dependant classifiers following a face matching approach and determine systems based on single participant’s votes as well as majority vote for both experts and non-experts.
5.1 Experimental Setup
5.1.1 Person Identification Method
For the identification of the introduced ten personalities associated with GDR, we employ a basic face matching approach, in which comparisons rely on deep feature representations. In order to build a dictionary of facial representations, each personality is represented by the mean of their embeddings which are generated from a ground truth set of face images. Thus, an unknown face is considered a person of the dictionary, if the cosine similarity between its embedding and a dictionary embedding is sufficiently high according to a cosine similarity threshold. In order to extract deep facial features, we employ a FaceNet implementation 222https://github.com/davidsandberg/facenet, which was trained on the VGGFace2 dataset  using the Inception ResNet v1 architecture . The model achieves a LFW  accuracy of %. We use the cosine similarity threshold of from the FaceNet implementation code.
5.1.2 Training and Test Set
We use the judgments on the GDR personalities of the different annotators from the presented annotation study as input data for the described person identification approach. For testing, we utilize further archive image data from DRA as well as additional images crawled from Google. After detecting faces and manually deleting irrelevant faces, we obtain a total of 350 face images which constitute our test set.
Following the described approach, we independently build a training dictionary for each of the expert and non-expert study participants which rely on an annotator’s personal judgments from the former annotation study and allow us to directly compare the person identification performance for different sets of annotators. After evaluating classification accuracy based on each participant’s training dictionary individually, we average the estimated performances for experts and non-experts, respectively. Furthermore, we generate ground truth labels considering the majority vote of the five experts and also evaluate the performance for this dictionary. For the non-experts we perform the majority vote based approach accordingly.
|Christa Wolf||76.00 %||72.80 %||76.00 %||64.00 %|
|Erich Honecker||87.13 %||85.94 %||86.14 %||87.13 %|
|Fritz Cremer||88.33 %||83.33 %||91.67 %||83.33 %|
|Hermann Henselmann||53.33 %||48.33 %||75.00 %||66.67 %|
|Hilde Benjamin||86.67 %||86.67 %||87.50 %||87.50 %|
|Michail Gorbatschow||95.45 %||95.45 %||95.45 %||95.45 %|
|Siegmund Jaehn||86.06 %||81.52 %||84.85 %||81.82 %|
|Stephan Hermlin||64.12 %||60.29 %||70.59 %||67.65 %|
|Walter Ulbricht||76.29 %||77.29 %||75.71 %||78.57 %|
|Werner Tuebke||82.35 %||81.76 %||82.35 %||82.35 %|
|Overall||80.97 %||79.43 %||82.00 %||80.86 %|
Comparing the results of single expert and non-expert systems in Table 2, expert annotations lead to an average increase of over in performance. For only three of the ten personalities non-expert systems can keep up with those of experts. Only in the case of Walter Ulbricht the performance is superior by . For the other seven personalities expert training images lead to an increase of up to in accuracy. In accordance with expectations, annotations determined using the majority vote of respectively five participants leads to an increase in performance for both experts and non-experts in comparison to individual judgments. The majority vote allows for more certain labels and acts as a filter for noise such as individual incorrect judgments or accidental false annotations. However, non-expert majority voting is still superior to that of single as well as multiple experts. For Christa Wolf the majority vote annotations even lead to a decrease of accuracy by almost over the individual non-expert systems. This may be due to a minority of non-expert labels lifting the average for single non-expert systems, but which do not have an influence on majority voting.
As for the number of annotated images used per system, we cannot measure a significant correlation to the person identification results (average correlation: 0.297). Therefore, we assume that differences in performance are not caused by imbalanced training data but by incorrectly annotated person images.
Although the results suggest a moderate impact of labeling quality on person identification performance, we argue that in a real world annotation task, in which archive material can depict arbitrary persons, identifying more than ten persons is even harder. In this regard, labeling errors may have a higher impact on system performance. Please also note that non-expert participants were German students being more or less familiar with GDR history. Hence, labels of annotators who are completely unaware of German-German history could be more erroneous, presumably causing a higher performance gap between expert and non-expert person identification.
In this paper, we have presented an inter-coder agreement study for historical image annotation involving both expert and non-expert annotators. We conducted our study for different image categories of 20 concepts and 10 personalities associated with the former GDR. In this regard, we evaluated whether amateur annotators can compete with the expertise of DRA archivists. We found that non-experts are as reliable annotators as experts for common sense and GDR-specific concepts. However, in identifying historical personalities, non-experts were more insecure which is reflected in a lower agreement. We also found that non-experts agree stronger with experts than with non-experts on personalities presumably due to a fraction of ’better’ labelers among the non-experts.
In an attempt to identify the importance of inter-coder agreement for the performance of visual concept classification, we could determine correlations between AP results, training data size, and inter-coder agreement. Using support vector regression, we jointly modeled AP prediction for visual concepts by the amount of training data and agreement value. Our results indicate that it is very important to consider inter-coder agreement when designing a lexicon of visual concepts for a specific domain.
We also extended our study to determine how labeling quality affects system performance for the task of person identification. Single expert systems in average outperformed systems based on single non-expert annotation sets. In this context, we argued that differences in performance may even rise in a more complex real-world annotation task involving more persons to identify among even more unknown persons in the data. It turned out that majority voting corrects erroneous annotations and thus positively affects system performance. However, when it comes to annotating large-scale image material that current deep learning methods require, it is usually not plausible to annotate several times. Also, employing crowdworkers for co-annotation may not create as accurate labels as the German students considered non-experts in our case due to lack of domain knowledge.
This work is financially supported by the German Research Foundation (DFG: Deutsche Forschungsgemeinschaft, project numbers: EW 134/4-1, FR 791/15-1, HO 5800).
-  Cao, Q., Shen, L., Xie, W., Parkhi, O.M., Zisserman, A.: Vggface2: A dataset for recognising faces across pose and age. In: International Conference on Automatic Face and Gesture Recognition (2018)
-  Ewerth, R., Springstein, M., Phan-Vogtmann, L.A., Schütze, J.: ”Are machines better than humans in image tagging?” - A user study adds to the puzzle. In: Proceedings of the European Conference on Information Retrieval (ECIR). pp. 186–198. Springer, Cham (2017)
-  Hayes, A.F., Krippendorff, K.: Reliability in Content Analysis. Human Communication Research 30(3), 411–433 (2004). https://doi.org/10.1111/j.1468-2958.2004.tb00738.x
Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Tech. Rep. 07-49 (2007)
-  Krippendorff, K.: Answering the Call for a Standard Reliability Measure for Coding Data. Communication Methods and Measures 1(1), 77–89 (2007). https://doi.org/10.1080/19312450709336664
-  Mühling, M., Meister, M., Korfhage, N., Wehling, J., Hörth, A., Ewerth, R., Freisleben, B.: Content-based video retrieval in historical collections of the german broadcasting archive. In: International Conference on Theory and Practice of Digital Libraries (TPDL). pp. 67–78 (2016). https://doi.org/10.1007/s00799-018-0236-z
-  Nowak, S., Rüger, S.: How reliable are annotations via crowdsourcing? A study about inter-annotator agreement for multi-label image annotation. In: Proceedings of the international conference on Multimedia information retrieval (MIR). pp. 557–566 (2010). https://doi.org/10.1145/1743384.1743478
Rion Snow, Brendan O’Connor, Daniel Jurafsky, A.Y.N.: Cheap and fast - but is it good? evaluating non-expert annotations for natural language tasks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 254–263 (2008)
-  Smola, A.J., Schölkopf, B.: A Tutorial on Support Vector Regression. Statistics and Computing 14(3), 199–222 (2004). https://doi.org/10.1023/B:STCO.0000035301.49549.88
-  Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.: Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning (2016), http://arxiv.org/abs/1602.07261
-  Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015)