The development of real-time affect detection models often depends upon obtaining annotated data for supervised learning by employing human experts to label the student data. One open question in annotating affective data for affect detection is whether the labelers (i.e., human experts) need to be socio-culturally similar to the students being labeled, as this impacts the cost feasibility of obtaining the labels. In this study, we investigate the following research questions: For affective state annotation, how does the socio-cultural background of human expert labelers, compared to the subjects, impact the degree of consensus and distribution of affective states obtained? Secondly, how do differences in labeler background impact the performance of affect detection models that are trained using these labels?
We employed 5 experts from the United States and 5 experts from Turkey to label the same data collected through authentic classroom pilots with students in Turkey. Using HELP , each group labeled 14 hours of multi-modal data collected from ten 9th grade students in 2 sessions (40 mins each) for 3 affective states: Satisfied, Bored, and Confused. We analyzed within-country and cross-country inter-rater agreements using Krippendorff’s alpha [3, 8], where we checked all-5 and the best-3 experts (having the highest agreement) of each group. We also compared affective state distributions using majority labels obtained by each group.
For affect detection models, we employed two modalities: (1) Appearance (Appr): upper-body information from the camera, (2) Context & Performance (C&P): interaction and performance logs from the online learning platform for Math. For Appr
, the raw video data are segmented into instances and time series analysis methods were utilized to extract 188 appearance features, consisting of motion and energy measures, robust statistical estimators of head velocity, and frequency domain features related to head position, pose, and facial expressions. Further details of theAppr modality can be found in our previous study  where we used the same features in this study. For C&P, we extracted 24 features related to time (time spent on video/questions), grade (success/failure of attempts), hints (number of hints used on questions), attempts (number of trials), and others (gender). Further details of the C&P features employed, which are adapted from the study , can be found in our previous study Instructional and Assessment). Instances are sliding windows of 8-sec with 4-sec overlaps. Further details of the methodology used in this study can be found in the full version of this paper .
|Human Experts||Krippendorff’s Alpha|
|Labels: US||Labels: TR|
3 Experimental Results
The inter-rater agreements and affect detection model results are summarized in Table 1 and Table 2, respectively. Students’ affective states distributions are given in Figure 1. These results indicate that experts from Turkey obtained moderately better inter-rater agreement than the experts from the U.S. Note that even though the U.S. experts agree with each other, they agree fairly poorly with the Turkey experts. In addition, we observed important differences between the distributions of affective states provided by experts in the U.S. versus Turkey, and between the performances of the resulting real-time multi-modal affect detectors; especially for Bored and Confused states.
4 Discussion and Conclusion
Our findings suggest that there are indeed implications to using expert labelers who do not belong to the same population as the research subjects. The results in this study indicate that there could be a cultural impact in interpreting labeling ambiguities for affective states, which also has an impact on the affect detection model accuracies, especially for detecting the Bored and Confused states of the students. One key take-away message from this research is that cross-national or cross-cultural labelers should be vetted for inter-rater agreement very carefully .
- Alyuz et al.  N. Alyuz, E. Okur, U. Genc, S. Aslan, C. Tanriover, and A. A. Esme. An unobtrusive and multimodal approach for behavioral engagement detection of students. In Proceedings of the 1st ACM SIGCHI International Workshop on Multimodal Interaction for Education, MIE 2017, pages 26–32, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-5557-5. doi: 10.1145/3139513.3139521. URL https://doi.acm.org/10.1145/3139513.3139521.
- Aslan et al.  S. Aslan, S. E. Mete, E. Okur, E. Oktay, N. Alyuz, U. E. Genc, D. Stanhill, and A. A. Esme. Human expert labeling process (help): Towards a reliable higher-order user state labeling process and tool to assess student engagement. Educational Technology, 57(1):53–59, 2017. ISSN 00131962. URL https://eric.ed.gov/?id=EJ1126255.
- Krippendorff  K. Krippendorff. Computing krippendorff’s alpha-reliability. Departmental Papers (ASC), (43), 2011. URL https://repository.upenn.edu/asc_papers/43.
- Ocumpaugh  J. Ocumpaugh. Baker rodrigo ocumpaugh monitoring protocol (bromp) 2.0 technical and training manual. New York, NY and Manila, Philippines: Teachers College, Columbia University and Ateneo Laboratory for the Learning Sciences, 2015.
Okur et al. 
E. Okur, N. Alyuz, S. Aslan, U. Genc, C. Tanriover, and A. Arslan Esme.
Behavioral engagement detection of students in the wild.
International Conference on Artificial Intelligence in Education (AIED 2017), volume 10331 of Lecture Notes in Computer Science, pages 250–261, Cham, June 2017. Springer International Publishing. ISBN 978-3-319-61425-0. doi: 10.1007/978-3-319-61425-0_21. URL https://doi.org/10.1007/978-3-319-61425-0_21.
- Okur et al.  E. Okur, S. Aslan, N. Alyuz, A. Arslan Esme, and R. S. Baker. Role of socio-cultural differences in labeling students’ affective states. In International Conference on Artificial Intelligence in Education (AIED 2018), volume 10947 of Lecture Notes in Computer Science, pages 367–380, Cham, June 2018. Springer International Publishing. ISBN 978-3-319-93843-1. doi: 10.1007/978-3-319-93843-1_27. URL https://doi.org/10.1007/978-3-319-93843-1_27.
- Pardos et al.  Z. A. Pardos, R. S. Baker, M. San Pedro, S. M. Gowda, and S. M. Gowda. Affective states and state tests: investigating how affect and engagement during the school year predict end-of-year learning outcomes. Journal of Learning Analytics, 1(1):107–128, 2014.
- Siegert et al.  I. Siegert, R. Böck, and A. Wendemuth. Inter-rater reliability for emotion annotation in human–computer interaction: comparison and methodological improvements. Journal on Multimodal User Interfaces, 8(1):17–28, Mar 2014. ISSN 1783-8738. doi: 10.1007/s12193-013-0129-9. URL https://doi.org/10.1007/s12193-013-0129-9.