The Importance of Socio-Cultural Differences for Annotating and Detecting the Affective States of Students

01/12/2019
by   Eda Okur, et al.
University of Pennsylvania
Intel
0

The development of real-time affect detection models often depends upon obtaining annotated data for supervised learning by employing human experts to label the student data. One open question in annotating affective data for affect detection is whether the labelers (i.e., human experts) need to be socio-culturally similar to the students being labeled, as this impacts the cost feasibility of obtaining the labels. In this study, we investigate the following research questions: For affective state annotation, how does the socio-cultural background of human expert labelers, compared to the subjects, impact the degree of consensus and distribution of affective states obtained? Secondly, how do differences in labeler background impact the performance of affect detection models that are trained using these labels?

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

05/05/2020

CODA-19: Reliably Annotating Research Aspects on 10,000+ CORD-19 Abstracts Using a Non-Expert Crowd

This paper introduces CODA-19, a human-annotated dataset that codes the ...
01/21/2020

Why Does Cultural Diversity Foster Technology-enabled Intergenerational Collaboration?

Globalization and information technology enable people to join the movem...
01/31/2018

Assessing student's achievement gap between ethnic groups in Brazil

Achievement gaps refer to the difference in the performance on examinati...
12/19/2019

Understanding Human Judgments of Causality

Discriminating between causality and correlation is a major problem in m...
01/04/2019

Towards a Live Anonymous Question Queue To Address Student Apprehension

In today's university climate many first and second year classes have ov...
12/06/2021

Making a Bird AI Expert Work for You and Me

As powerful as fine-grained visual classification (FGVC) is, responding ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The development of real-time affect detection models often depends upon obtaining annotated data for supervised learning by employing human experts to label the student data. One open question in annotating affective data for affect detection is whether the labelers (i.e., human experts) need to be socio-culturally similar to the students being labeled, as this impacts the cost feasibility of obtaining the labels. In this study, we investigate the following research questions: For affective state annotation, how does the socio-cultural background of human expert labelers, compared to the subjects, impact the degree of consensus and distribution of affective states obtained? Secondly, how do differences in labeler background impact the performance of affect detection models that are trained using these labels?

2 Methodology

We employed 5 experts from the United States and 5 experts from Turkey to label the same data collected through authentic classroom pilots with students in Turkey. Using HELP [2], each group labeled 14 hours of multi-modal data collected from ten 9th grade students in 2 sessions (40 mins each) for 3 affective states: Satisfied, Bored, and Confused. We analyzed within-country and cross-country inter-rater agreements using Krippendorff’s alpha [3, 8], where we checked all-5 and the best-3 experts (having the highest agreement) of each group. We also compared affective state distributions using majority labels obtained by each group.

For affect detection models, we employed two modalities: (1) Appearance (Appr): upper-body information from the camera, (2) Context & Performance (C&P): interaction and performance logs from the online learning platform for Math. For Appr

, the raw video data are segmented into instances and time series analysis methods were utilized to extract 188 appearance features, consisting of motion and energy measures, robust statistical estimators of head velocity, and frequency domain features related to head position, pose, and facial expressions. Further details of the

Appr modality can be found in our previous study [5] where we used the same features in this study. For C&P, we extracted 24 features related to time (time spent on video/questions), grade (success/failure of attempts), hints (number of hints used on questions), attempts (number of trials), and others (gender). Further details of the C&P features employed, which are adapted from the study [7], can be found in our previous study [1]

. Separate generic classifiers (Random Forests) trained using majority labels from each expert group for each modality and each activity type (

Instructional and Assessment). Instances are sliding windows of 8-sec with 4-sec overlaps. Further details of the methodology used in this study can be found in the full version of this paper [6].

Human Experts Krippendorff’s Alpha
Within-country (all-5) US-all-5 0.472
TR-all-5 0.585
Within-country (best-3) US-best-3 0.564
TR-best-3 0.626
Cross-country Cross-all-10 0.379
Cross-best-6 0.400
Table 1: Inter-rater agreements (Krippendorff’s Alpha) among experts from the United States (US) and Turkey (TR).
Labels: US Labels: TR
Section Type Class Appr C&P Appr C&P
Instructional Satisfied 0.62 0.58 0.41 0.42
Bored 0.67 0.59 0.86 0.88
Overall 0.65 0.58 0.77 0.80
Assessment Satisfied 0.59 0.80 0.43 0.73
Confused 0.45 0.63 0.57 0.66
Overall 0.53 0.74 0.51 0.70
Table 2: Affect detection classifier results (F1-scores) for separate modalities (Appr: Appearance, C&P: Context & Performance) and section types (Instr: Instructional, Assess: Assessment) trained using labels by experts from the United States (US) and Turkey (TR).
Figure 1: Affective-state distributions by the experts from US and TR

3 Experimental Results

The inter-rater agreements and affect detection model results are summarized in Table 1 and Table 2, respectively. Students’ affective states distributions are given in Figure 1. These results indicate that experts from Turkey obtained moderately better inter-rater agreement than the experts from the U.S. Note that even though the U.S. experts agree with each other, they agree fairly poorly with the Turkey experts. In addition, we observed important differences between the distributions of affective states provided by experts in the U.S. versus Turkey, and between the performances of the resulting real-time multi-modal affect detectors; especially for Bored and Confused states.

4 Discussion and Conclusion

Our findings suggest that there are indeed implications to using expert labelers who do not belong to the same population as the research subjects. The results in this study indicate that there could be a cultural impact in interpreting labeling ambiguities for affective states, which also has an impact on the affect detection model accuracies, especially for detecting the Bored and Confused states of the students. One key take-away message from this research is that cross-national or cross-cultural labelers should be vetted for inter-rater agreement very carefully [4].

References

  • Alyuz et al. [2017] N. Alyuz, E. Okur, U. Genc, S. Aslan, C. Tanriover, and A. A. Esme. An unobtrusive and multimodal approach for behavioral engagement detection of students. In Proceedings of the 1st ACM SIGCHI International Workshop on Multimodal Interaction for Education, MIE 2017, pages 26–32, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-5557-5. doi: 10.1145/3139513.3139521. URL https://doi.acm.org/10.1145/3139513.3139521.
  • Aslan et al. [2017] S. Aslan, S. E. Mete, E. Okur, E. Oktay, N. Alyuz, U. E. Genc, D. Stanhill, and A. A. Esme. Human expert labeling process (help): Towards a reliable higher-order user state labeling process and tool to assess student engagement. Educational Technology, 57(1):53–59, 2017. ISSN 00131962. URL https://eric.ed.gov/?id=EJ1126255.
  • Krippendorff [2011] K. Krippendorff. Computing krippendorff’s alpha-reliability. Departmental Papers (ASC), (43), 2011. URL https://repository.upenn.edu/asc_papers/43.
  • Ocumpaugh [2015] J. Ocumpaugh. Baker rodrigo ocumpaugh monitoring protocol (bromp) 2.0 technical and training manual. New York, NY and Manila, Philippines: Teachers College, Columbia University and Ateneo Laboratory for the Learning Sciences, 2015.
  • Okur et al. [2017] E. Okur, N. Alyuz, S. Aslan, U. Genc, C. Tanriover, and A. Arslan Esme. Behavioral engagement detection of students in the wild. In

    International Conference on Artificial Intelligence in Education (AIED 2017)

    , volume 10331 of Lecture Notes in Computer Science, pages 250–261, Cham, June 2017. Springer International Publishing.
    ISBN 978-3-319-61425-0. doi: 10.1007/978-3-319-61425-0_21. URL https://doi.org/10.1007/978-3-319-61425-0_21.
  • Okur et al. [2018] E. Okur, S. Aslan, N. Alyuz, A. Arslan Esme, and R. S. Baker. Role of socio-cultural differences in labeling students’ affective states. In International Conference on Artificial Intelligence in Education (AIED 2018), volume 10947 of Lecture Notes in Computer Science, pages 367–380, Cham, June 2018. Springer International Publishing. ISBN 978-3-319-93843-1. doi: 10.1007/978-3-319-93843-1_27. URL https://doi.org/10.1007/978-3-319-93843-1_27.
  • Pardos et al. [2014] Z. A. Pardos, R. S. Baker, M. San Pedro, S. M. Gowda, and S. M. Gowda. Affective states and state tests: investigating how affect and engagement during the school year predict end-of-year learning outcomes. Journal of Learning Analytics, 1(1):107–128, 2014.
  • Siegert et al. [2014] I. Siegert, R. Böck, and A. Wendemuth. Inter-rater reliability for emotion annotation in human–computer interaction: comparison and methodological improvements. Journal on Multimodal User Interfaces, 8(1):17–28, Mar 2014. ISSN 1783-8738. doi: 10.1007/s12193-013-0129-9. URL https://doi.org/10.1007/s12193-013-0129-9.