Log In Sign Up

A Machine Learning Approach to Assess Student Group Collaboration Using Individual Level Behavioral Cues

by   Anirudh Som, et al.

K-12 classrooms consistently integrate collaboration as part of their learning experiences. However, owing to large classroom sizes, teachers do not have the time to properly assess each student and give them feedback. In this paper we propose using simple deep-learning-based machine learning models to automatically determine the overall collaboration quality of a group based on annotations of individual roles and individual level behavior of all the students in the group. We come across the following challenges when building these models: 1) Limited training data, 2) Severe class label imbalance. We address these challenges by using a controlled variant of Mixup data augmentation, a method for generating additional data samples by convexly combining different pairs of data samples and their corresponding class labels. Additionally, the label space for our problem exhibits an ordered structure. We take advantage of this fact and also explore using an ordinal-cross-entropy loss function and study its effects with and without Mixup.


page 3

page 14


AutoBalance: Optimized Loss Functions for Imbalanced Data

Imbalanced datasets are commonplace in modern machine learning problems....

Multi-label Thoracic Disease Image Classification with Cross-Attention Networks

Automated disease classification of radiology images has been emerging a...

ProtoTransformer: A Meta-Learning Approach to Providing Student Feedback

High-quality computer science education is limited by the difficulty of ...

Computing a Data Dividend

Quality data is a fundamental contributor to success in statistics and m...

Artistic Style in Robotic Painting; a Machine Learning Approach to Learning Brushstroke from Human Artists

Robotic painting has been a subject of interest among both artists and r...

1 Introduction

Collaboration is identified by both the Next Generation Science Standards [20] and the Common Core State Standards [7] as a required and necessary skill for students to successfully engage in the fields of Science, Technology, Engineering and Mathematics (STEM). Most teachers in K-12 classrooms instill collaborative skills in students by using instructional methods like project-based learning [16] or problem-based learning [8]. For a group of students performing a group-based collaborative task, a teacher monitors and assesses each student based on various verbal and non-verbal behavioral cues. However, due to the wide range of behavioral cues, it can often be hard for the teacher to identify specific behaviors that contribute to or detract from the collaboration effort [19, 17, 21]. This task becomes even more difficult when several student groups need to be assessed simultaneously.

Figure 1: The collaboration assessment conceptual model. In this paper we focus on building machine learning models that map features from Level B2 Level A and Level C Level A as indicated by the red arrows. For the duration of each task, Level A codes were assigned based on the overall collaboration ability of the group; Level B2 codes consisted of fixed-length 1-minute segments for each student in the group; Level C consisted of fluid/variable-length segments for each student in the group.

To better assist teachers, in our previous work we proposed an automated collaboration assessment conceptual model that provides an assessment of the collaboration quality of student groups based on behavioral communication at individual and group levels [2, 3]. The conceptual model illustrated in Figure 1

represents a multi-level, multi-modal integrated behavior analysis tool. The input to this model consists of Video or Audio+Video data recordings of a student group performing a collaborative task. This was done to test if visual behaviors alone could be used to estimate collaboration skills and quality. Next, low level features like facial expressions, body-pose are extracted at Level E. Information like joint attention and engagement are encoded at Level D. Level C describes complex interactions and individual behaviors. Level B is divided into two categories: Level B1 describes the overall group dynamics for a given task; Level B2 describes the changing individual roles assumed by each student in the group. Finally, Level A describes the overall collaborative quality of the group based on the information from all previous levels. This paper focuses on building machine learning models that predict a group’s collaboration quality from individual roles (Level B2) and individual behaviors (Level C) of the students, indicated by red arrows in Figure


Deep-learning algorithms have gained increasing attention in the Educational Data Mining (EDM) community. For instance, the first papers to use deep-learning for EDM were published in 2015, and the number of publications in this field keeps growing with each year [12]. Despite their growing popularity, deep-learning methods are difficult to work with under certain challenging scenarios. For example, deep-learning algorithms work best with access to large amounts of representative training data, i.e., data containing sufficient variations of each class label pattern. They also assume that the label distribution of the training data is approximately uniform. If either case is not satisfied then deep-learning methods tend to perform poorly at the desired task. Challenges arising due to limited and imbalanced training data is clearly depicted in Figure 2

. For our classification problem the label distribution appears to be similar to a bell-shaped normal distribution. As a result, for both Video and Audio+Video modality cases we have very few data samples for

Effective and Working Independently codes, and the highest number of samples for Progressing code. Figure 2

also shows the aggregate confusion matrix over all test sets after training Multi Layer Perceptron (MLP) classification models with class-balancing (

i.e., assigning a weight to the training data sample that is inversely proportional to the number of training samples corresponding to that sample’s class label). The input feature representations used were obtained from Level B2 and Level C. We observe that despite using class-balancing the predictions of the MLP model are biased towards the Progressing code.

Figure 2:

(left) Distribution of Level A codes which also represent the target label distribution for our classification problem. (middle, right) Aggregate confusion matrix of Multi Layer Perceptron (MLP) classification models that have been subjected to class-balancing during the training process. Even with class-balancing the MLP models are unable to overcome the bias in the training data. Note, the confusion matrix is normalized along each row, with the number in each cell representing the percentage of data samples that are classified to each class.

Contributions: To address the above challenges, in this paper we explore using a controlled variant of Mixup data augmentation, a simple and common approach for generating additional data samples [23]. Additional data samples are obtained by linearly combining different pairs of data samples and their corresponding class labels. Also note that the label space for our classification problem exhibits an ordered relationship. In addition to Mixup, we also explore the value in using an ordinal-cross-entropy loss function instead of the commonly used categorical-cross-entropy loss function.

Outline of the paper: Section 2 discusses related work. Section 3 provides necessary background on categorical-cross-entropy loss, ordinal-cross-entropy loss and Mixup data augmentation. Section 4

provides description of the dataset, features extracted and the controlled variant of Mixup data augmentation. Section

5 describes the experiments and results. Section 6 concludes the paper.

2 Related Work

Use of machine learning concepts for collaboration problem-solving analysis and assessment is still relatively new in the Educational Data Mining community. Reilly et al.

used Coh-Metrix indices (a natural language processing tool to measure cohesion for written and spoken texts) to train machine learning models to classify co-located participant discourse in a multi-modal learning analytics study

[18]. The multi-modal dataset consisted of eye-tracking, physiological and motion sensing data. They analyzed the collaboration quality between novice programmers that were instructed to program a robot to solve a series of mazes. However, they studied only two collaborative states thereby making it a binary classification problem. Huang et al. used an unsupervised machine learning approach to discover unproductive collaborative states for the same multi-modal dataset [14]. For input features they computed different measures for each modality. Using an unsupervised approach they were able to identify a three-state solution that showed high correlation with task performance, collaboration quality and learning gain. Kang et al.

also used an unsupervised learning approach to study the collaborative problem-solving process of middle school students. They analyzed data collected using a computer-based learning environment of student groups playing a serious game

[15]. They used KmL, an R package useful for applying -means clustering on longitudinal data [10]

. They too identified three different states using the proposed unsupervised method. In our paper we define five different group collaboration quality states in a supervised learning setup. The above studies discuss different ways to model positive collaboration between participants in a group. For Massive Open Online Courses (MOOCs), Alexandron

et al. proposed a technique to detect cheating in the form of unauthorized collaboration using machine learning classifiers trained on data of another form of cheating (copying using multiple accounts) [1].

Guo and Barmaki used a deep-learning based object detection approach for analysis of pairs of students collaborating to locate and paint specific body muscles on each other [11]. They used a Mask R-CNN for detecting students in video data. This is the only paper that we found that used deep-learning for collaboration assessment. They claim that close proximity of group participants and longer time taken to complete a task are indicators of good collaboration. However, they quantify participant proximity by the percentage of overlap between the student masks obtained using the Mask R-CNN. The amount of overlap can change dramatically across different view points. Also, collaboration need not necessarily be exhibited by groups that take a longer time to complete a task. In this paper, the deep-learning models are based off of the systematically designed multi-level conceptual model shown in Figure 1. The proposed approach utilizes features at lower levels of our conceptual model but we go well beyond these and also include higher level behavior analysis as well roles taken on by students to predict the overall group collaboration quality.

We propose using Mixup augmentation, an over-sampling approach together with an ordinal cross-entropy loss function to better handle limited and imbalanced training data. Over-sampling techniques have been commonly used to make the different label categories to be approximately equal. SMOTE is one of the oldest and most widely cited over-sampling methods proposed by Chawla et al. [5]. The controlled variant of Mixup that we propose is very similar to their approach. However, ordinal loss functions have not received as much attention since the label space of most current classification problems of interest do not exhibit an ordered structure or relationship. We refer interested readers to the following papers that talk about ordinal loss functions for deep ordinal classification [13, 4]. In this paper we propose a simple variant of the regular cross-entropy loss that takes into account the relative distance of the predicted samples from their true class label location.

3 Preliminaries

In this section we briefly go through the concepts that are used in the following sections. For the remainder of this section let us denote the input variables or covariates as

, ground-truth label vector as

and the predicted probability distribution as


3.1 Classification Loss Functions

The cross-entropy loss a.k.a. the categorical-cross-entropy loss function is commonly used for training deep-learning models for multi-class classification. Given a training sample , the cross-entropy loss can be represented as


Here, represents the number of classes. For a classification problem with

label categories, a deep-learning model’s softmax layer outputs a probability distribution vector

of length . The -th entry in represents the predicted probability of the -th class. The ground-truth label

is one-hot-encoded and represents a binary vector whose length is also equal to

. Note, and . For an imbalanced dataset, the learnt weights of a deep-learning model will be greatly governed by the class having the most number of samples in the training set. Also, if the label space exhibits an ordinal structure, the cross-entropy loss focuses only on the predicted probability of the ground-truth class and ignores the relative distance between an incorrectly predicted data sample and its true class label. A simple variant of the cross-entropy loss that is useful for problems exhibiting an ordered label space is shown in Equation 2.


Here, is an additional weight that is multiplied with the regular cross-entropy loss. Within , argmax returns the index of the maximum valued element in the vector and denotes the absolute value. During the training process, for training samples that are correctly classified, with the ordinal-cross-entropy loss being the same as the cross-entropy loss. However, the ordinal-cross-entropy loss will be higher than cross-entropy loss for misclassified samples and the increase in loss is proportional to how far the samples have been misclassified from their true label locations. We later go over the benefit of using the ordinal-cross-entropy loss function in Section 5.1.

3.2 Mixup Data Augmentation

Figure 3: Beta distributions for

respectively. Each Beta distribution plot has a different y-axis range and represents a 500-bin histogram of 200000 randomly selected

s. Note, most s for Beta(0.1,0.1) are at 0 and 1.

Despite best data collection practices, bias exists in most training datasets resulting from time or resource constraints. These biases, and the resulting performance problems of machine learning models trained on this data, are directly correlated with the problem of class imbalance. Class imbalance refers to the unequal representation or number of occurrences of different class labels. If the training data is more representative of some classes than others, then the model’s predictions would systematically be worse for the under-represented classes. Conversely, with too much data or an over-representation of certain classes can skew the decision toward a particular result. Mixup is a simple data augmentation technique that can be used for imbalanced datasets


. It is used for generating additional training samples and encourages the deep-learning model to behave linearly in-between the training samples. It extends the training distribution by incorporating the prior knowledge that linear interpolations of input variables

should lead to linear interpolations of the corresponding target labels . For example, given a random pair of training samples , , additional samples can be obtained by convexly combining the input covariate information and the corresponding class labels. This is illustrated in Equation 3.


Here, is the new generated sample. and is obtained using a Beta distribution with . Figure 3 shows different Beta distributions for respectively. If approaches 0 the s obtained have a higher probability of being 0 or 1. If

approaches 1 then the Beta distribution looks more like a uniform distribution. Based on the suggestions and findings in other papers

[23, 22], for our experiments we set . Apart from improving the classification performance on various image classification benchmarks [23], Mixup also leads to better calibrated deep-learning models [22]. This means that the predicted softmax scores of a model trained using Mixup are much better indicators of the actual likelihood of a correct prediction than models trained in a regular fashion. In Section 5.2, we explore the benefit of using Mixup with and without ordinal-cross-entropy loss.

4 Dataset Description, Feature Extraction and Controlled Mixup Data Generation

4.1 Dataset Description

Audio and video data was collected from 15 student groups across five middle schools. Each group was asked to perform 12 open-ended life science and physical science tasks that required the students to construct models of science phenomena. Each group was given only one hour to complete as many tasks as they possibly could. This resulted in 15 hours of audio and video recordings. Out of the 15 groups, 13 groups had 4 students, 1 group had 3 students, and 1 group had 5 students.

For Level A and Level B2, each video recording was coded by three human annotators using ELAN (an open-source annotation software) under two different modality conditions: 1) Video, 2) Audio+Video. For a given task performed by a group, each annotator first manually coded each level for the Video modality and later coded the same task for the Audio+Video modality. This was done to prevent any coding bias resulting due to the difference in modalities. A total of 117 tasks were coded by each of the three annotators. Next, the majority vote (code) from the group of three coders was used to determine the ground-truth Level A code. For cases where a clear majority was not possible the median of the three codes was used as the ground-truth. We used the same code ordering depicted in Figure

2. For example, if the three coders assigned Effective, Satisfactory, Progressing for a certain task then Satisfactory would be selected as the ground-truth label. Note that out of the 117 tasks within each modality we did not observe a majority Level A code for only 2 tasks. The distribution of the Level A target labels is shown in Figure 2. For learning mappings from Level B2 Level A we had access to only 351 data samples (117 tasks 3 coders) to train the machine learning models, with the ground-truth Level A labels determined using the process described above. The protocol used for generating training-test splits is described in Section 5.

In the case of Level C, each video recording was coded by just one annotator. Because of this we only had access to 117 data samples (117 tasks coded) for training the machine learning models to learn mappings from Level C Level A. This makes it an even more challenging classification problem. Note, the distribution of the Level A labels for this classification setting is similar to the distribution shown in Figure 2, with the difference being that each label class now has just one-third of the samples.

Figure 4: Histogram feature generation for Level B2 and Level C. Different colors indicate different codes assigned to each segment. Level B2 codes are represented as fixed-length 1 minute segments. Level C codes are represented as variable-length segments. A B2 histogram is generated for each task by compiling all the codes from all the students in the group. Similarly, level C histogram was generated by compiling all the codes observed after every 0.1 seconds over the duration of the task.
Level B2 Codes Level C Codes
Group guide/Coordinator [GG] Talking Recognizing/Inviting others contributions Joking/Laughing
Contributor (Active) [C] Reading Setting group roles and responsibilities Playing/Horsing around/Rough housing
Follower [F] Writing Comforting, encouraging others/Coralling Excessive difference to authority/leader
Conflict Resolver [CR] Using/Working with materials Agreeing Blocking information from being shared
Conflict Instigator/Disagreeable [CI] Setting up the physical space Off-task/Disinterested Doing nothing/Withdrawing
Off-task/Disinterested [OT] Actively listening/Paying attention Disagreeing Engaging with outside environment
Lone Solver [LS] Explaining/Sharing ideas Arguing Waiting
Problem solving/Negotiation Seeking recognition/Boasting
Table 1: Coding rubric for Level B2 and Level C.

4.2 Level B2 and Level C Histogram Representation

For the entire length of each task, Level B2 was coded using fixed-length one minute segments and Level C was coded using variable-length segments. This is illustrated in Figure 4. The coding rubric used by the annotators for these two levels is shown in Table 1. Level B2 and Level C consist of 7 codes and 23 codes respectively. Our objective in this paper is to be able to determine the overall collaboration quality of a group by summarizing all of the individual student roles and behaviors for a given task. A simple but effective way to do this is by generating histogram representations of all the codes observed in each task. Figure 4 also provides a simple illustration of the histogram generation process. While it is straightforward to generate histograms for Level B2, in the case of Level C we compile all the codes observed after every 0.1 seconds for generating the histogram. Once the histogram is generated for each task we normalize the histogram by dividing by the total number of codes in the histogram. Normalizing the histogram in a way removes the temporal component of the task. For example, if group-1 took 10 minutes to solve a task and group-2 took 30 minutes to solve the same task, but both groups were assigned the same Level A code despite group-1 finishing the task sooner. The raw histogram representations of both these groups would look different due to the difference in number of segments coded. However, normalizing the histograms would make the two groups more comparable. Note, the normalized histograms are the input to our machine learning models.

4.3 Controlled Mixup

We described the simplicity and benefits of Mixup augmentation in Section 3.2. Here, we describe a controlled variant of Mixup and how it is used for our dataset. From Figure 2, we know that our dataset has an imbalanced label distribution. We have a lot of data samples corresponding to the Progressing class and the number of samples keeps decreasing as we go towards the Effective and Working Independently classes. Conventional Mixup selects a random pair of samples and interpolates them by a that is determined using a Beta distribution. However, this results in generating samples that have the same imbalanced class distribution. We want to be able to generate a fixed number of samples for a specific category. To do this we first limit the range of , i.e., . Figure 5 shows a Beta(0.4,0.4) distribution where we only consider above threshold .

Figure 5: Illustration of a Beta(, ) distribution with . Using Mixup we generate additional data samples, where the selected is always above threshold .

Next, to generate additional samples for a specific class, we pair that class with its adjacent or neighboring classes. For example, let us use the following denotation: (primary-class, [adjacent-class-1, adjacent-class-2]), where primary-class represents the class for which we want to create additional samples; adjacent-class-1 and adjacent-class-2 represents its neighbors. We create the following pairs: (Effective, [Satisfactory,Progressing]), (Satisfactory, [Effective,Progressing]), (Progressing, [Satisfactory,Needs Improvement]), (Needs Improvement, [Progressing,Working Independently]) and (Working Independently, [Progressing,Needs Improvement]). The final step consists of generating samples for the primary-class using Mixup. We do this by randomly pairing samples from the primary-class with samples from the adjacent-classes. This process is repeated times. Note that for Mixup augmentation, is always multiplied with the primary-class sample and is multiplied with the adjacent-class sample. For our experiments we explore the following values of : 0.55, 0.75 and 0.95. Setting guarantees that the generated sample would always be dominated by the primary-class.

5 Experiments

5.0.1 Network Architecture:

We used a 5-layer Multi Layer Perceptron (MLP) model whose design was based on the MLP model described in [9]. It contains the following layers: 1 input layer, 3 middle dense layers and 1 output dense layer. The normalized histogram representations discussed in Section 4.2

are passed as input to the input layer. Each dense middle layer has 500 units with ReLU activation. The output dense layer has a softmax activation and the number of units is equal to 5 (total number of classes in Level A). We also used dropout layers between each layer to avoid overfitting. The dropout-rate after the input layer and after each of the three middle layers was set to 0.1, 0.2, 0.2, 0.3 respectively. We try three different types of input data: B2 histograms, C histograms and concatenating B2 and C histograms (referred to as B2+C histograms). The number of trainable parameters for B2 histogram is 507505, C histogram is 515505, B2+C histogram is 519005. Our models were developed using Keras with TensorFlow backend


. We used the Adams optimizer and trained all our models for 500 epochs. The batch-size was set to one-tenth of the number of training samples during any given training-test split. We saved the best model that gave us the lowest test-loss for each training-test split.

5.0.2 Training and Evaluation Protocol:

We adopt a round-robin leave-one-group-out cross validation protocol. This means that for each training-test split we use data from groups for training and the group is used as the test set. This process is repeated for all groups. For our experiments though we have histogram representations for each task performed by 15 student groups. This is because in the Audio+Video modality setting all samples corresponding to the Effective class were found only in one group. Similarly, for the Video modality all samples corresponding to Working Independently class were also found in just one group. Due to this reason we do not see any test samples for the Effective class in Audio+Video and the Working Independently class in Video in the confusion matrices shown earlier in Figure 2. Note, for Level B2 Level A we have 351 data samples, and for Level C Level A we only have 117 data samples (discussed in Section 4.1).

5.1 Effect of Ordinal-Cross-Entropy Loss

The ordinal-cross-entropy loss shown in Equation 2 takes into account the distance of the highest predicted probability from its one-hot encoded true label. This is what separates it from the regular cross-entropy loss (Equation 1) which only focuses on the predicted probability corresponding to the ground-truth label. In this section we explore the following four variations: cross-entropy loss only, cross-entropy loss with class balancing, ordinal-cross-entropy loss only and ordinal-cross-entropy loss with class balancing. Here class balancing refers to weighting each data sample by a weight that is inversely proportional to the number of data samples corresponding to that sample’s class label.

Figure 6: Comparison of the average weighted F1-Score performance between cross-entropy loss and ordinal-cross-entropy loss, with and without class balancing and under different parameter settings S1-S9.

Figure 6 illustrates the average weighted F1-score classification performance for the four variations under different parameter settings. We only varied the patience and minimum-learning-rate (Min-LR) parameter as we found that these two affected the classification performance the most. These parameters were used to reduce the learning-rate by a factor of 0.5 if the loss did not change after a certain number of epochs indicated by the patience parameter. Compared to the two cross-entropy-loss variants we clearly see that the two ordinal-cross-entropy loss variants help significantly improve the F1-scores across all the parameter settings. We consistently see improvements across both modality conditions and for the different histogram inputs. Using class balancing we only see marginal improvements for both loss functions. Also, the F1-scores for Video modality is always lower than the corresponding settings in Audio+Video modality. This is expected as it shows that annotations obtained using Audio+Video recordings are more cleaner and better represent the student behaviors.

5.2 Effect of Controlled Mixup Data Augmentation

Figure 7: Comparison of the average weighted F1-Score performance of using controlled Mixup augmentation, with and without ordinal-cross-entropy loss. Here, 200, 500, 1000 samples refer to the number of samples generated per class () using controlled Mixup.

We use different parameter settings (listed in Figure 6) for each type of histogram input. For B2 histogram we use parameter setting S5 for both modalities. For C histogram we use parameter setting S6 for Video and S3 for Audio+Video. For B2+C histogram we use parameter setting S7 for both Video and Audio+Video. For each modality and each type of histogram input we vary threshold in the range 0.55, 0.75, 0.95, and also vary (the number of controlled Mixup samples generated per class) in the range of 200, 500, 1000. Note, 200 implies that the training dataset contains 1000 samples (). Note, this form of Mixup does not retain the original set of training samples. Figure 7 shows the effect of controlled Mixup with and without ordinal-cross-entropy loss. Across both modality settings, we observe that Mixup augmentation with ordinal-cross-entropy loss is better than Mixup with regular cross-entropy loss for all cases in B2 histogram and for most cases in C histogram and B2+C histogram. This implies that controlled Mixup and ordinal-cross-entropy loss complement each other in most cases. We also observed that having a larger does not necessarily imply better performance. For Audio+Video modality we observe F1-scores to be similar irrespective of the value of . However, in the Video modality case we observe that F1-score decreases as increases. This could be attributed to the noisy nature of codes assigned by the annotators due to the lack of Audio modality. We also notice better performance using or for Audio+Video modality and for Video modality for B2 and B2+C histogram. However, we see the opposite effect in the case of C histogram. In the next section we will discuss two different variants of the controlled Mixup augmentation.

5.3 Full Mixup Vs Limited Mixup

For the controlled Mixup experiment described in the previous section, the MLP models were trained using the generated samples per class which do not retain the original set of training samples. Let us refer to this as Full-Mixup. In this section we explore training MLP models with the original set of training samples and only generate samples needed to reach samples per class using controlled Mixup. For example, let us assume that the Effective class already has training samples, then we only compute samples using controlled Mixup to reach the required samples per class. This process makes sure that we always have the original set of training samples. Let us refer to this as Limited-Mixup. Figure 8 shows the average weighted F1-score comparing Full-Mixup and Limited-Mixup. We only show results for using B2 histogram feature as we observed similar trends in the other cases as well. We see that Full-Mixup and Limited-Mixup have similar F1-scores. This implies that we can generate the samples per class only using controlled Mixup protocol described in Section 3.2 without much noticeable difference in F1-score performance.

Figure 8: Full-Mixup Vs Limited Mixup evaluation using different loss functions. Average weighted F1-score shown only for B2 histogram feature input and .

5.4 Additional Analysis and Discussion

In this section we discuss in more detail the behavior of different classification models seen in the previous sections. Under each modality, Table 2

shows the weighted precision, weighted recall and weighted F1-score results for the best MLP models under different experimental settings. Here, the best MLP models were decided based on the weighted F1-score since it provides a more summarized assessment by combining information seen in both precision and recall. Values in the table represent the longest bars observed in Figures

6 and 7. Note, weighted recall is equal to accuracy. We also show results using an SVM classifier. For SVM we explored linear and different non-linear kernels with different parameter settings and only showed the best result in Table 2.

For both modalities and for each type of histogram input, if we focus only on the weighted F1-scores we notice that there is little or no improvement as we go towards incorporating controlled Mixup and ordinal-cross-entropy loss. For this reason we also show the corresponding weighted precision and weighted recall values. We observe that the average weighted precision increases and the standard-deviation of the weighted precision decreases as we go towards the proposed approach. For an imbalanced classification problem the objective is to be able to predict more true positives. Thus a higher precision indicates more true positives as it does not consider any false negatives in its calculation. The bold values in Table

2 indicate the top two methods with highest weighted precision values in each modality. We find that the Cross-Entropy loss + Mixup and Ordinal-Cross-Entropy loss + Mixup methods show the highest weighted Precision using the B2 histogram input in the Video modality and the B2+C histogram input in the Audio+Video modality.

Feature Classifier Video Audio + Video
Weighted Precision Weighted Recall Weighted F1-Score Weighted Precision Weighted Recall Weighted F1-Score
SVM 74.6011.27 62.679.42 63.8411.18 84.4513.43 73.1916.65 76.9215.39
MLP - Cross-Entropy Loss 76.9012.91 73.9511.02 72.8913.22 83.7216.50 86.4210.44 84.4013.85
MLP - Cross-Entropy Loss
+ Class-Balancing
77.0813.03 73.8413.27 74.1213.59 83.9317.89 85.2914.37 84.1616.23
MLP - Ordinal-Cross-Entropy Loss 81.5113.44 79.0913.62 79.1113.96 86.9614.56 88.7810.36 87.0313.16
MLP - Ordinal-Cross-Entropy Loss
+ Class-Balancing
80.7814.12 78.7011.98 77.9314.05 86.7314.43 88.209.66 86.6012.54
MLP - Cross-Entropy Loss
+ Mixup
81.6112.81 73.5610.31 76.4011.00 88.5112.32 83.5814.14 85.6413.23
MLP - Ordinal-Cross-Entropy Loss
+ Mixup
83.3010.06 76.579.42 79.069.66 89.5910.15 84.9313.20 86.0912.94
SVM 59.2727.00 42.7620.69 46.8522.26 72.3320.33 60.1519.45 63.2517.96
MLP - Cross-Entropy Loss 63.2420.78 65.7316.34 60.4617.57 81.1516.90 84.1611.67 81.7014.41
MLP - Cross-Entropy Loss
+ Class-Balancing
63.8222.08 64.7718.51 60.6419.89 80.4418.11 84.8811.70 81.6715.06
MLP - Ordinal-Cross-Entropy Loss 68.1627.13 72.5917.88 67.8823.01 86.0514.11 86.9011.43 85.3313.07
MLP - Ordinal-Cross-Entropy Loss
+ Class-Balancing
71.7424.34 74.1016.75 70.3720.94 85.2413.54 86.1111.65 84.9412.52
MLP - Cross-Entropy Loss
+ Mixup
72.2723.29 64.4519.55 66.0220.35 84.2513.78 81.9113.68 81.8213.93
MLP - Ordinal-Cross-Entropy Loss
+ Mixup
75.1121.63 69.5418.64 70.0320.01 82.9414.63 81.9114.68 81.6314.46
SVM 72.4915.35 61.8913.21 64.9514.15 82.3216.53 73.3215.27 76.6515.65
MLP - Cross-Entropy Loss 76.1515.81 74.5915.02 73.3516.08 83.3819.42 87.7514.68 85.0917.12
MLP - Cross-Entropy Loss
+ Class-Balancing
75.7517.23 73.8116.50 73.1117.17 84.7116.57 88.6811.04 85.5215.01
MLP - Ordinal-Cross-Entropy Loss 78.0517.94 77.8816.16 76.7317.65 85.5117.28 89.2512.19 86.9114.99
MLP - Ordinal-Cross-Entropy Loss
+ Class-Balancing
78.1017.70 77.3317.02 76.6117.96 86.9015.83 88.8211.50 86.9914.15
MLP - Cross-Entropy Loss
+ Mixup
77.9917.42 72.8614.32 74.2916.08 90.4811.20 86.5714.07 87.4513.65
MLP - Ordinal-Cross-Entropy Loss
+ Mixup
77.9216.66 75.8215.27 76.4516.02 90.0510.80 85.9114.00 87.0113.18
Table 2: Weighted precision, weighted recall and weighted F1-score MeanStd for the best MLP models under different experimental settings. The best models were selected based on the weighted F1-score. Bold values indicate the top two methods with the highest weighted precision under each modality condition.
Figure 9: Aggregate confusion matrix illustrations of the MLP classification model under different experimental conditions. The confusion matrix for each method corresponds to the best MLP model described in Table 2. The confusion matrices are normalized along each row. Note, the number in each cell represents the percentage of samples classified to each class.

The higher weighted precision is better illustrated using the confusion matrices shown in Figure 9. Here, we show confusion matrices for Video modality using the B2 histogram features and for Audio+Video modality using the B2+C histogram, as these showed the best weighted precision values in Table 2. As seen earlier in Section 5.1, ordinal-cross-entropy loss did show significant improvements in terms of weighted F1-score. However, even with class balancing we notice that the best MLP model is still biased towards the class with the most training samples. If we look at the controlled Mixup variants with either cross-entropy loss or ordinal-cross-entropy loss we notice a better diagonal structure in the confusion matrix, indicating more number of true positives. Note, we do not see any test samples for the Effective class in Audio+Video and the Working Independently class in Video in the confusion matrices. Between Cross-Entropy loss + Mixup and Ordinal-Cross-Entropy loss + Mixup, we notice that ordinal-cross-entropy loss helps minimize the spread of test sample prediction only to the nearest neighboring classes.

6 Conclusion

In this paper we built simple machine learning models to determine the overall collaboration quality of a student group based on the summary of individual roles and individual behaviors exhibited by each student. We come across challenges like limited training data and severe class imbalance when building these models. To address these challenges we proposed using an ordinal-cross-entropy loss function together with a controlled variation of Mixup data augmentation. Ordinal-cross-entropy loss is different from the regular categorical cross-entropy loss as it takes into account how far the training samples have been classified from their true label locations. We proposed a controlled variant of Mixup allowing us to generate a desired number of data samples for each label category for our problem. Through various experiments we studied the behavior of different machine learning models under different experimental conditions and realized the benefit of using ordinal-cross-entropy loss with Mixup. For future work, we would like to explore building machine learning models that learn mappings across the other levels described in Figure 1 and also explore the temporal nature of the annotation segments as a regression problem.


  • [1] G. Alexandron, J. A. Ruipérez-Valiente, and D. E. Pritchard (2020)

    Towards a general purpose anomaly detection method to identify cheaters in massive open online courses

    Cited by: §2.
  • [2] N. Alozie, S. Dhamija, E. McBride, and A. Tamrakar (2020) Automated collaboration assessment using behavioral analytics. Cited by: §1.
  • [3] N. Alozie, E. McBride, and S. Dhamija (2020) Collaboration conceptual model to inform the development of machine learning models using behavioral analytics. Cited by: §1.
  • [4] C. Beckham and C. Pal (2017) Unimodal probability distributions for deep ordinal classification. arXiv preprint arXiv:1705.05278. Cited by: §2.
  • [5] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer (2002) SMOTE: synthetic minority over-sampling technique.

    Journal of artificial intelligence research

    16, pp. 321–357.
    Cited by: §2.
  • [6] F. Chollet et al. (2015) Keras. Note: Cited by: §5.0.1.
  • [7] W. R. Daggett and D. S. GendroO (2010) Common core state standards initiative. International center. Cited by: §1.
  • [8] N. Davidson and C. H. Major (2014) Boundary crossings: cooperative learning, collaborative learning, and problem-based learning.. Journal on excellence in college teaching 25. Cited by: §1.
  • [9] H. I. Fawaz, G. Forestier, J. Weber, L. Idoumghar, and P. Muller (2019) Deep learning for time series classification: a review. Data Mining and Knowledge Discovery 33 (4), pp. 917–963. Cited by: §5.0.1.
  • [10] C. Genolini and B. Falissard (2011) KmL: a package to cluster longitudinal data. Computer methods and programs in biomedicine 104 (3), pp. e112–e121. Cited by: §2.
  • [11] Z. Guo and R. Barmaki (2019) Collaboration analysis using object detection.. In EDM, Cited by: §2.
  • [12] A. Hernández-Blanco, B. Herrera-Flores, D. Tomás, and B. Navarro-Colorado (2019) A systematic review of deep learning approaches to educational data mining. Complexity 2019. Cited by: §1.
  • [13] L. Hou, C. Yu, and D. Samaras (2016)

    Squared earth mover’s distance-based loss for training deep neural networks

    arXiv preprint arXiv:1611.05916. Cited by: §2.
  • [14] K. Huang, T. Bryant, and B. Schneider (2019) Identifying collaborative learning states using unsupervised machine learning on eye-tracking, physiological and motion sensor data.. International Educational Data Mining Society. Cited by: §2.
  • [15] J. Kang, D. An, L. Yan, and M. Liu (2019) Collaborative problem-solving process in a science serious game: exploring group action similarity trajectory.. International Educational Data Mining Society. Cited by: §2.
  • [16] J. S. Krajcik and P. C. Blumenfeld (2006) Project-based learning. na. Cited by: §1.
  • [17] M. L. Loughry, M. W. Ohland, and D. DeWayne Moore (2007) Development of a theory-based assessment of team member effectiveness. Educational and psychological measurement 67 (3), pp. 505–524. Cited by: §1.
  • [18] J. M. Reilly and B. Schneider (2019) Predicting the quality of collaborative problem solving through linguistic analysis of discourse.. International Educational Data Mining Society. Cited by: §2.
  • [19] K. A. Smith-Jentsch, J. A. Cannon-Bowers, S. I. Tannenbaum, and E. Salas (2008) Guided team self-correction: impacts on team mental models, processes, and effectiveness. Small Group Research 39 (3), pp. 303–327. Cited by: §1.
  • [20] N. L. States (2013) Next generation science standards: for states, by states. The National Academies Press. Cited by: §1.
  • [21] S. Taggar and T. C. Brown (2001) Problem-solving team behaviors: development and validation of bos and a hierarchical factor structure. Small Group Research 32 (6), pp. 698–726. Cited by: §1.
  • [22] S. Thulasidasan, G. Chennupati, J. A. Bilmes, T. Bhattacharya, and S. Michalak (2019) On mixup training: improved calibration and predictive uncertainty for deep neural networks. In Advances in Neural Information Processing Systems, pp. 13888–13899. Cited by: §3.2.
  • [23] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §1, §3.2, §3.2.