1 Introduction
Collaboration is identified by both the Next Generation Science Standards [20] and the Common Core State Standards [7] as a required and necessary skill for students to successfully engage in the fields of Science, Technology, Engineering and Mathematics (STEM). Most teachers in K12 classrooms instill collaborative skills in students by using instructional methods like projectbased learning [16] or problembased learning [8]. For a group of students performing a groupbased collaborative task, a teacher monitors and assesses each student based on various verbal and nonverbal behavioral cues. However, due to the wide range of behavioral cues, it can often be hard for the teacher to identify specific behaviors that contribute to or detract from the collaboration effort [19, 17, 21]. This task becomes even more difficult when several student groups need to be assessed simultaneously.
To better assist teachers, in our previous work we proposed an automated collaboration assessment conceptual model that provides an assessment of the collaboration quality of student groups based on behavioral communication at individual and group levels [2, 3]. The conceptual model illustrated in Figure 1
represents a multilevel, multimodal integrated behavior analysis tool. The input to this model consists of Video or Audio+Video data recordings of a student group performing a collaborative task. This was done to test if visual behaviors alone could be used to estimate collaboration skills and quality. Next, low level features like facial expressions, bodypose are extracted at Level E. Information like joint attention and engagement are encoded at Level D. Level C describes complex interactions and individual behaviors. Level B is divided into two categories: Level B1 describes the overall group dynamics for a given task; Level B2 describes the changing individual roles assumed by each student in the group. Finally, Level A describes the overall collaborative quality of the group based on the information from all previous levels. This paper focuses on building machine learning models that predict a group’s collaboration quality from individual roles (Level B2) and individual behaviors (Level C) of the students, indicated by red arrows in Figure
1.Deeplearning algorithms have gained increasing attention in the Educational Data Mining (EDM) community. For instance, the first papers to use deeplearning for EDM were published in 2015, and the number of publications in this field keeps growing with each year [12]. Despite their growing popularity, deeplearning methods are difficult to work with under certain challenging scenarios. For example, deeplearning algorithms work best with access to large amounts of representative training data, i.e., data containing sufficient variations of each class label pattern. They also assume that the label distribution of the training data is approximately uniform. If either case is not satisfied then deeplearning methods tend to perform poorly at the desired task. Challenges arising due to limited and imbalanced training data is clearly depicted in Figure 2
. For our classification problem the label distribution appears to be similar to a bellshaped normal distribution. As a result, for both Video and Audio+Video modality cases we have very few data samples for
Effective and Working Independently codes, and the highest number of samples for Progressing code. Figure 2also shows the aggregate confusion matrix over all test sets after training Multi Layer Perceptron (MLP) classification models with classbalancing (
i.e., assigning a weight to the training data sample that is inversely proportional to the number of training samples corresponding to that sample’s class label). The input feature representations used were obtained from Level B2 and Level C. We observe that despite using classbalancing the predictions of the MLP model are biased towards the Progressing code.Contributions: To address the above challenges, in this paper we explore using a controlled variant of Mixup data augmentation, a simple and common approach for generating additional data samples [23]. Additional data samples are obtained by linearly combining different pairs of data samples and their corresponding class labels. Also note that the label space for our classification problem exhibits an ordered relationship. In addition to Mixup, we also explore the value in using an ordinalcrossentropy loss function instead of the commonly used categoricalcrossentropy loss function.
Outline of the paper: Section 2 discusses related work. Section 3 provides necessary background on categoricalcrossentropy loss, ordinalcrossentropy loss and Mixup data augmentation. Section 4
provides description of the dataset, features extracted and the controlled variant of Mixup data augmentation. Section
5 describes the experiments and results. Section 6 concludes the paper.2 Related Work
Use of machine learning concepts for collaboration problemsolving analysis and assessment is still relatively new in the Educational Data Mining community. Reilly et al.
used CohMetrix indices (a natural language processing tool to measure cohesion for written and spoken texts) to train machine learning models to classify colocated participant discourse in a multimodal learning analytics study
[18]. The multimodal dataset consisted of eyetracking, physiological and motion sensing data. They analyzed the collaboration quality between novice programmers that were instructed to program a robot to solve a series of mazes. However, they studied only two collaborative states thereby making it a binary classification problem. Huang et al. used an unsupervised machine learning approach to discover unproductive collaborative states for the same multimodal dataset [14]. For input features they computed different measures for each modality. Using an unsupervised approach they were able to identify a threestate solution that showed high correlation with task performance, collaboration quality and learning gain. Kang et al.also used an unsupervised learning approach to study the collaborative problemsolving process of middle school students. They analyzed data collected using a computerbased learning environment of student groups playing a serious game
[15]. They used KmL, an R package useful for applying means clustering on longitudinal data [10]. They too identified three different states using the proposed unsupervised method. In our paper we define five different group collaboration quality states in a supervised learning setup. The above studies discuss different ways to model positive collaboration between participants in a group. For Massive Open Online Courses (MOOCs), Alexandron
et al. proposed a technique to detect cheating in the form of unauthorized collaboration using machine learning classifiers trained on data of another form of cheating (copying using multiple accounts) [1].Guo and Barmaki used a deeplearning based object detection approach for analysis of pairs of students collaborating to locate and paint specific body muscles on each other [11]. They used a Mask RCNN for detecting students in video data. This is the only paper that we found that used deeplearning for collaboration assessment. They claim that close proximity of group participants and longer time taken to complete a task are indicators of good collaboration. However, they quantify participant proximity by the percentage of overlap between the student masks obtained using the Mask RCNN. The amount of overlap can change dramatically across different view points. Also, collaboration need not necessarily be exhibited by groups that take a longer time to complete a task. In this paper, the deeplearning models are based off of the systematically designed multilevel conceptual model shown in Figure 1. The proposed approach utilizes features at lower levels of our conceptual model but we go well beyond these and also include higher level behavior analysis as well roles taken on by students to predict the overall group collaboration quality.
We propose using Mixup augmentation, an oversampling approach together with an ordinal crossentropy loss function to better handle limited and imbalanced training data. Oversampling techniques have been commonly used to make the different label categories to be approximately equal. SMOTE is one of the oldest and most widely cited oversampling methods proposed by Chawla et al. [5]. The controlled variant of Mixup that we propose is very similar to their approach. However, ordinal loss functions have not received as much attention since the label space of most current classification problems of interest do not exhibit an ordered structure or relationship. We refer interested readers to the following papers that talk about ordinal loss functions for deep ordinal classification [13, 4]. In this paper we propose a simple variant of the regular crossentropy loss that takes into account the relative distance of the predicted samples from their true class label location.
3 Preliminaries
In this section we briefly go through the concepts that are used in the following sections. For the remainder of this section let us denote the input variables or covariates as
, groundtruth label vector as
and the predicted probability distribution as
.3.1 Classification Loss Functions
The crossentropy loss a.k.a. the categoricalcrossentropy loss function is commonly used for training deeplearning models for multiclass classification. Given a training sample , the crossentropy loss can be represented as
(1) 
Here, represents the number of classes. For a classification problem with
label categories, a deeplearning model’s softmax layer outputs a probability distribution vector
of length . The th entry in represents the predicted probability of the th class. The groundtruth labelis onehotencoded and represents a binary vector whose length is also equal to
. Note, and . For an imbalanced dataset, the learnt weights of a deeplearning model will be greatly governed by the class having the most number of samples in the training set. Also, if the label space exhibits an ordinal structure, the crossentropy loss focuses only on the predicted probability of the groundtruth class and ignores the relative distance between an incorrectly predicted data sample and its true class label. A simple variant of the crossentropy loss that is useful for problems exhibiting an ordered label space is shown in Equation 2.(2) 
Here, is an additional weight that is multiplied with the regular crossentropy loss. Within , argmax returns the index of the maximum valued element in the vector and denotes the absolute value. During the training process, for training samples that are correctly classified, with the ordinalcrossentropy loss being the same as the crossentropy loss. However, the ordinalcrossentropy loss will be higher than crossentropy loss for misclassified samples and the increase in loss is proportional to how far the samples have been misclassified from their true label locations. We later go over the benefit of using the ordinalcrossentropy loss function in Section 5.1.
3.2 Mixup Data Augmentation
Despite best data collection practices, bias exists in most training datasets resulting from time or resource constraints. These biases, and the resulting performance problems of machine learning models trained on this data, are directly correlated with the problem of class imbalance. Class imbalance refers to the unequal representation or number of occurrences of different class labels. If the training data is more representative of some classes than others, then the model’s predictions would systematically be worse for the underrepresented classes. Conversely, with too much data or an overrepresentation of certain classes can skew the decision toward a particular result. Mixup is a simple data augmentation technique that can be used for imbalanced datasets
[23]. It is used for generating additional training samples and encourages the deeplearning model to behave linearly inbetween the training samples. It extends the training distribution by incorporating the prior knowledge that linear interpolations of input variables
should lead to linear interpolations of the corresponding target labels . For example, given a random pair of training samples , , additional samples can be obtained by convexly combining the input covariate information and the corresponding class labels. This is illustrated in Equation 3.(3) 
Here, is the new generated sample. and is obtained using a Beta distribution with . Figure 3 shows different Beta distributions for respectively. If approaches 0 the s obtained have a higher probability of being 0 or 1. If
approaches 1 then the Beta distribution looks more like a uniform distribution. Based on the suggestions and findings in other papers
[23, 22], for our experiments we set . Apart from improving the classification performance on various image classification benchmarks [23], Mixup also leads to better calibrated deeplearning models [22]. This means that the predicted softmax scores of a model trained using Mixup are much better indicators of the actual likelihood of a correct prediction than models trained in a regular fashion. In Section 5.2, we explore the benefit of using Mixup with and without ordinalcrossentropy loss.4 Dataset Description, Feature Extraction and Controlled Mixup Data Generation
4.1 Dataset Description
Audio and video data was collected from 15 student groups across five middle schools. Each group was asked to perform 12 openended life science and physical science tasks that required the students to construct models of science phenomena. Each group was given only one hour to complete as many tasks as they possibly could. This resulted in 15 hours of audio and video recordings. Out of the 15 groups, 13 groups had 4 students, 1 group had 3 students, and 1 group had 5 students.
For Level A and Level B2, each video recording was coded by three human annotators using ELAN (an opensource annotation software) under two different modality conditions: 1) Video, 2) Audio+Video. For a given task performed by a group, each annotator first manually coded each level for the Video modality and later coded the same task for the Audio+Video modality. This was done to prevent any coding bias resulting due to the difference in modalities. A total of 117 tasks were coded by each of the three annotators. Next, the majority vote (code) from the group of three coders was used to determine the groundtruth Level A code. For cases where a clear majority was not possible the median of the three codes was used as the groundtruth. We used the same code ordering depicted in Figure
2. For example, if the three coders assigned Effective, Satisfactory, Progressing for a certain task then Satisfactory would be selected as the groundtruth label. Note that out of the 117 tasks within each modality we did not observe a majority Level A code for only 2 tasks. The distribution of the Level A target labels is shown in Figure 2. For learning mappings from Level B2 Level A we had access to only 351 data samples (117 tasks 3 coders) to train the machine learning models, with the groundtruth Level A labels determined using the process described above. The protocol used for generating trainingtest splits is described in Section 5.In the case of Level C, each video recording was coded by just one annotator. Because of this we only had access to 117 data samples (117 tasks coded) for training the machine learning models to learn mappings from Level C Level A. This makes it an even more challenging classification problem. Note, the distribution of the Level A labels for this classification setting is similar to the distribution shown in Figure 2, with the difference being that each label class now has just onethird of the samples.
Level B2 Codes  Level C Codes  
Group guide/Coordinator [GG]  Talking  Recognizing/Inviting others contributions  Joking/Laughing 
Contributor (Active) [C]  Reading  Setting group roles and responsibilities  Playing/Horsing around/Rough housing 
Follower [F]  Writing  Comforting, encouraging others/Coralling  Excessive difference to authority/leader 
Conflict Resolver [CR]  Using/Working with materials  Agreeing  Blocking information from being shared 
Conflict Instigator/Disagreeable [CI]  Setting up the physical space  Offtask/Disinterested  Doing nothing/Withdrawing 
Offtask/Disinterested [OT]  Actively listening/Paying attention  Disagreeing  Engaging with outside environment 
Lone Solver [LS]  Explaining/Sharing ideas  Arguing  Waiting 
Problem solving/Negotiation  Seeking recognition/Boasting 
4.2 Level B2 and Level C Histogram Representation
For the entire length of each task, Level B2 was coded using fixedlength one minute segments and Level C was coded using variablelength segments. This is illustrated in Figure 4. The coding rubric used by the annotators for these two levels is shown in Table 1. Level B2 and Level C consist of 7 codes and 23 codes respectively. Our objective in this paper is to be able to determine the overall collaboration quality of a group by summarizing all of the individual student roles and behaviors for a given task. A simple but effective way to do this is by generating histogram representations of all the codes observed in each task. Figure 4 also provides a simple illustration of the histogram generation process. While it is straightforward to generate histograms for Level B2, in the case of Level C we compile all the codes observed after every 0.1 seconds for generating the histogram. Once the histogram is generated for each task we normalize the histogram by dividing by the total number of codes in the histogram. Normalizing the histogram in a way removes the temporal component of the task. For example, if group1 took 10 minutes to solve a task and group2 took 30 minutes to solve the same task, but both groups were assigned the same Level A code despite group1 finishing the task sooner. The raw histogram representations of both these groups would look different due to the difference in number of segments coded. However, normalizing the histograms would make the two groups more comparable. Note, the normalized histograms are the input to our machine learning models.
4.3 Controlled Mixup
We described the simplicity and benefits of Mixup augmentation in Section 3.2. Here, we describe a controlled variant of Mixup and how it is used for our dataset. From Figure 2, we know that our dataset has an imbalanced label distribution. We have a lot of data samples corresponding to the Progressing class and the number of samples keeps decreasing as we go towards the Effective and Working Independently classes. Conventional Mixup selects a random pair of samples and interpolates them by a that is determined using a Beta distribution. However, this results in generating samples that have the same imbalanced class distribution. We want to be able to generate a fixed number of samples for a specific category. To do this we first limit the range of , i.e., . Figure 5 shows a Beta(0.4,0.4) distribution where we only consider above threshold .
Next, to generate additional samples for a specific class, we pair that class with its adjacent or neighboring classes. For example, let us use the following denotation: (primaryclass, [adjacentclass1, adjacentclass2]), where primaryclass represents the class for which we want to create additional samples; adjacentclass1 and adjacentclass2 represents its neighbors. We create the following pairs: (Effective, [Satisfactory,Progressing]), (Satisfactory, [Effective,Progressing]), (Progressing, [Satisfactory,Needs Improvement]), (Needs Improvement, [Progressing,Working Independently]) and (Working Independently, [Progressing,Needs Improvement]). The final step consists of generating samples for the primaryclass using Mixup. We do this by randomly pairing samples from the primaryclass with samples from the adjacentclasses. This process is repeated times. Note that for Mixup augmentation, is always multiplied with the primaryclass sample and is multiplied with the adjacentclass sample. For our experiments we explore the following values of : 0.55, 0.75 and 0.95. Setting guarantees that the generated sample would always be dominated by the primaryclass.
5 Experiments
5.0.1 Network Architecture:
We used a 5layer Multi Layer Perceptron (MLP) model whose design was based on the MLP model described in [9]. It contains the following layers: 1 input layer, 3 middle dense layers and 1 output dense layer. The normalized histogram representations discussed in Section 4.2
are passed as input to the input layer. Each dense middle layer has 500 units with ReLU activation. The output dense layer has a softmax activation and the number of units is equal to 5 (total number of classes in Level A). We also used dropout layers between each layer to avoid overfitting. The dropoutrate after the input layer and after each of the three middle layers was set to 0.1, 0.2, 0.2, 0.3 respectively. We try three different types of input data: B2 histograms, C histograms and concatenating B2 and C histograms (referred to as B2+C histograms). The number of trainable parameters for B2 histogram is 507505, C histogram is 515505, B2+C histogram is 519005. Our models were developed using Keras with TensorFlow backend
[6]. We used the Adams optimizer and trained all our models for 500 epochs. The batchsize was set to onetenth of the number of training samples during any given trainingtest split. We saved the best model that gave us the lowest testloss for each trainingtest split.
5.0.2 Training and Evaluation Protocol:
We adopt a roundrobin leaveonegroupout cross validation protocol. This means that for each trainingtest split we use data from groups for training and the group is used as the test set. This process is repeated for all groups. For our experiments though we have histogram representations for each task performed by 15 student groups. This is because in the Audio+Video modality setting all samples corresponding to the Effective class were found only in one group. Similarly, for the Video modality all samples corresponding to Working Independently class were also found in just one group. Due to this reason we do not see any test samples for the Effective class in Audio+Video and the Working Independently class in Video in the confusion matrices shown earlier in Figure 2. Note, for Level B2 Level A we have 351 data samples, and for Level C Level A we only have 117 data samples (discussed in Section 4.1).
5.1 Effect of OrdinalCrossEntropy Loss
The ordinalcrossentropy loss shown in Equation 2 takes into account the distance of the highest predicted probability from its onehot encoded true label. This is what separates it from the regular crossentropy loss (Equation 1) which only focuses on the predicted probability corresponding to the groundtruth label. In this section we explore the following four variations: crossentropy loss only, crossentropy loss with class balancing, ordinalcrossentropy loss only and ordinalcrossentropy loss with class balancing. Here class balancing refers to weighting each data sample by a weight that is inversely proportional to the number of data samples corresponding to that sample’s class label.
Figure 6 illustrates the average weighted F1score classification performance for the four variations under different parameter settings. We only varied the patience and minimumlearningrate (MinLR) parameter as we found that these two affected the classification performance the most. These parameters were used to reduce the learningrate by a factor of 0.5 if the loss did not change after a certain number of epochs indicated by the patience parameter. Compared to the two crossentropyloss variants we clearly see that the two ordinalcrossentropy loss variants help significantly improve the F1scores across all the parameter settings. We consistently see improvements across both modality conditions and for the different histogram inputs. Using class balancing we only see marginal improvements for both loss functions. Also, the F1scores for Video modality is always lower than the corresponding settings in Audio+Video modality. This is expected as it shows that annotations obtained using Audio+Video recordings are more cleaner and better represent the student behaviors.
5.2 Effect of Controlled Mixup Data Augmentation
We use different parameter settings (listed in Figure 6) for each type of histogram input. For B2 histogram we use parameter setting S5 for both modalities. For C histogram we use parameter setting S6 for Video and S3 for Audio+Video. For B2+C histogram we use parameter setting S7 for both Video and Audio+Video. For each modality and each type of histogram input we vary threshold in the range 0.55, 0.75, 0.95, and also vary (the number of controlled Mixup samples generated per class) in the range of 200, 500, 1000. Note, 200 implies that the training dataset contains 1000 samples (). Note, this form of Mixup does not retain the original set of training samples. Figure 7 shows the effect of controlled Mixup with and without ordinalcrossentropy loss. Across both modality settings, we observe that Mixup augmentation with ordinalcrossentropy loss is better than Mixup with regular crossentropy loss for all cases in B2 histogram and for most cases in C histogram and B2+C histogram. This implies that controlled Mixup and ordinalcrossentropy loss complement each other in most cases. We also observed that having a larger does not necessarily imply better performance. For Audio+Video modality we observe F1scores to be similar irrespective of the value of . However, in the Video modality case we observe that F1score decreases as increases. This could be attributed to the noisy nature of codes assigned by the annotators due to the lack of Audio modality. We also notice better performance using or for Audio+Video modality and for Video modality for B2 and B2+C histogram. However, we see the opposite effect in the case of C histogram. In the next section we will discuss two different variants of the controlled Mixup augmentation.
5.3 Full Mixup Vs Limited Mixup
For the controlled Mixup experiment described in the previous section, the MLP models were trained using the generated samples per class which do not retain the original set of training samples. Let us refer to this as FullMixup. In this section we explore training MLP models with the original set of training samples and only generate samples needed to reach samples per class using controlled Mixup. For example, let us assume that the Effective class already has training samples, then we only compute samples using controlled Mixup to reach the required samples per class. This process makes sure that we always have the original set of training samples. Let us refer to this as LimitedMixup. Figure 8 shows the average weighted F1score comparing FullMixup and LimitedMixup. We only show results for using B2 histogram feature as we observed similar trends in the other cases as well. We see that FullMixup and LimitedMixup have similar F1scores. This implies that we can generate the samples per class only using controlled Mixup protocol described in Section 3.2 without much noticeable difference in F1score performance.
5.4 Additional Analysis and Discussion
In this section we discuss in more detail the behavior of different classification models seen in the previous sections. Under each modality, Table 2
shows the weighted precision, weighted recall and weighted F1score results for the best MLP models under different experimental settings. Here, the best MLP models were decided based on the weighted F1score since it provides a more summarized assessment by combining information seen in both precision and recall. Values in the table represent the longest bars observed in Figures
6 and 7. Note, weighted recall is equal to accuracy. We also show results using an SVM classifier. For SVM we explored linear and different nonlinear kernels with different parameter settings and only showed the best result in Table 2.For both modalities and for each type of histogram input, if we focus only on the weighted F1scores we notice that there is little or no improvement as we go towards incorporating controlled Mixup and ordinalcrossentropy loss. For this reason we also show the corresponding weighted precision and weighted recall values. We observe that the average weighted precision increases and the standarddeviation of the weighted precision decreases as we go towards the proposed approach. For an imbalanced classification problem the objective is to be able to predict more true positives. Thus a higher precision indicates more true positives as it does not consider any false negatives in its calculation. The bold values in Table
2 indicate the top two methods with highest weighted precision values in each modality. We find that the CrossEntropy loss + Mixup and OrdinalCrossEntropy loss + Mixup methods show the highest weighted Precision using the B2 histogram input in the Video modality and the B2+C histogram input in the Audio+Video modality.Feature  Classifier  Video  Audio + Video  
Weighted Precision  Weighted Recall  Weighted F1Score  Weighted Precision  Weighted Recall  Weighted F1Score  

SVM  74.6011.27  62.679.42  63.8411.18  84.4513.43  73.1916.65  76.9215.39  
MLP  CrossEntropy Loss  76.9012.91  73.9511.02  72.8913.22  83.7216.50  86.4210.44  84.4013.85  

77.0813.03  73.8413.27  74.1213.59  83.9317.89  85.2914.37  84.1616.23  
MLP  OrdinalCrossEntropy Loss  81.5113.44  79.0913.62  79.1113.96  86.9614.56  88.7810.36  87.0313.16  

80.7814.12  78.7011.98  77.9314.05  86.7314.43  88.209.66  86.6012.54  

81.6112.81  73.5610.31  76.4011.00  88.5112.32  83.5814.14  85.6413.23  

83.3010.06  76.579.42  79.069.66  89.5910.15  84.9313.20  86.0912.94  

SVM  59.2727.00  42.7620.69  46.8522.26  72.3320.33  60.1519.45  63.2517.96  
MLP  CrossEntropy Loss  63.2420.78  65.7316.34  60.4617.57  81.1516.90  84.1611.67  81.7014.41  

63.8222.08  64.7718.51  60.6419.89  80.4418.11  84.8811.70  81.6715.06  
MLP  OrdinalCrossEntropy Loss  68.1627.13  72.5917.88  67.8823.01  86.0514.11  86.9011.43  85.3313.07  

71.7424.34  74.1016.75  70.3720.94  85.2413.54  86.1111.65  84.9412.52  

72.2723.29  64.4519.55  66.0220.35  84.2513.78  81.9113.68  81.8213.93  

75.1121.63  69.5418.64  70.0320.01  82.9414.63  81.9114.68  81.6314.46  

SVM  72.4915.35  61.8913.21  64.9514.15  82.3216.53  73.3215.27  76.6515.65  
MLP  CrossEntropy Loss  76.1515.81  74.5915.02  73.3516.08  83.3819.42  87.7514.68  85.0917.12  

75.7517.23  73.8116.50  73.1117.17  84.7116.57  88.6811.04  85.5215.01  
MLP  OrdinalCrossEntropy Loss  78.0517.94  77.8816.16  76.7317.65  85.5117.28  89.2512.19  86.9114.99  

78.1017.70  77.3317.02  76.6117.96  86.9015.83  88.8211.50  86.9914.15  

77.9917.42  72.8614.32  74.2916.08  90.4811.20  86.5714.07  87.4513.65  

77.9216.66  75.8215.27  76.4516.02  90.0510.80  85.9114.00  87.0113.18 
The higher weighted precision is better illustrated using the confusion matrices shown in Figure 9. Here, we show confusion matrices for Video modality using the B2 histogram features and for Audio+Video modality using the B2+C histogram, as these showed the best weighted precision values in Table 2. As seen earlier in Section 5.1, ordinalcrossentropy loss did show significant improvements in terms of weighted F1score. However, even with class balancing we notice that the best MLP model is still biased towards the class with the most training samples. If we look at the controlled Mixup variants with either crossentropy loss or ordinalcrossentropy loss we notice a better diagonal structure in the confusion matrix, indicating more number of true positives. Note, we do not see any test samples for the Effective class in Audio+Video and the Working Independently class in Video in the confusion matrices. Between CrossEntropy loss + Mixup and OrdinalCrossEntropy loss + Mixup, we notice that ordinalcrossentropy loss helps minimize the spread of test sample prediction only to the nearest neighboring classes.
6 Conclusion
In this paper we built simple machine learning models to determine the overall collaboration quality of a student group based on the summary of individual roles and individual behaviors exhibited by each student. We come across challenges like limited training data and severe class imbalance when building these models. To address these challenges we proposed using an ordinalcrossentropy loss function together with a controlled variation of Mixup data augmentation. Ordinalcrossentropy loss is different from the regular categorical crossentropy loss as it takes into account how far the training samples have been classified from their true label locations. We proposed a controlled variant of Mixup allowing us to generate a desired number of data samples for each label category for our problem. Through various experiments we studied the behavior of different machine learning models under different experimental conditions and realized the benefit of using ordinalcrossentropy loss with Mixup. For future work, we would like to explore building machine learning models that learn mappings across the other levels described in Figure 1 and also explore the temporal nature of the annotation segments as a regression problem.
References

[1]
(2020)
Towards a general purpose anomaly detection method to identify cheaters in massive open online courses
. Cited by: §2.  [2] (2020) Automated collaboration assessment using behavioral analytics. Cited by: §1.
 [3] (2020) Collaboration conceptual model to inform the development of machine learning models using behavioral analytics. Cited by: §1.
 [4] (2017) Unimodal probability distributions for deep ordinal classification. arXiv preprint arXiv:1705.05278. Cited by: §2.

[5]
(2002)
SMOTE: synthetic minority oversampling technique.
Journal of artificial intelligence research
16, pp. 321–357. Cited by: §2.  [6] (2015) Keras. Note: https://keras.io Cited by: §5.0.1.
 [7] (2010) Common core state standards initiative. International center. Cited by: §1.
 [8] (2014) Boundary crossings: cooperative learning, collaborative learning, and problembased learning.. Journal on excellence in college teaching 25. Cited by: §1.
 [9] (2019) Deep learning for time series classification: a review. Data Mining and Knowledge Discovery 33 (4), pp. 917–963. Cited by: §5.0.1.
 [10] (2011) KmL: a package to cluster longitudinal data. Computer methods and programs in biomedicine 104 (3), pp. e112–e121. Cited by: §2.
 [11] (2019) Collaboration analysis using object detection.. In EDM, Cited by: §2.
 [12] (2019) A systematic review of deep learning approaches to educational data mining. Complexity 2019. Cited by: §1.

[13]
(2016)
Squared earth mover’s distancebased loss for training deep neural networks
. arXiv preprint arXiv:1611.05916. Cited by: §2.  [14] (2019) Identifying collaborative learning states using unsupervised machine learning on eyetracking, physiological and motion sensor data.. International Educational Data Mining Society. Cited by: §2.
 [15] (2019) Collaborative problemsolving process in a science serious game: exploring group action similarity trajectory.. International Educational Data Mining Society. Cited by: §2.
 [16] (2006) Projectbased learning. na. Cited by: §1.
 [17] (2007) Development of a theorybased assessment of team member effectiveness. Educational and psychological measurement 67 (3), pp. 505–524. Cited by: §1.
 [18] (2019) Predicting the quality of collaborative problem solving through linguistic analysis of discourse.. International Educational Data Mining Society. Cited by: §2.
 [19] (2008) Guided team selfcorrection: impacts on team mental models, processes, and effectiveness. Small Group Research 39 (3), pp. 303–327. Cited by: §1.
 [20] (2013) Next generation science standards: for states, by states. The National Academies Press. Cited by: §1.
 [21] (2001) Problemsolving team behaviors: development and validation of bos and a hierarchical factor structure. Small Group Research 32 (6), pp. 698–726. Cited by: §1.
 [22] (2019) On mixup training: improved calibration and predictive uncertainty for deep neural networks. In Advances in Neural Information Processing Systems, pp. 13888–13899. Cited by: §3.2.
 [23] (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §1, §3.2, §3.2.