Predicting Engagement in Video Lectures

05/31/2020 ∙ by Sahan Bulathwela, et al. ∙ UCL 14

The explosion of Open Educational Resources (OERs) in the recent years creates the demand for scalable, automatic approaches to process and evaluate OERs, with the end goal of identifying and recommending the most suitable educational materials for learners. We focus on building models to find the characteristics and features involved in context-agnostic engagement (i.e. population-based), a seldom researched topic compared to other contextualised and personalised approaches that focus more on individual learner engagement. Learner engagement, is arguably a more reliable measure than popularity/number of views, is more abundant than user ratings and has also been shown to be a crucial component in achieving learning outcomes. In this work, we explore the idea of building a predictive model for population-based engagement in education. We introduce a novel, large dataset of video lectures for predicting context-agnostic engagement and propose both cross-modal and modality-specific feature sets to achieve this task. We further test different strategies for quantifying learner engagement signals. We demonstrate the use of our approach in the case of data scarcity. Additionally, we perform a sensitivity analysis of the best performing model, which shows promising performance and can be easily integrated into an educational recommender system for OERs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

Code Repositories

context-agnostic-engagement

This repository contains the VLEngagement dataset and the helper functions/ tools required to work with the dataset.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the recent popularity of online learning platforms, the creation of Open Educational Resources (OERs) is increasing rapidly [TVET2018]. This recent large-scale creation of educational material demands for ways to automatically manage educational resources. In the context of OERs, this means finding and recommending material that fits the learners’ goals while maximising learning outcomes. Such a goal usually entails a large personalisation factor [truelearn]. We define it as contextualised engagement, which captures how engaging a learning resource is with regard to the context of the learner (e.g., learning needs/goals and learner state). Although contextualised engagement has gained interest in the recent years [sum_20], we argue that there is also a context-agnostic engagement factor, that only relates to features of the learning resource and attempts to capture the gold-standard label of population-based engagement (i.e. the marginal of contextual engagement for a resource across the population of learners). Modelling context-agnostic engagement enables identifying highly engaging resources across a population of learners before personalising educational recommendations to individuals. This paper studies the features involved in context-agnostic engagement, as a first step towards building an integrative educative recommendation system, that will join both contextualised and context-agnostic features [trueeducation].

A high quality learning resource needs to satisfy three main properties: i) academic soundness and appropriate coverage of the body of knowledge, ii) pedagogical robustness and iii) enabling learners to achieve their desired learning outcomes [Lane10]. Learner engagement has been shown to be a proxy for (iii), as engaging with material is a prerequisite for learning. There is evidence from both online [ramesh2014learning, lan2017behavior] and classroom [pardos2014affective, slater2016semantic] educational settings showing that higher learner engagement increases the likelihood of better learning outcomes. We thus focus on finding the general characteristics of engaging material. Using features that can be extracted across multiple modalities (video, text, audio etc.) allows developing prediction models for gold-standard engagement that are easily adaptable to a wide range of OERs and can be automated [x5gon].

Our work is one of the first to address educational engagement prediction with video lectures, specially from a quantitative perspective. One of our primary goals is to understand if easily automatable cross-modal features can be used as predictors for how engaging an educational resource is, as opposed to modality specific features. Although large-scale studies (involving millions of videos) have been conducted to analyse the prediction of engagement for general purpose videos [beyondviews], the largest study in the context of educational video lectures involves 800 videos from 4 courses and analyses engagement from a qualitative perspective [Guo_vid_prod]. To the best of our knowledge, this work is the first attempt to predict engagement with educational videos automatically. Our experiments involve more than 4000 video lectures that span over 20 diverse subjects, making it the largest dataset to date in this field. Our dataset, code and best performing model are released with the paper.

Given the usefulness of predicting context-agnostic engagement and the scarcity of work in this topic, we are motivated to answer the following research questions, which will enable the deployment of such a model in an educational platform:

  • How to encode context-agnostic engagement?

  • How effective are cross-modal language-based features for predicting engagement with video lectures?

  • Does including modality-specific features lead to a significant improvement in performance?

  • What features influence context-agnostic engagement?

  • Is predicting marginal population-based engagement useful over personalised engagement?

  • Can we assume a common underlying model for predicting engagement across different knowledge areas?

2 Related Work

The interest in identifying useful and engaging information goes beyond the educational domain and is investigated in numerous other fields [quality_features]

. For example, Wikipedia uses a review system to evaluate the quality of its articles. To do so, different machine learning models, such as support vector regression and ensemble methods, are used with features such as text style, readability, structure, network, recency and review information

[Dalip_wiki_svr, wiki_wang]

. Moreover, in the context of automatic essay scoring, promising results have been obtained through rank preference support vector machines

[AES_dataset]

and more sophisticated deep learning models

[taghipour2016neural].

Quality-biased document ranking [Bendersky2011] and spam web-page detection [Ntoulas2006] are other areas in the information retrieval domain that also utilises textual features and recency related features. These features categorise into different verticals such as understandability, topic coverage, presentation, freshness and authority [quality_features].

OERs available to the public come in large-scale and various modalities [x5gon, stud_eng_ou], which makes modality-specific models of limited use. As existing work proposes models with domain/modality specific features (e.g. network features of Wikipedia [dalip_quality_features] or speaker speed in videos [Guo_vid_prod]), there is a need for models that can evaluate how engaging educational materials are at scale using a cross-modal feature set. We attempt to address this gap through this work.

2.1 Why Modelling Engagement?

As argued by Lane [Lane10], a well designed learning resource should enable the learner to achieve the expected learning outcomes. Prior work has studied learner engagement in Massively Open Online Courses and shown that when optimised, engagement can increase the likelihood of achieving better learning outcomes [ramesh2014learning, lan2017behavior]. User engagement has also been shown to differ greatly from popularity measures such as number of views [beyondviews], as the latter does not necessarily capture whether learners consume the material. In our work, we also show that engagement does not positively correlate with user ratings. Instead, what we observe is that lectures with low rating also present low engagement rate. However, lectures with greater ratings can have different engagement rates.

For videos, watch time has been used as the main measure for quantifying engagement in the literature, e.g., for YouTube recommendations [Covington2016], predicting engagement with videos [beyondviews]. For educational content, the median of normalised engagement time (i.e., the percentage of watch time from the total video) has been used as gold standard for engagement [Guo_vid_prod]. Our work tests several approaches to encoding user engagement.

Most of the related work regarding predicting educational engagement attempts to model learner engagement as a function of the learner’s context (demography, user activity, etc.) [bonafini2017much, stud_eng_ou, beal2006classifying], as opposed to modelling context-agnostic learner engagement as a function of content-based features of the educational resource, which is our aim. Context-agnostic engagement has been previously studied for video lectures, advocating for qualitative and general recommendations such as keeping videos short [Guo_vid_prod], using conversational language for lecture delivery [brame2016effective] and others. These recommendations empower authors to create better educational videos. However, none of these works address the need for automatically identifying the features of highly engaging educational resources, which is imperative for retrieving and recommending educational material at scale.

3 Data and Methodology

This section first describes the dataset built for predicting engagement, together with the set of features proposed in this paper. Then, we introduce the machine learning methods and the feature importance analysis method considered.

To address the research questions outlined in the introductory section of this paper we do the following: i) We study different ways of refining user engagement signals, linking to literature on psychometrics (RQ1). ii) We propose two sets of easily automatable features for predicting engagement (cross-modal features inspired by context-agnostic quality literature and video-specific features) and evaluate the difference of predictive performance between them (RQ2 and RQ3). iii) We construct a large dataset of video lectures and evaluate the performance of the proposed engagement signals and sets of features (RQ2-4). iv) We compare cross-modal to modality specific features, analysing the impact of individual features in the predictive model that presents the most promising performance (RQ4). v) We compare our population-based engagement approach to its personalised analogue to demonstrate its usefulness (RQ5). vi) We compare the engagement models obtained from dividing the video lectures in two differentiated knowledge areas: STEM (such as technology, physics and mathematics lectures) vs others (such as arts, social science and philosophy lectures).

3.1 Dataset and Features (RQ2-4)

We use data from a popular OER repository, VideoLectures.Net (VLN)111www.videolectures.net, a collection of videos of researchers presenting in peer-reviewed conferences. This data is suitable for our aim for two reasons: i) It contains watch patterns about how learners consume lectures, and ii) the lectures are peer-reviewed and hence material is controlled for correctness of knowledge and pedagogical robustness. The transcriptions of English lectures and English translations for the non-English lectures are provided by the TransLectures project222www.translectures.eu. We restrict the final dataset to lectures that has been viewed by at least 5 unique users, leading to the final dataset having 4,063 lectures. These lectures are categorised into 21 subjects, e.g. Computer Science, Physics, Philosophy, etc. Learner engagement labels of the dataset is computed using 155,850 user view log events (video viewing events) created between December 8, 2016 and February 17, 2018.The dataset constructed is publicly available, including different statistics of population engagement and all the cross-modal and video-based features proposed.

3.1.1 Cross-modal Features

We selected a subset of cross-modal and mostly language-based features that are easy to extract from the VLN dataset. The 13 extracted features are shown in Table 1. This set has been selected based on recurring features in the related work [Bendersky2011, Dalip_wiki_svr, Guo_vid_prod, Ntoulas2006, wiki_wang] and their quality verticals [quality_features] identified in our prior work. The majority of features were extracted using methods and token (word) sets that are found in the prior work referenced in Table 1.

Additionally, we introduce the published date

, represented by converting the video publication date to UNIX epoch time (in days). In other words, it is the number of days between January 01, 1970 and the lecture published date.

3.1.2 Video-based Features

We also extracted four out of the seven features proposed for analysing educational engagement with video lectures from [Guo_vid_prod], selecting those features that can be automatised and are objective. These are: i) lecture duration, as shorter videos have been shown to be much more engaging; ii) is chunked, whether the lecture has been partitioned into multiple parts; iii) a set of indicator variables describing the type of lecture, such as tutorial, workshop, etc; and iv) speaker speed, measured by the average amount of words spoken per minute. We also include the silence period rate (SPR), calculated using the special tags in the video transcripts that indicate silence. Formally, for a lecture , this feature is calculated as follows:

(1)

where is a tag in the collection of tags that belong to lecture , returns the type of tag and returns the duration of tag or lecture and is the indicator function (returning 1 when the condition is verified, 0 otherwise).

Feature Reference
Content-based features
Easiness (FK Easiness) [Dalip_wiki_svr]
Stop-word Presence Rate [Ntoulas2006]
Stop-word Coverage Rate [Ntoulas2006]
Document Entropy [Bendersky2011]
Word Count [wiki_wang]
Title Word Count [Bendersky2011]
Preposition Rate [Dalip_wiki_svr]
Auxiliary Rate [Dalip_wiki_svr]
To Be Rate [Dalip_wiki_svr]
Conjunction Rate [Dalip_wiki_svr]
Normalization Rate [Dalip_wiki_svr]
Pronoun Rate [Dalip_wiki_svr]
Published Date
Video-based features
Lecture Duration [Guo_vid_prod]
Is Chunked [Guo_vid_prod]
Video Lecture Type [Guo_vid_prod]
Speaker speed [Guo_vid_prod]
Silence Period Rate (SPR)
Table 1: Extracted features from the VLN dataset.

3.2 Quantifying Engagement (RQ1)

Our work focuses on implicit user feedback (most specifically, engagement). Implicit feedback (in the form of number of views, engagement or any other measure that does not require the user to provide explicit feedback) has been used for building recommender systems for nearly two decades with great success [Oard1998ImplicitFF, Jannach2018RecommendingBO, Jawaheer14], as an alternative to explicit ratings, which have a high cognitive load on users and thus are usually sparse. However, implicit signals have other challenges associated with them. For example, implicit feedback is usually positive-only [Jannach2018RecommendingBO] and can contain effects such as popularity bias, i.e., there might be a bias towards more popular items, whereas implicit feedback for other items may be very sparse. There has been several works investigating the relationship between explicit and implicit feedback [Claypool:2001:III:359784.359836, ShapiraTM06, Zigoris2006], which we also do through this work.

The main measure that we use to quantify engagement is the Median of Normalised Engagement/watch Time (MNET), as it has been proposed as the gold standard for engagement with educational materials in previous work [Guo_vid_prod]. To have the MNET label in the range

, we set the upper bound of MNET to 1. We observed in our initial data analysis that MNET values in the VLN dataset follow a Log-Normal distribution, where it can be seen that most users generally abandon the lecture after a generally low time threshold. We hypothesise this may be because it takes some time to decide whether the content is relevant for the learner. Users that make it after this threshold seem more committed and thus the leaving rate is significantly lower. To address this, as this is usually a problem when using machine learning methods, we applied a log transformation to transform the engagement signal. The final label,

Log Median Normalised Engagement Time (LMNET) is computed using the following:

(2)

To test if LMNET can be further improved, we compare this approach of encoding engagement to other alternative ways of quantifying and cleaning engagement signals, drawing inspiration from the literature on psychometrics and subjective assessment [Janowski15, Thurstone1927], which focuses on explicit human feedback and assumes that users present cognitive biases and differences, with applications in preference ranking and measuring perception-based qualities, such as engagement. The intuition behind this is that different learners may have a different engagement threshold and scale, similarly as with explicit ratings [Janowski15]. We compare different approaches for defining engagement:

  1. Raw LMNET, as per Eq. (2) which considers that no user differences exist and the marginal over the population can be directly used as gold standard label for engagement, similarly as in [Guo_vid_prod].

  2. Cleaned LMNET, for which we test the removal of bot-like users (those users with an average engagement rate less than 5%), which may have a detrimental factor in the median of raw engagement.

  3. Standardised LMNET

    , in which we preprocess LMNET per user (subtracting the mean of the user and dividing by the standard deviation), as commonly done with human ratings in order to remove user biases and differences

    [Janowski15]. In this scale, positive values indicate lectures that are more engaging than the mean of the user and vice versa.

  4. Comparative MNET, in which we exploit the law of comparative judgement and use psychometric scaling to go from user comparative engagement data to a probabilistically interpretable engagement scale [Thurstone1927, perezortiz2017practical]. More specifically, we assume that engagement data can only be compared per user (as users may have different biases, thresholds or engagement scales). To do so, we generated a matrix of engagement comparisons (of the type: Did learner prefer lecture to

    in terms of engagement?), which is used as the input for psychometric scaling, producing a final scale in which distances can be interpreted in terms of probability of greater engagement.

As discussed, the limitation of these approaches is that they disregard the context of the learner and the temporal component that may inherently be present when engaging with educational material. A different measure to encode engagement is found in Wu et al. [beyondviews], where the main idea is to compare engagement relative to the length of the video. The authors propose this for entertainment videos. However, we argue against this approach in the case of educational material, as the aim is to take the learner to the desired state in the most efficient way, thus the general recommendations found in the literature of keeping videos as short as possible [Guo_vid_prod].

3.3 Machine Learning Models (RQ2)

To learn to rank video lectures based on engagement, we evaluate the performance using pointwise ranking models. Regression algorithms predict the target variable in real value space (

), which allows them to create a global ranking of observations based on predictions. We also evaluate the performance of engagement prediction using kernelised models. Kernelisation allows capturing non-linear patterns in data without having to operate in the respective basis. Although it is more computationally efficient than working in the non-linear space itself, it is more computationally expensive than solving the non-kernelised problem. Our choice of kernel for the models is the Radial Basis Function (RBF). RBF kernel is widely used in the literature and has mathematical connections to other popular kernels such as exponential and polynomial kernels

[chang2010training, shawe2004kernel].

We use two regression algorithms, namely, Ridge Regression (RR) and Support Vector Regression (SVR) in primal form. We use RR as it is a widely used algorithm for regression [beyondviews] and SVR as it has performed well in a similar task in prior work [Dalip_wiki_svr]. We also evaluate the performance of the kernelised version of the same two algorithms (with RBF kernel), Kernelised Ridge Regression (KRR) and Kernelised Support Vector Regression (KSVR). This allows us to understand if there is non-linearity in the patterns that benefits the prediction task. In all four models discussed above, we employ standard scaling as these models are not scale invariant. L2 regularisation is used to defend against overfitting and multicollinearity [Ng_l2]. As ensemble techniques have shown to perform well in prior work [wiki_wang], we also employ a Random Forest Regressor (RF) to evaluate its prediction capabilities. This model is also capable of capturing non-linear patterns.

3.3.1 Comparison to Personalised Models (RQ5)

One of our aims is to compare the population-based model to its personalised counterpart. The idea in this case is to test if a common baseline can be assumed for all users. For this, we train the same machine learning models per user, using the features previously proposed.

3.4 Feature Importance Analysis (RQ4)

Understanding how different features influence engageability of materials is vital in educational domain as learners will be guided on life-changing pathways based on these judgements. In a conventional linear model such as RR or SVM, feature importance analysis is straightforward as the weight coefficients reflect the influence of features.

In this paper we use SHapley Additive exPlanations (SHAP)

, which is a model-agnostic framework that quantifies the impact of features on the model predictions. It reliably estimates feature importance of complex model families such as ensembles

[NIPS2017_shap]. A SHAP value is computed for every feature of every prediction. Given a prediction and a feature, SHAP is computed by averaging how the prediction changes when the feature is present and vice versa. This procedure enables quantifying the contribution of each feature to the model prediction. By plotting all the SHAP values of the prediction data points in a SHAP summary plot, we can identify how each feature influences the prediction. By calculating the Mean Absolute SHAP (MAS) for each feature over the observations:

(3)

we obtain a more quantitative understanding of feature influence. is the number of observations.

4 Experiments and Discussion

This section shows the experimental setup and results for the different experiments conducted.

4.1 Experimental Setup

The evaluation of the machine learning models is performed using a 5-fold cross-validation for both feature sets. The performance of different machine learning models with different engagement quantification approaches can be found in Table 2. The performance when video-specific features are added is found in Table 3.

After gaining an understanding of model performance (see results in Table 2), we employ the best performing method and encoding for the rest of the analyses, using a hold-out validation with a train-test split of 70:30 to save computation. That is, the model is trained on the 70% training set and interpreted using the 30% test set. The experiments were implemented using Scikit-learn [scikit-learn], textatistic [textatistic] and SHAP [NIPS2017_shap] python packages. The source code in python and dataset are publicly available333https://github.com/sahanbull/context-agnostic-engagement.

4.1.1 Evaluation metrics

Pairwise accuracy (Pair.) and Spearman Rank Order Correlation Coefficient (SROCC) are the ranking metrics we used to evaluate the ranking performance of machine learning models with different engagement signal encodings.

Identifying models that can rank between video lectures is the core objective of this work. Hence, we devise pairwise accuracy

as the main evaluation metric. Pairwise accuracy is more intuitive for this task as it represents the fraction of pairwise comparisons where the model could predict the more engaging lecture. Another opportunity that pairwise comparison provides is the ability to restrict the comparisons to subsets of lecture pairs (e.g. lectures that belong to the same subject, lectures that have similar LMNET).

In some of our experiments we also perform misranking analysis and report the pairwise accuracy. Misranking could happen if a subset of examples is systematically difficult to rank. We hypothesize that misclassification happens more frequently as the difference of LMNET between a pair of video lectures gets smaller. That is, the model may struggle to differentiate between two lectures with similar engagement. By doing this analysis, we can also understand the sensitivity of the prediction model to similarly engaging lectures. Obviously, misranking a pair of lectures that are significantly different in engagement incurs a larger cost in terms of user satisfaction than misranking a pair of lectures with similar engagement.

4.1.2 Controlling for Topics in Content

The topics covered in the content of the lecture is likely to drive learner engagement. For instance, Data Science lectures can be more popular than Physics lectures leading to easy pairwise comparison predictions between the domains. To test this, we restrict in some experiments the pairwise accuracy calculation to pairs of lectures that belong to the same domain (

subject-specific column in Table 3) and observe if the accuracy value changes significantly compared to its counterpart metric that considers all lecture pairs in a domain-agnostic fashion.

4.2 Results

This section presents a series of experiments to:

  • Analyse the relationship between engagement, number of views and mean star ratings (RQ1).

  • Test different machine learning models and engagement signals for the cross-modal features (RQ1-2).

  • Study the distribution of engagement with respect to length of materials (RQ4).

  • Study the influence of modality-specific features and comparison across subject areas (RQ3).

  • Analyse the importance of different features in the model (RQ4).

  • Compare the population-based model to its personalised counterpart (RQ5).

  • Test if the same underlying model can be assumed for different knowledge/subject areas (RQ6).

Figure 1: Scatter plots showing the relationship between (i) number of views vs. MNET, (ii) mean star rating for the video lecture vs. MNET and (iii) mean star rating vs. number of views, together with the Spearman’s rank correlation coefficient (SROCC).

4.2.1 E1: Engagement vs Views and Ratings

The VLN data source also has mean star ratings (explicit feedback) for a subset of the considered lectures. It is noteworthy that we only have access to mean star ratings, not to the individual ratings per observer or the number of measurements. As done in previous work, we also analyse the relationship between implicit signals (engagement and number of views) and explicit ratings. This can be found in Figure 1, where we show mean star rating vs MNET and number of views. The SROCC is close to zero, mainly because of the large number of lectures with high rating but low engagement and number of views. We test the correlation for the 4 different versions of engagement considered (raw, cleaned, standardised and comparative), but all achieve similar results, with SROCC close to zero. One conclusion that is clear from the plot in Figure 1

is that number of views, ratings and engagement do represent very different information. For example, it can be appreciated that the variance of MNET and number of views increases with higher ratings, showing heteroskedasticity. This indicates that for low quality resources (with low ratings) engagement is generally low, whereas for resources with higher ratings engagement differs and may be either high or low. This suggest other factors involved in engagement than simply quality perceived by learners. Regarding number of views it seems that the correlation is rather negative, showing that the materials with the highest number of views present very low engagement.

Model RR SVR KRR KSVR RF
Engagement Pair. SROCC Pair. SROCC Pair. SROCC Pair. SROCC Pair. SROCC
Raw .705.011 .581.027 .707.000 .586.000 .715.004 .607.011 .714.007 .604.019 .723.009 .625.027
Clearned .636.033 .396.093 .634.031 .392.089 .646.025 .424.071 .642.028 .414.078 .646.031 .427.087
Standard .603.035 .302.098 .600.035 .292.100 .609.035 .315.099 .602.025 .297.071 .611.035 .323.099
Comparative .624.010 .365.028 .624.012 .363.036 .626.013 .370.040 .627.009 .373.027 .636.012 .397.038
Table 2:

Pairwise accuracy (Pair.) and Spearman’s Rank Correlation Coefficient(SROCC) of engagement prediction models with standard error from 5-fold cross validation and cross-modal features.

4.2.2 E2: Encoding and Predicting Engagement

Inherently, the task of finding a better engagement signal is very challenging, given the lack of ground truth. In this paper, we first attempt to see if any of these signals present better correlation with star ratings. However, we observe from Figure 1 that engagement is not strongly correlated with perceived quality by users (explicit star ratings) and similar results emerge for different methods of quantifying engagement, meaning it is inconclusive that transforming raw engagement signals strengthens its relationship to explicit perceived quality. Thus, in order to decide on which is the best way of capturing and quantifying engagement, we compare the pairwise accuracy for the four proposed approaches (raw LMNET, cleaned, standardised and comparative). This simply tells us which output target variable is easier to predict given the proposed features. Table 2 presents these results, together with the pairwise accuracy (Pair.) and Spearman’s Rank Order Correlation Coefficient (SROCC) obtained for each machine learning model with the standard error bounds based on 5-fold cross validation. The larger the accuracy value, the better performing the model is.

These results suggest that raw LMNET may be the most appropriate target label, particularly since the proposed features seem to be more useful when building a model for predicting raw LMNET. These results do not contradict the literature, both educational and non-educational, as MNET has been used as the gold-standard way of quantifying engagement. Our experiments thus showed that the use of subjective assessment inspired transformations do not improve the predictive power of engagement signals. This may be because these transformations/correction methods are initially designed to address biases in latent user preferences. Although similar biases may exist in learners when consuming educational materials (e.g. learner fatigue, different engagement thresholds, language level preferences, etc.) we hypothesise that the most influential driver of engagement is the information content and style of the video.

Another observation from Table 2 is that KRR and KSVR models outperform their linear versions. This suggests that there could be non-linearity in the dataset that is better captured by the kernel techniques. RF seems to be the best performing model providing more evidence that non-linearity plays a significant role.

To show how the accuracy changes when the difference of MNET between two lectures changes, we first compute all the possible differences between pairs of lectures and binarize these pairs into bins of size

from to , finally we compute the pairwise accuracy for each bin. Figure 2 shows how the performance of the model changes based on the difference of MNET between lecture pairs. The bars in the figure represent the pairwise accuracy for all the pairs that belong to the same bin. For example, the pairs with largest difference of MNET are predicted correctly with 0.962 accuracy whereas pairs with the smallest difference are predicted with 0.642 accuracy.

Intuitively, a learner might have a similar experience consuming a pair of video lectures that are similarly engaging (at least disregarding the topic), as one is less likely to notice the difference. The black line in Figure 2 presents the cumulative pairwise accuracy of the model if we were to assume that the learners are insensitive to noticing the difference of experience for lecture pairs that have a small difference of MNET. The plotted cumulative pairwise accuracy (y-axis) is computed by restricting the comparisons to lecture pairs with a difference of MNET between the lower bound of the x-axis value and 1.0. For instance, the cumulative pairwise accuracy of the model is 0.816 when the learners do not notice the difference when interacting with similarly engaging lecture pairs with MNET difference of [0.0, 0.2]. This value is the pairwise accuracy of all the lecture pairs with a MNET difference of ]0.2, 1.0].

Figure 2: Bar chart plot showing how the pairwise accuracy changes based on the difference of MNET between lecture pairs

4.2.3 E3: Length of Materials vs. Engagement

Several studies have shown that features that quantify material length have a significant impact (this is also reaffirmed by our observations in our feature importance analysis in Figure 6 and 7) on sustained engagement with the material [Guo_vid_prod, Dalip_wiki_svr]. We investigate how the length of the lectures impacts engagement prediction (i.e. if the engagement predictor is naïvely distinguishing between long vs. short video lectures). We first investigate the distribution of total word count in the video lectures (Figure 3), which is directly related to the length. Based on the observed multi-modal distribution, we make two groups, i) short lectures of less than 5000 words and ii) long lecture (see engagement distribution in Figure 4). It can be seen that, as anticipated, the percentage of watch time tends to be shorter for long lectures.

Figure 3: Distribution of word count of video lectures
Figure 4: Distribution of engagement labels for short and long lectures.
Figure 5: Accuracy bar chart for different types of comparisons using short and long lecture labels.

We investigate how median engagement labels are distributed in the aforementioned groups and also how the pairwise accuracy differs among and between the groups. Figure 5 shows that the model is better at comparing between short-short lecture pairs compared to long-long lecture pairs. In the context of VLN dataset, this is good because there are more short lectures than long lectures (Figure 3). Recent findings (e.g.[Guo_vid_prod]) also encourage authors to make short videos, increasing the likelihood of future video productions being short lectures. MNET distribution in Figure 4

shows that long lectures have a more skewed target value distribution concentrated closer to 0 compared to short lectures suggesting that learners tend to consume smaller fractions of long videos. This is likely to be driven by factors beyond other measured features of the lectures, such as limited time availability and short attention span of learners.

Model Pairwise Accuracy
Subject-agnostic Subject-specific
Content-based Features .724.014 .733.018
Video-specific Features .744.011 .755.014
Table 3: Pairwise accuracy with standard error via 5-fold cross validation for RF model using content-based features vs. content-based + video-specific features.

4.2.4 E4: Video-Features and Subject Areas

Table 3 shows how the pairwise accuracy increases when restricted to subject-specific comparisons (lecture pairs belonging to the same subject area). This is clearly an advantage, given that most often, an educational recommendation system needs to make choices among sets of resources that belong to the same subject area.

Table 3 additionally shows how the performance differs when using exclusively the cross-modal set of features and when adding video specific features. The addition of video features increase the performance by approximately 2%. This result shows that there is a compromise in performance when restricting features to cross-modal features although the feature extractors can be reused in a practical scenario.

4.2.5 E5: Feature Importance Analysis

The SHAP value summary plots for content-based and video-specific feature sets are presented in Figures 6 and 7 respectively, where the features are ordered based on overall feature influence using the best performing prediction model (RF). Colour represents the raw feature value (blue low, red high). For example, when the observed values of a feature is red and they have a negative SHAP value, this means that higher values of this feature negatively impact LMNET prediction. Regarding video length, figures validate its impact on engagement, showing that long videos generally present lower engagement and vice versa, with lecture duration and word count being the most relevant features. Prior studies confirm this observation [Guo_vid_prod, beyondviews, dalip_quality_features]).

Figure 6: SHAP summary plot for cross-modal features.
Figure 7: SHAP summary plot with video-specific features.
Quality Vertical Feature MAS % MAS
Topic Coverage Word Count .250 .366
Freshness Published Date .107 .157
Understandability Easiness .052 .076
Understandability Stop-word Coverage Rate .042 .061
Presentation Normalization Rate .039 .058
Topic Coverage Title Word Count .039 .057
Presentation To Be Rate .038 .055
Topic Coverage Document Entropy .033 .048
Understandability Stop-word Presence Rate .028 .041
Presentation Conjunction Rate .019 .028
Presentation Preposition Rate .014 .020
Presentation Pronoun Rate .013 .020
Presentation Auxiliary Rate .009 .013
Table 4: Influence of content-based features on engagement as per their verticals outlined in [quality_features].

Table 4 complements Figure 6 by giving a more quantitative representation of how the influence of different features across the test dataset changes. Higher MAS is associated with more important features. By looking at the five most influential features, we observe that all identified quality verticals (topic coverage, understandability, freshness and presentation) are represented. This observation supports the importance of considering all the different verticals when predicting context-agnostic engagement. The influence of top features is also consistent with results on quality biased information search [Bendersky2011] where it is also found that Title Word Count is comparatively less important. Figures 6 and 7 also show the importance of modality-specific features in this prediction task by raising Lecture Duration, Silence Period Rate and Speaker Speed in Figure 7 to high ranks.

4.2.6 E6: Population-based vs. Personalised

We use the 20 most active learners from the VLN dataset to compare the predictive performance of context-agnostic to contextual/personalised models when predicting engagement. Firstly, we train the population-based prediction model using the VLN dataset (outlined in section 3.1) using a 70:30 train-test split. In order to build the personalised model, for each user, we make a similar 70:30 train-test split respecting the temporal order of their individual events. We use the training data to build a personalised model per user using only the cross-modal set of features (no video-specific features). For each learner , we make predictions on the test events using (i) population-based model and (ii) the personalised model trained on personal events of the learner. We calculate Mean Absolute Error () as:

(4)

where is the prediction. As regression models are devised for the task, is a sensible evaluation metric to measure predictive performance of the models. Then we calculate the difference of between the population-based and personalised model. Thus, a negative value indicates that the population model is better and vice versa. Figure 8, where the y-axis represents the difference in performance between the population-based and personalised, shows that the population-based model has better predictive power when the number of training examples available for the individual learner is limited (). This is represented by the green line (at a MAE difference of 0). This demonstrates the usefulness of the population-based engagement prediction model in a situation where the recommender system is in a cold-start phase.

Figure 8: How the difference between Mean Absolute Error (MAE) of population-based and personalised models change with the number of training events per learner. Each data point is an individual learner in the dataset.

4.2.7 E7: Individual Models per Knowledge Area

To understand if training subject-specific models can improve on the predictive power of the overall task, we partition the lecture records into 2 categories:

  • STEM: Life Sciences, Physics, Technology and Mathematics.

  • Miscellaneous: Social Sciences, Humanities, Arts and Philosophy.

Then, we compare the performance of the models trained on subject-agnostic (STEM + miscellaneous) and subject-specific (STEM only or miscellaneous only) training data. Table 5 demonstrates that there is little evidence in our results contradicting that a common subject-agnostic engagement model can be assumed across knowledge areas. This is shown in the fact that both training with all knowledge areas or dividing into two, the models obtain very similar test accuracy for each category (.737 vs .732 and .708 vs .704). In fact, the best performance is obtained in both cases by training with the whole dataset. This indicates that in general a common engagement model can be assumed throughout knowledge areas.

Training Data Test Data
STEM Misc.
Subject-agnostic .737 .708
Subject-specific .732 .704
Table 5: Pairwise accuracy for STEM and Miscelleneous (Misc.) lectures when trained with subject-agnostic and subject-specific training data

4.3 Limitations

Firstly, the model does not include features that capture authority of content or its authors. Authority has been identified as an influential feature and lacking it is a weakness of this model. However, identifying an authority indicator that generalises beyond niche communities (e.g. academia) is challenging yet necessary, especially in the OER landscape where anyone can author learning materials. Additionally, the topic coverage features used in this model (Word Count, Title Word Count and Document Entropy) are relatively naïve, although they are useful. Having better features will likely improve the model. The current work demonstrates promise in predicting learner engagement with video lectures using easily automatable material features alone. More sophisticated features, both cross-modal and modality-specific could lead to higher predictive performance and better understanding of context-agnostic engagement. Thirdly, the engagement model is trained on English lectures and English translation of non-English lectures. This impacts the generalisation ability of the model. The same applies to non-video content as well. More rigorous testing is needed in these fronts. Lastly, given that our dataset only considers OERs and excludes the learning dimension, we highlight that some of our findings may not be directly applicable to other type of educational material. Particularly, given that most of our features are language-based and we disregard visual information, the built models may not generalise to general purpose videos.

5 Conclusions

Given its timely need, we set out to develop and empirically test the suitability of engagement prediction models for automatically assessing context-agnostic engagement of OERs. Due to the scarcity of publicly available datasets for the task, we sourced a new video-lectures dataset and evaluated how different machine learning models perform on this dataset. In our analysis, we observed that the Random Forest algorithm performs best. We show that cross-modal features provide satisfactory performance, which is a major advantage, since these can be extracted from different resource modalities. Further experiments show that the predictive performance of the model can gain a slight boost in performance by adding modality-specific features. However, the performance does not deviate significantly. Feature analysis showed that lecture length features are the most influential features in predicting context-agnostic engagement, which agrees with prior work. Other moderately influential features come from diverse quality verticals. Our analysis also showed that the model classifies much better when lectures with very different engagement values are compared, as opposed to lectures with similar engagement. This is natural and obviously the negative impact of misranking pairs of similar engagement lectures is relatively small. Our experiments demonstrated that the built model is useful in data scarcity scenarios, e.g. to approach the common cold-start problem in recommender systems. This is both for new users and new content, as our model can automatically estimate the engagement for new material and the model can be used as a prior for when we do not have enough data from a user to build a personalised model. We finally show that dividing the dataset into different knowledge areas (Subjects) and building separate models does not show improved performance, thus validating that a common underlying model can be built for estimating engagement across differentiated knowledge areas.

The proposed context-agnostic engagement prediction model can be beneficial in improving different components of an educational recommendation system. In situations where new content is discovered frequently (e.g. OER landscape [x5gon, x5learn]), the proposed prediction model estimates how engaging materials are prior to exposing them to the learner population. This allows better balancing the risks relating to learner satisfaction with opportunities of having fresh materials. Also, the proposed context-agnostic model can be integrated with a personalisation system in different ways. It can act as a prior that mitigates cold-start problem both on user and content fronts. In systems where personalisation heavily focuses on the topics covered in the materials [trueeducation, truelearn], this model can complement the content-based model by accounting for stylistic and lingual features that go beyond topic coverage.

To further improve the models, future work should address the three main limitations discussed: Future versions of our model should incorporate more sophisticated features. It could be beneficial to include features capturing authority and topic coverage [quality_features]. In this sense, Wikification [wikifier] can be used to extract covered topics, and data driven authority features, such as [Adler_wiki_content_driven], can be used to learn a universal author authority score. In the cross-modal front, more features focusing on content understanding, such as topic coherence and argument strength, can be considered. In the video-specific front, features such as liveliness of the presenter, sound quality and narration quality can be incorporated. Regarding the generalisation capabilities of the model, evaluating the effectiveness of the cross-modal feature set with a bigger video lecture dataset [Guo_vid_prod, beyondviews] and a text dataset [Dalip_wiki_svr] will increase the confidence on the feature set. Similarly, non-English datasets should also be taken into account.

6 Acknowledgments

This research is part of the X5GON project funded from the EU’s Horizon 2020 research programme grant No 761758 and partially funded by the EPSRC Fellowship titled "Task Based Information Retrieval", under grant No EP/P024289/1.

References