Log In Sign Up

Can Population-based Engagement Improve Personalisation? A Novel Dataset and Experiments

by   Sahan Bulathwela, et al.

This work explores how population-based engagement prediction can address cold-start at scale in large learning resource collections. The paper introduces i) VLE, a novel dataset that consists of content and video based features extracted from publicly available scientific video lectures coupled with implicit and explicit signals related to learner engagement, ii) two standard tasks related to predicting and ranking context-agnostic engagement in video lectures with preliminary baselines and iii) a set of experiments that validate the usefulness of the proposed dataset. Our experimental results indicate that the newly proposed VLE dataset leads to building context-agnostic engagement prediction models that are significantly performant than ones based on previous datasets, mainly attributing to the increase of training examples. VLE dataset's suitability in building models towards Computer Science/ Artificial Intelligence education focused on e-learning/ MOOC use-cases is also evidenced. Further experiments in combining the built model with a personalising algorithm show promising improvements in addressing the cold-start problem encountered in educational recommenders. This is the largest and most diverse publicly available dataset to our knowledge that deals with learner engagement prediction tasks. The dataset, helper tools, descriptive statistics and example code snippets are available publicly.


VLEngagement: A Dataset of Scientific Video Lectures for Evaluating Population-based Engagement

With the emergence of e-learning and personalised education, the product...

PEEK: A Large Dataset of Learner Engagement with Educational Videos

Educational recommenders have received much less attention in comparison...

Predicting Engagement in Video Lectures

The explosion of Open Educational Resources (OERs) in the recent years c...

Ranking Clarifying Questions Based on Predicted User Engagement

To improve online search results, clarification questions can be used to...

Affect-driven Engagement Measurement from Videos

In education and intervention programs, person's engagement has been ide...

DAiSEE: Towards User Engagement Recognition in the Wild

We introduce DAiSEE, the largest multi-label video classification datase...

CLUE: Contextualised Unified Explainable Learning of User Engagement in Video Lectures

Predicting contextualised engagement in videos is a long-standing proble...

1 Introduction

With the rapid growth of Open Educational Resources (OER) [42] and Massively Open Online Courses (MOOC) [37, 23]

, large educational resource repositories need scalable tools to understand and estimate the engagement potential of newly added materials

[13]. While contextualised engagement can be defined as learner’s engagement driven by variables related to the context of the learner at a given time in their learning path (e.g., learning needs/goals, knowledge state, background on the topic, etc.), context-agnostic engagement aims to capture patterns and features associated with engagement that instead are applicable to an entire learner population rather than individual contexts of specific learners [6]. Put simply, context-agnostic engagement is concerned with the features that generally make a piece of educational material engaging.

For works such as this, it is important to clarify that Information Retrieval (IR) in education has very distinct characteristics than conventional web-searches, as users tend to gather information. Learners discovering knowledge in an educational IR system usually carry out learning and familiarising with novel concepts during the search session leading to the task deviating or expanding into unanticipated sub tasks [15]. This behaviour is drastically different from typical search engine users who are usually aware of the exact information they are after. This distinction has led to many works that are specific to educational IR [21, 32]. This difference also makes conventional query log datasets sub-optimal when training e-learning IR models. Prior works have shown this mismatch [11] leading to using education specific datasets [40] instead of general-purpose query logs.

This work attempts to enrich the educational IR research domain by making a large educational video dataset available to the community. Video Lecture Engagement (VLE)

is a novel dataset that presents around 12,000 peer-reviewed scientific video lectures constructed from a popular OER repository, VideoLectures.NET and contains a variety of lecture types ranging from scientific talks to expert panels to MOOC-like lectures. The majority of these lectures belong to Computer Science (CS), Artificial Intelligence (AI) and Data Science subject areas, making this dataset a great source for understanding learner engagement with AI/CS related educational videos. The dataset provides an extensive set of textual and video-specific features extracted from the lecture transcripts, together with Wikipedia topics covered in the lecture (via entity linking) and user engagement labels (both explicit and implicit) for each video.

This dataset is uniquely suited for solving the cold-start problem in educational recommenders, both i) user cold-start, where new users join the system and we may not have enough information about their context and ii) item cold-start, where new educational content is released, for which we may not have user engagement data yet and thus an engagement predictive model would be necessary. The effectiveness of using context-agnostic engagement prediction to address cold-start problem has been empirically demonstrated before (see Figure 1). To the best of our knowledge, this is the largest publicly available dataset to tackle such problems in educational recommenders. The aim of the dataset is not to replace personalised recommenders, but rather complement them providing meaningful population baseline/priors to solve the common cold-start problem. While constructing the VLE dataset is a major contribution of this paper, a series of experimental results is also included as an additional contribution. These results validate VLE dataset’s usefulness in i) significantly improving engagement prediction models, ii) determining how the training data size impacts model improvements, iv) using the VLN dataset for AI/CS and E-Learning/MOOC educational recommenders and finally, v) showing the feasibility of fusing context-agnostic prediction with personalised recommenders to improve overall prediction performance.

Figure 1: How the difference between Mean Absolute Error (MAE) of population-based and personalised models change with the number of training events per learner. Until about 80 events in this plot, population-based predictions are more accurate. Plot from [6]

2 Related Work

The majority of work in Intelligent Tutoring Systems (ITS) and Educational Recommendation Systems (EduRecSys) revolve around using learner context [26, 8] to predict learner engagement. Although the connection between learner engagement and learning gains have been explored by many [2, 20, 28], public datasets in this realm are hard to come by. Many MOOC platforms such as edX [23] and Khan Academy [29] harvest valuable data created in an in-the-wild setting, yet this data is gated within course owners and consortia[21] (or heavily anonymised) due to its proprietary nature. However, with the boom of online education, understanding features of contents leading to context-agnostic (population-based) engagement, an under researched knowledge area, has become a critical area to explore. This work marks a significant step in this direction by publishing a large dataset that the community can use to push the research frontiers. Other public datasets relating to educational question generation [11, 36], argument strength [34] and essay scoring [46] serve different objectives and tasks.

Study of context-agnostic engagement of video lectures so far has been mainly qualitative, deriving guidelines such as keeping videos short and in parts [23] and using conversational language [3]. These guidelines are only useful at the content creation stage and have no use in moderating the mammoth of materials already circulating in the Internet. Our work, proposes to model context-agnostic engagement using features associated with the educational resource itself, which is useful for scalable quality assurance and recommendation of existing (and newly created) materials.

2.1 Related Datasets

A few works on engagement prediction with videos (e.g. modelling watch time) have been done with YouTube [16, 24], where YouTube specific meta-data features (e.g. channel reputation, category etc.) are used exclusively. Although large-scale public datasets relating to engagement prediction are encountered, they focus on general-purpose videos (largely entertainment) rather than educational videos [44]. Some of the features used by these works share similarity with ones proposed in this paper (such as video duration, language and topic features). However, no textual features (based on video transcript) relating to understandability and presentation are used in this prior work, making the methods hard to generalise outside of YouTube. Beyond videos, educational IR [39, 14] and Wikipedia page quality prediction [19, 43] has been attempted using features such as text style, readability, structure, network, recency and review data. Publicly available Wikipedia article quality dataset [19] with human annotated (explicit) labels is used to tackle the latter task although implicit labels are not included in this dataset. Similar datasets are available for automated essay scoring [41]. But, none of these datasets fill the lack of resources for predicting engagement of educational videos.

In the context of education, a different line of work looks at learner engagement using learner-specific multi-modal data (brain waives [45], facial expressions [25] etc.). While tackling a different task, these datasets are mainly collected in a controlled lab setting where the in-the-wild factor is missing [22]. A large number of public datasets and competitions in education also relate to students interacting with learning problems (e.g. ASSISTments [30] or multiple choice questions [12]), but these datasets, contrary to the proposed VLE dataset, do not focus on implicit engagement. More relevant and similar datasets that address context-agnostic engagement prediction in education has been emerging with a focus on MOOCs. Studying approximately 800 videos from edX platform, Guo et al. [23] manually processed and provided a qualitative analysis of engagement, with some features being relatively subjective and difficult to automate. A similar work [38] takes 22 edX videos, extracts cross-modal features and manually annotates their quality with no focus on learner engagement with the videos. Neither dataset is publicly available. MOOCCube is a recently released dataset that contains a spectrum of details relating to MOOC interactions [47]. Although large, the video watch logs in MOOCCube come from 190,000 users where as VLE signals are generated over 1.1 Million users. As central values (e.g. median) are used for context-agnostic engagement prediction, larger user base adds stability to the engagement centres. MOOCCube uses a closed topic taxonomy disconnected from Wikipedia which prevents the dataset from using all the powerful signals in Wikipedia (e.g. semantic relatedness, category tree to name a few) to improve prediction models. No engagement prediction attempts are published thus far to demonstrate MOOCCube’s promise in the task.

On a more relevant contribution, [6] has demonstrated that context-agnostic engagement prediction models can be built with implicit watch time based labels and these models can be used in addressing the cold-start problem (see Figure 1). We identify this work as the closest contribution to the proposed dataset. They gather a collection of 4,000 video lectures with engagement signals generated by 150,000 informal learners. We consider our work, VLE, as an expansion to this dataset with around 12,000 video lectures where engagement signals are generated by 7 times as many learners. The new dataset also restricts itself to strictly English lectures to preserve the meaningfulness of majority of features that are specific to English language. VLE also introduces a new set of Wikipedia-based topical features. Furthermore, this work also accompanies additional experiments that goes beyond data quantity and validates the usefulness of the dataset.

3 VLE Dataset

Figure 2: The video data and the learner interaction logs from VideoLectures.Net repository are processed separately to create the content-based, video-specific features and Wikipedia-based Topics. Multiple different engagement labels are extracted from interaction logs and published in VLE dataset.

VLE dataset is constructed using the aggregated video lectures consumption data coming from a popular OER repository, (VLN). These videos are recorded when researchers are presenting their work at peer-reviewed conferences. Lectures are thus reviewed and material is controlled for correctness of knowledge. The presenters and authors of published work that is recorded by VLN agree and provide rights to publish their presentation video, slides and supplementary materials under an open licence in the VLN website which can be used for educational and research purposes. The majority of research venues covered by VLN is related to Artificial Intelligence and Computer Science, making most videos associating to these topics. Although the dataset mainly consists of scientific talks that are geared towards postgraduate and PhD level learners, a significant number of tutorials (Table 1) are geared for university level students. In that aspect, many videos in the dataset draw similarities to the style of conventional MOOC lectures.

The dataset provides a set of statistics aimed at studying population based engagement in scientific videos, together with other conventional metrics in subjective assessment such as star ratings and number of views. We believe the dataset will serve the community applying AI in Education to further understand what are the features of educational material that makes it engaging for learners. The users also agree for the user-generated content in the VLN website to be available for research purposes. We have anonymised all user interaction data and aggregated them to ensure privacy of the users. Figure 2 gives a high level representation of how different data silos within VLN repository has been used to create the VLE dataset.

3.1 Feature Extraction

We process the video meta-data and English transcriptions222provided by to extract i) content-based textual features, ii) Wikipedia topic-based features and iii) video-specific features. Majority of the extracted features (with the exception of a few features in the video-specific category) are cross modal (e.g. books, websites and audios) and are easily automatable.

3.1.1 Content-based Features

Prior work [6] has proposed a set of effective content-based features for similar datasets. We use the video meta-data and transcript to extract Word Count [43], Title Word Count and Document Entropy [1], language style features [18], Preposition Rate, Auxiliary Rate, To Be Rate, Conjunction Rate, Normalisation Rate, Pronoun Rate, readability related Easiness (FK Easiness)[18] and language style related Stop-word Presence Rate, Stop-word Coverage Rate [1, 31]. To represent Freshness of lectures (recency), we calculate the number of days between January 01, 1960 and the lecture published date which is a proxy for recency of the lecture [6].

3.1.2 Wikipedia-based Features

We use Wikifier333 [4], a novel entity linking method, on transcripts to extract topical features. Two main types of novel topical features that cover topic authority and topic coverage verticals [9] are proposed through this work.

The top-5 authoritative topic URLs and top-5 PageRank scores features represent the Topic Authority feature vertical. Wikifier [4] produces a PageRank score [5] that indicates the marginal authoritativeness of a Wikipedia concept among all Wikipedia concepts associated with a lecture. We use this score to rank and identify the most authoritative topics and use the actual PageRank score as a proxy for quantifying authority. It is noteworthy that authority of a learning resource entails author, organisation and content authority [9]. These features represent content authority.

The top-5 covered topic URLs and top-5 cosine similarity scores features represent Topic Coverage

feature vertical. The cosine similarity score

between the Term Frequency-Inverse Document Frequency (TF-IDF) representations of the lecture transcript and the Wikipedia page of concept is also an output from the Wikifier. We use these values to i) rank and identify most covered topics and ii) quantify the topic coverage [8].

For authority and coverage, the top 5 topic URLs and their scores are included introducing four additional feature groups providing 20 distinct feature columns in the final dataset. Figure 3 presents two word clouds that show the 25 most authoritative and covered topics in the VLE dataset. With respect to the topics in a lecture, the authoritative topics are the ones that have strong connection (linkage in Wikipedia) with other co-occurring topics in a lecture whereas the covered topics are the ones with high textual overlap with Wikipedia concept pages. This figure confirms that these two feature groups represent different things although there is some overlap between the topics.

Figure 3: WordClouds summarising the distribution of top 25 (i) most authoritative and (ii) most covered Wikipedia topics in the dataset. Note that Data Science and Computer Science related topics are most dominant topics.

3.1.3 Video-specific Features

We identify a set of easily automatable, prior proposed [6] video specific features. Lecture Duration, Is Chunked, Lecture Type[23], Silence Period Rate (SPR) and Speaker Speed [6] are calculated based on prior work . Lecture Duration is reported in seconds. Is Chunked is a binary feature which indicates if a lecutre has multiple parts. Lecture type value is derived from the metadata. The possible values for this feature are described in Table 1.

Table 1 also gives insight into how diverse the VLE dataset is. There are different types of videos such as research presentations (e.g. ), scientific talks (e.g. ), dialogues (e.g. ) and tutorials (e.g. vtt) among the dataset.

Abbr. Description Freq. Abbr. Description Freq.
vbp Best Paper 67 vdb Debate 135
vdm Demonstration 315 viv Interview 121
vid Introduction 75 vit Invited Talk 609
vkn Keynote 274 vl Lecture 7125
vop Opening 190 oth Other 58
vpa Panel 207 vps Poster 162
vpr Promotional Video 69 vtt Tutorial 2142
Table 1: 14 types of lectures in the VLE dataset and their abbriviation (Abbr.) and frequency (Freq).

3.2 Labels

Multiple labels based on explicit and implicit feedback are provided with this dataset which will allow researchers to compare between different labels and also integrate multiple label types when modelling engagement. Three main types of quantification of engagement labels are presented.

3.2.1 Explicit Ratings

Mean Star Rating based on a rating scale of 1-5 for each lecture is provided. This value is accompanied by the number of ratings used to calculate the mean. The proposed dataset has 2,127 ratings (almost 2x than Bulathwela et al. [6]). Missing ratings are labelled with -1.

3.2.2 Popularity

The total number of views, named View Count, for each video lecture as of February 1, 2021 is extracted from the metadata and provided with the dataset.

3.2.3 Watch Time/Engagement

The majority of learner engagement labels in the VLE dataset are based on watch time. We compute Normalised Engagement Time (NET) to compute the Median of Normalised Engagement (MNET), as it has been proposed as the gold standard for engagement with educational materials in previous work [23]. We further compute Average of Normalised Engagement (ANET). To have the MNET and ANET labels in the range , we set the upper bound to 1 and derive Saturated MNET (SMNET) and Saturated ANET (SANET) that are included in the dataset.

The standard deviation of

NET for each lecture (Std of Engagement) is reported, together with the Number of User Sessions used for calculating MNET and ANET. These measure allows understanding stability of the centres published. The set of individual NET values are also published to allow future researchers to exploit the true distribution of values.

3.3 Preserving Anonymity and Ethics

We only publish lectures with more than 5 views to preserve k-anonymity and avoid revealing learner identities [17]. A regime of additional techniques are used to preserve lecturer anonymity in order to avoid having unanticipated effects on lecturer’s reputation by associating implicit learner engagement values to their content.

Rarely occurring Lecture Type values were grouped together to create the other category in Table 1. Similarly, subject categories Life Sciences, Physics, Technology, Mathematics, Computer Science, Data Science and Computers subjects to stem category and the other subjects to misc category. Rounding is used with Freshness and Lecture Duration

to the nearest 10 days and 10 seconds respectively. Gaussian white noise (10%) is added to

Title Word Count feature and rounded to the nearest integer.

VLE dataset comes from a video lecture collection that mainly belongs to the Computer Science community, where there is a gender imbalance, both in audience and presenters. To enhance the neutrality of our findings and contributions, we have avoided using feature classes that could potentially reflect gender characteristics. For example, we have avoided using visual features (facial features, presenter emotions) and sound related features (pitch) that may actively or passively embed gender characteristics. Instead we have focused on features that mainly reflect informational content of the lectures. Furthermore, where video specific features are incorporated, we have used very generic features such as "speaker speed" that are unlikely to be correlated with characteristics such as gender or age.

3.4 Final Dataset

The final dataset includes lectures that are published between September 1, 1999 and December 31, 2020. The engagement labels are created from events of over 1.1 Million users logged between December 01, 2016 and February 01, 2020. The final dataset contains 11,548 lectures across 21 subjects (eg. Computer Science, Philosophy, etc. with a majority from AI and Computer Science) that are grouped into STEM and Miscellaneous categories. The collection of videos span various video lengths with the duration distribution having two modes at approx. 2000s (33 mins) and 4000s (1hr) time points which align with typical lengths of research talks and presentations. The mean word count of the videos is 5347.9. The video lecture collection uses on average 93.9 learners per video when calculating engagement centres. The dataset, helper tools, example code snippets and various descriptive statistics related to the VLE dataset are available publicly444

3.5 Supported Tasks

This section introduces the reader to the tasks that the dataset could be used for. The main application areas of these tasks are i) quality assurance in educational video repositories and understanding and ii) predicting context-agnostic engagement in an web-based learning setting. We establish two main tasks, which we mainly focus on in this paper, that can be objectively addressed using the VLE dataset using a supervised learning approach. These are:

  • Task 1: Predicting context-agnostic (population-based) engagement of video lectures

    : The dataset provides a set of relevant features and labels to construct machine learning models to predict context-agnostic engagement in video lectures. The task can be treated as a regression problem.

  • Task 2: Ranking of video lectures based on engagement: Building predictive models that can rank videos based on their context-agnostic engagement could be useful in the setting of an educational recommendation system, including tackling the cold-start problem associated to new video lectures. The task can be treated as a ranking problem to predict the global/relative ranking of video lectures.

Other Tasks

Several auxiliary tasks can also be addressed with this dataset. This dataset is suitable for, not limiting to, several tasks such as i) understanding influential features for engagement prediction ii) understanding the strengths and weaknesses of different implicit/explicit labels, that have been investigated in prior work with similar datasets [6, 33]. The video representations, with the use of unsupervised approaches can be used to understand meaningful hidden patterns contrasting between clusters of videos (e.g. talks vs. lectures vs. tutorials). With the use of Wikipedia based topics, we also see opportunities in deducing the structure of knowledge based on how topics co-occur within videos. Such investigations can be done on this dataset in isolation or can be fused with similar datasets where this task has been attempted before [47].

3.6 Evaluating Performance

We identify Root Mean Squared Error (RMSE) as a suitable metric for evaluating Task 1. Measuring RMSE against the original labels published with the datasets will allow different works to be compared fairly. With reference to Task 2, we identify Spearman’s Rank Order Correlation Coefficient (SROCC) as a suitable metric. SROCC is suitable for comparing between ranking models that create global rankings (e.g. point-wise rankers).

We use 5-fold cross validation to evaluate model performance with tasks 1 and 2. The folds are released together with the dataset, to allow to facilitate fair comparison and reproducability. The five folds can be identified using the fold column in the dataset. 5-fold cross validation allows reporting the confidence intervals (1.96 Standard Error) of the performance estimate, which we include in Table 2.

4 Baselines and Experiments

Through our experiments we seek answers to multiple research questions, which we detail below. Note that these research questions overlap only partially with the proposed supported tasks outlined in section 3.5, as the use purposes of the dataset go much beyond what we could explore in this paper. The main research questions of interest are:

  • RQ1: Does the newly constructed VLE dataset lead to training more performant prediction models?

  • RQ2: How does the larger quantity of training data affect predictive performance?

  • RQ3: Is the model useful for modelling engagement with Computer Science materials?

  • RQ4: Is this dataset useful for modelling engagement in E-Learning lectures and MOOC videos?

  • RQ5: Does context-agnostic engagement prediction help in the cold-start scenario?

Prior work by Bulathwela et al. [6] demonstrated Random Forest (RF) model obtains best performance among linear and non-linear models in similar datasets. Therefore, we use the RF model to benchmark the new VLE dataset for Tasks 1 and 2 described earlier. In addition, a handful of

Multi Layer Perceptron (MLP)

architectures were also experimented with due to their success in engagement prediction with YouTube videos [16].

The code for extracting the proposed features is also published with this work in the dataset repository.

4.1 Labels and Features for Baseline Models


label is used as the target variable for both tasks. Preliminary investigations indicated that SMNET label follows a Log-Normal distribution, motivating us to use a log transformation on the SMNET values before training the models. Empirical results further confirmed that this step improves the final performance of the models. We undo this transformation for computing

while this transformation doesn’t affect .

All the features outlined as the content-based and video-based sections in section 3.1 are included in the baseline models. The models are trained with three different feature sets in an incremental fashion:

  1. Content-based: Features extracted from lecture metadata and the transcript-based textual features.

  2. + Wiki-based: Content-based + 2 Wikipedia-based features (Top 1 Most Authoritative Topic URL and Most Covered Topic URL).

  3. + Video-based: Content-based + Wikipedia-based + Video-specific features.

However, due to the large amount of topics in the Wikipedia-based feature groups, we restrict to the top 1 authoritative and covered topic features where they are encoded as binary categorical variables. Our initial attempts to encode these features in a reduced dimension space (using Singular Value Decomposition) led to deteriorated results contrary to our expectations. Practitioners are encouraged to try further encoding of the topic variables, as it will likely have a positive impact on the performance.

4.2 Experiments

Addressing RQ1, both the RF and MLP models are experimented with the proposed features sets with the smaller 4k dataset [6] and the newly proposed VLE dataset. 5-fold cross validation is used in this experiment. This setup allows identifying i) how performance gains are achieved through adding each new group of features and ii) how performance gains are achieved through adding new observations. Follow on experiments addressing RQ2 and 3 are run using folds 1-4 as training data and fold 5 of the dataset as testing data. We experiment by using varying proportions of training data to train the model. When selecting training data, random sampling is used to keep the diversity of the lectures similar to the full dataset. All the trained models using different quantities of training data are then evaluated on the same held out test set.

To validate RQ4, we first partition the entire dataset to two parts, i) tutorial videos (vtt in Table 1) and ii) all other videos, as test and train data respectively. However, tutorials presented in a research conference may significantly vary from e-learning videos geared for course learning. To address this mismatch, we further identify 1,035 videos (among the tutorials) that exclusively belong to the Open Course Ware Consortium (OCWC)555 OCWC contains university lectures that have been intended for course teaching via e-learning. These lectures are recorded using a variety of MOOC production techniques such as classroom lecture, talking head and power point presentation [23]. We define these videos as ocw lectures. We use the training data (all except tutorials) to train the engagement model and evaluate prediction performance on i) OpenCourseWare, ocw lectures, ii) all tutorials but OpenCourseWare, !ocw and iii) all tutorials vtt, (Entire test data). The follow on experiments (RQ2-4) are only done with the best performing model from Table 2 (RF model with Content + Wiki + Video feature group) to reduce computational cost.

A different experiment was run to answer RQ5. We utilise TrueLearn Novel [8] (hereby referred to as TrueLearn), a recently proposed educational recommender model that learns individualised models to predict engagement with video lectures. A key limitation with personalised models such as TrueLearn is that there is no information about the user in the beginning, leading to a user cold-start problem which effectively means having ill-informed engagement predictions in the beginning of the learner session. To address this, we utilise the proposed context-agnostic engagement prediction model where a hybrid recommendation system (hereby referred to as TrueLearn) is built by combining it with TrueLearn model. For simplicity’s sake, TrueLearn uses "switching" [10], where the context agnostic model is used to make a prediction for the first event of the user (where the personalistion model has no information to work with) and then switched to TrueLearn model which can exploit the user watch history. PEEK dataset [7], which includes more than 20,000 user sessions, is used for the experiment where the context agnostic model is trained using lectures that are not present in the PEEK test data. Then the predictive performance on the PEEK test data using TrueLearn (The baseline) and TrueLearn

(where the first event is predicted using the context-agnostic model) is measured using Accuracy, Precision, Recall and F1-Score. To evaluate if the improvement of metrics is statistically significant, a learner-wise, one-tailed paired t-test is used.

5 Results and Discussion

with Task 1 with Task 2
Feature set 4k 12k (Ours) 4k 12k (Ours)
Content-based .1801.006 .1170.006 .6190.011 .7504.013
+ Wiki-based .1798.007 .1178.006 .6251.014 .7505.013
+ Video-specific .1728.007 .1098.007 .6758.020 .7832.009
Table 2: RMSE and SROCC with confidence intervals for the engagement prediction (Task 1) and lecture ranking (Task 2) using the Random Forests model with both 4k [6] and 12k (Our VLE) datasets. Better performance per task is highlighted in bold.
Figure 4: Predictive performance for (i) engagement prediction and (ii) lecture ranking tasks with varying proportions of randomly sampled training data. The test set performance for full test dataset (Blue) and subsets of test dataset that consists of CS lectures only (Orange) and Non-CS lectures only (Green) are also reported
ocw !ocw vtt From Table 2
with Task 1 .0539 .0404 .0406 .1098
with Task 2 .9485 .9209 .9223 .7832
Table 3: Performance for OpenCourseWare (ocw), Non-OpenCourseWare (!ocw) tutorial and All tutorial (vtt) videos for engagement prediction and lecture ranking tasks. Better performance per task is highlighted in bold.
Model Acc. Prec. Rec. F1
Truelearn 62.69 57.54 81.88 64.98
Truelearn 63.51 57.91 79.13 64.39
Table 4: Average test set performance for Accuracy (Acc.), Precision (Prec.), Recall (Rec.) and F1-Score (F1). The more performant value is highlighted in bold. The metrics where the proposed model that outperform the baseline counterpart in the PEEK dataset ( in a one-tailed paired t-test) are marked with .
Model Acc. Prec. Rec. F1
TrueLearn 44.21 44.21 100.00 61.32
TrueLearn 56.09 50.32   53.58 51.90
Table 5: Test set performance for Accuracy (Acc.), Precision (Prec.), Recall (Rec.) and F1-Score (F1) for first event of each learner. The more performant value is highlighted in bold. The metrics where the proposed model that outperform the baseline counterpart in the PEEK dataset ( in a one-tailed paired t-test) are marked with .

The performance metrics observed with the RF model on Task 1 and 2 (RQ1) are outlined in Table 2. Although we expected competitive results with MLP models, we failed to observe promising results. We believe this is due to the restrictive architecture space we used in our experiments. We encourage future researchers to experiment with more complex and wider range of neural architectures that may lead to better results.

Figure 4 illustrates how the training data size impacts the i) RMSE and ii) SROCC (RQ2). It also demonstrate the performance difference between predicting engagement of Computer Science videos (CS) vs. Non-CS videos (RQ3). Finally, Table 4 presents predictive performance of the model on e-learning type lectures and tutorials (RQ4).

The overall performance results relating to the effect of combining the context agnostic model with personalisation models to battle cold-start problem (RQ5) is reported in Table 4. A magnified view of the performance of predicting the outcome of the first event of each user (where the context-agnostic predictor is supposed to help the personaliser) is reported in Table 5.

5.1 Performance Gains and Causes (RQ1-2)

Results in Table 2 clearly shows that the 300% larger VLE lead to significant performance gains in both engagement prediction and lecture ranking tasks. In the case where the full feature set is used with the RF model, RMSE on Task 1 drops from .17 to .1 (41%) while SROCC on Task 2 jumps from .68 to .78 (15%). The labels (both explicit and implicit) themselves are more accurate as they are calculated using a larger user population. This means that many of the lectures that already existed in the smaller dataset [6] are likely to get improved engagement labels as the new labels are calculated using more user sessions that interacted with the videos during a wider time period (including very recent sessions until February 2020).

Within the VLE dataset itself, using additional feature groups tend to lead to better performing models. Results for VLE dataset in Table 2 demonstrates this trend where a significant jump in performance is evidenced when incorporating modality specific features (Video-specific features). However, the results also show that the cross-modal content-based features alone lead to substantial performance. This is a good indication that easy-to-compute, cross-modal features alone are sufficient to build a system that can predict context-agnostic engagement of videos. In a practical viewpoint, the proposed cross-modal features are computationally light (unlike complex deep models, e.g. vision models). Wikification, used in generating Wiki-based features, also operates at web-scale666 Although the results show minute gains by adding the Wiki-based features, we believe that this is due to the simplicity of the Wiki features used in constructing the baselines leaving much room for sophistication (e.g. exploiting the semantic relatedness of topics). The topics, coming from a humanly-intuitive taxonomy, leaves room for building interpretable features.

Figure 4 confirms that the increase of training data is leading to performance gains in both tasks. Figure 4(i) suggests that the Root Mean Square Error (RMSE) can be further improved with more training examples. However, Figure 4(ii) tells rather a different story where the performance gain seem to saturate around 60% mark. This is an indication that improving ranks of the test data gets significantly harder at around 5,500 training examples ( 60% of the 9,239 training set) .

5.2 Relevance to AI/CS Education (RQ3)

The follow up experimental results in Figure 4 demonstrates that this dataset allows achieving higher performance on CS-only lectures. A likely reason for this may be the higher diversity of lecture in the non-CS category as it consists of significantly different subjects. Nevertheless, Figure 4 shows that a test set RMSE of and SROCC of is achievable with CS lectures. Figure 3 further shows that majority of the lectures in the dataset contains concepts relating to Artificial Intelligence and Data Science (e.g. Machine Learning, Ontology, Semantic Web …). This indicates that majority of the CS lectures in the dataset contain topics relating to AI making this dataset a highly suitable dataset for training engagement prediction models for AI and CS education.

5.3 Relevance to E-Learning and MOOCs (RQ4)

Table 4 shows strong evidence that the models trained with VLE dataset generalise really well for engagement modelling in e-learning type videos created for course teaching amid the dataset containing many different video types (as per Table 1). The models trained are much better at engagement prediction and ranking of tutorial-like videos than general scientific talks. Having tested with lectures that have been recorded using different MOOC video production techniques, the high performance obtained on ocw lectures confirms that VLE dataset can be highly effective in building context-agnostic engagement models for e-learning and MOOC systems.

5.4 Relevance to Addressing Cold-Start (RQ5)

Table 4 shows that by simply incorporating the context-agnostic engagement prediction in TrueLearn Novel algorithm (together becoming TrueLearn) can lead to significant improvements in accuracy and precision. The same table also shows that the drop of overall F1 score can be attributed to the steep drop of Recall Score. Table 5 sheds more light into where this steep drop of recall occurs. This is, the baseline TrueLearn model always predicts positive engagement for the first event of the user. As seen in table 5, the recall of baseline TrueLearn model being 1.0 while the accuracy and precision being the same depicts this fact. TrueLearn predicts positive in event 1 of each user because the model has no information to base the prediction on [8]. However, table 5 shows that the scenario is different in TrueLearn as the model has additional information during the first event. Both accuracy and precision of predictions in first event of the learner population significantly improves. The recall will fall as the proposed context-agnostic model only captures a population based prior which may deviate from the individuality of the learners. However, it can be argued that making a prediction with additional information is better than predicting with no prior information. In the bigger picture, being able to make slightly more informed and varied predictions for the first event of learners based on lecture content features enable significantly improving prediction accuracy and precision of TrueLearn in Table 4

. It is also noteworthy that our experiment, for the sake of simplicity, uses a rule that could be significantly improved further, e.g. using weights of the probabilities of both population-based and personalised models at the beginning of a user session (using weighting or stacking

[10]), where the weight of population-based engagement decreases as we gather more information about the user. as we gather more information about the user.

6 Opportunities and Limitations

The VLE dataset brings plenty of opportunities to the research community that is interested in building context-agnostic engagement using content related features. It is significantly larger (15x than [23] and 3x than [6]) and more education focused than the contenders [44]

focused on engagement with videos. The larger quantity of examples may even enable more complex model families (e.g. deep learning) to be used in this task although our limited early experiments did not reap fruit. Another noteworthy opportunity is that this dataset can be extended periodically to create even larger and more accurate evaluations of the dataset.

The Wiki-based features open up limitless possibilities as many sophisticated feature sets can be built and experimented. Due to the connectivity to Wikipedia, both its content and link structures can be exploited to invent meaningful, yet interpretable features. A further step can enable other data structures such as knowledge bases (e.g Wikidata), category tree etc. to be used for feature creation.

Support for other tasks beyond the main tasks will allow this dataset to be useful in creating decision support systems that can help future content creators and help them create engaging educational videos [27] while confirming prior findings [23, 3, 6] regarding drivers of learner engagement. Furthermore, the dataset also enables comparing the findings from studies outside of education [44]. The Wikipedia features also facilitates grounds to explore topic related tasks such as learning the structure of knowledge [47].

There also exists limitations in this dataset. VLE dataset is largely comprised of Computer Science and Data Science materials (Figure 3) that are delivered all in English. While this is an opportunity for AI and Computer Science education, results in Figure 4 also shows that this fact leads to comparatively less fruitful non-CS results. The dataset and its features are also not suitable for non-English video collections. Amid its size, the dataset still lacks variety of materials in topical and lingual sense.

At first sight, the majority of the lectures in our dataset come from male presenters, potentially creating a significant gender imbalance in the dataset. As pointed out in section 3.3, we have taken some measures to restrict the feature set to what we believe to be more neutral features. However, since we do not have access to gender information in the data collected, it is impossible to test and guarantee that there is zero correlation between the proposed features in the VLE dataset and negative gender biases. Care should be taken when enhancing these features and there is room to do more rigorous tests to understand if any gender biases are present within the dataset.

Learner Engagement is a loaded concept with many facets. In relation to consuming videos, many behavioural actions such as pausing, rewinding and skipping can contribute to latent engagement with a video lecture [28]. Due to the technical limitations of the platform and privacy concerns, only watch time, view and mean ratings are included in this dataset. Although watch time has been used as a representative proxy for learner engagement with videos [23, 44, 16], we acknowledge that more informative measures may lead to more complete engagement signals.

7 Conclusions

Identifying the need of resources to push the frontiers of context-agnostic engagement prediction, which is a crucial part of addressing scalable quality assurance and the cold-start problem in educational recommenders, we release the VLE dataset. This dataset consists of around 12,000 videos with three groups of features, namely, i) content-based, ii) Wikipedia-based and iii) Video-specific features, accompanied as well by iv) multiple implicit and explicit engagement labels. We establish 2 main tasks, and identify multiple other tasks that can be tackled with this dataset. Addressing the 2 main tasks proposed, we benchmark the new dataset to show significant prediction gains over a similar yet, smaller prior dataset. In follow on experiments, we investigate how the magnitude of training examples relate to performance gains while also demonstrating the suitability of this dataset to build models for AI/CS related video collections. We further validate that the dataset can be used to model engagement with e-learning type video lectures to show its relevance to educational recommendation systems in the context of MOOCs. With the use of a simple experiment, it is also demonstrated that such a model can be incorporated with a personalised (contextual) engagement prediction model to significantly improve the predictions.

Future Directions

We see several lines of future work addressing current limitations of the dataset. Adding diverse examples from different subject areas is our top priority. Experimenting further with more sophisticated Wiki-based features (e.g. by exploiting the Wikipedia semantic graph [35]) is a promising way forward. The possibility of including more learner engagement related signals (e.g.: pauses, replays, skips, etc.) will be explored in the subsequent version of the dataset, without compromising learner privacy.As more understanding of engagement with other modalities (such as PDFs and e-Books) is gained, it is possible to add more learning resources from diverse modalities to widen the horizons of the dataset and improve understanding of engagement with different modalities of educational material.

This work only devices a simple mechanism to combine the context-agnostic model with a personalisation model for the sake of proving that the proposal has genuine use-cases. The experiment barely scratches the surface on how a context-agnostic engagement prediction model can be incorporated in an educational recommendation system. There are a variety of alternatively and more sophisticated approaches that can potentially have bring larger performance gains, which we will explore in future work.

8 Acknowledgments

This research was partially conducted as part of the X5GON project funded from the EU’s Horizon 2020 research programme grant No 761758. This work is also supported by the European Commission funded project "Humane AI: Toward AI Systems That Augment and Empower Humans by Understanding Us, our Society and the World Around Us" (grant 820437) and the EPSRC Fellowship titled "Task Based Information Retrieval" (grant EP/P024289/1).


  • [1] M. Bendersky, W. B. Croft, and Y. Diao. Quality-biased ranking of web documents. In Proc. of ACM Int. Conf. on Web Search and Data Mining, 2011.
  • [2] F. Bonafini, C. Chae, E. Park, and K. Jablokow. How much does student engagement with videos and forums in a mooc affect their achievement? Online Learning Journal, 21(4), 2017.
  • [3] C. J. Brame. Effective educational videos: Principles and guidelines for maximizing student learning from video content. CBE—Life Sciences Education, 15(4), 2016.
  • [4] J. Brank, G. Leban, and M. Grobelnik. Annotating documents with relevant wikipedia concepts. In Proc. of Slovenian KDD Conf. on Data Mining and Data Warehouses (SiKDD), 2017.
  • [5] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proc. of Int. Conf. on World Wide Web, 1998.
  • [6] S. Bulathwela, M. Perez-Ortiz, A. Lipani, E. Yilmaz, and J. Shawe-Taylor. Predicting engagement in video lectures. In Proc. of Int. Conf. on Educational Data Mining, EDM ’20, 2020.
  • [7] S. Bulathwela, M. Perez-Ortiz, E. Novak, E. Yilmaz, and J. Shawe-Taylor. Peek: A large dataset of learner engagement with educational videos, 2021.
  • [8] S. Bulathwela, M. Perez-Ortiz, E. Yilmaz, and J. Shawe-Taylor. Truelearn: A family of bayesian algorithms to match lifelong learners to open educational resources. In AAAI Conference on Artificial Intelligence, AAAI ’20, 2020.
  • [9] S. Bulathwela, E. Yilmaz, and J. Shawe-Taylor. Towards Automatic, Scalable Quality Assurance in Open Education. In Workshop on AI and the United Nations SDGs at Int. Joint Conf. on Artificial Intelligence, 2019.
  • [10] R. Burke. Hybrid recommender systems: Survey and experiments. User modeling and user-adapted interaction, 12(4):331–370, 2002.
  • [11] G. Chen, J. Yang, C. Hauff, and G. Houben. Learningq: A large-scale dataset for educational question generation. In Proceedings of the Twelfth International Conference on Web and Social Media, ICWSM 2018, Stanford, California, USA, June 25-28, 2018, pages 481–490, 2018.
  • [12] Y. Choi, Y. Lee, D. Shin, J. Cho, S. Park, S. Lee, J. Baek, C. Bae, B. Kim, and J. Heo. Ednet: A large-scale hierarchical dataset in education. In International Conference on Artificial Intelligence in Education, pages 69–73. Springer, 2020.
  • [13] K. Clements and J. Pawlowski. User-oriented quality for oer: understanding teachers’ views on re-use, quality, and trust. Journal of Computer Assisted Learning, 28(1), 2012.
  • [14] K. Collins-Thompson, P. N. Bennett, R. W. White, S. de la Chica, and D. Sontag. Personalizing web search results by reading level. In Proc. of Conf. on Information and Knowledge Management, 2011.
  • [15] R. Cortinovis, A. Mikroyannidis, J. Domingue, P. Mulholland, and R. Farrow. Supporting the discoverability of open educational resources. Education and Information Technologies, 24(5):3129–3161, 2019.
  • [16] P. Covington, J. Adams, and E. Sargin.

    Deep neural networks for youtube recommendations.

    In Proc. of ACM Conf. on Recommender Systems, 2016.
  • [17] N. Craswell, D. Campos, B. Mitra, E. Yilmaz, and B. Billerbeck. Orcas: 20 million clicked query-document pairs for analyzing search. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management, CIKM ’20, page 2983–2989, New York, NY, USA, 2020. Association for Computing Machinery.
  • [18] D. H. Dalip, M. A. Gonçalves, M. Cristo, and P. Calado. A general multiview framework for assessing the quality of collaboratively created content on web 2.0. Journal of the Association for Information Science and Technology, 2017.
  • [19] D. H. Dalip, M. A. Gonçalves, M. Cristo, and P. Calado. Automatic assessment of document quality in web collaborative digital libraries. Journal of Data and Information Quality, 2(3), Dec. 2011.
  • [20] D. Davis, G. Chen, C. Hauff, and G. Houben. Activating learning at scale: A review of innovations in online learning strategies. Comput. Educ., 125:327–344, 2018.
  • [21] D. Davis, D. Seaton, C. Hauff, and G. Houben. Toward large-scale learning design: categorizing course designs in service of supporting learning outcomes. In Proceedings of the Fifth Annual ACM Conference on Learning at Scale, London, UK, June 26-28, 2018, pages 4:1–4:10, 2018.
  • [22] M. A. Dewan, M. Murshed, and F. Lin. Engagement detection in online learning: a review. Smart Learning Environments, 6(1):1, 2019.
  • [23] P. J. Guo, J. Kim, and R. Rubin. How video production affects student engagement: An empirical study of mooc videos. In Proc. of the First ACM Conf. on Learning @ Scale, 2014.
  • [24] M. Horta Ribeiro and R. West. Youniverse: Large-scale channel and video metadata from english-speaking youtube. Proceedings of the International AAAI Conference on Web and Social Media, 15(1), 2021.
  • [25] A. Kaur, A. Mustafa, L. Mehta, and A. Dhall. Prediction and localization of student engagement in the wild. In 2018 Digital Image Computing: Techniques and Applications (DICTA), pages 1–8. IEEE, 2018.
  • [26] S. Kim, W. Kim, Y. Jang, S. Choi, H. Jung, and H. Kim. Student knowledge prediction for teacher-student interaction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 15560–15568, 2021.
  • [27] M. Kurdi, N. Albadi, and S. Mishra. “think before you upload”: an in-depth analysis of unavailable videos on youtube. Social Network Analysis and Mining, 11(1):1–21, 2021.
  • [28] A. S. Lan, C. G. Brinton, T.-Y. Yang, and M. Chiang. Behavior-based latent variable model for learner engagement. In Proc. of Int. Conf. on Educational Data Mining, 2017.
  • [29] Z. MacHardy and Z. A. Pardos. Evaluating the relevance of educational videos using bkt and big data. In Proc. of Int. Conf. on Educational Data Mining, 2015.
  • [30] M. Mendicino, L. Razzaq, and N. T. Heffernan. A comparison of traditional homework to computer-supported homework. Journal of Research on Technology in Education, 41(3):331–359, 2009.
  • [31] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proc. of Int. Conf. on World Wide Web, 2006.
  • [32] G. Penha and C. Hauff. Curriculum learning strategies for IR. In Advances in Information Retrieval - 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14-17, 2020, Proceedings, Part I, 2020.
  • [33] M. Perez-Ortiz, A. Mikhailiuk, E. Zerman, V. Hulusic, G. Valenzise, and R. K. Mantiuk. From pairwise comparisons and rating to a unified quality scale. IEEE Transactions on Image Processing, 29:1139–1151, 2019.
  • [34] I. Persing and V. Ng. Modeling argument strength in student essays. In Association for Computational Linguistics, 2015.
  • [35] M. Ponza, P. Ferragina, and S. Chakrabarti. On computing entity relatedness in wikipedia, with applications. Knowledge-Based Systems, 188:105051, 2020.
  • [36] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100, 000+ questions for machine comprehension of text. In EMNLP, 2016.
  • [37] A. Ramesh, D. Goldwasser, B. Huang, H. Daume III, and L. Getoor. Learning latent engagement patterns of students in online courses. In Proc. of AAAI Conference on Artificial Intelligence, 2014.
  • [38] J. Shi, C. Otto, A. Hoppe, P. Holtz, and R. Ewerth. Investigating correlations of automatically extracted multimodal features and lecture video quality. In Proceedings of the 1st International Workshop on Search as Learning with Multimedia Information, SALMM ’19, page 11–19, New York, NY, USA, 2019. Association for Computing Machinery.
  • [39] R. Syed and K. Collins-Thompson. Optimizing search results for human learning goals. Inf. Retr. J., 20(5):506–523, 2017.
  • [40] R. Syed and K. Collins-Thompson. Retrieval algorithms optimized for human learning. In Proc. of Int. Conf. on Research and Development in Information Retrieval, 2017.
  • [41] K. Taghipour and H. T. Ng. A neural approach to automated essay scoring. In

    Proc. of Conf. on Empirical Methods in Natural Language Processing

    , 2016.
  • [42] UNESCO. Open educational resources (oer)., 2021. Accessed: 2021-04-01.
  • [43] M. Warncke-Wang, D. Cosley, and J. Riedl. Tell me more: An actionable quality model for wikipedia. In Proc. of Int. Symposium on Open Collaboration, 2013.
  • [44] S. Wu, M. Rizoiu, and L. Xie. Beyond views: Measuring and predicting engagement in online videos. In Proc. of the Twelfth Int. Conf. on Web and Social Media, 2018.
  • [45] F. Xu, L. Wu, K. P. Thai, C. Hsu, W. Wang, and R. Tong. MUTLA: A large-scale dataset for multimodal teaching and learning analytics. CoRR, abs/1910.06078, 2019.
  • [46] H. Yannakoudakis, T. Briscoe, and B. Medlock. A new dataset and method for automatically grading esol texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11, pages 180–189, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics.
  • [47] J. Yu, G. Luo, T. Xiao, Q. Zhong, Y. Wang, W. Feng, J. Luo, C. Wang, L. Hou, J. Li, et al. Mooccube: a large-scale data repository for nlp applications in moocs. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3135–3142, 2020.