Augmenting classroom teaching with technology can support students with personalised educational needs, while mitigating limited teaching resources. Educational pedagogical agents aim to deliver personalised learning interactions with students. Additionally, pedagogical agents have the flexibility to be deployed in any setting, which is beneficial given the demands and challenges of remote learning exposed in recent times.
Within this area, both virtual and embodied agents have been deployed to engage students on a range of subjects, and occupying roles as both the tutor and the novice . When pedagogical agents act in the role of a novice, or learner, the student takes on the role of a teacher, communicating material to the agent, for the benefit of the student’s own learning. An illustration of the roles in this interaction is given in Figure 1. The use of this framing aims to elicit the Protégé Effect , a pedagogical phenomenon in which the student is likely to invest more in learning the material when it is for the benefit of the agent. Students synthesise information better, and adapt their teaching based on the tutee’s performance, which may lead to positive cognitive outcomes . Allowing the agent to fill the role of pupil may also act in the agent’s own favour, as its actions may not always be perfect, and any errors may be more easily forgiven if it is not expected to be an expert.
Teachable agents rarely engage in conversations with their student teachers, while intelligent tutoring systems employ this style of interaction more frequently in order to emulate tutoring dialogues [13, 32]. When natural language is not the primary mode of communication, teachable agents may instead use buttons , or a concept map . When learning is considered along the continuum of active (doing something), constructive (producing something) to interactive (exchanging information with someone), students have been shown to experience greater learning gains when interacting with material through discourse with a peer or tutor . Allowing students to interact with a teachable agent in this way may provide the same benefits to learning. Using natural language to synthesise and communicate new information also draws on the benefits of paraphrasing as a comprehension strategy . In pedagogical research, paraphrasing has been shown to encourage students to connect new material with prior knowledge, and establish retrieval cues .
The Curiosity Notebook  is an example of a technology that supports conversational teaching interactions with a teachable agent, either virtually, or via integration with a robot. The Curiosity Notebook is a learning by teaching web application, in which students teach an agent about a classification task. In the current implementation, there is limited communication through natural language to direct the teaching interaction. Users can communicate to the teachable agent by clicking on one of seven teaching buttons, after which they are prompted to teach material through a series of agent-directed questions. To these questions, users sometimes type short, free-form answers, but largely are prompted to select sentences from the source material.
This study builds upon the prior work conducted with the Curiosity Notebook. In this work, we first investigate the effect of teaching modality on learning outcomes and engagement, during an interaction with a virtual teachable agent. In particular, the study investigates the effects of rephrasing by comparing two teaching modalities: (1) by selecting full sentences from source material in order to teach the agent information , against (2) the student typing their teaching utterances into a chat window, with encouragement to put the source material into their own words. Our results show that teaching modality influences learning outcomes and engagement, and that the amount of rephrasing effort correlates to learning gains.
Ii Related Work
There has been significant development of educational technologies that aim to address the need for widely accessible, personalised learning tools for the classroom. Pedagogical agents are one such technology: virtual or embodied characters designed to help students learn material. Such agents can occupy a variety of roles, for example a tutor [13, 23], a peer learner , or a novice [3, 5, 22]. The following sections provide an overview of educational teachable agents, research relevant to teaching via natural language communication, an introduction to the Curiosity Notebook , the technology upon which this research is based, and an overview of the measures related to this study.
Ii-a Educational Teachable Agents
Much of the initial research in the field of educational technology was focused on the development of Intelligent Tutoring Systems (ITS) . These systems were designed to provide instruction similar to that of a knowledgeable human tutor . Early approaches were limited by not including the student as an active participant in the learning interaction , though later work has remedied this. The platform AutoTutor  supports conversations between a virtual tutor and student, where the virtual tutor is capable of engaging in human-inspired tutoring dialogue. The platform has been shown to produce learning gains across a variety of domains .
More recently, pedagogical agents have been developed which occupy different roles than those in ITS, such as where the agent acts as a learner, or novice. In such interactions, the student takes on the role of a teacher, engaging in the learning-by-teaching paradigm, a recognised pedagogical tool . Learning by teaching is beneficial to students as it elicits the Protégé Effect, where a student is likely to invest more in learning the material when it is for the benefit of someone else . Studies have shown that teaching someone else promotes more organised cognitive structures, compared to learning for oneself . Additionally, it has been shown that teaching requires the student to reflect upon their teaching based on the performance of the tutee, which may lead to positive cognitive outcomes . The presence of the Protégé effect has been confirmed in several studies, with positive cognitive  and meta-cognitive  outcomes.
Applications of the learning-by-teaching approach include SimStudent , a tool to help students gain skills in mathematical problem solving. In a game-like interaction, a virtual agent performs steps to solve a mathematical problem, and the student provides hints and corrections to help the agent arrive at the correct answer. Another example is Betty’s Brain , in which students teach Betty, a virtual learner, about casual relationships in science through the manipulation of concept maps in a shared visual interface. This approach has been shown to improve motivation and learning gains.
These teachable agents share a limited use of natural language as the mode of teaching, relying largely on interaction with the computer interface. Research shows that when human-computer interfaces are consistent with social conventions in daily life, this leads to a more engaging and satisfying user experience . This motivates the use of tutoring dialogues in technology such as AutoTutor, as this is the mode of interaction between human tutors and students [13, 26]. There are also benefits to learning outcomes when students engage with material through discourse and argumentation with a peer or tutor . Technologies such as AutoTutor support complex dialogue for a virtual tutor, however there is a lack of such dialogue systems for teachable agents.
Ii-B Teaching via Natural Language
Interacting with a teachable agent using natural language not only elicits the Protégé effect, and its attendant benefits, but also engages the student-teacher in paraphrasing, which is recognised as an effective comprehension strategy in the pedagogical research . The act of reading and paraphrasing content helps to establish retrieval cues, and encourages the student to connect more deeply to the material .
Natural Language Processing (NLP) is a necessary component for a teachable agent taught via natural language conversation. NLP uses computational techniques in order to learn, understand, and produce human language content [14, 31]. A review of NLP approaches  groups them into techniques concerned with syntax, semantics, and pragmatics, which respectively are concerned with grammar, word meaning, and word meaning with context. Dialog systems can use statistical NLP approaches like intent classification and slot filling, for example Snips 
, which uses logistic regression to train intent classifiers, and several linear-chain conditional random fields for slot extraction. Alternatively, approaches based on lexical semantics such as Latent Semantic Analysis can be used, and can be seen in dialog systems such as the previously mentioned AutoTutor[21, 13].
Ii-C Curiosity Notebook
The Curiosity Notebook  is one such technology that has the potential to support an interaction with a teachable robot, through natural language, and it forms the basis of this research. The Curiosity Notebook is a highly configurable, learning by teaching web application, where students teach an agent about different classification tasks. It supports flexible agent embodiment—for example a virtual agent , or a physical robot —the configuration of agent characteristics—for example its style of humour —and also facilitates group based teaching . In the current implementation, natural language input is only supported for short, defined inputs, such as what topic to teach or responses to well-defined questions that are easy to parse (e.g. what kind of rock is Granite?). Teaching the agent about the features of these items occurs through the selection of sentences within source articles embedded in the interface. The research described in this paper extends upon this implementation by adding support for natural language communication throughout the teaching interaction.
Ii-D Student Engagement and Learning Outcomes
Education technologies are typically evaluated along two main metrics: the efficacy of the tool in teaching students, and the students’ interaction with the system . These metrics consider, respectively, the knowledge gain or learning outcomes, and the functionality of the technology (independently of learning outcomes). A review of educational technologies found that almost 78.6% of included studies measured learning . Additionally, almost 61.6% measured affective elements, such as perceptions, engagement and attitudes and beliefs . Educators are particularly concerned with educational technologies increasing student engagement . Fredricks et al.  conceptualise student engagement in three dimensions: behavioural, which considers observable behaviours related directly to the learning process, affective, the emotional response to the educational experience, and cognitive, the expenditure of energy related to comprehension and learning.
Teachable agents have been shown to improve learning outcomes over tutoring agents, through the Protégé effect [8, 3, 25]. They have been used in applications where students manipulate interface elements such as concept maps , or provide hints to a virtual student , rarely using natural language as a teaching modality. Intelligent tutoring systems such as AutoTutor  have successfully employed natural language in tutoring conversations, though natural language dialogue systems for teachable agents remain an under researched area.
The Curiosity Notebook  supports teaching conversations between a student and teachable agent, however its natural language capabilities to date are limited to very short inputs. This work builds upon this technology to examine the effects of teaching modality on learning outcomes and engagement.
The aim of this study was to investigate the effect of teaching modality within an interaction with a virtual teachable agent. The study compared two methods of providing information to the teachable agent: (1) by selecting full sentences from source material in order to teach the agent information, as implemented in the original implementation of the Curiosity Notebook , against (2) the student paraphrasing the source material by typing their teaching utterances into a chat window, with encouragement to put the source material into their own words. In the first condition, the agent was named Alpha, and the condition will be described as the sentence selection condition; in the second condition the agent was named Gamma, and the condition will be described as the text input condition.
The study aimed to answer how the interaction modality affects two main metrics: Firstly, the student-teacher’s learning outcomes, and secondly, their engagement in the teaching task. As discussed in the related works, student engagement can be considered along three dimensions: behavioural, affective, cognitive. It is expected that the intervention of teaching modality would most significantly effect aspects of behavioural engagement, and also the participants’ perception of the teaching task.
The implementation required the development or revision of the following components: the Curiosity Notebook interface to support natural language input, a natural language model to parse user input utterances, and agent utterance generation in order to drive the interaction and reflect the different teaching modalities.
Iii-a Curiosity Notebook Interface
The Curiosity Notebook is a learning-by-teaching platform built as an Angular web application, with Python Flask on the backend . The teaching activity is a classification task, in this experiment on rocks and minerals. In this study, we used a simplified version of the Curiosity Notebook; the interface can be seen in figure 2. There are three rock categories, with two examples within each category, for a total of six rocks for each test condition, and twelve in total. Each rock is accompanied by a short article, written in simple language, and a picture. Contained within the article is information relating to a total of 30 rock features.
Three teaching actions are available: ‘Describe’, in which the user teaches a feature of the rock, ‘Explain’, in which the user explains why a feature is present in the rock, and ‘Compare’, in which the user compares features that are similar, or different, between two rocks. Users can quiz the agent on the content it has learned, and correct information it has mis-learned. A button can be clicked to show the agent’s “notebook” which shows what the agent has learned.
Iii-B Teaching via Sentence Selection
While teaching Alpha in the sentence selection condition, users communicate in one of two ways. The user will type in the chatbox using natural language when prompted for simple information such as which rock they would like to teach. The information in these utterances is recognised using a simple string match for known entities e.g. recognising ‘igneous’ in the input ‘this is an igneous rock.’
When the user is required to select a sentence to teach the agent new information, the chatbox is greyed out, and when the student hovers their mouse over the sentences in the source material, they animate with a yellow highlight effect, and are clickable in order to communicate the sentence to the agent.
If a sentence is about the feature of an object (e.g., colour of Slate), that sentence will be mapped to a set of feature IDs. When a sentence is selected, the features are added to the agent’s database, and each appear as a ‘note’ in the agent’s notebook, which keeps track of what it has learned. In order to avoid Alpha being a perfect learner, and therefore seem cleverer than Gamma, an artificial learning error was included in the sentence selection condition such that 20% of taught sentences resulted in an error response that prompted the student to select something different.
This version of the interface was used as the baseline condition in the experiment.
Iii-C Teaching via Text Input
While teaching Gamma, all communication occurs via the user typing in the chatbox, and sentences in the articles are not clickable at any time.
User inputs are parsed using a natural language model built using the Snips Natural Language Understanding python library . The library supports parsing input sentences written in natural language and extracting structured information such as intents and slot values.
The natural language model used in the text input condition was designed in such a way that input utterances would be mapped onto the 30 known features already in the database. A combination of intents and slot values was created to match inputs to the known database features. These were named in alignment with the feature ID they corresponded to i.e. the slot value for ‘large crystals” was named “2” to match its feature ID, while the intent “why the rock has large crystals” was named “12” to match its feature ID.
For each input sentence, Snips extracts all matched intents, and if the intent has slots defined, the slot values as well, ordered by confidence in the match. The array of matched intents was looped over and those with a confidence score of 20% or greater were included as features to add to Gamma’s notebook.
The language model was developed by hand, using the source material as a guide to generate micro and macro variations of the content within them. Input utterances obtained during the pilot phase of this experiment were used to supplement the model, which then remained the same for all participants in this experiment.
Iii-D Mixed-Initiative Interaction
A mixed-initiative capability was added in this experiment. In the published version of the Curiosity Notebook, all teaching actions are initiated by the user. Once an action is initiated, the agent will then prompt the user on which information is provided, and in what order. When the teaching action is over, the agent defers to the user by saying, for example, “You can now select a new button to keep teaching me.”
The mixed-initiative capabilities were added for both conditions in this experiment, such that when a teaching action was completed, 75% of the time the user was prompted to select the next teaching action, as in the published version of the notebook, and for the remaining 25% of the time, the agent chose for themselves, and immediately began the teaching action dialogue. The agent chose between the three teaching actions, describe, explain, and compare, along a probability distribution of [0.5, 0.3, 0.2]. Independently of who initiated,the agent and user each control the decision of which rock to teach, or be taught, 50% of the time. If fewer than 50% of all rocks were known to the agent (i.e. something has been taught about them), the agent would choose an unknown rock, otherwise a rock would be chosen at random, which may or may not have been unknown. Examples of the dialogue in these initiation modes are provided in figure3.
Iii-E Agent Utterances
The agent utterances between the two conditions differed when teaching actions were requested. When teaching Alpha, the user would be asked to ‘select a sentence’ to teach, and when teaching Gamma, they would be asked to ‘read the material and type’ their response. An example of this difference is shown in figure 4. Other than this, the utterances were identical between conditions. Each utterance type included several variations in wording, which were chosen at random.
In the event of a learning error, the agent would tell the participant that they didn’t understand the input, and they were prompted to confirm whether they would like to try again. After three failed attempts, the agent would automatically move on.
Iv Teaching Modality User Study
The effect of teaching modality on learning outcomes and engagement was examined by allowing participants to interact with the two different agents, Alpha and Gamma.
Participants were recruited from Monash University and personal contacts of the researcher. A total of 46 participants were recruited (23 female), ranging from 20 to 50 years old. Participants were provided with the Explanatory Statement ahead of a mutually agreed-upon time slot. The true nature of the experiment was not disclosed to the participants until the conclusion of the experiment to avoid influencing the participants’ behaviours. This user study was reviewed and received ethics approval through the Monash University Human Research Ethics Committee (Project ID: 26087).
Iv-B Experimental Procedure
The experiments were conducted as a mix of in-person sessions, and virtual sessions held via Zoom. In both environments, the researcher was present (either on call, or in the room), but did not take part in the teaching interaction, and was only available for technical assistance.
Each participant taught two different agents, Alpha, and Gamma. Participants taught Alpha via sentence selection, and taught Gamma by typed text input. They were told that they would be helping Alpha and Gamma prepare for a test, and were randomly assigned the order in which they would teach the agents. Participants received an introductory explanation of the interface, and of their task, then completed a demographic survey, and a pre-test on rock classification. The test contained two types of questions: those that required users to select all relevant features of a specific rock, and those that asked multiple choice questions about rock formation in general. Participants could spend up to 20 minutes with each agent, and could conduct the teaching interaction as they chose. After interacting with each agent, they completed a post-test on only the six rocks they had just seen, and also a user experience survey. Finally, a survey directly comparing the two agents, and a post-experiment interview was completed.
The hypotheses of this experiment were:
H1: Teaching via text input would result in greater learning gains
H2: Teaching via text input would result in higher engagement from the participant
V User Study Results
V-a Learning Outcomes
Learning outcomes were measured using the difference between test scores on the pre and post knowledge assessment, per condition. As the pre-test contained questions on all twelve rocks, these scores were split according to the scores achieved for the rocks seen in each condition. The difference in test scores per condition was chosen as a measure of learning outcomes in order to reduce the effects of participant’s level of prior knowledge.
Of the 46 participants, five had test results missing. Outliers were detected in the data as those with test scores that fell outside the range ofand
, where Q1 and Q3 are the lower and upper quartile, respectively, and IQR is the inter-quartile range. Two participants were removed under these conditions for this analysis. This gives a total of 44 test differences in the sentence selection condition, and 41 in the text-input condition.
|Time per Teaching Action (s)|
|Number of Teaching Attempts|
Prior to analysing the differences in test scores, repeated measures ANOVA was conducted to examine the effect of teaching modality on the time taken per teaching action, defined from the time a teaching action was initiated, to the time it was completed (either successfully or unsuccessfully). A significant difference was found, between the conditions, where participants teaching Gamma took longer, on average, to complete each teaching action, than those teaching Alpha, as seen in Table I. This resulted in significantly fewer teaching attempts in the fixed time sessions , with Alpha receiving an average of attempts, and Gamma receiving attempts. This is consistent with the requirements of each condition, where teaching via paraphrasing is expected to take longer than teaching via sentence selection, due to the need to both read and type out the paraphrased response.
|Test Score Differences|
|Test Score Differences / Teaching Attempts|
Following this, to test H1, repeated measures ANOVA was conducted to examine the effect of teaching modality on the difference in test scores.
Participants showed a slightly larger difference in test scores in the sentence selection condition compared to the text input condition, as shown in Table II, however this difference was not statistically significant, .
Due to the significant difference in the time taken to complete each teaching action, and the fixed duration of the experiment, the test differences were then normalised by the number of teaching attempts, and a repeated measures ANOVA was conducted on these values. For these normalised values, there is a statistically significant difference between conditions, . For tests covering material from the text input condition, participants had a normalised test difference of , while for tests covering material from the sentence selection condition, the normalised test difference was , shown in Table II. The normalised results indicate that learning gains using a text-input modality are higher per teaching action, supporting H1.
We next assessed metrics related to participant engagement, firstly those focusing on quantitative metrics of behavioural engagement, and secondly on qualitative measures representing affective engagement.
V-B1 Behavioural Engagement
Behavioural engagement was measured using data representing the participant’s interaction with the Curiosity Notebook interface. The metrics used are the number of quizzes initiated by the user, the number of times the agent’s notebook was checked, and the number of clicks to navigate between different articles and categories.
As mentioned in the presentation of results on learning outcomes, there was a significant difference in the time taken to complete each teaching action between the two conditions. Participants in the text-input condition tended to spend longer formulating their typed responses, and so had less opportunity to click around the interface in general, or engage in tasks other than teaching, such as quizzes. All metrics have been normalised by the number of teaching attempts to account for this difference. The results are summarised in Table III.
|Quizzes / Teaching Attempts|
|Notebook Checks / Teaching Attempts|
|Navigation Clicks / Teaching Attempts|
A Wilcoxon signed rank test was used to examine the difference between conditions, as all data was non-normal. This analysis found no statistically significant differences between these normalised measures across the two conditions.
V-B2 Affective Engagement
Metrics of affective engagement were obtained from survey data, administered directly after each teaching interaction, and also after experiencing both conditions. After teaching each agent, a 5-point Likert scale was used to rate enjoyment of the interaction, and perceived utility of the interaction for one’s own learning. An IMI survey was used to rate perceived effort and stress during the interaction. The statistically significant results from these surveys are summarised in Table IV.
|Perceived Utility for Learning|
A Wilcoxon signed rank test found a significant difference in ratings of enjoyment between conditions, , effect size . Participants rated Alpha slightly higher than Gamma . A significant difference was also found in the participants’ perceived effort in teaching task, , effect size , where Gamma was rated moderately higher than Alpha .
Participants were also asked about their perceived usefulness of the teaching task for their own learning. There was a significant difference between conditions, , effect size , where participants rated Gamma slightly higher than Alpha .
There was no statistically significant difference in participant’s ratings of their stress .
After participants had experienced both conditions, they were given a survey that directly compared their experiences teaching the two agents. They were asked which agent they preferred teaching, teaching which agent was more useful for their own learning, and which agent they would prefer to interact with in the future. A summary of all three survey question results can be seen in figure 5.
When asked which agent they preferred teaching, 26.1% of participants stated that they preferred teaching Alpha, with 19.6% indicating a strong preference. 37.0% of participants stated that they preferred teaching Gamma, with 13.0% indicating a strong preference. The remaining 4.3% did not have a preference.
When asked which agent they found most helpful for their own learning, 2.2% of participants felt that Alpha was somewhat more helpful, with 15.2% indicating that Alpha was much more helpful. 17.4% of participants felt Gamma was somewhat more helpful, with 39.1% indicating that Gamma was much more helpful. The remaining 26.1% found both agents equally helpful.
When asked which agent they would prefer to interact with in the future, 26.1% of participants indicated they would prefer Alpha, with 6.5% indicating a strong preference. 39.1% of participants indicated a preference for Gamma, with 13.0% indicating a strong preference. The remaining 15.2% did not have a preference.
Metrics relating to both behavioural engagement and affective engagement have been measured in this study, which provides two different ways through H2 can be evaluated. The results indicate that H2 is supported when we consider affective engagement, but that there is not significant support when we consider behavioural engagement.
Vi Paraphrasing in Text Input Condition
The analysis of results from the user study indicated that teaching via text input has a positive effect of learning outcomes, and on aspects of affective engagement. During this experimental condition, users were encouraged to paraphrase the source material while formulating their responses. However, there were differences in paraphrasing effort between participants, from fully reformulating the source material, to copying down full sentences from the source material verbatim. Additionally, while learning outcomes in the text input condition were greater than for the sentence selection condition, these results still showed a large variance. To further examine H1, an analysis was performed on the user inputs to examine the effect of the amount of paraphrasing on learning outcomes, and also on measures of affective engagement.
In order to estimate the amount of paraphrasing in each teaching input, semantic similarity between the user generated sentence and the relevant sentence(s) from the source material is used as a proxy. This analysis of user and source data is implemented using Sentence-BERT (SBERT), a network that rapidly derives sentence embeddings that can be compared using cosine similarity. Using this method, if a user typed exactly what was contained in the source sentence, their cosine similarity would be high, indicating a small amount of paraphrasing, and if the cosine similarity is low, this indicates a greater semantic distance between the two sentences, indicating that more paraphrasing has occurred.
Out of all user inputs, only those that returned a matched feature from the Snips natural language model were used. This was done to provide certainty that the user was trying to teach the agent something within scope, and that there would be corresponding material in the source articles with which to compare the user’s input. If the user taught about a feature that was recognised by the model, but wasn’t relevant to the rock in question, the user input also wasn’t processed. The sentence-transformations Python library 
was used to create a model for all usable user inputs, and all source material sentences. Training the model then produced vector sentence encodings for these sentences.
To calculate the cosine similarity between the user input and source material, several situations that may have been present in the user data needed to be accounted for, including teaching:
One feature to the agent, about one rock
Multiple features from a single sentence, about one rock
Multiple features from multiple sentences, about one rock
One feature relevant to multiple rocks
Multiple features relevant to multiple rocks, from one or more sentences per rock
In each user teaching input, for the rock it was about, and for each feature that was extracted from the input sentence, all sentences from that rock’s source material containing information on that feature were obtained. The cosine similarities between the user input sentence and all relevant source material sentences were calculated. The source sentence with the highest cosine similarity was identified as the most likely sentence the user was paraphrasing from for that feature.
In the cases where multiple features were taught, if the cosine similarity was highest for all features from the same source sentence, only that single sentence was identified. However, if the cosine similarity was highest for different source sentences, those sentences were combined, and the model was retrained with this additional, compound sentence. The cosine similarity was then calculated between the user input sentence and the new, combined sentence.
A similar approach was taken where a user taught information relevant to more than one rock. The sentences with the highest cosine similarity to the user input sentence, per rock, per feature, were identified, and all unique sentences from those identified were combined. The model was retrained on all sentences including the new compound sentences, and the cosine similarity was calculated between the user input sentence and the new compound sentence.
For each user, the average cosine similarity for all their teaching inputs was calculated, and used as an approximation of their paraphrasing effort across the teaching interaction.
Effect on Learning Outcomes: Considering the difference in test scores using all questions (‘full learning outcomes’), a correlation is present with the cosine similarity, but the relationship is not statistically significant ().
Considering only rock-specific questions (’rock-specific learning outcomes’), a statistically significant correlation is present between cosine similarity and rock-specific learning outcomes (). This relationship is shown in Figure 6.
Effect on Affective Engagement: The Pearson Correlation Coefficient was also calculated for the relationship between the amount of paraphrasing, and the metrics of affective engagement (enjoyment, perceived effort, perceived stress, and perceived usefulness of the interaction for learning).
There were no statistically significant relationships found between the amount of paraphrasing and the metrics of engagement.
Vii-a Learning Outcomes
This study compares the effect of two different teaching modalities along the metrics of learning outcomes and engagement, during an interaction with a virtual teachable agent. The results indicate that paraphrasing and typing to teach can have a positive effect on both of these metrics. The results also show that there is a significant difference in how long each teaching action takes, and this has an impact on both learning outcomes and engagement. When teaching via paraphrasing and text input, participants were required to read the material, and also take the time to formulate and type out their response. The latter requirement is notably absent when teaching via sentence selection. For both metrics of interest, the increased time taken to complete each teaching action is important due to the fixed duration of the interaction, as fewer teaching actions would be able to occur during the available time.
We see this difference in the results on learning outcomes, where differences in test scores between the two conditions are not statistically significant. However, when we account for the number of teaching actions completed, we see that teaching via paraphrasing and text input resulted in better learning outcomes per teaching action. This suggests that participants were able to better recall the information they taught using this modality. This may be due to a number of reasons, including the increased time spent on on the material during each teaching interaction, the requirement to read the source material more closely, the paraphrasing, the act of typing out the response, or a combination of these. These results also indicate that future experiments using this teaching modality may benefit from removing the time constraint on the teaching interaction time.
The secondary analysis of only the text-input condition to characterise the influence of paraphrasing indicates that the degree to which a participant paraphrased the source material, measured in this case by semantic similarity, has a positive impact on learning outcomes. This adds support to the hypothesis of the study, indicating that not only does paraphrasing and typing the teaching response have a positive impact on learning outcomes, but that more paraphrasing can further these outcomes.
This analysis makes use of the rock-specific learning outcomes, as opposed to the full-learning outcomes. This may show that more paraphrasing is more helpful in questions that focus on knowledge recall. The rock-specific questions ask users to identify features of a rock from a list, while the multi-choice questions ask participants to attribute the explanation of features common across many rocks that they may have seen. It is possible that the broader nature of the multi-choice questions reflected knowledge that the participants already had, and thus did not contribute to a knowledge gain after the teaching interaction took place.
During the study, both behavioural and affective engagement was measured. Metrics of behavioural engagement, representing the participants’ interaction with the interface, were not statistically significantly different between the two conditions after normalising for teaching attempts.
In terms of affective engagement, the responses that participants gave differed slightly depending on when they were asked, either immediately following the condition, or at the conclusion of the experiment, when comparing the conditions directly. Responses given directly after the teaching interaction indicate that participants had a preference for teaching Alpha, via sentence selection. Participants also indicated that teaching Gamma, via paraphrasing, required more effort. This suggests that immediately following the interaction, participants were more favourable towards the interaction which required less effort. When participants were asked to directly compare the two conditions, after experiencing them both, there was no clear preference for either condition. This suggests that after some time, any bias that the amount of effort had on reported enjoyment was less prevalent.
Consistently, participants responded that they felt like teaching Gamma was more helpful for their own learning. This supports the hypothesis that participants better learned the material they taught to Gamma, though it is important to note that this study does not explicitly quantify the relationship between perceived versus actual utility for learning.
We also found that a majority of participants would choose to interact with Gamma again, over Alpha. This suggests that there is a priority among this demographic of participants in interacting with the agent that is perceived to have a greater impact on learning, rather than the one that was more enjoyable, or less effort, to teach.
The secondary analysis focusing on the amount of paraphrasing found no significant impact on measures of affective engagement. This may be interpreted positively, as it indicates that paraphrasing more extensively from the source material was not perceived as requiring more or less effort, as being more or less stressful, and either increased nor decreased enjoyment of the teaching task, compared to more limited, or no paraphrasing. The differences in measures of affective engagement when compared to the sentence selection condition may only be a reflection of the time and effort required to type out the teaching response at all, thus encouraging users to paraphrase more may benefit their learning outcomes at limited detriment to their user experience.
It is also noteworthy that despite this analysis indicating that the amount of paraphrasing did positively impact learning outcomes, participants did not perceive this effect themselves. This may be a reflection on the design of the knowledge assessments, where participants were not able to accurately assess their performance, or draw meaningful connections between the material they taught the agent, and the material they were tested on.
Vii-C Limitations of Study and Analysis
There are some limitations in the study design that should be considered when contextualising the results. This study examined learning outcomes, but only focused on short-term recall, and did not investigate the effects of teaching modality on long term knowledge retention, which may be of interest if the technology were to be deployed in a classroom setting.
The fixed duration of the experiment also poses some limitations, including those already discussed in relation to the pace of teaching in each condition. Additionally, constraining the amount of time participants can spend in each condition may have placed pressure on participants to maximally cover the material in the available time, and impacting teaching choices, perhaps to favour wider coverage over depth.
Finally, using semantic similarity as a proxy for the amount of paraphrasing of the teaching utterances, compared to the source material places limitations on the conclusions drawn from this analysis. Most significantly, this measure does not include any indication of the quality of the paraphrase, or of the accuracy of the information from a teaching perspective. For example, the input sentence may be semantically distant from the source material because it is brief, lacking in detail, or contains information outside the source article.
Viii Conclusion and Future Work
In this work, we investigate the effect of teaching modality while interacting with a virtual teachable agent, via an online teaching platform, the Curiosity Notebook. A method of selecting sentences from the source material is compared against a method of reading and paraphrasing the source material, and typing out responses to teach. A user study has been conducted to measure the learning outcomes and engagement of participants across the two conditions. A secondary analysis conducted on the paraphrasing condition alone to measure the effect of differences in individual paraphrasing behaviours.
The results of the user study indicate that teaching via paraphrasing does have a positive effect on learning outcome, improving recall of the material covered. It was observed that individual teaching actions using this modality took longer to complete than for teaching via sentence selection, which reduced the total volume of material covered when teaching via paraphrasing.
This also affected metrics of behavioural engagement, as users experienced periods of inactivity while reading or typing. Teaching via paraphrasing was recognised as requiring more effort, however this did not significantly negatively impact enjoyment of the teaching task. Additionally, it was consistently recognised that teaching via paraphrasing was more helpful for the users’ own learning, which positively affected their desire to use this teaching modality in the future.
It was also found that the more paraphrasing in the participants’ teaching inputs, measured via semantic similarity, had a positive impact on learning outcomes. Conversely, the amount of paraphrasing did not have any effect on the perceived effort or enjoyment of the teaching task, indicating that this benefit to learning outcomes does not come at the expense of these affective measures.
This teaching interaction involves other factors, such as the error rate in responses of the agent, and the teaching path selected by individual users. Future development of the Curiosity Notebook that utilises teaching via paraphrasing would benefit from exploration of the factors to determine which have an effect on the metrics of interest, and how they may be optimised to benefit students.
-  (1980) On the cognitive benefits of teaching. Journal of Educational Psychology 72 (5), pp. 593–604. External Links: Cited by: §II-A.
-  (2018-08) Social robots for education: A review. Science Robotics 3 (21) (en). External Links: Cited by: §I.
-  (2005-03) Learning By Teaching: A New Agent Paradigm For Educational Software.. Applied Artificial Intelligence 19, pp. 363–392. External Links: Cited by: §I, §I, §II-A, §II-A, §II-E, §II.
-  (2020-01) Mapping research in student engagement and educational technology in higher education: a systematic evidence map. International Journal of Educational Technology in Higher Education 17 (1), pp. 1–30 (en). External Links: Cited by: §II-D.
-  (1999) Teachable Agents: Combining Insights from Learning Theory and Computer Science. In In S. P. Lajoie and M. Vivet (Eds.), Artificial Intelligence in Education, pp. 21–28. Cited by: §II.
-  (2014-05) Jumping NLP Curves: A Review of Natural Language Processing Research. IEEE Computational Intelligence Magazine 9 (2), pp. 48–57. External Links: Cited by: §II-B.
-  (2021-05) Can a Humorous Conversational Agent Enhance Learning Experience and Outcomes?. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–14. External Links: Cited by: §II-C.
-  (2009-08) Teachable Agents and the Protégé Effect: Increasing the Effort Towards Learning. Journal of Science Education and Technology 18 (4), pp. 334–352 (en). External Links: Cited by: §I, §II-A, §II-E.
-  (2019) Using Conversational Agents To Support Learning By Teaching. arXiv e-prints, pp. 7 (en). Cited by: §II-C.
-  (2019) Towards the Learning, Perception, and Effectiveness of Teachable Conversational Agents. Ph.D. Thesis, University of Waterloo. Cited by: §II-C.
-  (2018-12) Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces. arXiv:1805.10190 [cs]. Cited by: §II-B, §III-C.
-  (2004-03) School Engagement: Potential of the Concept, State of the Evidence. Review of Educational Research 74 (1), pp. 59–109 (en). External Links: Cited by: §II-D.
-  (1999-12) AutoTutor: A simulation of a human tutor. Cognitive Systems Research 1 (1), pp. 35–51 (en). External Links: Cited by: §I, §II-A, §II-A, §II-B, §II-E, §II.
-  (2015-07) Advances in natural language processing. Science 349 (6245), pp. 261–266 (en). External Links: Cited by: §II-B.
-  (2009-12) Measuring the Effectiveness of Educational Technology: what are we Attempting to Measure?. Electronic Journal of e-Learning 7 (3), pp. pp273‑280 (en). External Links: Cited by: §II-D.
-  (1998-01) Comprehension: A Paradigm for Cognition. Cambridge University Press (en). External Links: Cited by: §I, §II-B.
-  (2009-09) Paraphrasing: An Effective Comprehension Strategy. The Reading Teacher 63 (1), pp. 73–77 (en). External Links: Cited by: §I, §II-B.
-  (2019-05) How is the use of technology in education evaluated? A systematic review. Computers & Education 133, pp. 27–42 (en). External Links: Cited by: §II-D.
-  (2020-04) Curiosity Notebook: A Platform for Learning by Teaching Conversational Agents. In Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, CHI EA ’20, Honolulu, HI, USA, pp. 1–9. External Links: Cited by: §II-C, §II-E, §III-A, §III.
-  (2021-10) Curiosity Notebook: The Design of a Research Platform for Learning by Teaching. Proceedings of the ACM on Human-Computer Interaction 5 (CSCW2), pp. 1–26. External Links: Cited by: §I, §I, §II.
A Review of Technologies for Conversational Systems.
Advanced Computational Methods for Knowledge Engineering, pp. 212–225. External Links: Cited by: §II-B.
-  (2012-12) Cognitive Anatomy of Tutor Learning: Lessons Learned With SimStudent. Journal of Educational Psychology 105, pp. 1152–1163. External Links: Cited by: §II-A, §II-E, §II.
-  (2001-06) The Case for Social Agency in Computer-Based Teaching: Do Students Learn More Deeply When They Interact With Animated Pedagogical Agents?. Cognition and Instruction 19 (2), pp. 177–213. External Links: Cited by: §II.
-  (2015-11) Learning by preparing to teach: Fostering self-regulatory processes and achievement during complex mathematics problem solving.. Journal of Educational Psychology 108 (4), pp. 474–492. Cited by: §II-A.
-  (2014-10) Expecting to teach enhances learning and organization of knowledge in free recall of text passages. Memory & Cognition 42 (7), pp. 1038–1048 (en). External Links: Cited by: §II-A, §II-E.
-  (2014-12) AutoTutor and Family: A Review of 17 Years of Natural Language Tutoring. International Journal of Artificial Intelligence in Education 24 (4), pp. 427–469 (en). External Links: Cited by: §II-A, §II-A.
-  (2021-08) Effects of an Adaptive Robot Encouraging Teamwork on Students’ Learning. In 2021 30th IEEE International Conference on Robot Human Interactive Communication (RO-MAN), pp. 250–257. External Links: Cited by: §II-C.
-  (2019-08) Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv preprint arXiv:1908.10084 (en). Cited by: §VI-A.
-  Sentence-transformers: Sentence Embeddings using BERT / RoBERTa / XLM-R. External Links: Cited by: §VI-A.
-  (2007-06) The Case for Caring Colearners: The Effects of a Computer-Mediated Colearner Agent on Trust and Learning. Journal of Communication 57 (2), pp. 183–204 (en). External Links: Cited by: §II.
Natural Language Processing Advancements By Deep Learning: A Survey. arXiv:2003.01200 [cs]. Cited by: §II-B.
-  (2013-11) My Science Tutor: A Conversational Multimedia Virtual Tutor. Journal of Educational Psychology 105, pp. 1115–1125. External Links: Cited by: §I, §II-A, §II-A.
-  (1987) Artificial Intelligence and Tutoring Systems: Computational and Cognitive Approaches to the Communication of Knowledge. Morgan Kaufmann (en). External Links: Cited by: §II-A.
-  (2014) Embodied Teachable Agents : Learning by Teaching Robots. In Intelligent Autonomous Systems, The 13th International Conference on, Cited by: §II-A.
-  (2018-06) When deictic gestures in a robot can harm child-robot collaboration. In Proceedings of the 17th ACM Conference on Interaction Design and Children, IDC ’18, Trondheim, Norway, pp. 195–206. External Links: Cited by: §I.