Log In Sign Up

Curriculum Q-Learning for Visual Vocabulary Acquisition

by   Ahmed H. Zaidi, et al.

The structure of curriculum plays a vital role in our learning process, both as children and adults. Presenting material in ascending order of difficulty that also exploits prior knowledge can have a significant impact on the rate of learning. However, the notion of difficulty and prior knowledge differs from person to person. Motivated by the need for a personalised curriculum, we present a novel method of curriculum learning for vocabulary words in the form of visual prompts. We employ a reinforcement learning model grounded in pedagogical theories that emulates the actions of a tutor. We simulate three students with different levels of vocabulary knowledge in order to evaluate the how well our model adapts to the environment. The results of the simulation reveal that through interaction, the model is able to identify the areas of weakness, as well as push students to the edge of their ZPD. We hypothesise that these methods can also be effective in training agents to learn language representations in a simulated environment where it has previously been shown that order of words and prior knowledge play an important role in the efficacy of language learning.


page 1

page 2

page 3

page 4


Curriculum Learning for Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) is a task where an agent navigates ...

Transfer Learning and Curriculum Learning in Sokoban

Transfer learning can speed up training in machine learning and is regul...

Reinforcement Learning based Curriculum Optimization for Neural Machine Translation

We consider the problem of making efficient use of heterogeneous trainin...

Efficient Pre-training of Masked Language Model via Concept-based Curriculum Masking

Masked language modeling (MLM) has been widely used for pre-training eff...

Comparison and Analysis of New Curriculum Criteria for End-to-End ASR

It is common knowledge that the quantity and quality of the training dat...

Adaptive Learning Material Recommendation in Online Language Education

Recommending personalized learning materials for online language learnin...

Target Curricula via Selection of Minimum Feature Sets: a Case Study in Boolean Networks

We consider the effect of introducing a curriculum of targets when train...

1 Introduction

With the rise of machine learning and the tasks such as automated teaching and assessment, there is an increased interest in understanding how machine learning models can be grounded in theories of language acquisition. Additionally, with the abundance of learner data in archive and generation, we now have an avenue through which we can not only evaluate our theories of learning, but also explore whether these theories can be used to train agents for the purpose of general AI.

Language Acquisition is a multidisciplinary field that includes linguistics, psychology, neuroscience, philosophy, and more recently computer science. At the intersection of language acquisition and pedagogy lie theories of educational practices for language learners, including for example, an optimal curriculum for both L1 and L2 learners. A curriculum is a guide that helps teachers decide what content to present and the order of which it needs to be presented. The aim of a curriculum is to provide a highly structured method of introducing concepts in order to maximise the rate of learning.

The idea of a curriculum to facilitate the rate of learning has been discussed from the perspective of animal training skinner1958teaching ; peterson2004day , where it is defined as shaping. It has also been referenced in an educational framework bruner1960process where the author introduces the idea of a spiral curriculum, a process by which complex information is first presented in a simplified manner and then revisited at a more difficult level later on. Similarly Vygotsky, from the view of language acquisition, introduces the idea of scaffolding in order to provide contextual support for more complex ideas using simplified language or visuals. All of these concepts have been discussed in different fields but reference the same underlying idea of presenting information in a structured manner in order to exploit prior knowledge.

Bruner bruner1961act argues that the role of the teacher is not to present information by rote learning but rather facilitate the learning process in order to teach students to become active learners: put simply, they are “learning to learn”. There are many factors that teachers need to consider when constructing a curriculum to achieve this goal, namely the difficulty and appropriateness of content.

Difficulty is measured relative to the zone of proximal development (ZDP), introduced by Vygotsky, which is a representation of what a learner is capable of achieving without help, with some help, and of concepts that are beyond the learner’s current ability. Appropriateness is a measure of whether content being presented is within the ZPD or, in the case of scaffolding, comprises material from within the ZPD.

Determining difficulty and appropriateness is traditionally a very laborious and resource intensive task which entails experts conducting focus groups and analysis to decide where a particular question or topic sits in the curriculum. This method is not only inefficient, it also assumes a static curriculum for all students.

To address these limitations, we propose the use of reinforcement learning (RL) in order to learn an optimal policy and curriculum for each student in the task of visual vocabulary acquisition. Through this, we also discuss the similarities between the properties and features of RL and those of language acquisition. We evaluate our models by simulating three types of student at different levels of proficiency (beginner, intermediate, and advanced). We find that the system is able to identify the difference in proficiency and adapt its curriculum to reflect.

Previous uses of RL in pedagogy include beck2000advisor where it is used to teach students arithmetic, aiming to minimise the time taken to answer questions. iglesias2009learning ; iglesias2003experience teach students database design using Q-learning. Both beck2000advisor and iglesias2009learning ; iglesias2003experience evaluate results on simulated students. martin2004agentx use RL for maths while tetreault2006comparing use it for physics. However, as far as we know, no previous work has been done in the space of visual lexical acquisition where the principles of RL have explicitly been related to theories of language acquisition.

The importance of curriculum learning in training deep learning models and agents has also been discussed by

bengio2009curriculum where its use is shown to facilitate the generalisation as well as the rate of convergence and training of deep learning networks. hermann2017grounded also illustrate the need for some form of curriculum to improve the rate of learning for agents in a 3D simulation. However, it is worth noting that no explicit RL is used to model curriculum by either bengio2009curriculum or hermann2017grounded .

2 Curriculum Q-Learning

In order to automate the process of curriculum learning for visual vocabulary acquisition, we must first identify the key components of our RL system. The agent in this task is the automated tutor that must learn what information to present to the student. The environment is the student who is interacting with the agent.

We assume that the student interacting with the tutor is a learner of English who has reached a given level on the Common European Framework of Reference (CEFR) scale. CEFR is an international standard for describing language ability, using a six point scale, from A1 for beginners, up to C2 for those who have mastered a language.

The RL algorithm used by our proposed system is Q-Learning, an off-policy algorithm for Temporal Difference (TD) Learning. Q-Learning can be defined as follows:


where is the Q-value of a state and action tuple. The is the learning rate and is the discount factor. models the fact that future rewards are less valuable than immediate rewards at a given time .

A policy maps states to actions . The aim of the Q-Learning algorithm is to find an optimal policy such that it maximises the long-term cumulative reward. The policy achieves this by acting greedily and taking the action that presents the maximum Q-value given the state such that .

In action selection, there is a trade-off between exploiting what you have learnt so far and exploring other state-action tuples. In this task we model that using

-greedy. This means the policy will, for most part, select the actions that provide the highest estimated future reward given the state. However, with a probability of

, an action will be selected randomly and independently from a uniform distribution. Action selection is usually drawn from a Q-Table which is a table that stores all state-action Q-values.

In this task, a policy can be viewed as a curriculum as it decides what should be shown and in what order. In order to learn a curriculum for vocabulary acquisition, we incorporate two models, the CEFR level model and word level model. The CEFR level model has 6 states which are defined by the 6 CEFR levels. The actions are whether the student should progress to the next level, stay in the current level, or go back a level. The word level model has two states: active (show the word), inactive (hide the word). The actions are remain in the current state and toggle state. This architecture ensures that there is also an estimated long-term reward associated with showing a student a particular word.

Modelling reward is often viewed as a challenging task in RL. For this application, a student is rewarded negatively (-1) for getting a question correct and positively (+1) for getting it incorrect. The motivation behind using these values is grounded in how we learn. As the RL model acts greedily and takes the action with the maximum reward, if we review a concept we understand, then we are not gaining in terms of knowledge by reviewing it again and thus its value should be reduced. Alternatively, if we get a question wrong, the benefit of reviewing that word is higher and thus we should increase the associated Q-value.

To evaluate the students’ understanding, we present a word in a form of an image. The objective for the students is to describe the image, and based on their response, the Q-Learning algorithm and thus the policy is updated. A valid response is defined by the target word associated with the image or a synonym of that target word which is automatically generated by looking at the top 10 nearest words to the target word in a pre-trained word2vec model mikolov2013efficient . The use of images was motivated by it’s inexpensive nature of producing evaluation material. Additionally, there are countless studies that indicate the effectiveness of images for learning verdi1997organized .

3 Experiments

For the CEFR level model, we use a learning rate of , a discount rate of and an value of . The word level model uses an of , a of and an value of 1 in order to prevent words randomly going into an inactive state.

To evaluate the performance of our system, we simulated three types of students at varying levels of proficiency: beginner, intermediate and advanced. In this case, we modelled the student’s probability of getting a question correct as a negated Gompertz distribution:


where denotes the level of user calibrated to a scale of . Each integer in the scale represents a corresponding CEFR level from A1 to C2 (e.g. 0 A1, 1 A2, etc.). represents the level of an item (i.e. a word which must be guessed from an image) calibrated to the same scale. The parameter determines the probability of success when student and item level match. This is set to to model a ‘typical’ pass rate of 75%. The calibrated curve is shown in Appendix B. The curve is flatter at the lower end as students may be expected to be comfortable with most of the material at lower CEFR levels than their own, whereas at higher levels, their ability is more uncertain. We ran simulations where each student had 100 interactions with the system. An interaction can be defined as when a student responds to a question.

Figure 1: CEFR levels determined by the agent for students of varying levels of proficiency over 100 interactions
Figure 2: Cumulative reward earned by students of varying levels of proficiency from the agent over 100 interactions

3.1 Results

The results from Figure 2 show how the agent responds to the various proficiency levels. The beginner student remains relatively constant around A1 and A2 which is reflective of the student’s current level. The intermediate student continually increases in CEFR level until level 3 (B2). The advanced student, although is tested with material beneath the actual level of proficiency, eventually reaches an advanced or higher CEFR level. We can also see that the agent tutor pushes the student to what can be interpreted as the edge of their ZDP. Figure 2 illustrates how the cumulative reward of the students varies for students at different proficiencies. The curve experiences a downward slope as the students reach their current level of vocabulary and are now being pushed to understand material beyond their scope.

4 Discussion

We have shown through the use of simulations, that we can effectively model a personalised curriculum for vocabulary acquisition using Q-Learning. Figure 2 and Figure 2 show clear indications of varying agent behaviour for students at different levels of lexical proficiency. However, beyond that, we have set up a framework that can be used in the future to extrapolate the difficulty and appropriateness of new material. The system will serve as a test bed that will yield metrics to determine where the content fits in the curriculum. Although this is foundational work, it lays the building blocks for future pedagogically inspired RL architectures.

Through this work, we have also shown that there are many similarities between the principles of RL and theories of language acquisition. Specifically, parallels can be drawn between the concept of -greedy and Krashen’s Input Hypothesis or the i+1. The Input Hypothesis states that students progress their learning by comprehending language that is slightly above their current language level. The interactions between the agent and the environment in RL is analogous to the social interaction approach to language acquisition, specifically equal importance of input and output. We also use the Q-Learning algorithm as opposed to the SARSA algorithm mainly due to the properties of Q-Learning that ensure an "optimal path" is followed i.e. the minimum number of steps to reach our goal (language fluency).

However, there is scope for substantial extensions in this space. Deploying the system on-line in order to collect user data will allow us to validate and improve our existing models. Incorporating memory and spaced repetition learning in order to optimise the policy and emulate cognitive processes is also an important extension that may have a great impact on the learning output. Using deep learning models to approximate the Q-value will allow the system to capture additional signals pertinent to language acquisition. Additionally, moving towards an adaptive reward model that reflects difficulty to encourage memory retention.

All of these models can also be applied to agents instead of students. As discussed previously, hermann2017grounded indicated the need for a curriculum in order to effectively train an agent in the simulated environment. Creating a dynamic environment guided by a curriculum grounded in pedagogically inspired RL may result in improved learning rates for the agent.


We thank Wenchao Chen who helped develop the back-end of our web-based platform.


  • [1] Burrhus Frederic Skinner. Teaching machines. Science, 128(3330):969–977, 1958.
  • [2] Gail B Peterson. A day of great illumination: Bf skinner’s discovery of shaping. Journal of the Experimental Analysis of Behavior, 82(3):317–328, 2004.
  • [3] Jerome S Bruner. The process of education:[a searching discussion of school education opening new paths to learning and teaching]. Vintage Books, 1960.
  • [4] Jerome S Bruner. The act of discovery. Harvard educational review, 1961.
  • [5] Joseph Beck, Beverly Park Woolf, and Carole R Beal. Advisor: A machine learning architecture for intelligent tutor construction. AAAI/IAAI, 2000:552–557, 2000.
  • [6] Ana Iglesias, Paloma Martínez, Ricardo Aler, and Fernando Fernández. Learning teaching strategies in an adaptive and intelligent educational system through reinforcement learning. Applied Intelligence, 31(1):89–106, 2009.
  • [7] Ana Iglesias, Paloma Martinez, and Fernando Fernández. An experience applying reinforcement learning in a web-based adaptive and intelligent educational system. 2003.
  • [8] Kimberly N Martin and Ivon Arroyo. Agentx: Using reinforcement learning to improve the effectiveness of intelligent tutoring systems. In Intelligent Tutoring Systems, pages 564–572. Springer, 2004.
  • [9] Joel R Tetreault and Diane J Litman. Comparing the utility of state features in spoken dialogue using reinforcement learning. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pages 272–279. Association for Computational Linguistics, 2006.
  • [10] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.
  • [11] Karl Moritz Hermann, Felix Hill, Simon Green, Fumin Wang, Ryan Faulkner, Hubert Soyer, David Szepesvari, Wojtek Czarnecki, Max Jaderberg, Denis Teplyashin, et al. Grounded language learning in a simulated 3d world. arXiv preprint arXiv:1706.06551, 2017.
  • [12] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  • [13] Michael P Verdi, Janet T Johnson, William A Stock, Raymond W Kulhavy, and Polly Whitman-Ahern. Organized spatial displays and texts: Effects of presentation order and display type on learning outcomes. The Journal of Experimental Education, 65(4):303–317, 1997.

Appendix A Curriculum Q-Learning System Overview

Figure 3: Overview of the system. A simulated student takes the place of a human actor in our study.

Appendix B Negated Gompertz Curve

Figure 4: Gompertz curve used as a model to simulate student success probabilities.

Appendix C Preview of Web-based Curriculum Q-Learning

Figure 5: A preview of the web-based Curriculum Q-Learning platform.