Knowledge tracing (KT) is a task of modeling how much knowledge students have obtained over time so that we can accurately predict how students will perform on future exercises and arrange study plans dynamically according to their real-time situations [2, 24]. Particularly, second language acquisition (SLA) modeling is a kind of KT in the filed of language learning. With the increasing importance of language-learning activity in people’s daily life , SLA modeling attracts more and more attention. For example, NAACL 2018 had held a public SLA modeling challenge.111http://sharedtask.duolingo.com/ Therefore, in this paper, we focus on SLA modeling.
SLA modeling is the learning process of a specific language, thus each SLA modeling task has a corresponding language, e.g., English, Spanish, and French. Meanwhile, each language is composed of many exercises, and an exercise is the smallest data unit. For an exercise, there are three possible types, i.e., listen, Translation, and Reverse Tap, and the answers to the exercises are all sentences regardless of the type of the exercise. In an exercise, a student will answer the given question and write its answer sentence. Then the student-provided sentence and the correct sentence will be compared word by word to evaluate the ability of the student. As shown in Fig. 1 (A), taking an English listening exercise as an example, the correct sentence is “ I love my mother and my father”, and the answer of the student is “ I love mader and fhader”; It can be shown that are three words that are correctly answered. Therefore, SLA modeling task is to predict whether students can answer each word correctly according to the exercise information (meta-information, correct sentence with corresponding linguistic information). Thus, it can be simply token as into a word-level binary classification task.
In SLA modeling task, low-resource is a common phenomenon which affects the training process significantly. Specifically, this phenomena is mainly caused by two reasons: (1) For some specific language-learning datasets, e.g. Czech, the size of data may be very small becuse we cannot collect enough language-learning exercises; (2) For a user, he/she will encounter cold start scenario when starting to learn a new language. However, almost all existing methods for SLA modeling task train a model separately for each language-learning dataset and thus their performance largely depends on the size of training data. Thus, they can hardly work well in low-resource scenarios. Fig. 1 (B) illustrates an example. Suppose that we have two language: English and Czech, existing methods will train two separate models for these two languages: model_en and model_cz. These two models will perform poorly in two low-resource scenarios: (1) If the English dataset has a large amount of data, the model_en will perform well, but the small size of Czech dataset may significantly hinders the performance of model_cz; (2) Suppose that a user has a large number of exercises for learning Czech, but when he/she begins to learn English, the number of English exercises for him/her will be very small, even zero. Thus, model_en can hardly predict the answers of his/her English exercises well.
Intuitively, there are lots of common patterns among different language-learning tasks, such as the learning habits of users and grammar learning skills. If the latent common patterns across these language-learning tasks can be well learned, they can be used to solve the low-resource SLA modeling problem.
Inspired by this idea, in this paper, we propose a novel multi-task learning method for SLA modeling, which is a unified model to process several language-learning datasets simultaneously. Specifically, the proposed model learns shared features across all language-learning datasets jointly, which is the inner nature of the language-learning activity, and can be taken as important prior-knowledge to deal with small language-learning datasets. Moreover, the embedding information of a user is shared, so the learning habits and language talents of the user could be shared in the unified model for other low-resource language-learning tasks. Therefore, when a user begins to learn a new language, the unified model can work well even though there is no exercise data for this user.
The main contributions of this paper are three-fold. (1) As far as we know, this is the first work applying multi-task neural network to SLA modeling and we effectively solve the problem of insufficient training data in low-resource scenarios. (2) We deeply study the common patterns among different languages and reveal the inner nature of language learning. (3) Extensive experiments show that our method performs much better than the state-of-the-art baselines in low-resource scenarios, and it also obtains improvement slightly in the non-low-resource scenario.
Ii Related Work
Ii-a SLA Modeling
Existing methods for SLA modeling can be roughly divided into three categories: (a) logistic regression based methods, (b) tree ensemble methods, and (c) sequence modeling methods. (a) The logistic regression based methods[14, 22, 3]
take the meta and context features provided by datasets and other manually constructed features as input and output the probability of answering each word correctly. These methods are simple but their performances are not very poor. (b) The tree ensemble methods (e.g., Gradient Boosting Decision Trees (GBDT))[31, 27, 4]
can powerfully capture non-linear relationships between features. Therefore, although the input and output of these methods are the same with (a), they are generally better than methods that belong to (a). (c) The sequence modeling methods (e.g., Recurrent Neural Networks (RNNs))[33, 34, 11] use neural networks, especially RNNs so that they can capture users’ performance over time. The performance of these methods are also very competitive.
However, methods above hardly can work well in low-resource scenarios because their performance largely depends on the size of training data.
Ii-B Multi-Task Learning
Multi-task learning (MTL) has been widely used in various tasks, such as machine learning[21, 16, 18]5, 17, 7], speech recognition [6, 12, 32]
and computer vision[8, 26, 35]. It effectively increases the sample size that we are using for training our model. Thus, it can improve generalization by leveraging the domain-specific information contained in related tasks, and enables the model to obtain a better sharing representation between each related task.
MTL is typically done with hard or soft parameter sharing of hidden layers and hard parameter sharing is the most commonly used approach to MTL in neural networks . It is generally applied by sharing the hidden layers between all tasks, while keeping several task-specific output layers.
SLA modeling has different language-learning tasks, and each task has something in common, which gives us an opportunity to use MTL to improve the overall performance.
Iii-a Problem Definition
Suppose there are second language-learning datasets , and the dataset is composed of exercises , where is the exercise in the dataset.
There are two kinds of information in an exercise , i.e., the meta information and the language related context information.
The meta information contains two user-related information: (1) user: the unique identifier for each student, e.g., D2inf5, (2) country: student’s country, e.g., CN, and the following five exercise-related information: (1) days: the number of days since the student started learning this language, e.g., 1.793, (2) client: the student’s device platform, e.g., android, (3) session: the session type, e.g., lesson, (4) format (or type): exercise type, e.g., Listen, (5) time: the amount of time in seconds it took for the student to construct and submit the whole answer, e.g., 16s. This is shared among all language datasets.
The context information in the exercise includes the word sequence , and word’s linguistic sequences, such as , which is the POS-tagging of each word. This is unique to each language-learning dataset.
At last, has a word level label sequence , where . means this word is answered correctly, and means the opposite.
Our task is to build a model based on users’ exercises, and further to predict word-level label sequence of future exercises.
Iii-B Encoder and Decoder Structure
Our model is an encoder-decoder structure with two encoders, i.e., a meta encoder, a context encoder, and a decoder. We use the meta encoder to learn the non-linear relationship between meta information, use the context encoder to learn the representation of a sequence of words and use the decoder to generate the final prediction of each word. The overall structure of the proposed model is shown in Fig. 2.
: The meta encoder is a multi-layer perceptron (MLP) based neural network. This encoder takes the metadata as inputs. First, these inputs are converted into high-dimensional representations by the embedding layers, which are randomly initialized and will map each input into a 150-dimensional vector. After the embedding step, we separately concatenate the user-related embeddings and the exercise-related embeddings, and send them intoand to get the representation of user-related meta information and the representation of exercise-related meta information , respectively. Finally, we concatenate and , and send the concatenated result to to obtain the representation of whole meta information . The meta encoder can be formulated as
where for the sake of simplicity, the variables are omitted from the subscript , and is the embedded representation of each meta information.
: The context encoder consists of three sub-encoders, i.e., a word level context encoder, a char level Long Short Term Memory (LSTM) context encoder, and a char level Convolutional Neural Network (CNN) context encoder. The word level encoder can capture better semantics and longer dependency than the character level encoders. By modeling the character sequence, we can partially avoid the out-of-vocabulary (OOV) problem [19, 1]. Furthermore, we only use the word sequence in the datasets without using any of the provided linguistic information here. The previous work  has pointed out that the linguistic information given by the datasets has mistakes. So, through two character level encoders, we can learn certain word information and linguistic rules.
Given the word sequence , the word level context encoder is computed as
where is the word in the sequence, and is the word embedding. Here, we use the pre-trained ELMo  as the look-up table. is the concatenated result of the last layer’s hidden state of the forward and the backward cells of . It is also the output of the word level context encoder.
The char level LSTM context encoder is computed according to the sequence characters of word . This can be formulated as
where is the last hidden state of the last layer of . is the concatenated result of the last layer’s hidden state of the forward and the backward cells of . It is also the output of the char level LSTM context encoder.
The char level CNN context encoder can be similarly formulated as
where is the result of CNN encoder. is the concatenated result of the last layer’s hidden state of the forward and the backward cells of . It is also the output of the char level CNN context encoder.
The final output of the context encoder is generated by a single-layer MLP, and the concatenation of , and is fed as the input. The process is formulated as
where is the final context representation of the word .
Iii-C Multi-Task Learning
As is shown in Fig. 3, suppose there are languages, and each has a corresponding dataset, i.e., . Since our task is to predict the exercise accuracy of language learners on each language, we can regard these predictions as different tasks. Therefore, there are tasks.
We defined the cross-entropy loss for each task, which encourages the correct predictions and punishes the incorrect ones. Specifically, for the task, we have
where is the hyper parameter to balance the negative and positive samples.
In multi-task learning, the parameters in meta encoder and decoder are shared, and each task only has its own parameters of the context encoder part, so the whole model has only one meta encoder, one decoder and context encoders. In this way, the common patterns extracted from all language datasets can be utilized simultaneously by the shared meta encoder and decoder.
In the training process, one mini batch contains data of datasets and they will all be sent to the same meta encoder and decoder, but will be sent to their corresponding context encoder according to their language type. Thus, the final loss with tasks is calculated as
Finally, we use Adam algorithm  to train the model.
Iv-a Datasets and Settings
|#words / exercise||3.18||2.7||2.84|
|%OOV radio (Test)||4.5%||10.0%||5.9%|
We conduct experiments on Duolingo SLA modeling shared datasets, which have three datasets and are collected from English students who can speak Spanish (en_es), Spanish students who can speak English (es_en), and French students who can speak English (fr_en) . Table I shows basic statistics of each dataset.
We compare our method with the following state-of-the-art baselines:
LR Here, we use the official baseline provided by Duolingo . It is a simple logistic regression using all the meta information and context information provided by datasets.
GBDT Here, we use NYU’s method , which is the best method among all tree ensemble methods. It uses an ensemble of GBDTs with existing features of dataset and manually constructed features based on psychological theories.
RNN Here, we use singsound’s method , which is the best method among all sequence modeling methods. It uses an RNN architecture which has four types of encoders, representing different types of features: token context, linguistic information, user data, and exercise format.
ours-MTL It is our encoder-decoder model without multi-task learning. Thus, we will separately train a model for each language-learning dataset.
In the experiments, the embedding size is set to 150 and the hidden size is also set to 150. Dropout  regularization is applied, where the dropout rate is set to 0.5. We use the Adam optimization algorithm with a learning rate of 0.001.
AUC is calculated as:
where is the probability,
is the trained classifier,is the instance randomly extracted from positive samples, and is the instance randomly extracted from negative samples.
is calculated as
where and are the precision rate and recall rate of the trained model.
|ours - meta encoder||0.743||0.353||0.716||0.320||0.750||0.478|
|ours - word level context encoder||0.862||0.559||0.838||0.526||0.858||0.575|
|ours - char level LSTM context encoder||0.863||0.563||0.838||0.526||0.860||0.579|
|ours - char level CNN context encoder||0.863||0.564||0.838||0.528||0.860||0.559|
|ours - char level context encoder all||0.863||0.562||0.838||0.526||0.859||0.579|
Iv-C Experiment on Small-scale Datasets
We first verify the advantages of our method in cases where the training data of the whole language-learning dataset is insufficient.
Specifically, we gradually decrease the size of training data from 400K ( 300K for fr_en ) to 1K and keep the development set and test set. For all baseline methods, since they only use the single language dataset for training, we hence only reduce the data of corresponding language data. For our multi-task learning method, we reduce the training data of one language dataset and keep the remaining other two datasets unchanged.
The experimental results are shown in Fig. 4. It can be found that our method outperforms all the state-of-the-art baselines when the training data of a language dataset is insufficient, which is a huge improvement compared with the existing methods. For example, as shown in AUC/en_es in Fig. 4, using only 1K training data, our multi-task learning method still could get the AUC score of 0.738, while the AUC score of ours-MTL is only 0.640, and existing RNN, GBDT and LR methods are 0.659, 0.658 and 0.650 respectively. Therefore, the performance of introducing the multi-task learning increases by nearly ten percent. Moreover, to achieve the same performance as our multi-task learning on 1K training data, the methods without multi-task learning require more than 10K training data, which is ten times more than ours. Thus, multi-task learning utilizes data from all language-learning datasets simultaneously and effectively alleviate the problem of lacking data in a single language-learning dataset.
At the same time, we notice that ours-MTL is slightly worse than the RNN and GBDT when the amount of training data is very small (1K, 5K, 10K). This is because our model does not utilize the linguistic related features of the dataset, and the deep model will be over-fitting when the amount of training data is insufficient. However, as the training data improves (10K), ours-MTL becomes better than the existing RNN and GBDT. Thus, our encoder-decoder structure is very competitive with existing methods even without multi-task learning.
Iv-D Experiment in the Cold Start Scenario
Further, we can consider directly predicting a user’s answer on a language without any training exercises of this user on this language at all. This is cold start scenario and also the situation that the language-learning platforms must consider.
Specifically, it can be found that user RWDt7srk and t6nj6nr/ are all English speakers and learn both Spanish and French, so they have data both in the dataset es_en and fr_en. The statistics are shown in Table IV. For baseline methods, we remove the data of these two users on the training set as well as development set of es_en, and then train a model. At last, we use the trained model to directly predict the data of this two users on the es_en test set. Similarly, we use our multi-task method to do the same experiment, and the training data of these two users is also removed from the es_en data set, but fr_en and en_es are unchanged.
The experimental results are shown in Table V. If we do not use multi-task learning to predict the new users directly, the performance will be very poor. Compared with the method without multi-task learning, such as ours-MTL, our multi-task learning method increases by 11% on ACU and 20% on . Because of the multi-task learning, the user information of these two users has been learned through the fr_en dataset. Therefore, although there is no training data of these two users on es_en, we can still obtain good performance with mult-task learning.
Iv-E Experiment in the Non-low-resource Scenario
The experiments above show that our method has a huge advantage over the existing methods in low-resource scenarios. In this section, we will observe the performance of our method in the non-low-resource scenario.
Specifically, we use all the data on the three language datasets to compare our methods with existing methods. This experiment is exactly 2018 public SLA modeling challenge held by Duolingo.222http://sharedtask.duolingo.com/ Here, we add a new baseline GBDT+RNN. This is SanaLabs’s method  which combines the prediction of a GBDT and an RNN, and it is also the current best method on the 2018 public SLA modeling challenge.
As shown in Table II, it can be found that although the improvement is not very big, our method surpasses all existing methods on all three datasets and refreshes the best scores on all three datasets. Especially for the smallest dataset fr_en, our method obtains the most improvement than ours-MTL. As for the largest dataset en_es, our method also improves the AUC score by 0.003 over the best existing method GBDT+RNN. Therefore, our method also gains improvement slightly in the non-low-resource scenario.
V Model Analysis
V-a Component Analysis
Our encoder-decoder structure contains two encoders, i.e., meta encoder and context encoder, where the context encoder includes three encoders, i.e., word level context encoder, char level LSTM context encoder and char level CNN context encoder. In order to explore the importance of each encoder, we do a component removal analysis experiment.
Specifically, we remove each encoder component, train a model, and record the performance on test set. We also remove both two char level context encoders and do the same experiment.
The experimental results are shown in Table III. It can be found that the meta information is critical to the final result, much more important than the context encoder. If the meta encoder is removed, the result will be sharply reduced. The reason is that: if there is only a context encoder, it is equal to modeling the global word error distribution, completely ignoring the individual’s situation, which violates adaptive learning.
For context encoder, word level encoder has a greater impact than char level encoder on the performance of our model.
V-B Metadata Analysis
The analysis above has proven that meta information is important for predicting results. Obviously, different features of meta information have different influence. Therefore, feature removal analysis is made to find important features. Specifically, we remove each meta feature and get the performance of the model without this feature.
As shown in Fig. 5, the most important feature is the user (id). Without user (id), the model performance declines rapidly, because user information is the key to building user-adaptive learning. This also shows that the most common pattern between learning different languages is the students themselves. Besides, it can be found that learning format and spent time also make significant influences on the model.
In this part, we will show what meta encoder has learned from three datasets by multi-task learning.
We cluster the user embedding with k-means algorithm (), and calculate the average accuracy of each user and the overall average accuracy of each cluster. Embeddings are processed by t-SNE  for visualization, as shown in Fig. 6, every point represents a user and its color represents the average accuracy of this user. Red means low accuracy and blue means high. The four large points indicate the center of clustering, and the value pointing to the point is the overall average accuracy of the corresponding cluster. It can be found that students with good grades and students with poor grades can be distinguished very well according to their user embeddings, so the user embedding trained by our model contains rich information for the final prediction.
In this paper, we have proposed a novel multi-task learning method for SLA modeling. As far as we know, this is the first work applying multi-task neural network to SLA modeling and study the common patterns among different languages. Extensive experiments show that our method performs much better than the state-of-the-art baselines in low-resource scenarios, and it also obtains improvement slightly in the non-low-resource scenario.
-  (2015) Improved transition-based parsing by modeling characters instead of words with lstms. arXiv preprint arXiv:1508.00657. Cited by: §III-B.
-  (2014) Recommending learning materials to students by identifying their knowledge gaps.. In RecSys Posters, Cited by: §I.
-  (2018) Predicting second language learner successes and mistakes by means of conjunctive features. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 349–355. Cited by: §II-A.
-  (2018) Feature engineering for second language acquisition modeling. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 356–364. Cited by: §II-A.
-  (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pp. 160–167. Cited by: §II-B.
-  (2013) New types of deep neural network learning for speech recognition and related applications: an overview. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8599–8603. Cited by: §II-B.
-  (2015) Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1, pp. 1723–1732. Cited by: §II-B.
-  (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §II-B.
A probabilistic interpretation of precision, recall and f-score, with implication for evaluation. In European Conference on Information Retrieval, pp. 345–359. Cited by: §IV-B.
-  (1982) The meaning and use of the area under a receiver operating characteristic (roc) curve.. Radiology 143 (1), pp. 29–36. Cited by: §IV-B.
-  (2018) TMU system for slam-2018. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 365–369. Cited by: §II-A.
-  (2017) Joint ctc-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4835–4839. Cited by: §II-B.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §III-C.
-  (2018) Grotoco@ slam: second language acquisition modeling with simple features, learners and task-wise models. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 206–211. Cited by: §II-A.
-  (2014) An introduction to second language acquisition research. Routledge. Cited by: §I.
-  (2016) Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE transactions on pattern analysis and machine intelligence 39 (1), pp. 102–114. Cited by: §II-B.
-  (2016) Recurrent neural network for text classification with multi-task learning. arXiv preprint arXiv:1605.05101. Cited by: §II-B.
-  (2015) Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114. Cited by: §II-B.
Addressing the rare word problem in neural machine translation. arXiv preprint arXiv:1410.8206. Cited by: §III-B.
-  (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §V-C.
Cross-stitch networks for multi-task learning.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3994–4003. Cited by: §II-B.
-  (2018) Context based approach for second language acquisition. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 212–216. Cited by: §II-A.
-  (2018) Second language acquisition modeling: an ensemble approach. arXiv preprint arXiv:1806.04525. Cited by: 3rd item, §IV-E, TABLE II, TABLE V.
-  (2017) Bayesian knowledge tracing, logistic models, and beyond: an overview of learner modeling techniques. User Modeling and User-Adapted Interaction 27 (3-5), pp. 313–350. Cited by: §I.
-  (2018) Deep contextualized word representations. In Proc. of NAACL, Cited by: §III-B.
-  (2019) . IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (1), pp. 121–135. Cited by: §II-B.
-  (2018) Modeling second-language learning from a psychological perspective. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 223–230. Cited by: §II-A, §III-B, 2nd item, TABLE II, TABLE V.
-  (2017) An overview of multi-task learning in deep neural networks. CoRR abs/1706.05098. External Links: Cited by: §II-B.
-  (2018) Second language acquisition modeling. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 56–65. Cited by: 1st item, §IV-A, TABLE II, TABLE V.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §IV-A.
-  (2018) A memory-sensitive classification model of errors in early second language learning. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 231–239. Cited by: §II-A.
-  (2015) Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4460–4464. Cited by: §II-B.
-  (2018) CLUF: a neural model for second language acquisition modeling. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 374–380. Cited by: §II-A, §III-B, TABLE II.
-  (2018) Neural sequence modelling for learner error prediction. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 381–388. Cited by: §II-A.
-  (2014) Facial landmark detection by deep multi-task learning. In European conference on computer vision, pp. 94–108. Cited by: §II-B.