Multi-task Learning for Low-resource Second Language Acquisition Modeling

08/25/2019 ∙ by Yong Hu, et al. ∙ Beijing Institute of Technology Baidu, Inc. 0

Second language acquisition (SLA) modeling is to predict whether second language learners could correctly answer the questions according to what they have learned. It is a fundamental building block of the personalized learning system and has attracted more and more attention recently. However, as far as we know, almost all existing methods cannot work well in low-resource scenarios because lacking of training data. Fortunately, there are some latent common patterns among different language-learning tasks, which gives us an opportunity to solve the low-resource SLA modeling problem. Inspired by this idea, in this paper, we propose a novel SLA modeling method, which learns the latent common patterns among different language-learning datasets by multi-task learning and are further applied to improving the prediction performance in low-resource scenarios. Extensive experiments show that the proposed method performs much better than the state-of-the-art baselines in the low-resource scenario. Meanwhile, it also obtains improvement slightly in the non-low-resource scenario.



There are no comments yet.


page 1

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Knowledge tracing (KT) is a task of modeling how much knowledge students have obtained over time so that we can accurately predict how students will perform on future exercises and arrange study plans dynamically according to their real-time situations [2, 24]. Particularly, second language acquisition (SLA) modeling is a kind of KT in the filed of language learning. With the increasing importance of language-learning activity in people’s daily life [15], SLA modeling attracts more and more attention. For example, NAACL 2018 had held a public SLA modeling challenge.111 Therefore, in this paper, we focus on SLA modeling.

SLA modeling is the learning process of a specific language, thus each SLA modeling task has a corresponding language, e.g., English, Spanish, and French. Meanwhile, each language is composed of many exercises, and an exercise is the smallest data unit. For an exercise, there are three possible types, i.e., listen, Translation, and Reverse Tap, and the answers to the exercises are all sentences regardless of the type of the exercise. In an exercise, a student will answer the given question and write its answer sentence. Then the student-provided sentence and the correct sentence will be compared word by word to evaluate the ability of the student. As shown in Fig. 1 (A), taking an English listening exercise as an example, the correct sentence is “ I love my mother and my father”, and the answer of the student is “ I love mader and fhader”; It can be shown that are three words that are correctly answered. Therefore, SLA modeling task is to predict whether students can answer each word correctly according to the exercise information (meta-information, correct sentence with corresponding linguistic information). Thus, it can be simply token as into a word-level binary classification task.

Fig. 1: (A) Illustration of an example of SLA modeling task. (B) Illustration of two kinds of low-resource phenomenons and the comparison of our method and existing methods.

In SLA modeling task, low-resource is a common phenomenon which affects the training process significantly. Specifically, this phenomena is mainly caused by two reasons: (1) For some specific language-learning datasets, e.g. Czech, the size of data may be very small becuse we cannot collect enough language-learning exercises; (2) For a user, he/she will encounter cold start scenario when starting to learn a new language. However, almost all existing methods for SLA modeling task train a model separately for each language-learning dataset and thus their performance largely depends on the size of training data. Thus, they can hardly work well in low-resource scenarios. Fig. 1 (B) illustrates an example. Suppose that we have two language: English and Czech, existing methods will train two separate models for these two languages: model_en and model_cz. These two models will perform poorly in two low-resource scenarios: (1) If the English dataset has a large amount of data, the model_en will perform well, but the small size of Czech dataset may significantly hinders the performance of model_cz; (2) Suppose that a user has a large number of exercises for learning Czech, but when he/she begins to learn English, the number of English exercises for him/her will be very small, even zero. Thus, model_en can hardly predict the answers of his/her English exercises well.

Intuitively, there are lots of common patterns among different language-learning tasks, such as the learning habits of users and grammar learning skills. If the latent common patterns across these language-learning tasks can be well learned, they can be used to solve the low-resource SLA modeling problem.

Inspired by this idea, in this paper, we propose a novel multi-task learning method for SLA modeling, which is a unified model to process several language-learning datasets simultaneously. Specifically, the proposed model learns shared features across all language-learning datasets jointly, which is the inner nature of the language-learning activity, and can be taken as important prior-knowledge to deal with small language-learning datasets. Moreover, the embedding information of a user is shared, so the learning habits and language talents of the user could be shared in the unified model for other low-resource language-learning tasks. Therefore, when a user begins to learn a new language, the unified model can work well even though there is no exercise data for this user.

The main contributions of this paper are three-fold. (1) As far as we know, this is the first work applying multi-task neural network to SLA modeling and we effectively solve the problem of insufficient training data in low-resource scenarios. (2) We deeply study the common patterns among different languages and reveal the inner nature of language learning. (3) Extensive experiments show that our method performs much better than the state-of-the-art baselines in low-resource scenarios, and it also obtains improvement slightly in the non-low-resource scenario.

Ii Related Work

Ii-a SLA Modeling

Existing methods for SLA modeling can be roughly divided into three categories: (a) logistic regression based methods, (b) tree ensemble methods, and (c) sequence modeling methods. (a) The logistic regression based methods

[14, 22, 3]

take the meta and context features provided by datasets and other manually constructed features as input and output the probability of answering each word correctly. These methods are simple but their performances are not very poor. (b) The tree ensemble methods (e.g., Gradient Boosting Decision Trees (GBDT))

[31, 27, 4]

can powerfully capture non-linear relationships between features. Therefore, although the input and output of these methods are the same with (a), they are generally better than methods that belong to (a). (c) The sequence modeling methods (e.g., Recurrent Neural Networks (RNNs))

[33, 34, 11] use neural networks, especially RNNs so that they can capture users’ performance over time. The performance of these methods are also very competitive.

However, methods above hardly can work well in low-resource scenarios because their performance largely depends on the size of training data.

Ii-B Multi-Task Learning

Multi-task learning (MTL) has been widely used in various tasks, such as machine learning

[21, 16, 18]

, natural language processing

[5, 17, 7], speech recognition [6, 12, 32]

and computer vision

[8, 26, 35]. It effectively increases the sample size that we are using for training our model. Thus, it can improve generalization by leveraging the domain-specific information contained in related tasks, and enables the model to obtain a better sharing representation between each related task.

MTL is typically done with hard or soft parameter sharing of hidden layers and hard parameter sharing is the most commonly used approach to MTL in neural networks [28]. It is generally applied by sharing the hidden layers between all tasks, while keeping several task-specific output layers.

SLA modeling has different language-learning tasks, and each task has something in common, which gives us an opportunity to use MTL to improve the overall performance.

Iii Model

Iii-a Problem Definition

Fig. 2: Illustration of our encoder-decoder structure

Suppose there are second language-learning datasets , and the dataset is composed of exercises , where is the exercise in the dataset.

There are two kinds of information in an exercise , i.e., the meta information and the language related context information.

The meta information contains two user-related information: (1) user: the unique identifier for each student, e.g., D2inf5, (2) country: student’s country, e.g., CN, and the following five exercise-related information: (1) days: the number of days since the student started learning this language, e.g., 1.793, (2) client: the student’s device platform, e.g., android, (3) session: the session type, e.g., lesson, (4) format (or type): exercise type, e.g., Listen, (5) time: the amount of time in seconds it took for the student to construct and submit the whole answer, e.g., 16s. This is shared among all language datasets.

The context information in the exercise includes the word sequence , and word’s linguistic sequences, such as , which is the POS-tagging of each word. This is unique to each language-learning dataset.

At last, has a word level label sequence , where . means this word is answered correctly, and means the opposite.

Our task is to build a model based on users’ exercises, and further to predict word-level label sequence of future exercises.

Iii-B Encoder and Decoder Structure

Our model is an encoder-decoder structure with two encoders, i.e., a meta encoder, a context encoder, and a decoder. We use the meta encoder to learn the non-linear relationship between meta information, use the context encoder to learn the representation of a sequence of words and use the decoder to generate the final prediction of each word. The overall structure of the proposed model is shown in Fig. 2.

Meta Encoder

: The meta encoder is a multi-layer perceptron (MLP) based neural network. This encoder takes the metadata as inputs. First, these inputs are converted into high-dimensional representations by the embedding layers, which are randomly initialized and will map each input into a 150-dimensional vector. After the embedding step, we separately concatenate the user-related embeddings and the exercise-related embeddings, and send them into

and to get the representation of user-related meta information and the representation of exercise-related meta information , respectively. Finally, we concatenate and , and send the concatenated result to to obtain the representation of whole meta information . The meta encoder can be formulated as


where for the sake of simplicity, the variables are omitted from the subscript , and is the embedded representation of each meta information.

Context Encoder

: The context encoder consists of three sub-encoders, i.e., a word level context encoder, a char level Long Short Term Memory (LSTM) context encoder, and a char level Convolutional Neural Network (CNN) context encoder. The word level encoder can capture better semantics and longer dependency than the character level encoders

[33]. By modeling the character sequence, we can partially avoid the out-of-vocabulary (OOV) problem [19, 1]. Furthermore, we only use the word sequence in the datasets without using any of the provided linguistic information here. The previous work [27] has pointed out that the linguistic information given by the datasets has mistakes. So, through two character level encoders, we can learn certain word information and linguistic rules.

Given the word sequence , the word level context encoder is computed as


where is the word in the sequence, and is the word embedding. Here, we use the pre-trained ELMo [25] as the look-up table. is the concatenated result of the last layer’s hidden state of the forward and the backward cells of . It is also the output of the word level context encoder.

The char level LSTM context encoder is computed according to the sequence characters of word . This can be formulated as


where is the last hidden state of the last layer of . is the concatenated result of the last layer’s hidden state of the forward and the backward cells of . It is also the output of the char level LSTM context encoder.

The char level CNN context encoder can be similarly formulated as


where is the result of CNN encoder. is the concatenated result of the last layer’s hidden state of the forward and the backward cells of . It is also the output of the char level CNN context encoder.

The final output of the context encoder is generated by a single-layer MLP, and the concatenation of , and is fed as the input. The process is formulated as


where is the final context representation of the word .

Decoder: The decoder takes the output of meta encoder and the output of context encoder as inputs, the prediction of word is computed with a MLP. It is formulated as


where the activation function of

is sigmoid function.

Iii-C Multi-Task Learning

Fig. 3: Illustration of multi-task learning

As is shown in Fig. 3, suppose there are languages, and each has a corresponding dataset, i.e., . Since our task is to predict the exercise accuracy of language learners on each language, we can regard these predictions as different tasks. Therefore, there are tasks.

We defined the cross-entropy loss for each task, which encourages the correct predictions and punishes the incorrect ones. Specifically, for the task, we have


where is the hyper parameter to balance the negative and positive samples.

In multi-task learning, the parameters in meta encoder and decoder are shared, and each task only has its own parameters of the context encoder part, so the whole model has only one meta encoder, one decoder and context encoders. In this way, the common patterns extracted from all language datasets can be utilized simultaneously by the shared meta encoder and decoder.

In the training process, one mini batch contains data of datasets and they will all be sent to the same meta encoder and decoder, but will be sent to their corresponding context encoder according to their language type. Thus, the final loss with tasks is calculated as


Finally, we use Adam algorithm [13] to train the model.

Iv Experiments

Iv-a Datasets and Settings

en_es es_en fr_en
#Exercises (Train) 824,012 731,896 326,792
#Exercises (Dev) 115,770 96,003 43,610
#Exercises (Test) 114,586 93,145 41,753
#Unique words 2,226 2,915 2,178
#Unique users 2,593 2,643 1,213
#words / exercise 3.18 2.7 2.84
%OOV radio (Test) 4.5% 10.0% 5.9%
%Correct radio 87% 86% 84%
%Incorrect radio 13% 14% 16%
TABLE I: The statistics of Duolingo SLA modeling dataset

We conduct experiments on Duolingo SLA modeling shared datasets, which have three datasets and are collected from English students who can speak Spanish (en_es), Spanish students who can speak English (es_en), and French students who can speak English (fr_en) [29]. Table I shows basic statistics of each dataset.

Fig. 4: Comparison of our method and baselines on training data of different sizes

We compare our method with the following state-of-the-art baselines:

  • LR Here, we use the official baseline provided by Duolingo [29]. It is a simple logistic regression using all the meta information and context information provided by datasets.

  • GBDT Here, we use NYU’s method [27], which is the best method among all tree ensemble methods. It uses an ensemble of GBDTs with existing features of dataset and manually constructed features based on psychological theories.

  • RNN Here, we use singsound’s method [23], which is the best method among all sequence modeling methods. It uses an RNN architecture which has four types of encoders, representing different types of features: token context, linguistic information, user data, and exercise format.

  • ours-MTL It is our encoder-decoder model without multi-task learning. Thus, we will separately train a model for each language-learning dataset.

In the experiments, the embedding size is set to 150 and the hidden size is also set to 150. Dropout [30] regularization is applied, where the dropout rate is set to 0.5. We use the Adam optimization algorithm with a learning rate of 0.001.

Iv-B Metric

SLA modeling is actually the word level classification task, so we use area under the ROC curve (AUC) [10] and score [9]

as evaluation metric.

  • AUC is calculated as:


    where is the probability,

    is the trained classifier,

    is the instance randomly extracted from positive samples, and is the instance randomly extracted from negative samples.

  • is calculated as


    where and are the precision rate and recall rate of the trained model.

Methods en_es es_en fr_en
LR [29] 0.774 0.190 0.746 0.175 0.771 0.281
GBDT[27] 0.859 0.468 0.835 0.420 0.854 0.493
RNN [33] 0.861 0.559 0.835 0.524 0.854 0.569
GBDT+RNN [23] 0.861 0.561 0.838 0.530 0.857 0.573
ours-MTL 0.863 0.564 0.837 0.527 0.857 0.575
ours 0.864 0.564 0.839 0.530 0.860 0.579
TABLE II: Comparison of our method with existing methods on different language datasets
Methods en_es es_en fr_en
ours - meta encoder 0.743 0.353 0.716 0.320 0.750 0.478
ours - word level context encoder 0.862 0.559 0.838 0.526 0.858 0.575
ours - char level LSTM context encoder 0.863 0.563 0.838 0.526 0.860 0.579
ours - char level CNN context encoder 0.863 0.564 0.838 0.528 0.860 0.559
ours - char level context encoder all 0.863 0.562 0.838 0.526 0.859 0.579
ours 0.864 0.564 0.839 0.530 0.860 0.579
TABLE III: Comparison of encoder removal
User Dataset Train Dev Test
RWDt7srk es_en 361 68 19
fr_en 519 80 51
t6nj6nr/ es_en 562 245 274
fr_en 998 0 0
TABLE IV: The statistics of two users (the number in the table is the number of words in exercises)
Methods AUC
LR [29] 0.765 0.083
GBDT [27] 0.751 0.187
RNN [23] 0.771 0.276
ours-MTL 0.770 0.210
ours 0.881 0.411
TABLE V: Comparison of our method and baselines in the cold start scenario

Iv-C Experiment on Small-scale Datasets

We first verify the advantages of our method in cases where the training data of the whole language-learning dataset is insufficient.

Specifically, we gradually decrease the size of training data from 400K ( 300K for fr_en ) to 1K and keep the development set and test set. For all baseline methods, since they only use the single language dataset for training, we hence only reduce the data of corresponding language data. For our multi-task learning method, we reduce the training data of one language dataset and keep the remaining other two datasets unchanged.

The experimental results are shown in Fig. 4. It can be found that our method outperforms all the state-of-the-art baselines when the training data of a language dataset is insufficient, which is a huge improvement compared with the existing methods. For example, as shown in AUC/en_es in Fig. 4, using only 1K training data, our multi-task learning method still could get the AUC score of 0.738, while the AUC score of ours-MTL is only 0.640, and existing RNN, GBDT and LR methods are 0.659, 0.658 and 0.650 respectively. Therefore, the performance of introducing the multi-task learning increases by nearly ten percent. Moreover, to achieve the same performance as our multi-task learning on 1K training data, the methods without multi-task learning require more than 10K training data, which is ten times more than ours. Thus, multi-task learning utilizes data from all language-learning datasets simultaneously and effectively alleviate the problem of lacking data in a single language-learning dataset.

At the same time, we notice that ours-MTL is slightly worse than the RNN and GBDT when the amount of training data is very small (1K, 5K, 10K). This is because our model does not utilize the linguistic related features of the dataset, and the deep model will be over-fitting when the amount of training data is insufficient. However, as the training data improves (10K), ours-MTL becomes better than the existing RNN and GBDT. Thus, our encoder-decoder structure is very competitive with existing methods even without multi-task learning.

Iv-D Experiment in the Cold Start Scenario

Further, we can consider directly predicting a user’s answer on a language without any training exercises of this user on this language at all. This is cold start scenario and also the situation that the language-learning platforms must consider.

Specifically, it can be found that user RWDt7srk and t6nj6nr/ are all English speakers and learn both Spanish and French, so they have data both in the dataset es_en and fr_en. The statistics are shown in Table IV. For baseline methods, we remove the data of these two users on the training set as well as development set of es_en, and then train a model. At last, we use the trained model to directly predict the data of this two users on the es_en test set. Similarly, we use our multi-task method to do the same experiment, and the training data of these two users is also removed from the es_en data set, but fr_en and en_es are unchanged.

The experimental results are shown in Table V. If we do not use multi-task learning to predict the new users directly, the performance will be very poor. Compared with the method without multi-task learning, such as ours-MTL, our multi-task learning method increases by 11% on ACU and 20% on . Because of the multi-task learning, the user information of these two users has been learned through the fr_en dataset. Therefore, although there is no training data of these two users on es_en, we can still obtain good performance with mult-task learning.

Iv-E Experiment in the Non-low-resource Scenario

The experiments above show that our method has a huge advantage over the existing methods in low-resource scenarios. In this section, we will observe the performance of our method in the non-low-resource scenario.

Specifically, we use all the data on the three language datasets to compare our methods with existing methods. This experiment is exactly 2018 public SLA modeling challenge held by Duolingo.222 Here, we add a new baseline GBDT+RNN. This is SanaLabs’s method [23] which combines the prediction of a GBDT and an RNN, and it is also the current best method on the 2018 public SLA modeling challenge.

As shown in Table II, it can be found that although the improvement is not very big, our method surpasses all existing methods on all three datasets and refreshes the best scores on all three datasets. Especially for the smallest dataset fr_en, our method obtains the most improvement than ours-MTL. As for the largest dataset en_es, our method also improves the AUC score by 0.003 over the best existing method GBDT+RNN. Therefore, our method also gains improvement slightly in the non-low-resource scenario.

V Model Analysis

V-a Component Analysis

Our encoder-decoder structure contains two encoders, i.e., meta encoder and context encoder, where the context encoder includes three encoders, i.e., word level context encoder, char level LSTM context encoder and char level CNN context encoder. In order to explore the importance of each encoder, we do a component removal analysis experiment.

Specifically, we remove each encoder component, train a model, and record the performance on test set. We also remove both two char level context encoders and do the same experiment.

The experimental results are shown in Table III. It can be found that the meta information is critical to the final result, much more important than the context encoder. If the meta encoder is removed, the result will be sharply reduced. The reason is that: if there is only a context encoder, it is equal to modeling the global word error distribution, completely ignoring the individual’s situation, which violates adaptive learning.

For context encoder, word level encoder has a greater impact than char level encoder on the performance of our model.

V-B Metadata Analysis

The analysis above has proven that meta information is important for predicting results. Obviously, different features of meta information have different influence. Therefore, feature removal analysis is made to find important features. Specifically, we remove each meta feature and get the performance of the model without this feature.

As shown in Fig. 5, the most important feature is the user (id). Without user (id), the model performance declines rapidly, because user information is the key to building user-adaptive learning. This also shows that the most common pattern between learning different languages is the students themselves. Besides, it can be found that learning format and spent time also make significant influences on the model.

V-C Visualization

In this part, we will show what meta encoder has learned from three datasets by multi-task learning.

Fig. 5: Analysis of meta features removal
Fig. 6: User embedding cluster

We cluster the user embedding with k-means algorithm (

), and calculate the average accuracy of each user and the overall average accuracy of each cluster. Embeddings are processed by t-SNE [20] for visualization, as shown in Fig. 6, every point represents a user and its color represents the average accuracy of this user. Red means low accuracy and blue means high. The four large points indicate the center of clustering, and the value pointing to the point is the overall average accuracy of the corresponding cluster. It can be found that students with good grades and students with poor grades can be distinguished very well according to their user embeddings, so the user embedding trained by our model contains rich information for the final prediction.

Vi Conclusion

In this paper, we have proposed a novel multi-task learning method for SLA modeling. As far as we know, this is the first work applying multi-task neural network to SLA modeling and study the common patterns among different languages. Extensive experiments show that our method performs much better than the state-of-the-art baselines in low-resource scenarios, and it also obtains improvement slightly in the non-low-resource scenario.


  • [1] M. Ballesteros, C. Dyer, and N. A. Smith (2015) Improved transition-based parsing by modeling characters instead of words with lstms. arXiv preprint arXiv:1508.00657. Cited by: §III-B.
  • [2] K. Bauman and A. Tuzhilin (2014) Recommending learning materials to students by identifying their knowledge gaps.. In RecSys Posters, Cited by: §I.
  • [3] Y. Bestgen (2018) Predicting second language learner successes and mistakes by means of conjunctive features. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 349–355. Cited by: §II-A.
  • [4] G. Chen, C. Hauff, and G. Houben (2018) Feature engineering for second language acquisition modeling. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 356–364. Cited by: §II-A.
  • [5] R. Collobert and J. Weston (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pp. 160–167. Cited by: §II-B.
  • [6] L. Deng, G. Hinton, and B. Kingsbury (2013) New types of deep neural network learning for speech recognition and related applications: an overview. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8599–8603. Cited by: §II-B.
  • [7] D. Dong, H. Wu, W. He, D. Yu, and H. Wang (2015) Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1, pp. 1723–1732. Cited by: §II-B.
  • [8] R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §II-B.
  • [9] C. Goutte and E. Gaussier (2005)

    A probabilistic interpretation of precision, recall and f-score, with implication for evaluation

    In European Conference on Information Retrieval, pp. 345–359. Cited by: §IV-B.
  • [10] J. A. Hanley and B. J. McNeil (1982) The meaning and use of the area under a receiver operating characteristic (roc) curve.. Radiology 143 (1), pp. 29–36. Cited by: §IV-B.
  • [11] M. Kaneko, T. Kajiwara, and M. Komachi (2018) TMU system for slam-2018. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 365–369. Cited by: §II-A.
  • [12] S. Kim, T. Hori, and S. Watanabe (2017) Joint ctc-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4835–4839. Cited by: §II-B.
  • [13] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §III-C.
  • [14] S. Klerke, H. M. Alonso, and B. Plank (2018) Grotoco@ slam: second language acquisition modeling with simple features, learners and task-wise models. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 206–211. Cited by: §II-A.
  • [15] D. Larsen-Freeman and M. H. Long (2014) An introduction to second language acquisition research. Routledge. Cited by: §I.
  • [16] A. Liu, Y. Su, W. Nie, and M. Kankanhalli (2016) Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE transactions on pattern analysis and machine intelligence 39 (1), pp. 102–114. Cited by: §II-B.
  • [17] P. Liu, X. Qiu, and X. Huang (2016) Recurrent neural network for text classification with multi-task learning. arXiv preprint arXiv:1605.05101. Cited by: §II-B.
  • [18] M. Luong, Q. V. Le, I. Sutskever, O. Vinyals, and L. Kaiser (2015) Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114. Cited by: §II-B.
  • [19] M. Luong, I. Sutskever, Q. V. Le, O. Vinyals, and W. Zaremba (2014)

    Addressing the rare word problem in neural machine translation

    arXiv preprint arXiv:1410.8206. Cited by: §III-B.
  • [20] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §V-C.
  • [21] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert (2016) Cross-stitch networks for multi-task learning. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 3994–4003. Cited by: §II-B.
  • [22] N. V. Nayak and A. R. Rao (2018) Context based approach for second language acquisition. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 212–216. Cited by: §II-A.
  • [23] A. Osika, S. Nilsson, A. Sydorchuk, F. Sahin, and A. Huss (2018) Second language acquisition modeling: an ensemble approach. arXiv preprint arXiv:1806.04525. Cited by: 3rd item, §IV-E, TABLE II, TABLE V.
  • [24] R. Pelánek (2017) Bayesian knowledge tracing, logistic models, and beyond: an overview of learner modeling techniques. User Modeling and User-Adapted Interaction 27 (3-5), pp. 313–350. Cited by: §I.
  • [25] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proc. of NAACL, Cited by: §III-B.
  • [26] R. Ranjan, V. M. Patel, and R. Chellappa (2019)

    Hyperface: a deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition

    IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (1), pp. 121–135. Cited by: §II-B.
  • [27] A. Rich, P. O. Popp, D. Halpern, A. Rothe, and T. Gureckis (2018) Modeling second-language learning from a psychological perspective. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 223–230. Cited by: §II-A, §III-B, 2nd item, TABLE II, TABLE V.
  • [28] S. Ruder (2017) An overview of multi-task learning in deep neural networks. CoRR abs/1706.05098. External Links: Link, 1706.05098 Cited by: §II-B.
  • [29] B. Settles, C. Brust, E. Gustafson, M. Hagiwara, and N. Madnani (2018) Second language acquisition modeling. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 56–65. Cited by: 1st item, §IV-A, TABLE II, TABLE V.
  • [30] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §IV-A.
  • [31] B. Tomoschuk and J. Lovelett (2018) A memory-sensitive classification model of errors in early second language learning. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 231–239. Cited by: §II-A.
  • [32] Z. Wu, C. Valentini-Botinhao, O. Watts, and S. King (2015) Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4460–4464. Cited by: §II-B.
  • [33] S. Xu, J. Chen, and L. Qin (2018) CLUF: a neural model for second language acquisition modeling. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 374–380. Cited by: §II-A, §III-B, TABLE II.
  • [34] Z. Yuan (2018) Neural sequence modelling for learner error prediction. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 381–388. Cited by: §II-A.
  • [35] Z. Zhang, P. Luo, C. C. Loy, and X. Tang (2014) Facial landmark detection by deep multi-task learning. In European conference on computer vision, pp. 94–108. Cited by: §II-B.