Are pre-trained text representations useful for multilingual and multi-dimensional language proficiency modeling?

by   Taraka Rama, et al.

Development of language proficiency models for non-native learners has been an active area of interest in NLP research for the past few years. Although language proficiency is multidimensional in nature, existing research typically considers a single "overall proficiency" while building models. Further, existing approaches also considers only one language at a time. This paper describes our experiments and observations about the role of pre-trained and fine-tuned multilingual embeddings in performing multi-dimensional, multilingual language proficiency classification. We report experiments with three languages – German, Italian, and Czech – and model seven dimensions of proficiency ranging from vocabulary control to sociolinguistic appropriateness. Our results indicate that while fine-tuned embeddings are useful for multilingual proficiency modeling, none of the features achieve consistently best performance for all dimensions of language proficiency. All code, data and related supplementary material can be found at:



There are no comments yet.


page 6

page 7


Testing pre-trained Transformer models for Lithuanian news clustering

A recent introduction of Transformer deep learning architecture made bre...

Czert – Czech BERT-like Model for Language Representation

This paper describes the training process of the first Czech monolingual...

Word Alignment by Fine-tuning Embeddings on Parallel Corpora

Word alignment over parallel corpora has a wide variety of applications,...

Role of Language Relatedness in Multilingual Fine-tuning of Language Models: A Case Study in Indo-Aryan Languages

We explore the impact of leveraging the relatedness of languages that be...

ÚFAL at MultiLexNorm 2021: Improving Multilingual Lexical Normalization by Fine-tuning ByT5

We present the winning entry to the Multilingual Lexical Normalization (...

MRA - Proof of Concept of a Multilingual Report Annotator Web Application

MRA (Multilingual Report Annotator) is a web application that translates...

PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors

EuroVoc is a multilingual thesaurus that was built for organizing the le...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.