X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained Language Models

10/13/2020
by   Zhengbao Jiang, et al.
0

Language models (LMs) have proven surprisingly successful at capturing factual knowledge by completing cloze-style fill-in-the-blank questions such as "Punta Cana is located in _." However, while knowledge is both written and queried in many languages, studies on LMs' factual representation ability have almost invariably been performed on English. To assess factual knowledge retrieval in LMs in different languages, we create a multilingual benchmark of cloze-style probes for typologically diverse languages. To properly handle language variations, we expand probing methods from single- to multi-word entities, and develop several decoding algorithms to generate multi-token predictions. Extensive experimental results provide insights about how well (or poorly) current state-of-the-art LMs perform at this task in languages with more or fewer available resources. We further propose a code-switching-based method to improve the ability of multilingual LMs to access knowledge, and verify its effectiveness on several benchmark languages. Benchmark data and code have been released at https://x-factr.github.io.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/27/2022

HiJoNLP at SemEval-2022 Task 2: Detecting Idiomaticity of Multiword Expressions using Multilingual Pretrained Language Models

This paper describes an approach to detect idiomaticity only from the co...
research
10/13/2020

XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization

The ability to correctly model distinct meanings of a word is crucial fo...
research
10/03/2022

SemEval 2023 Task 9: Multilingual Tweet Intimacy Analysis

We propose MINT, a new Multilingual INTimacy analysis dataset covering 1...
research
09/17/2021

The futility of STILTs for the classification of lexical borrowings in Spanish

The first edition of the IberLEF 2021 shared task on automatic detection...
research
08/30/2019

CodeSwitch-Reddit: Exploration of Written Multilingual Discourse in Online Discussion Forums

In contrast to many decades of research on oral code-switching, the stud...
research
02/19/2023

Intent Identification and Entity Extraction for Healthcare Queries in Indic Languages

Scarcity of data and technological limitations for resource-poor languag...
research
12/11/2022

IndicXTREME: A Multi-Task Benchmark For Evaluating Indic Languages

In this work, we introduce IndicXTREME, a benchmark consisting of nine d...

Please sign up or login with your details

Forgot password? Click here to reset