ANALOGICAL - A New Benchmark for Analogy of Long Text for Large Language Models

05/08/2023
by   Thilini Wijesiriwardene, et al.
2

Over the past decade, analogies, in the form of word-level analogies, have played a significant role as an intrinsic measure of evaluating the quality of word embedding methods such as word2vec. Modern large language models (LLMs), however, are primarily evaluated on extrinsic measures based on benchmarks such as GLUE and SuperGLUE, and there are only a few investigations on whether LLMs can draw analogies between long texts. In this paper, we present ANALOGICAL, a new benchmark to intrinsically evaluate LLMs across a taxonomy of analogies of long text with six levels of complexity – (i) word, (ii) word vs. sentence, (iii) syntactic, (iv) negation, (v) entailment, and (vi) metaphor. Using thirteen datasets and three different distance measures, we evaluate the abilities of eight LLMs in identifying analogical pairs in the semantic vector space. Our evaluation finds that it is increasingly challenging for LLMs to identify analogies when going up the analogy taxonomy.

READ FULL TEXT

page 2

page 6

page 8

research
11/22/2020

DiaLex: A Benchmark for Evaluating Multidialectal Arabic Word Embeddings

Word embeddings are a core component of modern natural language processi...
research
10/12/2022

Perplexity from PLM Is Unreliable for Evaluating Text Quality

Recently, amounts of works utilize perplexity (PPL) to evaluate the qual...
research
09/09/2021

All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality

Similarity measures are a vital tool for understanding how language mode...
research
01/03/2016

Contrastive Entropy: A new evaluation metric for unnormalized language models

Perplexity (per word) is the most widely used metric for evaluating lang...
research
02/24/2023

In-Depth Look at Word Filling Societal Bias Measures

Many measures of societal bias in language models have been proposed in ...
research
04/14/2020

Extending Text Informativeness Measures to Passage Interestingness Evaluation (Language Model vs. Word Embedding)

Standard informativeness measures used to evaluate Automatic Text Summar...
research
09/02/2022

A taxonomy of surprise definitions

Surprising events trigger measurable brain activity and influence human ...

Please sign up or login with your details

Forgot password? Click here to reset