Is Information Extraction Solved by ChatGPT? An Analysis of Performance, Evaluation Criteria, Robustness and Errors

05/23/2023
by   Ridong Han, et al.
0

ChatGPT has stimulated the research boom in the field of large language models. In this paper, we assess the capabilities of ChatGPT from four perspectives including Performance, Evaluation Criteria, Robustness and Error Types. Specifically, we first evaluate ChatGPT's performance on 17 datasets with 14 IE sub-tasks under the zero-shot, few-shot and chain-of-thought scenarios, and find a huge performance gap between ChatGPT and SOTA results. Next, we rethink this gap and propose a soft-matching strategy for evaluation to more accurately reflect ChatGPT's performance. Then, we analyze the robustness of ChatGPT on 14 IE sub-tasks, and find that: 1) ChatGPT rarely outputs invalid responses; 2) Irrelevant context and long-tail target types greatly affect ChatGPT's performance; 3) ChatGPT cannot understand well the subject-object relationships in RE task. Finally, we analyze the errors of ChatGPT, and find that "unannotated spans" is the most dominant error type. This raises concerns about the quality of annotated data, and indicates the possibility of annotating data with ChatGPT. The data and code are released at Github site.

READ FULL TEXT

page 10

page 23

research
10/06/2022

Retrieval of Soft Prompt Enhances Zero-Shot Task Generalization

During zero-shot inference with language models (LMs), using hard prompt...
research
05/24/2023

Self-ICL: Zero-Shot In-Context Learning with Self-Generated Demonstrations

Large language models (LMs) have exhibited superior in-context learning ...
research
08/19/2023

FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models

Large language models (LLMs) have demonstrated exceptional performance i...
research
08/21/2023

Dynamic Strategy Chain: Dynamic Zero-Shot CoT for Long Mental Health Support Generation

Long counseling Text Generation for Mental health support (LTGM), an inn...
research
07/05/2023

Comparative Analysis of GPT-4 and Human Graders in Evaluating Praise Given to Students in Synthetic Dialogues

Research suggests that providing specific and timely feedback to human t...
research
10/23/2022

TAPE: Assessing Few-shot Russian Language Understanding

Recent advances in zero-shot and few-shot learning have shown promise fo...

Please sign up or login with your details

Forgot password? Click here to reset