Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging

07/31/2017
by   Nils Reimers, et al.
0

In this paper we show that reporting a single performance score is insufficient to compare non-deterministic approaches. We demonstrate for common sequence tagging tasks that the seed value for the random number generator can result in statistically significant (p < 10^-4) differences for state-of-the-art systems. For two recent systems for NER, we observe an absolute difference of one percentage point F1-score depending on the selected seed value, making these systems perceived either as state-of-the-art or mediocre. Instead of publishing and reporting single performance scores, we propose to compare score distributions based on multiple executions. Based on the evaluation of 50.000 LSTM-networks for five sequence tagging tasks, we present network architectures that produce both superior performance as well as are more stable with respect to the remaining hyperparameters.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/21/2017

Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks

Selecting optimal parameters for a neural network architecture can often...
research
03/26/2018

Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approaches

Developing state-of-the-art approaches for specific tasks is a major dri...
research
06/24/2018

Character-Level Feature Extraction with Densely Connected Networks

Generating character-level features is an important step for achieving g...
research
12/26/2018

A New Concept of Deep Reinforcement Learning based Augmented General Sequence Tagging System

In this paper, a new deep reinforcement learning based augmented general...
research
09/10/2018

Toward a Standardized and More Accurate Indonesian Part-of-Speech Tagging

Previous work in Indonesian part-of-speech (POS) tagging are hard to com...
research
06/04/2018

An unsupervised and customizable misspelling generator for mining noisy health-related text sources

In this paper, we present a customizable datacentric system that automat...
research
04/09/2021

Larger-Context Tagging: When and Why Does It Work?

The development of neural networks and pretraining techniques has spawne...

Please sign up or login with your details

Forgot password? Click here to reset