Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging

by   Nils Reimers, et al.

In this paper we show that reporting a single performance score is insufficient to compare non-deterministic approaches. We demonstrate for common sequence tagging tasks that the seed value for the random number generator can result in statistically significant (p < 10^-4) differences for state-of-the-art systems. For two recent systems for NER, we observe an absolute difference of one percentage point F1-score depending on the selected seed value, making these systems perceived either as state-of-the-art or mediocre. Instead of publishing and reporting single performance scores, we propose to compare score distributions based on multiple executions. Based on the evaluation of 50.000 LSTM-networks for five sequence tagging tasks, we present network architectures that produce both superior performance as well as are more stable with respect to the remaining hyperparameters.


page 1

page 2

page 3

page 4


Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks

Selecting optimal parameters for a neural network architecture can often...

Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approaches

Developing state-of-the-art approaches for specific tasks is a major dri...

Character-Level Feature Extraction with Densely Connected Networks

Generating character-level features is an important step for achieving g...

A New Concept of Deep Reinforcement Learning based Augmented General Sequence Tagging System

In this paper, a new deep reinforcement learning based augmented general...

Toward a Standardized and More Accurate Indonesian Part-of-Speech Tagging

Previous work in Indonesian part-of-speech (POS) tagging are hard to com...

An unsupervised and customizable misspelling generator for mining noisy health-related text sources

In this paper, we present a customizable datacentric system that automat...

Larger-Context Tagging: When and Why Does It Work?

The development of neural networks and pretraining techniques has spawne...

Please sign up or login with your details

Forgot password? Click here to reset