Improving the output quality of official statistics based on machine learning algorithms

03/01/2021
by   Quinten Meertens, et al.
0

National statistical institutes currently investigate how to improve the output quality of official statistics based on machine learning algorithms. A key obstacle is concept drift, i.e., when the joint distribution of independent variables and a dependent (categorical) variable changes over time. Under concept drift, a statistical model requires regular updating to prevent it from becoming biased. However, updating a model asks for additional data, which are not always available. In the literature, we find a variety of bias correction methods as a promising solution. In the paper, we will compare two popular correction methods: the misclassification estimator and the calibration estimator. For prior probability shift (a specific type of concept drift), we investigate the two correction methods theoretically as well as experimentally. Our theoretical results are expressions for the bias and variance of both methods. As experimental result, we present a decision boundary (as a function of (a) model accuracy, (b) class distribution and (c) test set size) for the relative performance of the two methods. Close inspection of the results will provide a deep insight into the effect of prior probability shift on output quality, leading to practical recommendations on the use of machine learning algorithms in official statistics.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/02/2022

On the Change of Decision Boundaries and Loss in Learning with Concept Drift

The notion of concept drift refers to the phenomenon that the distributi...
research
06/01/2015

Bootstrap Bias Corrections for Ensemble Methods

This paper examines the use of a residual bootstrap for bias correction ...
research
09/21/2020

Selectivity correction with online machine learning

Computer systems are full of heuristic rules which drive the decisions t...
research
07/29/2022

Factorizable Joint Shift in Multinomial Classification

Factorizable joint shift (FJS) was recently proposed as a type of datase...
research
12/10/2002

How to Shift Bias: Lessons from the Baldwin Effect

An inductive learning algorithm takes a set of data as input and generat...
research
12/13/2018

Machine Learning in Official Statistics

In the first half of 2018, the Federal Statistical Office of Germany (De...
research
10/08/2020

Transcending Transcend: Revisiting Malware Classification with Conformal Evaluation

Machine learning for malware classification shows encouraging results, b...

Please sign up or login with your details

Forgot password? Click here to reset