A random forest system combination approach for error detection in digital dictionaries

10/30/2014
by   Michael Bloodgood, et al.
0

When digitizing a print bilingual dictionary, whether via optical character recognition or manual entry, it is inevitable that errors are introduced into the electronic version that is created. We investigate automating the process of detecting errors in an XML representation of a digitized print dictionary using a hybrid approach that combines rule-based, feature-based, and language model-based methods. We investigate combining methods and show that using random forests is a promising approach. We find that in isolation, unsupervised methods rival the performance of supervised methods. Random forests typically require training data so we investigate how we can apply random forests to combine individual base methods that are themselves unsupervised without requiring large amounts of training data. Experiments reveal empirically that a relatively small amount of data is sufficient and can potentially be further reduced through specific selection criteria.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/06/2016

Comments on: "A Random Forest Guided Tour" by G. Biau and E. Scornet

This paper is a comment on the survey paper by Biau and Scornet (2016) a...
research
02/25/2016

Data Cleaning for XML Electronic Dictionaries via Statistical Anomaly Detection

Many important forms of data are stored digitally in XML format. Errors ...
research
06/08/2018

Prediction of the FIFA World Cup 2018 - A random forest approach with an emphasis on estimated team ability parameters

In this work, we compare three different modeling approaches for the sco...
research
04/08/2021

Grapheme-to-Phoneme Transformer Model for Transfer Learning Dialects

Grapheme-to-Phoneme (G2P) models convert words to their phonetic pronunc...
research
12/13/2014

Oriented Edge Forests for Boundary Detection

We present a simple, efficient model for learning boundary detection bas...
research
10/31/2022

HARRIS: Hybrid Ranking and Regression Forests for Algorithm Selection

It is well known that different algorithms perform differently well on a...

Please sign up or login with your details

Forgot password? Click here to reset