DeeperDive: The Unreasonable Effectiveness of Weak Supervision in Document Understanding A Case Study in Collaboration with UiPath Inc

08/17/2022
by   Emad Elwany, et al.
0

Weak supervision has been applied to various Natural Language Understanding tasks in recent years. Due to technical challenges with scaling weak supervision to work on long-form documents, spanning up to hundreds of pages, applications in the document understanding space have been limited. At Lexion, we built a weak supervision-based system tailored for long-form (10-200 pages long) PDF documents. We use this platform for building dozens of language understanding models and have applied it successfully to various domains, from commercial agreements to corporate formation documents. In this paper, we demonstrate the effectiveness of supervised learning with weak supervision in a situation with limited time, workforce, and training data. We built 8 high quality machine learning models in the span of one week, with the help of a small team of just 3 annotators working with a dataset of under 300 documents. We share some details about our overall architecture, how we utilize weak supervision, and what results we are able to achieve. We also include the dataset for researchers who would like to experiment with alternate approaches or refine ours. Furthermore, we shed some light on the additional complexities that arise when working with poorly scanned long-form documents in PDF format, and some of the techniques that help us achieve state-of-the-art performance on such data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/28/2021

WALNUT: A Benchmark on Weakly Supervised Learning for Natural Language Understanding

Building quality machine learning models for natural language understand...
research
11/06/2019

Open Domain Web Keyphrase Extraction Beyond Language Modeling

This paper studies keyphrase extraction in real-world scenarios where do...
research
12/01/2017

Learning Deep Representations for Word Spotting Under Weak Supervision

Convolutional Neural Networks have made their mark in various fields of ...
research
12/01/2021

Towards More Robust Natural Language Understanding

Natural Language Understanding (NLU) is a branch of Natural Language Pro...
research
03/28/2022

Understanding Questions that Arise When Working with Business Documents

While digital assistants are increasingly used to help with various prod...
research
04/28/2023

CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data

In recent years, the field of document understanding has progressed a lo...
research
06/10/2019

Automated Curriculum Learning for Turn-level Spoken Language Understanding with Weak Supervision

We propose a learning approach for turn-level spoken language understand...

Please sign up or login with your details

Forgot password? Click here to reset