Transforming Unstructured Text into Data with Context Rule Assisted Machine Learning (CRAML)

01/20/2023
by   Stephen Meisenbacher, et al.
0

We describe a method and new no-code software tools enabling domain experts to build custom structured, labeled datasets from the unstructured text of documents and build niche machine learning text classification models traceable to expert-written rules. The Context Rule Assisted Machine Learning (CRAML) method allows accurate and reproducible labeling of massive volumes of unstructured text. CRAML enables domain experts to access uncommon constructs buried within a document corpus, and avoids limitations of current computational approaches that often lack context, transparency, and interpetability. In this research methods paper, we present three use cases for CRAML: we analyze recent management literature that draws from text data, describe and release new machine learning models from an analysis of proprietary job advertisement text, and present findings of social and economic interest from a public corpus of franchise documents. CRAML produces document-level coded tabular datasets that can be used for quantitative academic research, and allows qualitative researchers to scale niche classification schemes over massive text data. CRAML is a low-resource, flexible, and scalable methodology for building training data for supervised ML. We make available as open-source resources: the software, job advertisement text classifiers, a novel corpus of franchise documents, and a fully replicable start-to-finish trained example in the context of no poach clauses.

READ FULL TEXT

page 10

page 14

page 16

page 17

research
08/21/2023

bbOCR: An Open-source Multi-domain OCR Pipeline for Bengali Documents

Despite the existence of numerous Optical Character Recognition (OCR) to...
research
06/24/2021

Evaluation of Representation Models for Text Classification with AutoML Tools

Automated Machine Learning (AutoML) has gained increasing success on tab...
research
08/05/2021

Exploring Out-of-Distribution Generalization in Text Classifiers Trained on Tobacco-3482 and RVL-CDIP

To be robust enough for widespread adoption, document analysis systems i...
research
05/11/2018

iLCM - A Virtual Research Infrastructure for Large-Scale Qualitative Data

The iLCM project pursues the development of an integrated research envir...
research
12/05/2018

How practical is it? Machine Learning for Identifying Conceptual Interoperability Constraints in API Documents

Building meaningful interoperation with external software units requires...
research
08/12/2022

Scholastic: Graphical Human-Al Collaboration for Inductive and Interpretive Text Analysis

Interpretive scholars generate knowledge from text corpora by manually s...
research
08/02/2022

Label Sleuth: From Unlabeled Text to a Classifier in a Few Hours

Text classification can be useful in many real-world scenarios, saving a...

Please sign up or login with your details

Forgot password? Click here to reset