From Strings to Data Science: a Practical Framework for Automated String Handling

11/02/2021
by   John W. van Lith, et al.
0

Many machine learning libraries require that string features be converted to a numerical representation for the models to work as intended. Categorical string features can represent a wide variety of data (e.g., zip codes, names, marital status), and are notoriously difficult to preprocess automatically. In this paper, we propose a framework to do so based on best practices, domain knowledge, and novel techniques. It automatically identifies different types of string features, processes them accordingly, and encodes them into numerical representations. We also provide an open source Python implementation to automatically preprocess categorical string data in tabular datasets and demonstrate promising results on a wide range of datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/17/2018

An Automata-based Abstract Semantics for String Manipulation Languages

In recent years, dynamic languages, such as JavaScript or Python, have f...
research
09/16/2023

Parallel Longest Common SubSequence Analysis In Chapel

One of the most critical problems in the field of string algorithms is t...
research
07/14/2020

A Decision Procedure for Path Feasibility of String Manipulating Programs with Integer Data Type

Strings are widely used in programs, especially in web applications. Int...
research
11/29/2017

A critical analysis of string APIs: The case of Pharo

Most programming languages, besides C, provide a native abstraction for ...
research
02/07/2023

Recent advances in the Self-Referencing Embedding Strings (SELFIES) library

String-based molecular representations play a crucial role in cheminform...
research
02/11/2020

Hidden in Plain Sight: Obfuscated Strings Threatening Your Privacy

String obfuscation is an established technique used by proprietary, clos...
research
09/13/2018

Where Does Haydn End and Mozart Begin? Composer Classification of String Quartets

For humans and machines, perceiving differences between string quartets ...

Please sign up or login with your details

Forgot password? Click here to reset