Encoding Categorical Variables with Conjugate Bayesian Models for WeWork Lead Scoring Engine

04/30/2019
by   Austin Slakey, et al.
0

Applied Data Scientists throughout various industries are commonly faced with the challenging task of encoding high-cardinality categorical features into digestible inputs for machine learning algorithms. This paper describes a Bayesian encoding technique developed for WeWork's lead scoring engine which outputs the probability of a person touring one of our office spaces based on interaction, enrichment, and geospatial data. We present a paradigm for ensemble modeling which mitigates the need to build complicated preprocessing and encoding schemes for categorical variables. In particular, domain-specific conjugate Bayesian models are employed as base learners for features in a stacked ensemble model. For each column of a categorical feature matrix we fit a problem-specific prior distribution, for example, the Beta distribution for a binary classification problem. In order to analytically derive the moments of the posterior distribution, we update the prior with the conjugate likelihood of the corresponding target variable for each unique value of the given categorical feature. This function of column and value encodes the categorical feature matrix so that the final learner in the ensemble model ingests low-dimensional numerical input. Experimental results on both curated and real world datasets demonstrate impressive accuracy and computational efficiency on a variety of problem archetypes. Particularly, for the lead scoring engine at WeWork -- where some categorical features have as many as 300,000 levels -- we have seen an AUC improvement from 0.87 to 0.97 through implementing conjugate Bayesian model encoding.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/01/2020

Sampling Techniques in Bayesian Target Encoding

Target encoding is an effective encoding technique of categorical variab...
research
04/01/2021

Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

Because most machine learning (ML) algorithms are designed for numerical...
research
11/29/2021

PCA-based Category Encoder for Categorical to Numerical Variable Conversion

Increasing the cardinality of categorical variables might decrease the o...
research
07/30/2019

Prudence When Assuming Normality: an advice for machine learning practitioners

In a binary classification problem the feature vector (predictor) is the...
research
01/27/2022

Fairness implications of encoding protected categorical attributes

Protected attributes are often presented as categorical features that ne...
research
01/30/2023

Machine Learning with High-Cardinality Categorical Features in Actuarial Applications

High-cardinality categorical features are pervasive in actuarial data (e...
research
06/04/2018

Similarity encoding for learning with dirty categorical variables

For statistical learning, categorical variables in a table are usually c...

Please sign up or login with your details

Forgot password? Click here to reset