1 Introduction
Tabular datasets often contain columns with string entries. However, fitting statistical models on such data generally requires a numerical representation of all entries, which calls for building an encoding, or vector representation of the entries. Considering string entries as nominal—unordered—categories gives wellframed statistical analysis. In such situations, categories are supposed to be mutually exclusive and unrelated, with a fixed known set of possible values. Yet, in many realworld datasets, string columns are not standardized in a small number of categories. This poses challenges for statistical analysis. First, the set of all possible categories may be huge and not known a priori, as the number of different strings in the column can indefinitely increase with the number of samples. Second, categories may be related: they often carry some morphological or semantic links.
The classic approach to encode categorical variables for statistical analysis is onehot encoding. It creates vectors that agree with the general intuition of nominal categories: orthogonal and equidistant [cohen2013applied]. However, for highcardinality categories, onehot encoding leads to feature vectors of high dimensionality. This is especially problematic in big data settings, which can lead to a very large number of categories, posing computational and statistical problems.
Data engineering practices typically tackle these issues with datacleaning techniques [pyle1999data, rahm2000data]. In particular, deduplication tries to merge different variants of the same entity [winkler2006overview, elmagarmid2007duplicate, christen2012data]. A related concept is that of normalization, used in databases and text processing to put entries in canonical forms. However, data cleaning or normalization often requires human intervention, and are major costs in data analysis^{1}^{1}1 Kaggle industry survey: https://www.kaggle.com/surveys/2017. To avoid the cleaning step, Similarity encoding [cerda2018similarity] relaxes onehot encoding by using string similarities [gomaa2013survey]. Hence, it addresses the problem of related categories and has been shown to improve statistical analysis upon onehot encoding [cerda2018similarity]
. Yet, it does not tackle the problem of high cardinality, and the data analyst much resort to heuristics such as choosing a subset of the training categories
[cerda2018similarity].Here, we seek encoding approaches for statistical analysis on string categorical entries that is suited to a very large number of categories without any human intervention: avoiding data cleaning, feature engineering, or neural architecture search. Our goals are: i) to provide feature vectors of limited dimensionality without any cleaning or feature engineering step, even for very large datasets; ii) to improve statistical analysis tasks such as supervised learning; and iii) to preserve the intuitions behind categories: entries can be arranged in natural groups that can be easily interpreted. We study two novel encoding methods that both address scalability and statistical performance: a minhash encoder, based on localitysensitive hashing (LSH) [gionis1999similarity]
, and a lowrank model of cooccurrences in character ngrams: a
GammaPoisson matrix factorization, suited to counting statistics. Both models scale linearly with the number of samples and are suitable for statistical analysis in streaming settings. Moreover, we show that the GammaPoisson factorization model enables interpretability with a sparse encoding that expresses the entries of the data as linear combinations of a small number of latent categories, built from their substring information. This interpretability is very important: opaque and blackbox machine learning models have limited adoption in realworld datascience applications. Often, practitioners resort to manual data cleaning to regain interpretability of the models.
Finally, we demonstrate on 17 reallife datasets that our encoding methods improve supervised learning on non curated data without the need for datasetspecific choices. As such, these encodings provide a scalable and automated replacement to data cleaning or feature engineering, and restore the benefits of a lowdimensional categorical encoding, as onehot encoding.
The paper is organized as follows. Section 2 states the problem in detail and the prior art on creating feature vectors from categorical variables. Section 3 introduces and studies our two encoding approaches. In section 4, we present our experimental study with an emphasis on interpretation and on statistical learning on 17 noncurated datasets. Section 5 discusses these results, after which appendices provide information on the datasets and the experiments to facilitate the reproduction of our findings.
2 Problem setting and prior art
The statistics literature often considers datasets that contain only categorical variables with a low cardinality, as datasets^{2}^{2}2See for example, the Adult dataset (https://archive.ics.uci.edu/ml/datasets/adult) in the UCI repository [dua2017uci]. In such settings, the popular onehot encoding is a suitable solution for supervised learning [cohen2013applied]. It models categories as mutually exclusive and, as categories are known a priory, new categories are not expected to appear in the test set. With enough data, supervised learning can then be used to link each category to a target variable.
2.1 Highcardinality categorical variables
However, in many realworld problems, the number of different string entries in a column is very large, often growing with the number of observations (Figure 1). Consider for instance the Drug Directory dataset^{3}^{3}3 Product listing data for all unfinished, unapproved drugs. Source: U.S. Food and Drug Administration (FDA): one of the variables is a categorical column with non proprietary names of drugs. As entries in this column have not been normalized, many different entries are likely related: they share a common ingredient such as alcohol (see (a)). Another example is the Employee Salaries dataset^{4}^{4}4 Annual salary information for employees of the Montgomery County, MD, U.S.A. Source: https://data.montgomerycountymd.gov/. Here, a relevant variable is the position title of employees. As shown in (b), here there is also overlap in the different occupations.
Highcardinality categorical variables may arise from variability in their string representations, such as abbreviations, special characters, or typos^{5}^{5}5A taxonomy of different sources of dirty data can be found on [kim2003taxonomy], and a formal description of data quality problems is proposed by [oliveira2005formal].. Such nonnormalized data often contains very rare categories. Yet, these categories tend to have common morphological information. Indeed, the number of unique entries grows less fast with the size of the data than the number of words in natural language (Figure 1). In both examples above, drug names and position titles of employees, there is an implicit taxonomy. Crafting featureengineering or datacleaning rules can recover a small number of relevant categories. However, it is time consuming and often needs domain expertise.
Notation
We write sets of elements with capital curly fonts, as . Elements of a vector space (we consider row vectors) are written in bold with the th entry denoted by , and matrices are in capital and bold , with the entry on the th row and th column.


Let be a categorical variable such that , the set of finite length strings. We call categories the elements of . Let , be the category corresponding to the th sample of a dataset. For statistical learning, we want to find an encoding function , such as . We call the feature map of . Table II contains a summary of the main variables used in the next sections.
2.2 Onehot encoding, limitations and extensions
2.2.1 Shortcomings of onehot encoding
From a statisticalanalysis standpoint, the multiplication of entries with related information is challenging for two reasons. First, it dilutes the information: learning on rare categories is hard. Second, with onehot encoding, representing these as separate categories creates highdimension feature vectors. This high dimensionality entails large computational and memory costs; it increases the complexity of the associated learning problem, resulting in a poor statistical estimation
[bottou2008tradeoffs]. Dimensionality reduction of the onehot encoded matrix can help with this issue, but at the risk of loosing information.Encoding all unique entries with orthogonal vectors discards the overlap information visible in the string representations. Also, onehot encoding cannot assign a feature vector to new categories that may appear in the testing set, even if its representation is close to one in the training set. Heuristics, such as assigning the zero vector to new categories, create collisions if more than one new category appears. As a result, onehot encoding is ill suited to online learning settings: if new categories arrive, the entire encoding of the dataset has to be recomputed and the dimensionality of the feature vector becomes unbounded.
Symbol  Definition 

Set of all finitelength strings.  
Set of all consecutive ngrams in .  
Vocabulary of ngrams in the train set.  
Categorical variable.  
Number of samples.  
Dimension of the categorical encoder.  
Cardinality of the vocabulary.  
Count matrix of ngrams.  
Feature matrix of .  
String similarity.  
Hash function with salt value equal to .  
Minhash function with salt value equal to . 
2.2.2 Similarity encoding for string categorical variables
For categorical variables represented by strings, similarity encoding extends onehot encoding by taking into account a measure of string similarity between pairs of categories [cerda2018similarity].
Let , the category corresponding to the th sample of a given training dataset. Given a string similarity , similarity encoding builds a feature map as:
(1) 
where is the set of all unique categories in the training set—or a subset of prototype categories chosen heuristically^{6}^{6}6
In this work, we use as dimensionality reduction technique the kmeans strategy explained in
[cerda2018similarity].. With the previous definition, onehot encoding corresponds to taking the discrete string similarity:(2) 
where is the indicator function.
Empirical work on databases with categorical columns containing nonnormalized entries showed that similarity encoding with a continuous string similarity brings significant benefits upon onehot encoding [cerda2018similarity]. Indeed, it relates rare categories to similar, more frequent ones. In columns with typos or morphological variants of the same information, a simple string similarity is often enough to capture additional information. Similarity encoding outperforms a bagofngrams representation of the input string, as well as methods that encode highcardinality categorical variables without capturing information in the strings representations [cerda2018similarity], such as target encoding [micci2001preprocessing] or hash encoding [weinberger2009feature].
A variety of string similarities can be considered for similarity encoding, but [cerda2018similarity] found that a good performer was a similarity based on ngrams of consecutive characters. This ngram similarity is based on splitting the two strings to compare in their character ngrams and calculating the Jaccard coefficient between these two sets [angell1983automatic]:
(3) 
where , is the set of consecutive character ngrams for the string . Beyond the use of string similarity, an important aspect of similarity encoding is that it is a prototype method, using as prototypes a subset of the categories in the train set.
2.3 Related solutions for encoding string categories
2.3.1 Bag of ngrams
A simple way to capture morphology in a string is to characterize it by the count of its character ngrams. This is sometimes called a bagofngrams characterization of strings. Such representation has been shown to be efficient for spelling correction [angell1983automatic]
or for namedentity recognition
[klein2003named].For highcardinality categorical variables, the number of different ngrams tends to increase with the number of samples. Yet, this number increases slower than in a typical NLP problem (see Figure 2). Indeed, categorical variables have less entropy than free text: they are usually repeated, often have subparts in common, and refer to a particular, more restrictive subject.
Representing strings by characterlevel ngrams is related to vectorizing text by their tokens or words. Common practice uses termfrequency inversedocumentfrequency (tfidf
) reweighting: dividing a token’s count in a sample by its count in the whole document. Dimensionality reduction by a singular value decomposition (SVD) on this matrix leads to a simple topic extraction, latent semantic analysis (LSA)
[landauer1998introduction]. A related but more scalable solution for dimensionality reduction are random projections, which give lowdimensional approximation of Euclidean distances [johnson1984extensions, achlioptas2003database].2.3.2 Word embeddings
If the string entries are common words, an approach to represent them as vectors is to leverage word embeddings developed in natural language processing
[pennington2014glove, mikolov2013efficient]. Euclidean similarity of these vectors captures related semantic meaning in words. Multiple words can be represented as a weighted sum of their vectors, or with more complex approaches [arora2016simple]. To cater for outofvocabulary strings, FastText [bojanowski2017enriching] considers subword information of words, i.e., characterlevel ngrams. Hence, it can encode strings even in the presence of typos. Word vectors computed on very large corpora are available for download. These have captured fine semantic links between words. However, to analyze a given database, the danger of such approach is that the semantic of categories may differ from that in the pretrained model. These encodings do not adapt to the information specific in the data at hand. Moreover, they cannot be trained directly on the categorical variables for two reasons: categories lack of enough context, as they are usually composed of short strings; and the number of samples in some datasets is not enough to properly train these models.3 Scalable encoding of string categories
We now describe two novel approaches for categorical encoding of string variables. Both are based on the characterlevel structure of categories. The first one, that we call minhash encoding, comes from the document indexation literature, and in particular from the idea of localitysensitive hashing (LSH) [gionis1999similarity]. The method is stateless and it has been shown to approximate the Jaccard coefficient between two strings [broder1997resemblance]. The second one is the GammaPoisson factorization [canny2004gap]
, a matrix factorization technique—originally used in the probabilistic topic modeling literature—that assumes a Poisson distribution on the ngram counts of categories, with a Gamma prior on the activations. An online algorithm of the factorization matrix allows to scale the method with a linear complexity on the number of samples. Both methods capture the morphological similarity of categories in a reduced dimensionality.
3.1 Minhash encoding
3.1.1 Background: minhash
Localitysensitive hashing (LSH) [gionis1999similarity] has been extensively used for approximate nearest neighbor search as an efficient way of finding similar objects (documents, pictures, etc.) in highdimensional settings. One of the most famous functions in the LSH family is the minhash function [broder1997resemblance, broder2000min], originally designed to retrieve similar documents in terms of the Jaccard coefficient of the word counts of documents (see [leskovec2014mining], chapter 3, for a primer).
Let be a totally ordered set and a random permutation of the order in . For any nonempty with finite cardinality, the minhash function can be defined as:
(4) 
Note that
can be also seen as a random variable. As shown in
[broder1997resemblance], for any , the minhash function has the following property:(5) 
Where is the Jaccard coefficient between the two sets. For a controlled approximation, several random permutations can be taken, which defines a minhash signature. For permutations drawn i.i.d., Equation 5 leads to:
(6) 
where
denotes the Binomial distribution. Dividing the above quantity by
thus gives a consistent estimate of the Jaccard coefficientWithout loss of generality, we can consider the case of being equal to the real interval , so now for any , .
Proposition 3.1.
Marginal distribution. If , and such that , then .
Proof.
It comes directly from considering that:
.
∎
Now that we know the distribution of the minhash random variable, we will show how each dimension of a minhash signature maps inclusion of sets to simple inequalities.
Proposition 3.2.
Inclusion. Let such that and .

If , then .

Proof.
At this point, we do not know anything about the case when , so for a fixed , we can not ensure that any set with lower minhash value has
as inclusion. The following theorem allows us to define regions in the vector space generated by the minhash signature that, with high probability, are associated to inclusion rules.
Theorem 3.1.
Identifiability of inclusion rules.
Let be two finite sets
such that
and . ,
if , then:
(7) 
Proof.
First, notice that:
Then, defining , with :
Finally:
∎
Theorem 3.1 tells us that taking enough random permutations, ensures that when , the probability that is small. This result is very important, as it shows a global property of the minhash representation when using several random permutations, going beyond the wellknown properties of collisions in the minhash signature. Figure 11 in the Appendix confirms empirically the bound on the dimensionality and its logarithmic dependence on the desired false positive rate .
3.1.2 The minhash encoder
A practical way to build a computationally efficient implementation of minhash is to use a hash function with different salt numbers instead of random permutations. Indeed, hash functions can be built with suitable i.i.d. randomprocess properties [broder2000min]. Thus, the minhash function can be constructed as follows:
(8) 
where is a hash function^{7}^{7}7 Here we use a 32bit version of the MurmurHash3 function [appleby2014murmurhash3]. on with salt value .
For the specific problem of categorical data, we are interested in a fast approximation of , where is the set of all consecutive character ngrams for the string . We define the minhash encoder as:
(9) 
Considering the hash functions as random processes, Equation 6 implies that this encoder has the following property:
(10) 
Proposition 3.2 tells us that the minhash encoder transforms the inclusion relations of strings into an order relation in the feature space. This is especially relevant for learning treebased models, as theorem 3.1 proofs that by performing a reduced number of splits in the minhash dimensions, the space can be divided between the elements that contain and do not contain a given substring .
As an example, Figure 3 shows this global property of the minhash encoder for the case of the employe salaries dataset with . The substrings Senior, Supply and Technician are all included in the category Senior Supply Technician, and as consequence, the position for this category in the encoding space will be always in the intersection of the bottomleft regions generated by its substrings.
Finally, this encoder is specially suitable for very large scale settings, as it is very fast to compute and completely stateless. A stateless encoding is very useful for distributed computing: different workers can then process data simultaneously without communication. Its drawback is that, as it relies on hashing, the encoding cannot easily be inverted and interpreted in terms of the original string entries.
3.2 GammaPoisson factorization
To facilitate interpretation, we now introduce an encoding approach that estimates a decomposition of the string entries in terms of a linear combination of latent categories.
3.2.1 Model
We use a generative model of strings from latent categories. For this, we rely on the GammaPoisson model [canny2004gap], a matrix factorization technique wellsuited to counting statistics. The idea was originally developed for finding lowdimensional representations, known as topics, of documents given their word count representation. As the string entries we consider are much shorter than text documents and can contain typos, we rely on their substring representation: we represent each observation by its count vector of characterlevel structure of ngrams. Each observation, a string entry described by its count vector , is modeled as a linear combination of unknown prototypes or topics, :
(11) 
Here, are the activations that decompose the observation in the prototypes in the count space. As we will see later, these prototypes can be seen as latent categories.
Given a training dataset with samples, the model estimates the unknown prototypes by factorizing the data’s bagofngrams representation , where is the number of different ngrams in the data:
(12) 
As is a vector of counts, it is natural to consider a Poisson distribution for each of its elements:
(13) 
For a prior on the elements of
, we use a Gamma distribution, as it is the conjugate prior of the Poisson distribution, but also because it can foster a soft sparsity:
(14) 
where , are the shape and scale parameters of the Gamma distribution for each one of the topics.
3.2.2 Estimation strategy
To fit the model to the input data, we maximize the likelihood of the model, denoted by:
(15) 
Maximizing the loglikelihood with respect to the parameters gives:
(16)  
(17) 
As explained in [canny2004gap]
, these expressions are analogous to solving the following nonnegative matrix factorization (NMF) with the generalized KullbackLeibler divergence
^{8}^{8}8 In the sense of the NMF literature. See for instance [lee2001algorithms]. as loss:(18) 
In other words, the GammaPoisson model can be interpreted as a constrained nonnegative matrix factorization in which the generalized KullbackLeibler divergence is minimized between and , subject to a Gamma prior in the distribution of the elements of . The Gamma prior induces sparsity in the activations of the model.
To solve the NMF problem above, [lee2001algorithms] proposes the following recurrences:
(19)  
(20) 
As is a sparse matrix, the summations above only need to be computed on the nonzero elements of . This fact considerably decreases the computational cost of the algorithm.
Following [lefevre2011online], we present an online (or streaming) version of the GammaPoisson solver (algorithm 1). The basic idea of the algorithm is to exploit the fact that in the recursion for (eq. 19 and 20), the summations are done with respect to the training samples. Instead of computing the numerator and denominator in the entire training set at each update, one can update this values only with minibatches of data, which considerably decreases the memory usage and time of the computations.
For better computational performance, we adapt the implementation of this solver to the specificities of our problem—factorizing substring counts across entries of a categorical variable. In particular, we take advantage of the repeated entries by saving a dictionary of the activations for each category in the convergence of the previous minibatches (algorithm 1, line 4) and use them as an initial guess for the same category in a future minibatch. This is a warm restart and is especially important in the case of categorical variables because for most datasets, the number of unique categories is much lower than the number of samples.
The hyperparameters of the algorithm and its initialization can affect convergence. One important parameter is , the discount factor for the previous iterations of the topic matrix (algorithm 1, line 910). Figure 9 in the Appendix shows that choosing gives a good compromise between stability of the convergence and data fitting in term of the Generalized KL divergence. With respect to the initialization of the topic matrix , a good option is to choose the centroids of a kmeans clustering (Figure 10) in a hashed version of the ngram count matrix (in order to speedup the kmeans algorithm) and then project back to the ngram space with a nearest neighbors algorithm. In the case of a streaming setting, the same approach can be used in a subset of the data.
3.2.3 Inferring feature names
An encoding strategy where each dimension can be understood by humans facilitates the interpretation of the full statistical analysis. A straightforward strategy for interpretation of the Gamma Poisson encoder is to describe each encoding dimension by the features of the string entries that it captures. For this, one alternative is to track the feature maps corresponding to each input category, and assign labels based on the input categories that activate the most in a given dimensionality. Another option is to apply the same strategy, but for substrings, such as words contained in the input categories. In the experiments, we follow the second approach as a lot of datasets are composed of entries with overlap, hence individual words carry more information for interpretability than the entire strings.
This method can be applied to any encoder, but it is expected to work well if the encodings are sparse and composed only of nonnegative values with a meaningful magnitude. The GammaPoisson factorization model ensures these properties.
Dataset  #samples  #categories  #categories per 1000 samples  Gini coefficient  Mean category length (#chars)  Source of high cardinality 

Crime Data  1.5M  135  64.5  0.85  30.6  Multilabel 
Medical Charges  163k  100  99.9  0.23  41.1  Multilabel 
Kickstarter Projects  281k  158  123.8  0.64  11.0  Multilabel 
Employee Salaries  9.2k  385  186.3  0.79  24.9  Multilabel 
Open Payments  2.0M  1.4k  231.9  0.90  24.7  Multilabel 
Traffic Violations  1.2M  11.3k  243.5  0.97  62.1  Typos; Description 
Vancouver Employees  2.6k  640  341.8  0.67  21.5  Multilabel 
Federal Election  3.3M  145.3k  361.7  0.76  13.0  Typos; Multilabel 
Midwest Survey  2.8k  844  371.9  0.67  15.0  Typos 
Met Objects  469k  26.8k  386.1  0.88  12.2  Typos; Multilabel 
Drug Directory  120k  17.1k  641.9  0.81  31.3  Multilabel 
Road Safety  139k  15.8k  790.1  0.65  29.0  Multilabel 
Public Procurement  352k  28.9k  804.6  0.82  46.8  Multilabel; Multilanguage 
Journal Influence  3.6k  3.2k  956.9  0.10  30.0  Multilabel; Multilanguage 
Building Permits  554k  430.6k  940.0  0.48  94.0  Typos; Description 
Wine Reviews  138k  89.1k  997.7  0.23  245.0  Description 
Colleges  7.8k  6.9k  998.0  0.02  32.1  Multilabel 
4 Experimental study of encodings
We now study experimentally different encoding methods in terms of interpretability and supervisedlearning performance. For this purpose, we use three different types of data: simulated categorical data, and real data with curated and noncurated categorical entries.
We benchmark the following strategies: onehot, tfidf, fastText [mikolov2018advances], similarity encoding [cerda2018similarity], the GammaPoisson factorization^{9}^{9}9 Default parameter values are listed in Table VIII, and minhash encoding. For all the strategies based on a ngram representation, we use the set of 24 character grams^{10}^{10}10 In addition to the word as tokens, pretrained versions of fastText also use the set of 36 character ngrams.. For a fair comparison across encoding strategies, we used the same dimensionality in all approaches. To set the dimensionality of onehot encoding, tfidf and fastText, we used a truncated SVD (implemented efficiently following [halko2011finding]). Note that dimensionality reduction improves onehot encoding with treebased learners for data with rare categories [cerda2018similarity]. For similarity encoding, we selected prototypes with a kmeans strategy, following [cerda2018similarity], as it gives slightly better prediction results than the most frequent categories^{11}^{11}11An implementation of these strategies can be found on https://dirtycat.github.io. We do not test the random projections strategy for similarity encoding as it is not scalable. .
4.1 Reallife datasets with string categories
4.1.1 Datasets with highcardinality categories
In order to evaluate the different encoding strategies, we collected 17 realworld datasets containing a prediction task and at least one relevant highcardinality categorical variable as feature^{12}^{12}12 If a dataset has more than one categorical variable, only one selected variable was encoded with the proposed approaches, while the rest of them were onehot encoded.. Table III shows a quick description of the datasets and the corresponding categorical variables (see Appendix A.1.1 for a description of datasets and the related learning tasks). Table III also details the source of highcardinality for the datasets: multilabel, typos, description and multilanguage. We call multilabel the situation when a single column contains multiple information shared by several entries, e.g., supply technician, where supply denotes the type of activity, and technician denotes the rank of the employee (as opposed, e.g., to manager). Typos refers to entries having small morphological variations, as midwest and midwest. Description refers to categorical entries that are composed of a short freetext description. These are close to a typical NLP problem, although constrained to a very particular subject, so they tend to contain very recurrent informative words and nearduplicate entries. Finally, multilanguage are datasets in which the categorical variable contains more that one language across the different entries.
4.1.2 Datasets with curated strings
We also evaluate the behavior of encoders when the categorical variables have already been curated: usually, entries are standardized to create lowcardinality categorical variables. For this, we collected seven of such datasets (see Appendix A.1.2). Experiments on these datasets are intended to show the robustness of the ngram based approaches to situations where there is no need to reduce the dimensionality of the problem, or when capturing the subword information is not necessarily an issue.
4.2 Recovering latent categories
4.2.1 Recovery on simulated data
Table III shows that the most common scenario for high cardinality string variables is the multilabel categories. The second most common problem is the presence of typos (or any source of morphological variation of the same idea). To analyze these two cases in a more controlled setting, we created two simulated categorical variables. Table IV shows examples of the categories we generated, taking as a base 8 ground truth categories of animals: chicken, eagle, giraffe, horse, leopard, lion, tiger and turtle.
The multilabel data was created by concatenating ground truth categories, with following a Poisson distribution—hence, all entries contain at least two labels. For the generation of data with typos, we added 10% of typos to the original ground truth categories by randomly replacing one character by another one (x, y, or z).
To measure the ability of an encoder to recover a feature matrix close to a onehot encoding matrix of groundtruth categories in these simulated settings, we use the Normalized Mutual Information (NMI) as metric. Given two random variables and , the NMI is defined as:
(21) 
Where is the mutual information and the entropy. To apply this metric to the feature matrix generated by the encoding of all ground truth categories, we consider that , after rescaling^{13}^{13}13An normalization of the rows.
, can be seen as a two dimensional probability distribution. For encoders that produce feature matrices with negative values, we take the elementwise absolute value of
. The NMI is a classic measure of correspondences between clustering results [vinh2010information]. Beyond its informationtheoretical interpretation, an appealing property is that it is invariant to order permutations. The NMI of any permutation of the identity matrix is equal to 1 and the NMI of any constant matrix is equal to 0. Thus, the NMI in this case is interpreted as a recovering metric of a onehot encoded matrix of latent, ground truth, categories.
Table V shows the NMI values for both simulated datasets. The GammaPoisson factorization obtains the highest values in both multilabel and typos settings and for different dimensionalities of the encoders. The best recovery is obtained when the dimensionality of the encoder is equal to the number of groundtruth categories, i.e., .
Type  Example categories 

Ground truth  chicken; eagle; giraffe; horse; leopard; 
lion; tiger; turtle.  
Multilabel  lion chicken; horse eagle lion; 
tiger leopard giraffe turtle.  
Typos (10%)  chxcken; eazle; gixaffe; gizaffe; hoyse; 
lexpard; lezpard; lixn; tiyer; tuxtle. 
Encoder  Multilabel  Typos  

=6  =8  =10  =6  =8  =10  
Tfidf + SVD  0.16  0.18  0.17  0.17  0.17  0.17 
FastText + SVD  0.08  0.09  0.09  0.08  0.08  0.09 
Similarity Encoder  0.32  0.25  0.24  0.72  0.82  0.78 
Minhash Encoder  0.14  0.15  0.13  0.14  0.15  0.13 
GammaPoisson  0.76  0.82  0.79  0.78  0.83  0.80 
4.2.2 Results for real curated data
Dataset  Gamma  Similarity  Tfidf  FastText 

(cardinality)  Poisson  Encoding  + SVD  + SVD 
Adult (15)  0.75  0.71  0.54  0.19 
Cacao Flavors (100)  0.51  0.30  0.28  0.07 
California Housing (5)  0.46  0.51  0.56  0.20 
Dating Profiles (19)  0.52  0.24  0.25  0.12 
House Prices (15)  0.83  0.25  0.32  0.11 
House Sales (70)  0.42  0.04  0.18  0.06 
Intrusion Detection (66)  0.34  0.58  0.46  0.11 
For curated data, the cardinality is usually low. We nevertheless perform the encoding using a default choice of , to gauge how well turnkey generic encoding represent these curated strings. Table VI shows the NMI values for the different curated datasets, measuring how much the generated encoding resembles a onehot encoding on the curated categories. Despite the fact that it is used with a dimensionality larger than the cardinality of the curated category, GammaPoisson factorization has the highest recovery performance in 5 out of 7 datasets^{14}^{14}14Table XI in the Appendix show the same analysis but for , the actual cardinality of the categorical variable. In this setting, the GammaPoisson gives much higher recovery results..

These experiments show that GammaPoisson factorization recovers well latent categories. To validate this intuition, Figure 4 shows such encodings in the case of the simulated data as well as the realworld noncurated Employees Salaries dataset. It confirms that the encodings can be interpreted as loadings on discovered categories that match the inferred feature names.
4.3 Encoding for supervised learning
To study how the encoders perform for statistical analysis, we now turn to measuring prediction accuracy in supervisedlearning tasks.
4.3.1 Experiment settings
We use gradient boosted trees, as implemented in XGBoost
[chen2016xgboost]. Note that trees can be implemented on categorical variables^{15}^{15}15 XGBoost does not support categorical features. The recommended option is to use onehot encoding (https://xgboost.readthedocs.io).. However, this encounter the same problems as onehot encoding: the number of comparisons grows with the number of categories. Hence, the best trees approaches for categorical data use target encoding to impose an order on categories [prokhorenkova2018catboost]. We also investigated other supervisedlearning: linear models, neural networks, and kernel machines with RBG and polynomial kernels. However, even with significant hyperparameter tuning, they underperformed XGBoost on our tabular datasets. The good performance of gradientboosted trees is consistent with previous reports of systematic benchmarks
[olson2017data].Depending on the dataset, the learning task can be either regression, binary or multiclass classification^{16}^{16}16 We use different scores to evaluate the performance of the corresponding supervised learning problem: the score for regression; average precision for binary classification; and accuracy for multiclass classification.. As datasets get different prediction scores, we visualize encoders’ performance with prediction results scaled in a relative score. It is a datasetspecific scaling of the original score, in order to bring performance across datasets in the same range. In other words, for a given dataset :
(22) 
where is the the prediction score for the dataset with the configuration
, the set of all trained models—in terms of dimensionality, type of encoder and crossvalidation split. The relative score is figurespecific and is only indented to be used as a visual comparison of classifiers’ performance across multiple datasets. A higher relative score means better results.
For a proper statistical comparison of encoders, we use a ranking test across multiple datasets [demvsar2006statistical]. Note that in this framework each dataset represents a single sample, and not the crossvalidation splits which are not mutually independent. To do so, for a particular dataset, encoders were ranked according to the median score value over crossvalidation splits. At the end, a Friedman test [friedman1937use] is used to determine if all encoders, for a fixed dimensionality
, come from the same distribution. If the null hypothesis is rejected, we use a Nemenyi posthoc test
[nemenyi1962distribution] to verify whether the difference in performance across pairs of encoders is significant.To do pairwise comparison between two encoders, we use a pairwise Wilcoxon signed rank test. The corresponding pvalues rejects the null hypothesis that the two encoders are equally performing across different datasets.
Encoder  SVD v/s Random projection (pvalue) 

Tfidf  0.003 
FastText  0.001 
Onehot  0.492 
4.3.2 Prediction with noncurated data
We now describe the results of several prediction benchmarks with the 17 noncurated datasets.
First, note that onehot, tfidf and fastText are naturally highdimensional encoders, so a dimensionality reduction technique needs to be applied in order to compare the different methodologies—also, without this reduction, the benchmark will be unfeasible given the long computational times of gradient boosting. Moreover, dimensionality reduction helps to improve prediction (see [cerda2018similarity]) with treebased methods. To approximate Euclidean distances, SVD is optimal. However, it has a cost of . Using Gaussian random projections [rahimi2008random] is appealing, as can lead to stateless encoders that requires no fit. Table VII compares the prediction performance of both strategies. For tfidf and fasText, the SVD is significantly superior to random projections. On the contrary, there is no statistical difference for onehot, even though the performance is slightly superior for the SVD (pvalue equal to 0.492). Given these results, we use SVD for all further benchmarks.
Figure 5 compares encoders in terms of the relative score of Equation 22. All ngram based encoders clearly improve upon onehot encoding, at both dimensions ( equal to 30 and 100). Minhash gives a slightly better prediction performance across datasets, despite of being the only method that does not require a data fit step. Results of the Nemenyi ranking test confirm the impression of the figure: ngrambased methods are superior to onehot encoding; and the minhash encoder has the best average ranking value for both dimensionalities, although the difference in prediction with respect to the other ngram based methods is not statistically significant.
While we seek generic encoding approaches, using precomputed fastText embeddings requires the choice of a language. As 15 out of 17 datasets are fully in English, the benchmarks above use English embeddings for fastTest. Figure 6, studies the importance of this choice, comparing the prediction results for fastText in different languages (English, French and Hungarian). Not choosing English leads to a sizeable drop in prediction accuracy, which gets bigger for languages more distant (such as Hungarian). This shows that the natural language semantics of fastText indeed are important to explain its good prediction performance. A good encoding not only needs to represent the data in a low dimension, but also needs to capture the similarities between the different entries.
4.3.3 Prediction with curated data
We now test the robustness of the different encoding methods to situations where there is no need to capture subword information—e.g., low cardinality categorical variables, or variables as ”Country name”, where the overlap of character ngrams does not have a relevant meaning. We benchmark in Figure 7 all encoders on 7 curated datasets. To simulate blackbox usage, the dimensionality was fixed to for all of them, with the exception of onehot. None of the ngram based encoders perform worst than onehot. Indeed, the F statistics for the average ranking does not reject the null hypothesis of all encoders coming from the same distribution (pvalue equal to 0.37).
4.3.4 Interpretable data science with the GammaPoisson factorization
As shown in Figure 4, the GammaPoisson factorization creates sparse, nonnegative feature vectors that are easily interpretable as a linear combination of latent categories. We give informative features names to each of these latent categories (see 3.2.3). To illustrate how such encoding can be used in a datascience setting where humans need to understand results, Figure 8 shows the permutation importances [altmann2010permutation] of each encoding direction of the GammaPoisson factorization and its corresponding feature names. By far, the most important inferred feature name to predict salaries in the Employee Salaries dataset is the latent category Manager, Management, Property, which matches general intuitions on salaries.
5 Discussion and conclusion
Onehot encoding is not well suited to columns of a table containing categories represented with many different strings [cerda2018similarity]. Character ngram count vectors can represent strings well, but they dilute the notion of categories with extremely highdimensional vectors. A good encoding should capture string similarity between entries and reflect it in a lower dimensional encoding.
We study several encoding approaches to capture the structural similarities of string entries. The minhash encoder gives a stateless injection of strings to a vector space, transforming inclusions between strings into simple inequalities (Theorem 3.1). A GammaPoisson factorization on the count matrix of substrings gives a lowrank approximation of similarities.
Scalability
Both GammaPoisson factorization and the minhash encoder can be used on very large datasets, as they can be used in streaming settings. They markedly improve upon onehot encoder for large scale learning as i) they do not need the definition of a vocabulary, ii) they give low dimensional representations, and thus decrease the cost of the subsequent analysis step. Indeed, for both of these encoding approaches, the cost of encoding is usually significantly smaller than that of running a powerful supervised learning method such as XGBoost, even on the reduced dimensionality (see Table X in the Appendix). The minhash encoder is unique in terms of scalability, as it gives lowdimensional representations while being completely stateless, which greatly facilitates distributed computing. The representations enable much better statistical analysis than a simpler stateless lowdimensional encoding built with random projections of ngram string representations. Notably, the most scalable encoder is also the best performing for supervised learning, at the cost of some loss in interpretability.
Recovery of latent categories
Describing results in terms of a small number of categories can greatly help interpreting a statistical analysis. Our experiments on real and simulated data show that encodings created by the GammaPoisson factorization correspond to loadings on meaningful recovered categories. It removes the need to manually curate entries to understand what drives an analysis. For this, positivity of the loadings and the soft sparsity imposed by the Gamma prior is crucial; a simple SVD fails to give interpretable loadings (Appendix Figure 13).
AutoML settings
AutoML (automatic machine learning) strives to develop machinelearning pipeline that can be applied to datasets without human intervention [hutter2015automatic, hutter2019automated]. To date, it has focused on tuning and model selection for supervised learning on numerical data. Our work addresses the featureengineering step. In our experiments, we apply the exact same prediction pipeline to 17 noncurated and 7 curated tabular datasets, without any custom feature engineering. Both GammaPoisson factorization and minhash encoder led to bestperforming prediction accuracy, using a classic gradientboosted tree implementation (XGBoost). We did not tune hyperparameters of the encoding, such as dimensionality or parameters of the priors for the Gamma Poisson. These string categorical encodings therefore open the door to autoML on the original data, removing the need for feature engineering which can lead to difficult model selection. A possible rule when integrating tabular data into an autoML pipeline could be to apply minhash or GammaPoisson encoder for string categorical columns with a cardinality above 30, and use onehot encoding for lowcardinality columns. Indeed, results show that these encoders are also suitable for normalized entries.
Onehot encoding is the defacto standard for statistical analysis on categorical entries. Beyond its simplicity, its strength is to represent the discrete nature of categories. However, it becomes impractical when there are too many different unique entries, for instance because the string representations have not been curated and display typos or combinations of multiple informations in the same entries. For highcardinality string categories, we have presented two scalable approaches to create lowdimensional encoding that retain the qualitative properties of categorical entries. The minhash encoder is extremely scalable and gives the best prediction performance because it transforms string inclusions to vectorspace operations that can easily be captured by a supervised learning step. If interpretability of results is an issue, the GammaPoisson factorization performs almost as well for supervised learning, but enables expressing results in terms of meaningful latent categories. As such, it gives a readilyusable replacement to onehot encoding for highcardinality string categorical variables. Progress brought by these encoders is important, as they avoid one of the timeconsuming steps of a datascience study: normalizing entries of databases via humancrafted rules.
Acknowledgments
Authors were supported by the DirtyData (ANR17CE23001801) project.
References
Appendix A Reproducibility
a.1 Dataset Description
a.1.1 Noncurated datasets
Building Permits^{17}^{17}17 https://www.kaggle.com/chicago/chicagobuildingpermits (sample size: 554k). Permits issued by the Chicago Department of Buildings since 2006. Target (regression): Estimated Cost. Categorical variable: Work Description (cardinality: 430k).
Colleges^{18}^{18}18 https://beachpartyserver.azurewebsites.net/VueBigData/DataFiles/Colleges.txt (7.8k). Information about U.S. colleges and schools. Target (regression): Percent Pell Grant. Cat. var.: School Name (6.9k).
Crime Data^{19}^{19}19 https://data.lacity.org/ASafeCity/CrimeDatafrom2010toPresent/y8tr7khq (1.5M). Incidents of crime in the City of Los Angeles since 2010. Target (regression): Victim Age. Categorical variable: Crime Code Description (135).
Drug Directory^{20}^{20}20 https://www.fda.gov/Drugs/InformationOnDrugs/ucm142438.htm (120k). Product listing data submitted to the U.S. FDA for all unfinished, unapproved drugs. Target (multiclass): Product Type Name. Categorical var.: Non Proprietary Name (17k).
Employee Salaries^{21}^{21}21 https://catalog.data.gov/dataset/employeesalaries2016 (9.2k). Salary information for employees of the Montgomery County, MD. Target (regression): Current Annual Salary. Categorical variable: Employee Position Title (385).
Federal Election^{22}^{22}22 https://classic.fec.gov/finance/disclosure/ftpdet.shtml (3.3M). Campaign finance data for the 20112012 US election cycle. Target (regression): Transaction Amount. Categorical variable: Memo Text (17k).
Journal Influence^{23}^{23}23 https://github.com/FlourishOA/Data (3.6k). Scientific journals and the respective influence scores. Target (regression): Average Cites per Paper. Categorical variable: Journal Name (3.1k).
Kickstarter Projects^{24}^{24}24 https://www.kaggle.com/kemical/kickstarterprojects (281k). More than 300,000 projects from https://www.kickstarter.com. Target (binary): State. Categorical variable: Category (158).
Medical Charges^{25}^{25}25 https://www.cms.gov/ResearchStatisticsDataandSystems/StatisticsTrendsandReports/MedicareProviderChargeData/Inpatient.html (163k). Inpatient discharges for Medicare beneficiaries for more than 3,000 U.S. hospitals. Target (regression): Average Total Payments. Categorical var.: Medical Procedure (100).
Met Objects^{26}^{26}26 https://github.com/metmuseum/openaccess (469k). Information on artworks objects of the Metropolitan Museum of Art’s collection. Target (binary): Department. Categorical variable: Object Name (26k).
Midwest Survey^{27}^{27}27 https://github.com/fivethirtyeight/data/tree/master/regionsurvey (2.8k). Survey to know if people selfidentify as Midwesterners. Target (multiclass): Census Region (10 classes). Categorical var.: What would you call the part of the country you live in now? (844).
Open Payments^{28}^{28}28 https://openpaymentsdata.cms.gov (2M). Payments given by healthcare manufacturing companies to medical doctors or hospitals (year 2013). Target (binary): Status (if the payment was made under a research protocol). Categorical var.: Company name (1.4k).
Public Procurement^{29}^{29}29 https://data.europa.eu/euodp/en/data/dataset/tedcsv (352k). Public procurement data for the European Economic Area, Switzerland, and the Macedonia. Target (regression): Award Value Euro. Categorical var.: CAE Name (29k).
Road Safety^{30}^{30}30 https://data.gov.uk/dataset/roadaccidentssafetydata (139k). Circumstances of personal injury of road accidents in Great Britain from 1979. Target (binary): Sex of Driver. Categorical variable: Car Model (16k).
Traffic Violations^{31}^{31}31 https://catalog.data.gov/dataset/trafficviolations56dda (1.2M). Traffic information from electronic violations issued in the Montgomery County, MD. Target (multiclass): Violation type (4 classes). Categorical var.: Description (11k).
Vancouver Employee^{32}^{32}32 https://data.vancouver.ca/datacatalogue/employeeRemunerationExpensesOver75k.htm(2.6k). Remuneration and expenses for employees earning over $75,000 per year. Target (regression): Remuneration. Categorical variable: Title (640).
Wine Reviews^{33}^{33}33 https://www.kaggle.com/zynicide/winereviews/home (138k). Wine reviews scrapped from WineEnthusiast. Target (regression): Points. Categorical variable: Description (89k).
a.1.2 Curated datasets
Adult^{34}^{34}34 https://archive.ics.uci.edu/ml/datasets/adult (sample size: 32k). Predict whether income exceeds $50K/yr based on census data. Target (binary): Income. Categorical variable: Occupation (cardinality: 15).
Cacao Flavors^{35}^{35}35 https://www.kaggle.com/rtatman/chocolatebarratings (1.7k). Expert ratings of over 1,700 individual chocolate bars, along with information on their origin and bean variety. Target (multiclass): Bean Type. Categorical variable: Broad Bean Origin (97).
California Housing^{36}^{36}36 https://github.com/ageron/handsonml/tree/master/datasets/housing (20k). Based on the 1990 California census data. It contains one row per census block group (a block group typically has a population of 600 to 3,000 people). Target (regression): Median House Value. Categorical variable: Ocean Proximity (5).
Dating Profiles^{37}^{37}37 https://github.com/rudeboybert/JSE_OkCupid (60k). Anonymized data of dating profiles from OkCupid. Target (regression): Age. Categorical variable: Diet (19).
House Prices^{38}^{38}38 https://www.kaggle.com/c/housepricesadvancedregressiontechniques (1.1k). Contains variables describing residential homes in Ames, Iowa. Target (regression): Sale Price. Categorical variable: MSSubClass (15).
House Sales^{39}^{39}39 https://www.kaggle.com/harlfoxem/housesalesprediction (21k). Sale prices for houses in King County, which includes Seattle. Target (regression): Price. Categorical variable: ZIP code (70).
Intrusion Detection^{40}^{40}40 https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data (493k). Network intrusion simulations with a variaty od descriptors of the attack type. Target (multiclass): Attack Type. Categorical variable: Service (66).
a.2 Learning pipeline
Sample size
Datasets’ size range from a couple of thousand to several million samples. To reduce computation time on the learning step, the number of samples was limited to 100k for large datasets.
Data preprocessing
We removed rows with missing values in the target or in any explanatory variable other than the selected categorical variable, for which we replaced missing entries by the string ‘nan’. The only additional preprocessing for the categorical variable was to transform all entries to lower case.
Crossvalidation
For every dataset, we made 20 random splits of the data, with one third of samples for testing at each time. In the case of binary classification, we performed stratified randomization.
Performance metrics
Depending on the type of prediction task, we used different scores to evaluate the performance of the supervised learning problem: for regression, we used the score; for binary classification, the average precision; and for multiclass classification, the accuracy score.
Parametrization of classifiers
We used the scikitlearn [pedregosa2011scikit] for most
of the data processing. For all the experiments, we used the scikitlearn
compatible implementations of XGBoost [chen2016xgboost], with a grid search
on the learning_rate
(0.05, 0.1, 0.3) and
max_depth
(3, 6, 9) parameters.
All datasets and encoders use the same parametrization.
Dimensionality reduction
We used the scikitlearn
implementations of TruncatedSVD
and GaussianRandomProjection
, with the default
parametrization in both cases.
a.3 Online Resources
Experiments are available in Python code at https://github.com/pcerda/tkde2019. Implementations and examples on learning with string categories can be found at http://dirtycat.github.io. The available encoders are compatible with the scikitlearn’s API.
Appendix B Algorithmic considerations
b.1 GammaPoisson factorization
Parameter  Definition  Default value 

Poisson shape  1.1  
Poisson scale  1.0  
Discount factor  0.95  
Minibatch size  256  
Approximation error  
Approximation error 
Algorithm 1 requires some input parameters and initializations that can affect convergence. One important parameter is , the discount factor for the fitting in the past. Figure 9 shows that choosing gives the best compromise between stability of the convergence and data fitting in terms of the Generalized KL divergence. The default values used in the experimen are listed in Table VIII.
With respect to the initialization of the topic matrix , the best option is to choose the centroids of a kmeans clustering (Figure 10) in a hashed version of the ngram count matrix in a reduced dimensionality (in order to speedup convergence of the kmeans algorithm) and then project back to the ngram space with a nearest neighbors algorithm.
is used, as it gives a good tradeoff between convergence and stability of the solution across the number of epochs.
Appendix C Additional figures

Datasets  Onehot + SVD  Similarity encoder  TfIdf + SVD  FastText + SVD  Gamma Poisson  Minhash encoder 

building permits  0.244  0.505  0.550  0.544  0.566  
colleges  0.499  0.532  0.530  0.524  0.527  
crime data  0.443  0.445  0.445  0.445  0.446  
drug directory  0.971  0.979  0.980  0.980  0.981  
employee salaries  0.880  0.905  0.892  0.901  0.900  
federal election  0.135  0.141  0.144  0.146  0.146  
journal influence  0.019  0.138  0.164  0.118  0.133  
kickstarter projects  0.879  0.879  0.880  0.879  0.880  
medical charge  0.904  0.905  0.904  0.904  0.904  
met objects  0.771  0.790  0.789  0.791  0.794  
midwest survey  0.575  0.635  0.646  0.636  0.651  
public procurement  0.678  0.677  0.678  0.678  0.674  
road safety  0.553  0.562  0.562  0.560  0.563  
traffic violations  0.782  0.789  0.789  0.790  0.792  
vancouver employee  0.395  0.550  0.530  0.509  0.556  
wine reviews  0.439  0.671  0.724  0.657  0.679  
Datasets  Encoding time  Training time  Encoding time / 

GammaPoisson  XGBoost  training time  
building permits  699  
colleges  17  
crime data  28  
drug directory  255  
employee salaries  4  
federal election  126  
journal influence  7  
kickstarter projects  20  
medical charge  42  
met objects  154  
midwest survey  2  
public procurement  547  
road safety  191  
traffic violations  105  
vancouver employee  2  
wine reviews  877 
Dataset  Gamma  Similarity  Tfidf  FastText 

(cardinality)  Poisson  Encoding  + SVD  + SVD 
Adult (15)  0.84  0.71  0.54  0.19 
Cacao Flavors (100)  0.48  0.34  0.34  0.1 
California Housing (5)  0.83  0.51  0.56  0.20 
Dating Profiles (19)  0.47  0.26  0.29  0.12 
House Prices (15)  0.91  0.25  0.32  0.11 
House Sales (70)  0.29  0.03  0.26  0.07 
Intrusion Detection (66)  0.27  0.65  0.61  0.13 
Comments
There are no comments yet.