Tabular datasets often contain columns with string entries. However, fitting statistical models on such data generally requires a numerical representation of all entries, which calls for building an encoding, or vector representation of the entries. Considering string entries as nominal—unordered—categories gives well-framed statistical analysis. In such situations, categories are supposed to be mutually exclusive and unrelated, with a fixed known set of possible values. Yet, in many real-world datasets, string columns are not standardized in a small number of categories. This poses challenges for statistical analysis. First, the set of all possible categories may be huge and not known a priori, as the number of different strings in the column can indefinitely increase with the number of samples. Second, categories may be related: they often carry some morphological or semantic links.
The classic approach to encode categorical variables for statistical analysis is one-hot encoding. It creates vectors that agree with the general intuition of nominal categories: orthogonal and equidistant [cohen2013applied]. However, for high-cardinality categories, one-hot encoding leads to feature vectors of high dimensionality. This is especially problematic in big data settings, which can lead to a very large number of categories, posing computational and statistical problems.
Data engineering practices typically tackle these issues with data-cleaning techniques [pyle1999data, rahm2000data]. In particular, deduplication tries to merge different variants of the same entity [winkler2006overview, elmagarmid2007duplicate, christen2012data]. A related concept is that of normalization, used in databases and text processing to put entries in canonical forms. However, data cleaning or normalization often requires human intervention, and are major costs in data analysis111 Kaggle industry survey: https://www.kaggle.com/surveys/2017. To avoid the cleaning step, Similarity encoding [cerda2018similarity] relaxes one-hot encoding by using string similarities [gomaa2013survey]. Hence, it addresses the problem of related categories and has been shown to improve statistical analysis upon one-hot encoding [cerda2018similarity]
. Yet, it does not tackle the problem of high cardinality, and the data analyst much resort to heuristics such as choosing a subset of the training categories[cerda2018similarity].
Here, we seek encoding approaches for statistical analysis on string categorical entries that is suited to a very large number of categories without any human intervention: avoiding data cleaning, feature engineering, or neural architecture search. Our goals are: i) to provide feature vectors of limited dimensionality without any cleaning or feature engineering step, even for very large datasets; ii) to improve statistical analysis tasks such as supervised learning; and iii) to preserve the intuitions behind categories: entries can be arranged in natural groups that can be easily interpreted. We study two novel encoding methods that both address scalability and statistical performance: a min-hash encoder, based on locality-sensitive hashing (LSH) [gionis1999similarity]
, and a low-rank model of co-occurrences in character n-grams: aGamma-Poisson matrix factorization
, suited to counting statistics. Both models scale linearly with the number of samples and are suitable for statistical analysis in streaming settings. Moreover, we show that the Gamma-Poisson factorization model enables interpretability with a sparse encoding that expresses the entries of the data as linear combinations of a small number of latent categories, built from their substring information. This interpretability is very important: opaque and black-box machine learning models have limited adoption in real-world data-science applications. Often, practitioners resort to manual data cleaning to regain interpretability of the models.
Finally, we demonstrate on 17 real-life datasets that our encoding methods improve supervised learning on non curated data without the need for dataset-specific choices. As such, these encodings provide a scalable and automated replacement to data cleaning or feature engineering, and restore the benefits of a low-dimensional categorical encoding, as one-hot encoding.
The paper is organized as follows. Section 2 states the problem in detail and the prior art on creating feature vectors from categorical variables. Section 3 introduces and studies our two encoding approaches. In section 4, we present our experimental study with an emphasis on interpretation and on statistical learning on 17 non-curated datasets. Section 5 discusses these results, after which appendices provide information on the datasets and the experiments to facilitate the reproduction of our findings.
2 Problem setting and prior art
The statistics literature often considers datasets that contain only categorical variables with a low cardinality, as datasets222See for example, the Adult dataset (https://archive.ics.uci.edu/ml/datasets/adult) in the UCI repository [dua2017uci]. In such settings, the popular one-hot encoding is a suitable solution for supervised learning [cohen2013applied]. It models categories as mutually exclusive and, as categories are known a priory, new categories are not expected to appear in the test set. With enough data, supervised learning can then be used to link each category to a target variable.
2.1 High-cardinality categorical variables
However, in many real-world problems, the number of different string entries in a column is very large, often growing with the number of observations (Figure 1). Consider for instance the Drug Directory dataset333 Product listing data for all unfinished, unapproved drugs. Source: U.S. Food and Drug Administration (FDA): one of the variables is a categorical column with non proprietary names of drugs. As entries in this column have not been normalized, many different entries are likely related: they share a common ingredient such as alcohol (see (a)). Another example is the Employee Salaries dataset444 Annual salary information for employees of the Montgomery County, MD, U.S.A. Source: https://data.montgomerycountymd.gov/. Here, a relevant variable is the position title of employees. As shown in (b), here there is also overlap in the different occupations.
High-cardinality categorical variables may arise from variability in their string representations, such as abbreviations, special characters, or typos555A taxonomy of different sources of dirty data can be found on [kim2003taxonomy], and a formal description of data quality problems is proposed by [oliveira2005formal].. Such non-normalized data often contains very rare categories. Yet, these categories tend to have common morphological information. Indeed, the number of unique entries grows less fast with the size of the data than the number of words in natural language (Figure 1). In both examples above, drug names and position titles of employees, there is an implicit taxonomy. Crafting feature-engineering or data-cleaning rules can recover a small number of relevant categories. However, it is time consuming and often needs domain expertise.
We write sets of elements with capital curly fonts, as . Elements of a vector space (we consider row vectors) are written in bold with the -th entry denoted by , and matrices are in capital and bold , with the entry on the -th row and -th column.
Let be a categorical variable such that , the set of finite length strings. We call categories the elements of . Let , be the category corresponding to the -th sample of a dataset. For statistical learning, we want to find an encoding function , such as . We call the feature map of . Table II contains a summary of the main variables used in the next sections.
2.2 One-hot encoding, limitations and extensions
2.2.1 Shortcomings of one-hot encoding
From a statistical-analysis standpoint, the multiplication of entries with related information is challenging for two reasons. First, it dilutes the information: learning on rare categories is hard. Second, with one-hot encoding, representing these as separate categories creates high-dimension feature vectors. This high dimensionality entails large computational and memory costs; it increases the complexity of the associated learning problem, resulting in a poor statistical estimation[bottou2008tradeoffs]. Dimensionality reduction of the one-hot encoded matrix can help with this issue, but at the risk of loosing information.
Encoding all unique entries with orthogonal vectors discards the overlap information visible in the string representations. Also, one-hot encoding cannot assign a feature vector to new categories that may appear in the testing set, even if its representation is close to one in the training set. Heuristics, such as assigning the zero vector to new categories, create collisions if more than one new category appears. As a result, one-hot encoding is ill suited to online learning settings: if new categories arrive, the entire encoding of the dataset has to be recomputed and the dimensionality of the feature vector becomes unbounded.
|Set of all finite-length strings.|
|Set of all consecutive n-grams in .|
|Vocabulary of n-grams in the train set.|
|Number of samples.|
|Dimension of the categorical encoder.|
|Cardinality of the vocabulary.|
|Count matrix of n-grams.|
|Feature matrix of .|
|Hash function with salt value equal to .|
|Min-hash function with salt value equal to .|
2.2.2 Similarity encoding for string categorical variables
For categorical variables represented by strings, similarity encoding extends one-hot encoding by taking into account a measure of string similarity between pairs of categories [cerda2018similarity].
Let , the category corresponding to the -th sample of a given training dataset. Given a string similarity , similarity encoding builds a feature map as:
is the set of all unique categories in the training set—or a subset
of prototype categories chosen heuristically666 In this work, we use as dimensionality reduction technique
the k-means strategy explained in
In this work, we use as dimensionality reduction technique the k-means strategy explained in[cerda2018similarity].. With the previous definition, one-hot encoding corresponds to taking the discrete string similarity:
where is the indicator function.
Empirical work on databases with categorical columns containing non-normalized entries showed that similarity encoding with a continuous string similarity brings significant benefits upon one-hot encoding [cerda2018similarity]. Indeed, it relates rare categories to similar, more frequent ones. In columns with typos or morphological variants of the same information, a simple string similarity is often enough to capture additional information. Similarity encoding outperforms a bag-of-n-grams representation of the input string, as well as methods that encode high-cardinality categorical variables without capturing information in the strings representations [cerda2018similarity], such as target encoding [micci2001preprocessing] or hash encoding [weinberger2009feature].
A variety of string similarities can be considered for similarity encoding, but [cerda2018similarity] found that a good performer was a similarity based on n-grams of consecutive characters. This n-gram similarity is based on splitting the two strings to compare in their character n-grams and calculating the Jaccard coefficient between these two sets [angell1983automatic]:
where , is the set of consecutive character n-grams for the string . Beyond the use of string similarity, an important aspect of similarity encoding is that it is a prototype method, using as prototypes a subset of the categories in the train set.
2.3 Related solutions for encoding string categories
2.3.1 Bag of n-grams
A simple way to capture morphology in a string is to characterize it by the count of its character n-grams. This is sometimes called a bag-of-n-grams characterization of strings. Such representation has been shown to be efficient for spelling correction [angell1983automatic]
or for named-entity recognition[klein2003named].
For high-cardinality categorical variables, the number of different n-grams tends to increase with the number of samples. Yet, this number increases slower than in a typical NLP problem (see Figure 2). Indeed, categorical variables have less entropy than free text: they are usually repeated, often have subparts in common, and refer to a particular, more restrictive subject.
Representing strings by character-level n-grams is related to vectorizing text by their tokens or words. Common practice uses term-frequency inverse-document-frequency (tf-idf
) reweighting: dividing a token’s count in a sample by its count in the whole document. Dimensionality reduction by a singular value decomposition (SVD) on this matrix leads to a simple topic extraction, latent semantic analysis (LSA)[landauer1998introduction]. A related but more scalable solution for dimensionality reduction are random projections, which give low-dimensional approximation of Euclidean distances [johnson1984extensions, achlioptas2003database].
2.3.2 Word embeddings
If the string entries are common words, an approach to represent them as vectors is to leverage word embeddings developed in natural language processing[pennington2014glove, mikolov2013efficient]. Euclidean similarity of these vectors captures related semantic meaning in words. Multiple words can be represented as a weighted sum of their vectors, or with more complex approaches [arora2016simple]. To cater for out-of-vocabulary strings, FastText [bojanowski2017enriching] considers subword information of words, i.e., character-level n-grams. Hence, it can encode strings even in the presence of typos. Word vectors computed on very large corpora are available for download. These have captured fine semantic links between words. However, to analyze a given database, the danger of such approach is that the semantic of categories may differ from that in the pretrained model. These encodings do not adapt to the information specific in the data at hand. Moreover, they cannot be trained directly on the categorical variables for two reasons: categories lack of enough context, as they are usually composed of short strings; and the number of samples in some datasets is not enough to properly train these models.
3 Scalable encoding of string categories
We now describe two novel approaches for categorical encoding of string variables. Both are based on the character-level structure of categories. The first one, that we call min-hash encoding, comes from the document indexation literature, and in particular from the idea of locality-sensitive hashing (LSH) [gionis1999similarity]. The method is stateless and it has been shown to approximate the Jaccard coefficient between two strings [broder1997resemblance]. The second one is the Gamma-Poisson factorization [canny2004gap]
, a matrix factorization technique—originally used in the probabilistic topic modeling literature—that assumes a Poisson distribution on the n-gram counts of categories, with a Gamma prior on the activations. An online algorithm of the factorization matrix allows to scale the method with a linear complexity on the number of samples. Both methods capture the morphological similarity of categories in a reduced dimensionality.
3.1 Min-hash encoding
3.1.1 Background: min-hash
Locality-sensitive hashing (LSH) [gionis1999similarity] has been extensively used for approximate nearest neighbor search as an efficient way of finding similar objects (documents, pictures, etc.) in high-dimensional settings. One of the most famous functions in the LSH family is the min-hash function [broder1997resemblance, broder2000min], originally designed to retrieve similar documents in terms of the Jaccard coefficient of the word counts of documents (see [leskovec2014mining], chapter 3, for a primer).
Let be a totally ordered set and a random permutation of the order in . For any non-empty with finite cardinality, the min-hash function can be defined as:
can be also seen as a random variable. As shown in[broder1997resemblance], for any , the min-hash function has the following property:
Where is the Jaccard coefficient between the two sets. For a controlled approximation, several random permutations can be taken, which defines a min-hash signature. For permutations drawn i.i.d., Equation 5 leads to:
denotes the Binomial distribution. Dividing the above quantity bythus gives a consistent estimate of the Jaccard coefficient
Without loss of generality, we can consider the case of being equal to the real interval , so now for any , .
Marginal distribution. If , and such that , then .
It comes directly from considering that:
Now that we know the distribution of the min-hash random variable, we will show how each dimension of a min-hash signature maps inclusion of sets to simple inequalities.
Inclusion. Let such that and .
If , then .
(i) is trivial and (ii) comes directly from Prop. 3.1:
At this point, we do not know anything about the case when , so for a fixed , we can not ensure that any set with lower min-hash value has
as inclusion. The following theorem allows us to define regions in the vector space generated by the min-hash signature that, with high probability, are associated to inclusion rules.
Identifiability of inclusion rules.
Let be two finite sets such that and . , if , then:
First, notice that:
Then, defining , with :
Theorem 3.1 tells us that taking enough random permutations, ensures that when , the probability that is small. This result is very important, as it shows a global property of the min-hash representation when using several random permutations, going beyond the well-known properties of collisions in the min-hash signature. Figure 11 in the Appendix confirms empirically the bound on the dimensionality and its logarithmic dependence on the desired false positive rate .
3.1.2 The min-hash encoder
A practical way to build a computationally efficient implementation of min-hash is to use a hash function with different salt numbers instead of random permutations. Indeed, hash functions can be built with suitable i.i.d. random-process properties [broder2000min]. Thus, the min-hash function can be constructed as follows:
where is a hash function777 Here we use a 32bit version of the MurmurHash3 function [appleby2014murmurhash3]. on with salt value .
For the specific problem of categorical data, we are interested in a fast approximation of , where is the set of all consecutive character n-grams for the string . We define the min-hash encoder as:
Considering the hash functions as random processes, Equation 6 implies that this encoder has the following property:
Proposition 3.2 tells us that the min-hash encoder transforms the inclusion relations of strings into an order relation in the feature space. This is especially relevant for learning tree-based models, as theorem 3.1 proofs that by performing a reduced number of splits in the min-hash dimensions, the space can be divided between the elements that contain and do not contain a given substring .
As an example, Figure 3 shows this global property of the min-hash encoder for the case of the employe salaries dataset with . The substrings Senior, Supply and Technician are all included in the category Senior Supply Technician, and as consequence, the position for this category in the encoding space will be always in the intersection of the bottom-left regions generated by its substrings.
Finally, this encoder is specially suitable for very large scale settings, as it is very fast to compute and completely stateless. A stateless encoding is very useful for distributed computing: different workers can then process data simultaneously without communication. Its drawback is that, as it relies on hashing, the encoding cannot easily be inverted and interpreted in terms of the original string entries.
3.2 Gamma-Poisson factorization
To facilitate interpretation, we now introduce an encoding approach that estimates a decomposition of the string entries in terms of a linear combination of latent categories.
We use a generative model of strings from latent categories. For this, we rely on the Gamma-Poisson model [canny2004gap], a matrix factorization technique well-suited to counting statistics. The idea was originally developed for finding low-dimensional representations, known as topics, of documents given their word count representation. As the string entries we consider are much shorter than text documents and can contain typos, we rely on their substring representation: we represent each observation by its count vector of character-level structure of n-grams. Each observation, a string entry described by its count vector , is modeled as a linear combination of unknown prototypes or topics, :
Here, are the activations that decompose the observation in the prototypes in the count space. As we will see later, these prototypes can be seen as latent categories.
Given a training dataset with samples, the model estimates the unknown prototypes by factorizing the data’s bag-of-n-grams representation , where is the number of different n-grams in the data:
As is a vector of counts, it is natural to consider a Poisson distribution for each of its elements:
For a prior on the elements of
where , are the shape and scale parameters of the Gamma distribution for each one of the topics.
3.2.2 Estimation strategy
To fit the model to the input data, we maximize the likelihood of the model, denoted by:
Maximizing the log-likelihood with respect to the parameters gives:
As explained in [canny2004gap]
, these expressions are analogous to solving the following non-negative matrix factorization (NMF) with the generalized Kullback-Leibler divergence888 In the sense of the NMF literature. See for instance [lee2001algorithms]. as loss:
In other words, the Gamma-Poisson model can be interpreted as a constrained non-negative matrix factorization in which the generalized Kullback-Leibler divergence is minimized between and , subject to a Gamma prior in the distribution of the elements of . The Gamma prior induces sparsity in the activations of the model.
To solve the NMF problem above, [lee2001algorithms] proposes the following recurrences:
As is a sparse matrix, the summations above only need to be computed on the non-zero elements of . This fact considerably decreases the computational cost of the algorithm.
Following [lefevre2011online], we present an online (or streaming) version of the Gamma-Poisson solver (algorithm 1). The basic idea of the algorithm is to exploit the fact that in the recursion for (eq. 19 and 20), the summations are done with respect to the training samples. Instead of computing the numerator and denominator in the entire training set at each update, one can update this values only with mini-batches of data, which considerably decreases the memory usage and time of the computations.
For better computational performance, we adapt the implementation of this solver to the specificities of our problem—factorizing substring counts across entries of a categorical variable. In particular, we take advantage of the repeated entries by saving a dictionary of the activations for each category in the convergence of the previous mini-batches (algorithm 1, line 4) and use them as an initial guess for the same category in a future mini-batch. This is a warm restart and is especially important in the case of categorical variables because for most datasets, the number of unique categories is much lower than the number of samples.
The hyper-parameters of the algorithm and its initialization can affect convergence. One important parameter is , the discount factor for the previous iterations of the topic matrix (algorithm 1, line 9-10). Figure 9 in the Appendix shows that choosing gives a good compromise between stability of the convergence and data fitting in term of the Generalized KL divergence. With respect to the initialization of the topic matrix , a good option is to choose the centroids of a k-means clustering (Figure 10) in a hashed version of the n-gram count matrix (in order to speed-up the k-means algorithm) and then project back to the n-gram space with a nearest neighbors algorithm. In the case of a streaming setting, the same approach can be used in a subset of the data.
3.2.3 Inferring feature names
An encoding strategy where each dimension can be understood by humans facilitates the interpretation of the full statistical analysis. A straightforward strategy for interpretation of the Gamma Poisson encoder is to describe each encoding dimension by the features of the string entries that it captures. For this, one alternative is to track the feature maps corresponding to each input category, and assign labels based on the input categories that activate the most in a given dimensionality. Another option is to apply the same strategy, but for substrings, such as words contained in the input categories. In the experiments, we follow the second approach as a lot of datasets are composed of entries with overlap, hence individual words carry more information for interpretability than the entire strings.
This method can be applied to any encoder, but it is expected to work well if the encodings are sparse and composed only of non-negative values with a meaningful magnitude. The Gamma-Poisson factorization model ensures these properties.
|Dataset||#samples||#categories||#categories per 1000 samples||Gini coefficient||Mean category length (#chars)||Source of high cardinality|
|Traffic Violations||1.2M||11.3k||243.5||0.97||62.1||Typos; Description|
|Federal Election||3.3M||145.3k||361.7||0.76||13.0||Typos; Multi-label|
|Met Objects||469k||26.8k||386.1||0.88||12.2||Typos; Multi-label|
|Public Procurement||352k||28.9k||804.6||0.82||46.8||Multi-label; Multi-language|
|Journal Influence||3.6k||3.2k||956.9||0.10||30.0||Multi-label; Multi-language|
|Building Permits||554k||430.6k||940.0||0.48||94.0||Typos; Description|
4 Experimental study of encodings
We now study experimentally different encoding methods in terms of interpretability and supervised-learning performance. For this purpose, we use three different types of data: simulated categorical data, and real data with curated and non-curated categorical entries.
We benchmark the following strategies: one-hot, tf-idf, fastText [mikolov2018advances], similarity encoding [cerda2018similarity], the Gamma-Poisson factorization999 Default parameter values are listed in Table VIII, and min-hash encoding. For all the strategies based on a n-gram representation, we use the set of 2-4 character grams101010 In addition to the word as tokens, pretrained versions of fastText also use the set of 3-6 character n-grams.. For a fair comparison across encoding strategies, we used the same dimensionality in all approaches. To set the dimensionality of one-hot encoding, tf-idf and fastText, we used a truncated SVD (implemented efficiently following [halko2011finding]). Note that dimensionality reduction improves one-hot encoding with tree-based learners for data with rare categories [cerda2018similarity]. For similarity encoding, we selected prototypes with a k-means strategy, following [cerda2018similarity], as it gives slightly better prediction results than the most frequent categories111111An implementation of these strategies can be found on https://dirty-cat.github.io. We do not test the random projections strategy for similarity encoding as it is not scalable. .
4.1 Real-life datasets with string categories
4.1.1 Datasets with high-cardinality categories
In order to evaluate the different encoding strategies, we collected 17 real-world datasets containing a prediction task and at least one relevant high-cardinality categorical variable as feature121212 If a dataset has more than one categorical variable, only one selected variable was encoded with the proposed approaches, while the rest of them were one-hot encoded.. Table III shows a quick description of the datasets and the corresponding categorical variables (see Appendix A.1.1 for a description of datasets and the related learning tasks). Table III also details the source of high-cardinality for the datasets: multi-label, typos, description and multi-language. We call multi-label the situation when a single column contains multiple information shared by several entries, e.g., supply technician, where supply denotes the type of activity, and technician denotes the rank of the employee (as opposed, e.g., to manager). Typos refers to entries having small morphological variations, as midwest and mid-west. Description refers to categorical entries that are composed of a short free-text description. These are close to a typical NLP problem, although constrained to a very particular subject, so they tend to contain very recurrent informative words and near-duplicate entries. Finally, multi-language are datasets in which the categorical variable contains more that one language across the different entries.
4.1.2 Datasets with curated strings
We also evaluate the behavior of encoders when the categorical variables have already been curated: usually, entries are standardized to create low-cardinality categorical variables. For this, we collected seven of such datasets (see Appendix A.1.2). Experiments on these datasets are intended to show the robustness of the n-gram based approaches to situations where there is no need to reduce the dimensionality of the problem, or when capturing the subword information is not necessarily an issue.
4.2 Recovering latent categories
4.2.1 Recovery on simulated data
Table III shows that the most common scenario for high cardinality string variables is the multi-label categories. The second most common problem is the presence of typos (or any source of morphological variation of the same idea). To analyze these two cases in a more controlled setting, we created two simulated categorical variables. Table IV shows examples of the categories we generated, taking as a base 8 ground truth categories of animals: chicken, eagle, giraffe, horse, leopard, lion, tiger and turtle.
The multi-label data was created by concatenating ground truth categories, with following a Poisson distribution—hence, all entries contain at least two labels. For the generation of data with typos, we added 10% of typos to the original ground truth categories by randomly replacing one character by another one (x, y, or z).
To measure the ability of an encoder to recover a feature matrix close to a one-hot encoding matrix of ground-truth categories in these simulated settings, we use the Normalized Mutual Information (NMI) as metric. Given two random variables and , the NMI is defined as:
Where is the mutual information and the entropy. To apply this metric to the feature matrix generated by the encoding of all ground truth categories, we consider that , after rescaling131313An normalization of the rows.
, can be seen as a two dimensional probability distribution. For encoders that produce feature matrices with negative values, we take the element-wise absolute value of. The NMI is a classic measure of correspondences between clustering results [vinh2010information]
. Beyond its information-theoretical interpretation, an appealing property is that it is invariant to order permutations. The NMI of any permutation of the identity matrix is equal to 1 and the NMI of any constant matrix is equal to 0. Thus, the NMI in this case is interpreted as a recovering metric of a one-hot encoded matrix of latent, ground truth, categories.
Table V shows the NMI values for both simulated datasets. The Gamma-Poisson factorization obtains the highest values in both multi-label and typos settings and for different dimensionalities of the encoders. The best recovery is obtained when the dimensionality of the encoder is equal to the number of ground-truth categories, i.e., .
|Ground truth||chicken; eagle; giraffe; horse; leopard;|
|lion; tiger; turtle.|
|Multi-label||lion chicken; horse eagle lion;|
|tiger leopard giraffe turtle.|
|Typos (10%)||chxcken; eazle; gixaffe; gizaffe; hoyse;|
|lexpard; lezpard; lixn; tiyer; tuxtle.|
|Tf-idf + SVD||0.16||0.18||0.17||0.17||0.17||0.17|
|FastText + SVD||0.08||0.09||0.09||0.08||0.08||0.09|
4.2.2 Results for real curated data
|(cardinality)||Poisson||Encoding||+ SVD||+ SVD|
|Cacao Flavors (100)||0.51||0.30||0.28||0.07|
|California Housing (5)||0.46||0.51||0.56||0.20|
|Dating Profiles (19)||0.52||0.24||0.25||0.12|
|House Prices (15)||0.83||0.25||0.32||0.11|
|House Sales (70)||0.42||0.04||0.18||0.06|
|Intrusion Detection (66)||0.34||0.58||0.46||0.11|
For curated data, the cardinality is usually low. We nevertheless perform the encoding using a default choice of , to gauge how well turn-key generic encoding represent these curated strings. Table VI shows the NMI values for the different curated datasets, measuring how much the generated encoding resembles a one-hot encoding on the curated categories. Despite the fact that it is used with a dimensionality larger than the cardinality of the curated category, Gamma-Poisson factorization has the highest recovery performance in 5 out of 7 datasets141414Table XI in the Appendix show the same analysis but for , the actual cardinality of the categorical variable. In this setting, the Gamma-Poisson gives much higher recovery results..
These experiments show that Gamma-Poisson factorization recovers well latent categories. To validate this intuition, Figure 4 shows such encodings in the case of the simulated data as well as the real-world non-curated Employees Salaries dataset. It confirms that the encodings can be interpreted as loadings on discovered categories that match the inferred feature names.
4.3 Encoding for supervised learning
To study how the encoders perform for statistical analysis, we now turn to measuring prediction accuracy in supervised-learning tasks.
4.3.1 Experiment settings
. We also investigated other supervised-learning: linear models, neural networks, and kernel machines with RBG and polynomial kernels. However, even with significant hyper-parameter tuning, they under-performed XGBoost on our tabular datasets. The good performance of gradient-boosted trees is consistent with previous reports of systematic benchmarks[olson2017data].
Depending on the dataset, the learning task can be either regression, binary or multiclass classification161616 We use different scores to evaluate the performance of the corresponding supervised learning problem: the score for regression; average precision for binary classification; and accuracy for multiclass classification.. As datasets get different prediction scores, we visualize encoders’ performance with prediction results scaled in a relative score. It is a dataset-specific scaling of the original score, in order to bring performance across datasets in the same range. In other words, for a given dataset :
where is the the prediction score for the dataset with the configuration
, the set of all trained models—in terms of dimensionality, type of encoder and cross-validation split. The relative score is figure-specific and is only indented to be used as a visual comparison of classifiers’ performance across multiple datasets. A higher relative score means better results.
For a proper statistical comparison of encoders, we use a ranking test across multiple datasets [demvsar2006statistical]. Note that in this framework each dataset represents a single sample, and not the cross-validation splits which are not mutually independent. To do so, for a particular dataset, encoders were ranked according to the median score value over cross-validation splits. At the end, a Friedman test [friedman1937use] is used to determine if all encoders, for a fixed dimensionality
, come from the same distribution. If the null hypothesis is rejected, we use a Nemenyi post-hoc test[nemenyi1962distribution] to verify whether the difference in performance across pairs of encoders is significant.
To do pairwise comparison between two encoders, we use a pairwise Wilcoxon signed rank test. The corresponding p-values rejects the null hypothesis that the two encoders are equally performing across different datasets.
|Encoder||SVD v/s Random projection (p-value)|
4.3.2 Prediction with non-curated data
We now describe the results of several prediction benchmarks with the 17 non-curated datasets.
First, note that one-hot, tf-idf and fastText are naturally high-dimensional encoders, so a dimensionality reduction technique needs to be applied in order to compare the different methodologies—also, without this reduction, the benchmark will be unfeasible given the long computational times of gradient boosting. Moreover, dimensionality reduction helps to improve prediction (see [cerda2018similarity]) with tree-based methods. To approximate Euclidean distances, SVD is optimal. However, it has a cost of . Using Gaussian random projections [rahimi2008random] is appealing, as can lead to stateless encoders that requires no fit. Table VII compares the prediction performance of both strategies. For tf-idf and fasText, the SVD is significantly superior to random projections. On the contrary, there is no statistical difference for one-hot, even though the performance is slightly superior for the SVD (p-value equal to 0.492). Given these results, we use SVD for all further benchmarks.
Figure 5 compares encoders in terms of the relative score of Equation 22. All n-gram based encoders clearly improve upon one-hot encoding, at both dimensions ( equal to 30 and 100). Min-hash gives a slightly better prediction performance across datasets, despite of being the only method that does not require a data fit step. Results of the Nemenyi ranking test confirm the impression of the figure: n-gram-based methods are superior to one-hot encoding; and the min-hash encoder has the best average ranking value for both dimensionalities, although the difference in prediction with respect to the other n-gram based methods is not statistically significant.
While we seek generic encoding approaches, using precomputed fastText embeddings requires the choice of a language. As 15 out of 17 datasets are fully in English, the benchmarks above use English embeddings for fastTest. Figure 6, studies the importance of this choice, comparing the prediction results for fastText in different languages (English, French and Hungarian). Not choosing English leads to a sizeable drop in prediction accuracy, which gets bigger for languages more distant (such as Hungarian). This shows that the natural language semantics of fastText indeed are important to explain its good prediction performance. A good encoding not only needs to represent the data in a low dimension, but also needs to capture the similarities between the different entries.
4.3.3 Prediction with curated data
We now test the robustness of the different encoding methods to situations where there is no need to capture subword information—e.g., low cardinality categorical variables, or variables as ”Country name”, where the overlap of character n-grams does not have a relevant meaning. We benchmark in Figure 7 all encoders on 7 curated datasets. To simulate black-box usage, the dimensionality was fixed to for all of them, with the exception of one-hot. None of the n-gram based encoders perform worst than one-hot. Indeed, the F statistics for the average ranking does not reject the null hypothesis of all encoders coming from the same distribution (p-value equal to 0.37).
4.3.4 Interpretable data science with the Gamma-Poisson factorization
As shown in Figure 4, the Gamma-Poisson factorization creates sparse, non-negative feature vectors that are easily interpretable as a linear combination of latent categories. We give informative features names to each of these latent categories (see 3.2.3). To illustrate how such encoding can be used in a data-science setting where humans need to understand results, Figure 8 shows the permutation importances [altmann2010permutation] of each encoding direction of the Gamma-Poisson factorization and its corresponding feature names. By far, the most important inferred feature name to predict salaries in the Employee Salaries dataset is the latent category Manager, Management, Property, which matches general intuitions on salaries.
5 Discussion and conclusion
One-hot encoding is not well suited to columns of a table containing categories represented with many different strings [cerda2018similarity]. Character n-gram count vectors can represent strings well, but they dilute the notion of categories with extremely high-dimensional vectors. A good encoding should capture string similarity between entries and reflect it in a lower dimensional encoding.
We study several encoding approaches to capture the structural similarities of string entries. The min-hash encoder gives a stateless injection of strings to a vector space, transforming inclusions between strings into simple inequalities (Theorem 3.1). A Gamma-Poisson factorization on the count matrix of sub-strings gives a low-rank approximation of similarities.
Both Gamma-Poisson factorization and the min-hash encoder can be used on very large datasets, as they can be used in streaming settings. They markedly improve upon one-hot encoder for large scale learning as i) they do not need the definition of a vocabulary, ii) they give low dimensional representations, and thus decrease the cost of the subsequent analysis step. Indeed, for both of these encoding approaches, the cost of encoding is usually significantly smaller than that of running a powerful supervised learning method such as XGBoost, even on the reduced dimensionality (see Table X in the Appendix). The min-hash encoder is unique in terms of scalability, as it gives low-dimensional representations while being completely stateless, which greatly facilitates distributed computing. The representations enable much better statistical analysis than a simpler stateless low-dimensional encoding built with random projections of n-gram string representations. Notably, the most scalable encoder is also the best performing for supervised learning, at the cost of some loss in interpretability.
Recovery of latent categories
Describing results in terms of a small number of categories can greatly help interpreting a statistical analysis. Our experiments on real and simulated data show that encodings created by the Gamma-Poisson factorization correspond to loadings on meaningful recovered categories. It removes the need to manually curate entries to understand what drives an analysis. For this, positivity of the loadings and the soft sparsity imposed by the Gamma prior is crucial; a simple SVD fails to give interpretable loadings (Appendix Figure 13).
AutoML (automatic machine learning) strives to develop machine-learning pipeline that can be applied to datasets without human intervention [hutter2015automatic, hutter2019automated]. To date, it has focused on tuning and model selection for supervised learning on numerical data. Our work addresses the feature-engineering step. In our experiments, we apply the exact same prediction pipeline to 17 non-curated and 7 curated tabular datasets, without any custom feature engineering. Both Gamma-Poisson factorization and min-hash encoder led to best-performing prediction accuracy, using a classic gradient-boosted tree implementation (XGBoost). We did not tune hyper-parameters of the encoding, such as dimensionality or parameters of the priors for the Gamma Poisson. These string categorical encodings therefore open the door to autoML on the original data, removing the need for feature engineering which can lead to difficult model selection. A possible rule when integrating tabular data into an autoML pipeline could be to apply min-hash or Gamma-Poisson encoder for string categorical columns with a cardinality above 30, and use one-hot encoding for low-cardinality columns. Indeed, results show that these encoders are also suitable for normalized entries.
One-hot encoding is the defacto standard for statistical analysis on categorical entries. Beyond its simplicity, its strength is to represent the discrete nature of categories. However, it becomes impractical when there are too many different unique entries, for instance because the string representations have not been curated and display typos or combinations of multiple informations in the same entries. For high-cardinality string categories, we have presented two scalable approaches to create low-dimensional encoding that retain the qualitative properties of categorical entries. The min-hash encoder is extremely scalable and gives the best prediction performance because it transforms string inclusions to vector-space operations that can easily be captured by a supervised learning step. If interpretability of results is an issue, the Gamma-Poisson factorization performs almost as well for supervised learning, but enables expressing results in terms of meaningful latent categories. As such, it gives a readily-usable replacement to one-hot encoding for high-cardinality string categorical variables. Progress brought by these encoders is important, as they avoid one of the time-consuming steps of a data-science study: normalizing entries of databases via human-crafted rules.
Authors were supported by the DirtyData (ANR-17-CE23-0018-01) project.
Appendix A Reproducibility
a.1 Dataset Description
a.1.1 Non-curated datasets
Building Permits171717 https://www.kaggle.com/chicago/chicago-building-permits (sample size: 554k). Permits issued by the Chicago Department of Buildings since 2006. Target (regression): Estimated Cost. Categorical variable: Work Description (cardinality: 430k).
Colleges181818 https://beachpartyserver.azurewebsites.net/VueBigData/DataFiles/Colleges.txt (7.8k). Information about U.S. colleges and schools. Target (regression): Percent Pell Grant. Cat. var.: School Name (6.9k).
Crime Data191919 https://data.lacity.org/A-Safe-City/Crime-Data-from-2010-to-Present/y8tr-7khq (1.5M). Incidents of crime in the City of Los Angeles since 2010. Target (regression): Victim Age. Categorical variable: Crime Code Description (135).
Drug Directory202020 https://www.fda.gov/Drugs/InformationOnDrugs/ucm142438.htm (120k). Product listing data submitted to the U.S. FDA for all unfinished, unapproved drugs. Target (multiclass): Product Type Name. Categorical var.: Non Proprietary Name (17k).
Employee Salaries212121 https://catalog.data.gov/dataset/employee-salaries-2016 (9.2k). Salary information for employees of the Montgomery County, MD. Target (regression): Current Annual Salary. Categorical variable: Employee Position Title (385).
Federal Election222222 https://classic.fec.gov/finance/disclosure/ftpdet.shtml (3.3M). Campaign finance data for the 2011-2012 US election cycle. Target (regression): Transaction Amount. Categorical variable: Memo Text (17k).
Journal Influence232323 https://github.com/FlourishOA/Data (3.6k). Scientific journals and the respective influence scores. Target (regression): Average Cites per Paper. Categorical variable: Journal Name (3.1k).
Kickstarter Projects242424 https://www.kaggle.com/kemical/kickstarter-projects (281k). More than 300,000 projects from https://www.kickstarter.com. Target (binary): State. Categorical variable: Category (158).
Medical Charges252525 https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Inpatient.html (163k). Inpatient discharges for Medicare beneficiaries for more than 3,000 U.S. hospitals. Target (regression): Average Total Payments. Categorical var.: Medical Procedure (100).
Met Objects262626 https://github.com/metmuseum/openaccess (469k). Information on artworks objects of the Metropolitan Museum of Art’s collection. Target (binary): Department. Categorical variable: Object Name (26k).
Midwest Survey272727 https://github.com/fivethirtyeight/data/tree/master/region-survey (2.8k). Survey to know if people self-identify as Midwesterners. Target (multiclass): Census Region (10 classes). Categorical var.: What would you call the part of the country you live in now? (844).
Open Payments282828 https://openpaymentsdata.cms.gov (2M). Payments given by healthcare manufacturing companies to medical doctors or hospitals (year 2013). Target (binary): Status (if the payment was made under a research protocol). Categorical var.: Company name (1.4k).
Public Procurement292929 https://data.europa.eu/euodp/en/data/dataset/ted-csv (352k). Public procurement data for the European Economic Area, Switzerland, and the Macedonia. Target (regression): Award Value Euro. Categorical var.: CAE Name (29k).
Road Safety303030 https://data.gov.uk/dataset/road-accidents-safety-data (139k). Circumstances of personal injury of road accidents in Great Britain from 1979. Target (binary): Sex of Driver. Categorical variable: Car Model (16k).
Traffic Violations313131 https://catalog.data.gov/dataset/traffic-violations-56dda (1.2M). Traffic information from electronic violations issued in the Montgomery County, MD. Target (multiclass): Violation type (4 classes). Categorical var.: Description (11k).
Vancouver Employee323232 https://data.vancouver.ca/datacatalogue/employeeRemunerationExpensesOver75k.htm(2.6k). Remuneration and expenses for employees earning over $75,000 per year. Target (regression): Remuneration. Categorical variable: Title (640).
Wine Reviews333333 https://www.kaggle.com/zynicide/wine-reviews/home (138k). Wine reviews scrapped from WineEnthusiast. Target (regression): Points. Categorical variable: Description (89k).
a.1.2 Curated datasets
Adult343434 https://archive.ics.uci.edu/ml/datasets/adult (sample size: 32k). Predict whether income exceeds $50K/yr based on census data. Target (binary): Income. Categorical variable: Occupation (cardinality: 15).
Cacao Flavors353535 https://www.kaggle.com/rtatman/chocolate-bar-ratings (1.7k). Expert ratings of over 1,700 individual chocolate bars, along with information on their origin and bean variety. Target (multiclass): Bean Type. Categorical variable: Broad Bean Origin (97).
California Housing363636 https://github.com/ageron/handson-ml/tree/master/datasets/housing (20k). Based on the 1990 California census data. It contains one row per census block group (a block group typically has a population of 600 to 3,000 people). Target (regression): Median House Value. Categorical variable: Ocean Proximity (5).
Dating Profiles373737 https://github.com/rudeboybert/JSE_OkCupid (60k). Anonymized data of dating profiles from OkCupid. Target (regression): Age. Categorical variable: Diet (19).
House Prices383838 https://www.kaggle.com/c/house-prices-advanced-regression-techniques (1.1k). Contains variables describing residential homes in Ames, Iowa. Target (regression): Sale Price. Categorical variable: MSSubClass (15).
House Sales393939 https://www.kaggle.com/harlfoxem/housesalesprediction (21k). Sale prices for houses in King County, which includes Seattle. Target (regression): Price. Categorical variable: ZIP code (70).
Intrusion Detection404040 https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data (493k). Network intrusion simulations with a variaty od descriptors of the attack type. Target (multiclass): Attack Type. Categorical variable: Service (66).
a.2 Learning pipeline
Datasets’ size range from a couple of thousand to several million samples. To reduce computation time on the learning step, the number of samples was limited to 100k for large datasets.
We removed rows with missing values in the target or in any explanatory variable other than the selected categorical variable, for which we replaced missing entries by the string ‘nan’. The only additional preprocessing for the categorical variable was to transform all entries to lower case.
For every dataset, we made 20 random splits of the data, with one third of samples for testing at each time. In the case of binary classification, we performed stratified randomization.
Depending on the type of prediction task, we used different scores to evaluate the performance of the supervised learning problem: for regression, we used the score; for binary classification, the average precision; and for multi-class classification, the accuracy score.
Parametrization of classifiers
We used the scikit-learn [pedregosa2011scikit] for most
of the data processing. For all the experiments, we used the scikit-learn
compatible implementations of XGBoost [chen2016xgboost], with a grid search
learning_rate (0.05, 0.1, 0.3) and
max_depth (3, 6, 9) parameters.
All datasets and encoders use the same parametrization.
We used the scikit-learn
GaussianRandomProjection, with the default
parametrization in both cases.
a.3 Online Resources
Appendix B Algorithmic considerations
b.1 Gamma-Poisson factorization
Algorithm 1 requires some input parameters and initializations that can affect convergence. One important parameter is , the discount factor for the fitting in the past. Figure 9 shows that choosing gives the best compromise between stability of the convergence and data fitting in terms of the Generalized KL divergence. The default values used in the experimen are listed in Table VIII.
With respect to the initialization of the topic matrix , the best option is to choose the centroids of a k-means clustering (Figure 10) in a hashed version of the n-gram count matrix in a reduced dimensionality (in order to speed-up convergence of the k-means algorithm) and then project back to the n-gram space with a nearest neighbors algorithm.
is used, as it gives a good trade-off between convergence and stability of the solution across the number of epochs.
Appendix C Additional figures
|Datasets||One-hot + SVD||Similarity encoder||TfIdf + SVD||FastText + SVD||Gamma Poisson||Min-hash encoder|
|Datasets||Encoding time||Training time||Encoding time /|
|(cardinality)||Poisson||Encoding||+ SVD||+ SVD|
|Cacao Flavors (100)||0.48||0.34||0.34||0.1|
|California Housing (5)||0.83||0.51||0.56||0.20|
|Dating Profiles (19)||0.47||0.26||0.29||0.12|
|House Prices (15)||0.91||0.25||0.32||0.11|
|House Sales (70)||0.29||0.03||0.26||0.07|
|Intrusion Detection (66)||0.27||0.65||0.61||0.13|