Interpretable Set Functions

05/31/2018
by   Andrew Cotter, et al.
Google
0

We propose learning flexible but interpretable functions that aggregate a variable-length set of permutation-invariant feature vectors to predict a label. We use a deep lattice network model so we can architect the model structure to enhance interpretability, and add monotonicity constraints between inputs-and-outputs. We then use the proposed set function to automate the engineering of dense, interpretable features from sparse categorical features, which we call semantic feature engine. Experiments on real-world data show the achieved accuracy is similar to deep sets or deep neural networks, and is easier to debug and understand.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/10/2017

Classifying Unordered Feature Sets with Convolutional Deep Averaging Networks

Unordered feature sets are a nonstandard data structure that traditional...
03/16/2020

A semi-supervised sparse K-Means algorithm

We consider the problem of data clustering with unidentified feature qua...
01/30/2020

Learn to Predict Sets Using Feed-Forward Neural Networks

This paper addresses the task of set prediction using deep feed-forward ...
04/03/2019

Interpretable Deep Learning for Two-Prong Jet Classification with Jet Spectra

Classification of jets with deep learning has gained significant attenti...
11/05/2018

Janossy Pooling: Learning Deep Permutation-Invariant Functions for Variable-Size Inputs

We consider a simple and overarching representation for permutation-inva...
10/19/2020

A Framework to Learn with Interpretation

With increasingly widespread use of deep neural networks in critical dec...
10/19/2021

AEFE: Automatic Embedded Feature Engineering for Categorical Features

The challenge of solving data mining problems in e-commerce applications...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the problem of learning a function that acts on a variable-length set of un-ordered feature vectors. For example, for one of the experiments we will predict the sales of a product based on its customer reviews, where for each customer review, there are features, such as that review’s star rating and word count. Recently, Zaheer et al. (2017) showed that for such permutation-invariant countable sets , all valid functions can be expressed as a transform of the average of per-token transforms:

(1)

where is the th token out of tokens in the example set , , and . (Note, we have changed their expression from a sum to an average to make the generalization of classic aggregation functions like the norm clearer; the two forms are equivalent because one of the features for each token can be the number of tokens ).

Zaheer et al. (2017) propose training neural networks representing and , which can be jointly optimized, and they call deep sets. This strategy inherits the arbitrary expressability of DNN’s (Zaheer et al., 2017). Another formulation of (1) comes from support distribution machines (Muandet et al., 2012; Poczos et al., 2012), which in this context can be expressed as:

(2)

where are training examples, is a kernel, and those training examples with non-zero play the role of finitely-sampled support distributions. Other work has also defined kernels for distributions derived from sets of inputs (Kondor and Jebara, 2003)

(in less related work, some machine-learning algorithms have also been proposed for permutation-invariant inputs, without handling variable-length inputs, e.g. 

(Shivaswamy and Jebara, 2006)). Another related approach is that of Hartford et al. (2016), who create a deep neural network architecture which takes variable-sized bimatrix game as input and allows permutation invariance across the actions (i.e. rows and columns of the payoff matrix).

2 Interpretable Set Functions with Lattice Models

In this paper, we propose using the deep lattice network (DLN) function class (You et al., 2017) for the and transforms in (1), which enables engineering more interpretable models than DNNs. This produces a new kind of DLN that we refer to as a DLN aggregation function, which we abbreviate in some places as DLN agg function. DLNs improves interpretability in two key ways: (i) the visualizability of the first layer of 1d calibrator curves, (ii) the ability to capture prior knowledge about global trends (aka monotonicity), as detailed in the following subsections. We also explain how bottle-necking (1) by setting improves debuggability.

2.1 Calibrator Curves Promote Visual Understanding

The first layer of a DLN is a calibration layer that automates feature pre-processing by learning a 1-d nonlinear transform for each of the features using 1-d piecewise linear functions. The resulting 1-d calibrators are easy to visualize at and interpret (see Fig. 1). Specifically, we define the th output of in (1) to take the form:

(3)

where each calibrator is an one-dimensional piecewise linear function, stored as a look-up table parameterized by vector , and after each feature is calibrated, the features are fused together by other DLN layers represented here as with parameters .

Such discriminatively-trained per-feature transforms have been shown to be an efficient way to capture nonlinearities in each feature (e.g. (Sharma and Bala, 2002; Howard and Jebara, 2007; Gupta et al., 2016)), and can also be framed as having a first layer to the model that is a generalized additve model (GAM) (Hastie and Tibshirani, 1990)

. Each piecewise-linear calibrator can equivalently be expressed as a sum of weighted, shifted ReLu’s 

(You et al., 2017), but the look-up table parameterization enables monotonicity regularization.

Figure 1: Learned calibrator curves in the agg function for the dataset of Section 3.2. Left: One sees that the agg function has learned to treat 2 star reviews as just as bad as 1 star reviews, and similarly considers 3 and 4 star reviews of equal importance, but its main distinction is that anything lower than 5 stars is a bad sign. Right: This calibrator curve shows that agg function learned to treat reviews under 25 words as indistinguishably short, and reviews over 50 words as equally usefully long, and is linearly sensitive to reviews from 25-50 words long (around 1/4 of all reviews).

2.2 Monotonicity Regularization Promotes End-to-End Model Understanding

For many applications, there is domain knowledge that some features should have a monotonic impact on the output. Thus a particularly interpretable way to regularize is to constrain a model to capture such domain knowledge (see e.g. (Groeneboom and Jongbloed, 2014; Barlow et al., 1972; Howard and Jebara, 2007; Daniels and Velikova, 2010; Sill and Abu-Mostafa, 1997; Kotlowski and Slowinski, 2009; You et al., 2017; Canini et al., 2016; Gupta et al., 2016)). For example, when training a model to predict sales of a product given its customer reviews , we will constrain that if the star rating for the th review is increased, the predicted product sales should never go down.

Monotonicity constraints especially improve interpretability and debuggability for nonlinear models, because no matter how complex the learned model is, the user knows the model respects their specified global properties, for example, that better reviews will never hurt predicted product sales. Monotonicity constraints are in general a handy regularizer because the per-feature constraints can be set a priori by domain experts without needing to tune some how much regularization, and the resulting regularization is robust to domain shift between the train and test distributions.

DLNs are a state-of-the-art function class for efficiently enabling monotonicity constraints (You et al., 2017)

. A DLN can alternate three kinds of layers: (i) calibration layers of one-dimensional piecewise linear transforms as in (

3

), (ii) linear embedding layers, and (iii) layers of multi-dimensional lattices (interpolated look-up tables) which enable nonlinear mixing of inputs. All three types of layers can be constrained for monotonicity, resulting in end-to-end monotonicity guarantees (by composition).

2.3 Special Case: For Better Debuggability and Memory Usage

We will show with the proposed DLN agg functions that for real problems we may be able to use a restricted architecture with just output from the function in (1), which has debuggability and memory advantages.

Debuggability: We have found that restricting such that and greatly aids interpretability, because the agg function becomes a visualizable 1-d transform after an average of token values, and each token value can be viewed and individually debugged, especially since each is a smooth monotonic function and can be easily debugged with partial dependence plots. In the customer reviews example, this makes it easy to quickly identify if a particular review is dominating the prediction, and if so, what it is about that review’s features that is important. Limiting to

still enables learning variations of most of the classic aggregation functions, such as min, max, unnormalized weighted mean, geometric mean (which are also all monotonic with respect to the main feature). However, to express a normalized weighted mean of the form

, requires the per-token function to produce two output values (), one for the numerator and one for the denominator.

Memory Usage: Using or even just may make it possible to substantially reduce run-time memory, if there are a finite number of possible tokens , because one can compute the offline for every possible token , and then only store its values for each token. At runtime, one sees the exact tokens that are needed, retrieves the pre-computed values, and takes the average and applies .

3 Experiments on Sets of Feature Vectors

We demonstrate the proposed aggregation functions with two real-world case studies (more experiments in Section 5). The DLN Agg function architecture we use is illustrated in Fig. 2.

Figure 2: Example set function architecture with tokens, features per token, and intermediate dimensions. Note that the same function is applied to each input token , before the outputs are averaged along each of the dimensions and fed into to produce the final output.

3.1 Implementation Details

For all the experiments in this paper, as show in Fig. 2, we use a 6-layer DLN Agg function archiecture composed of calibrated lattice models for , followed by an average, and then is composed of a -dimensional calibrated lattice model followed by a final one-dimensional calibration layer. All DLN layers are differentiable, and thus we jointly train the parameters of and

using backpropagation of gradients as in (

1). See Appendix A in the supplemental for more details on our implementation, initialization, and optimization of and , which largely follow the descriptions in other recent papers on lattice models (You et al., 2017; Canini et al., 2016; Gupta et al., 2016).

We will provide open-source Tensor Flow code to implement the proposed agg functions, building on the DLN layers and monotonicity projection operators of the open source Tensor Flow Lattice package (github.com/tensorflow/lattice).

We compare to deep sets (Zaheer et al., 2017): for all the all deep sets comparisons, we model and each as 3-hidden-layer fully-connected DNNs implemented in TensorFlow, with the number of hidden nodes for each layer fixed to be the same, and was also set to that value. We then trained using the ADAM optimizer Kingma and Ba (2014)

. The hyperparameters validated over were learning rate, the number of units in each hidden layer, and number of training iterations.

3.2 Case Study: How Customer Reviews Affect Product Sales

This case study illustrates the interpretability of the proposed DLN aggregation. The goal is to understand how different aspects of product reviews affect product sales. The data, from a luxury goods company [name redacted for blind review], will be made publicly available on Kaggle. The training set label is number of sales of the product over a six month window, and the training set are all the products in stock during that time period, and the training features are derived from all product reviews posted at the end of that time period. The validation and test sets are analogous, but for the next two six month periods, respectively. This produces train/validation/test samples, which are non-IID due to the time shift and because of the validation and test products are the same as the training products, albeit with the statistics collected over different time periods (the rest are newly-released products). Each product is described by customer reviews, and the features (i) the star rating of each review, (ii) the word count for the review, and (iii) the number of reviews that product got . While tiny, this real-world example is excellent for analyzing and comparing flexibility-regularization trade-offs. For the proposed aggregation functions, we constrained the predicted sales to be monotonically increasing in the star rating, and in the number of reviews (which signifies popularity).

Results are given in Table 1

. DLN agg functions are able to achieve the best performance on the test set in this example; deep sets, along with the naive linear regression baseline, perform substantially worse. The large gap in performance is likely due to two factors: (1) the small size of the dataset and (2) the test set being non-IID with respect to the training set. Both of these create advantages for simpler and more regularized DLN models.

Train Set Validation Set Test Set
Averaged Aggregation, Linear Fusion 3308 3771 8221
Deep Sets 561.8 2377 8323
DLN Agg, 3054 3454 7502
DLN Agg, 2646 2894 7737
Table 1:

Mean Absolute Error Estimating Product Sales From Reviews

3.3 Case Study: Predicting User Intent

For this binary classification problem from a large internet services company, the goal is to predict if a given query (a string containing multiple words) is seeking a specific type of result. We use examples, which we split randomly into training/validation/test sets in 8/1/1 proportions. Each query is broken into ngrams, and each ngram has corresponding pieces of information. Two of the features should be positive signals for the intent (e.g. what percentage of users who issued that ngram in the past were seeking this result type), and their effect is constrained to be monotonic. The other eight features are conditional features, e.g. how popular the ngram is, or the order of (number of terms in) the ngram.

Table 2 shows that specific set function models are significantly better than models on pre-averaged features. Furthermore, DLNs with are the best performing of the set function models. The DLN performs very similarly to the Deep Sets approach.

Train Set Validation Set Test Set
Averaged Aggregation, Linear Fusion 0.623 0.609 0.610
Averaged Aggregation, DNN Fusion 0.634 0.623 0.624
Deep Sets 0.662 0.644 0.643
DLN Agg, 0.653 0.644 0.643
DLN Agg, 0.674 0.648 0.646
Table 2:

Accuracy for Classifying User Intent

4 Semantic Feature Engine

We propose applying set function learning to handle sparse categorical variables in a debuggable and stable way, an approach we call the

Semantic Feature Engine (SFE). A common approach to sparse categoricals is to create a Boolean feature for each possible category, and either use these directly as predictors or train an embedding. These strategies work well, but have poor interpretability, debuggability, and can be highly variable across retrainings causing unwanted churn and instability (Cormier et al., 2016). By contrast, our proposed SFE converts the sparse categoricals into one dense, understandable feature that is an estimate of , for some label .

For example, suppose the goal is to produce a classifier that predicts if a movie will be rated PG (suitable for all-ages), and we want to use information about the movie’s actors. SFE produces a feature that is an estimate , which will have the semantic meaning of an actor prior feature. This feature could then be combined with other dense and meaningful features, such as the movie’s budget or studio’s previous track record, to produce a final model that is both powerful and interpretable.

In general, SFE produces a feature that is an estimate of some label given some set . In simple cases one can simply form the point estimate . More generally, may not have occurred often enough in the training data to derive a straightforward point estimate; for example, many movies have sets of actors who have never appeared together before. To address this, the key idea of SFE is to convert into a set of tokens , estimate for each token, then learn the best aggregation of the set of estimates to form the best overall estimate , which can be used as a feature in a bigger model. See Appendix B (8.2) in the Supplemental for a complete worked example.

Tokenization and Fallback Rules: If is not a single element, one must choose a tokenization rule to produce a set of tokens from a given example . For example, for text, a standard tokenization is to break the text into -order ngrams up to some max order . For a set of categorical variable, such as {actors}, one can tokenize it into all -tuples up to some max subset size . If is a pair of sets (e.g. the actors in a candidate movie to recommend, and the list of all actors in movies the user has previously watched), tokens can be crosses or set differences across the pair. In addition, we suggest adding fallback rules if the token values are missing for a given token. For sets of categorical variables, our fallback is to iteratively consider size subsets for any categories not contained in an existing token of subset size until each category appears in at least one token if possible (see Appendix B for full examples). The last fallback is always to set the token values to missing.

Training Sets: The SFE needs an SFE token training set of pairs to train the per-token estimates of . This can be the same training set as used to learn the SFE aggregation function, and in fact this simple approach often works well. Using different training sets can reduce overfitting. One may also prefer to use different labels, for example, training the SFE per-token estimates of on a large (but noisy) dataset of clicks, but training the aggregration function to produce on a smaller, cleaner, human-labeled dataset.

Token Table Building: Given tokenization rules, build a token table by iterating through the SFE token training set to populate a table with empirical estimates of , and possibly other aggregate statistics such as how often the token was seen in the training data. One can also store non-aggregated token-specific information, such as a token’s subset size, in a table during the pre-computation phase; alternatively, these values could be populated at training time. Note that when training the set function, one can also add example-specific details (e.g. for actors, their salary or number of lines in the given movie) to provide additional information on how to weight the different token values. Cumulatively, these techniques provide the features per token .

Token Table Filtering:

To reduce table size and improve the statistical significance of the SFE signal, one should filter the token table before learning the SFE aggregation function. Two filtering rules that we have found useful are: (i) a count threshold (that is, if there are too few examples of a specific token in the token training set, then it should be dropped from the token table), (ii) a confidence interval threshold (that is, if one of the token features is an estimate of a target label

, and the confidence interval of that estimate is too big, then it should be dropped from the token table). Compared to count-thresholding, confidence interval filtering will keep more lower-frequency tokens that have high label-agreement amongst their occurrences.

Learn an Aggregation Function: Given the tokens and their token values, train a set function as per (1) to estimate the label , with monotonicity constraints on the token features based on domain knowledge. Generally, it makes sense to constrain the point estimate feature to have a monotonic impact on the aggregation function’s prediction of .

Extended Example: Let if the movie is PG and otherwise. Let represent the set of actors, and suppose a new movie comes out with actors {Alice, Bob, Carol and David}. Suppose the tokenization rule is to use all subsets of up to size 3, with a fallback to smaller subsets if their parent sets are not in the SFE token table. This rule produces tokens { {Alice, Carol}, {Bob, Carol}, {David}}, meaning that no set of 3 actors appeared in the table, nor did David appear with any of the other actors in the table. Suppose our token features are (i) the percentage of movies with that set of actors that are PG, and (ii) the number of movies with that set of actors. Then we might find that , , and . We can then apply a set function that was trained on all pre-2017 movies and was constrained to be monotonically increasing in the first feature, which might produce a movie title priorprobability of , which can then be combined with other features for a final prediction of the movie’s rating.

Separability: Here, we have separated the overall model into three parts: (i) the token table that stores token values, (ii) the set function that combines the token features across the tokens for each example, (iii) the follow-on model which might take many SFE features as inputs. This separability has two key advantages in practice. First, each of these parts can has semantic meanings that aids interpretability and debuggability. Second, each of these three parts can be refreshed or improved independently, which reduces churn (Cormier et al., 2016) and system complexity. That said, jointly training the SFE set function and the follow-on classifier could result in additional metric gains.

5 Experiments using the Semantic Feature Engine To Create Sets

We also evaluate the performance of our set function learning approach when applied as part of the Semantic Feature Engine to sparse categorical predictors; that is, precomputing statistics about each category and learning a set function over token statistics. We show that as in the earlier experiments, it outperforms Deep Sets in learning the best set function over the tokens. We also compare it to a strategy of directly learning a deep neural network (DNN) on a multi-hot encoding of the categories and find that it tends to perform similarly well, with the added benefits of interpretability and debuggability.

For each experiment, our original predictors are a variable-length set of categories, and our SFE tokens are subsets of categories. In particular, we compute features for each token that is then fed into the set function: (a) the average label, computed over the training data, for a token; (b) how frequently the token appears; (c) the size of the subset that the token represents; (d) whether the token fully matches a set’s list of categories; (e) the number of categories in the set; and (f) the number of tokens generated from the set. Note that these fall into three buckets: (a) is the direct estimate of the label’s value for a token; (b), (c), and (d) are token-specific features that can provide information on how much to weight the tokens; and (e) and (f) are set-specific features (the same for all tokens in an example) to help calibrate the outputs and nature of the aggregation.

For the DNN comparison, we use standard TensorFlow embeddings, first creating a tf.feature_column.embedding_column over the raw categories (that appear in the train set), then feeding that into a tf.estimator.DNNClassifier with two hidden layers before a final softmax layer. We optimize the ADAM learning rate, number of epochs, and size of the embedding and hidden layers over the validation set. New validation-set or test-set categories are ignored; only embeddings from categories that also appear in the training set are used in evaluation.

Train Set Validation Set Test Set
DNN on Attributes 0.793 0.789 0.790
Deep Sets 0.794 0.786 0.785
DLN Agg 0.795 0.785 0.785
DLN Agg 0.795 0.786 0.786
Table 3: Accuracy for Classifying Facial Attractiveness

5.1 CelebA

There are images of faces Liu et al. (2015), which we randomly split 70/10/20 into a train/validation/test set. Each face is described by 40 binary attributes, such as whether the subject has blond hair, earrings, or a mustache. There is also a Boolean feature for whether the face was judged to be attractive. We treat the problem as a binary classifier of predicting whether a face is labeled as attractive based on its attributes, and use the Semantic Feature Engine to generate conditional probabilities of attractiveness for all subsets of attributes, whose estimated values have confidence intervals of 0.2 or under.

Results in Table 3 show that DLNs on aggregated token subset information, including the very simple model, perform better than Deep Sets models with substantially higher dimensional functions. DNNs directly on the attributes, however, perform the best of all models considered, with nearly 0.4% higher test accuracy than the best-performing DLN.

Train Acc Validation Acc Test Acc Test Prec@1 Test Prec@3
DNN on Attributes 0.986 0.974 0.974 0.728 0.883
Deep Sets 0.981 0.973 0.973 0.707 0.888
DLN Agg 0.991 0.974 0.973 0.710 0.880
DLN Agg 0.984 0.974 0.974 0.729 0.890
Table 4: Accuracy and Ranking Precision for Classifying Recipe Cuisine

5.2 Cuisine Classification from Recipe Ingredient List

The recipes dataset (www.kaggle.com/kaggle/recipe-ingredients-dataset) consists of recipes represented by their list of ingredients, and the cuisine they come from. It is randomly split 70/10/20 into a train/validation/test set. We build a model that takes a list of ingredients and a cuisine and acts as a binary classifier that estimates if it’s a correct match, that is, the model outputs an estimate of . There is a fixed set of 20 possible cuisines, so we create one positive and nineteen negative training samples from each row of the original dataset. Note that the problem can also be thought of as a multiclass classifier; for that reason, we also report the multiclass metrics precision@1 and precision@3 (which compute how often the correct cuisine’s score was the top or among the top-3 predictions for all candidate cuisines for a given recipe).

We start by doing some basic pre-processing on the ingredients, such as converting to lower case and word-stemming, to make it more likely that the equivalent ingredients are identified; we will upload a Kaggle kernel with the details. Then, we use SFE to tokenize the resulting ingredient set with each item crossed with each cuisine, see Appendix B (8.2) for a complete example. We filter out token values for any sets of ingredients that appear fewer than 5 times in the training data. We consider subsets of ingredients up to size 3; we cross-validated max subset sizes up to 5. Using all subsets is not feasible; many recipes have over 20 ingredients and the longest has 65, meaning that hundreds or even thousands of tokens are being aggregated for some examples even with a max subset size of 3. Another unique feature of the recipes dataset compared with the other benchmarks is that the vocabulary is very large. There are over unique ingredients, meaning that many ingredients in the test set do not appear or barely appear in the training set.

As Table 4 shows, DLN agg functions with (in particular, here) perform the best of all models on the precision metrics, while DNNs are slightly higher on binary classifier accuracy. The DLN performs similarly to the higher-complexity Deep Sets model.

Train Set Validation Set Test Set
DNN on Attributes 7.04 7.39 7.20
Deep Sets 7.29 7.45 7.26
DLN Agg 7.04 7.37 7.19
DLN Agg 7.02 7.35 7.19
Table 5: Mean Squared Error for Predicting Wine Quality

5.3 Wine

The wine dataset (www.kaggle.com/zynicide/wine-reviews) consists of different wines, along with their quality, price, country of origin, and a set of descriptive terms culled from reviews (out of a total set of 39 possible adjectives such as complex, oak, and velvet). We focus on predicting the quality, which is scored on a 100-point scale, using the set of review adjectives. As discussed in the SFE section, this review prior score could be used in a follow-on interpretable model that also incorporated the other features such as price, though we do not do so here. We consider all subsets and require sets of adjectives to appear at least 32 times in the training data before entering the token table.

Table  5 shows that DLNs perform best on the wine dataset, with even the DLN performing better than the DNN. Deep Sets struggles here, performing far worse than either of the competitors.

6 Conclusions

We have shown that we can learn DLN aggregation functions over sets that provide similar accuracy to deep sets Zaheer et al. (2017), but provide greater interpretability due to three key aspects. First DLN agg functions enable monotonicity constraints, providing end-to-end high-level understanding and greater predictability of even highly nonlinear models. Second, the first-layer of 1-d per-feature calibrator functions can be visualized and interpreted. Third, we showed that we can simplify the middle layer to a per-token score before averaging (), with slight or no loss of accuracy on real-world problems. This greatly aids debuggability as it makes it easier to determine which tokens are most responsible for the output, and whether any of the per-token scores are noisy or suspicious.

We show that learning on sets is broadly applicable with our semantic feature engine proposal, which converts highly sparse features into a dense estimate of . Our experiments show these estimates were similar in accuracy to applying a DNN to the sparse features. The main advantage of the SFE over the DNN is its greater interpretability and debuggability. We expect SFE will also show greater stability and less churn if re-trained. SFE features can be combined with other information in follow-on DLN models, with monotonicity regularization on the SFE features, which makes the follow-on DLN more interpretable and stable over re-trainings.

References

  • Barlow et al. [1972] R. E. Barlow, D. J. Bartholomew, J. M. Bremner, and H. D. Brunk. Statistical inference under order restrictions; the theory and application of isotonic regression. Wiley, New York, USA, 1972.
  • Canini et al. [2016] K. Canini, A. Cotter, M. M. Fard, M. R. Gupta, and J. Pfeifer. Fast and flexible monotonic functions with ensembles of lattices. Advances in Neural Information Processing Systems (NIPS), 2016.
  • Cormier et al. [2016] Q. Cormier, M. Milani Fard, and M. R. Gupta. Launch and iterate: Reducing prediction churn. Advances in Neural Information Processing Systems (NIPS), 2016.
  • Cotter et al. [2016] A. Cotter, M. R. Gupta, and J. Pfeifer. A Light Touch for heavily constrained SGD. In 29th Annual Conference on Learning Theory, pages 729–771, 2016.
  • Daniels and Velikova [2010] H. Daniels and M. Velikova. Monotone and partially monotone neural networks. IEEE Trans. Neural Networks, 21(6):906–917, 2010.
  • Duchi et al. [2011] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal Machine Learning Research, 12:2121–2159, 2011.
  • Groeneboom and Jongbloed [2014] P. Groeneboom and G. Jongbloed. Nonparametric estimation under shape constraints. Cambridge Press, New York, USA, 2014.
  • Gupta et al. [2016] M. R. Gupta, A. Cotter, J. Pfeifer, K. Voevodski, K. Canini, A. Mangylov, W. Moczydlowski, and A. V. Esbroeck. Monotonic calibrated interpolated look-up tables. Journal of Machine Learning Research, 17(109):1–47, 2016. URL http://jmlr.org/papers/v17/15-243.html.
  • Hartford et al. [2016] J. S. Hartford, J. R. Wright, and K. Leyton-Brown. Deep learning for predicting human strategic behavior. In Advances in Neural Information Processing Systems, pages 2424–2432, 2016.
  • Hastie and Tibshirani [1990] T. Hastie and R. Tibshirani. Generalized Additive Models. Chapman Hall, New York, 1990.
  • Howard and Jebara [2007] A. Howard and T. Jebara. Learning monotonic transformations for classification. Advances in Neural Information Processing Systems (NIPS), 2007.
  • Kingma and Ba [2014] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kondor and Jebara [2003] R. Kondor and T. Jebara. A kernel between sets of vectors. In Proceedings of the 20th International Conference on Machine Learning (ICML), pages 361–368, 2003.
  • Kotlowski and Slowinski [2009] W. Kotlowski and R. Slowinski. Rule learning with monotonicity constraints. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 537–544. ACM, 2009.
  • Liu et al. [2015] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In

    Intl. Conf. Computer Vision (ICCV)

    , pages 3730–3738, 2015.
  • Muandet et al. [2012] K. Muandet, K. Fukumizu, F. Dinuzzo, and B. Schoelkopf. Learning from distributions with support measure machines. Advances in Neural Information Processing Systems (NIPS), 2012.
  • Poczos et al. [2012] B. Poczos, L. Xiong, D. Sutherland, and J. Schenider. Support distribution machines. available on arXiv, 2012.
  • Sharma and Bala [2002] G. Sharma and R. Bala. Digital Color Imaging Handbook. CRC Press, New York, 2002.
  • Shivaswamy and Jebara [2006] P. K. Shivaswamy and T. Jebara. Permutation invariant svms. In Advances in Neural Information Processing Systems, 2006.
  • Sill and Abu-Mostafa [1997] J. Sill and Y. S. Abu-Mostafa. Monotonicity hints. Advances in Neural Information Processing Systems (NIPS), pages 634–640, 1997.
  • You et al. [2017] S. You, D. Ding, K. Canini, J. Pfeifer, and M. R. Gupta. Deep lattice networks and partial monotonic functions. Advances in Neural Information Processing Systems (NIPS), 2017.
  • Zaheer et al. [2017] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. Salakhutdinov, and A. Smola. Deep sets. Advances in Neural Information Processing Systems (NIPS), 2017.

7 Appendix A: More Implementation Details

We provide more details of our implementation, particularly how we constrain the input-and-output range of each layer.

7.1 Details on Each Layer

First Layer (): Calibrators Each of the calibrators has a input range bounded in a user-defined range, which we set based on the train set or domain knowledge. Each calibrator is a one-dimensional piece-wise linear function stored as a set of pairs of keypoints-and-values, where the keypoints are spaced from

in line with the quantiles of the inputs, the

calibrator values get jointly trained but are bounded to , and the values are initialized to form a linear function spanning the input-output range. Every calibrator is constrained to be a monotonic function by adding linear inequality constraints on the adjacent look-up table parameters that restrict each of the one-dimensional look-up table values to be greater than its left neighbor [Gupta et al., 2016]. If the function has outputs, then there are separate calibrators.

Second Layer (): Lattices The lattice layer follows the calibration layer and thus takes inputs in . Each set of inputs goes into one of lattices. The outputs of this lattice layer are all bounded to be in , and all the lattices parameters are initialized to . Each lattice is represented by a look-up table with parameters, where usually suffices, but was tuned on the validation set for some of our experiments to a higher integer value to form a finer-grained lattice. Each lattice is constrained to be monotonic increasing or decreasing with respect to a user-specified set of features by adding appropriate linear inequality constraints on the lattice’s look-up table parameters (see Gupta et al. [2016] for details). For each experiment, we used a priori domain knowledge to specify the monotonicity constraints.

Third Layer (Simple Average): The outputs from each of the lattices are averaged, such that the average layer outputs outputs total. We use an average rather than a sum (as in (1)) so that the input(s) to is guaranteed bounded to .

Fourth Layer (: Calibrators): This layer are largely similar to the first and second layers. A slight difference is that the inputs to the fourth layer lie in , as described above, and that the keypoints are initialized uniformly across the range rather than according to empirical quantiles from the data. Like the other calibrators, their outputs are bounded in .

Fifth Layer (: Lattice on inputs): The calibrated values from the fourth layer get fused together by one -dimensional lattice. We explictly constrain the lattice parameters to all be in , and since the output of this layer is just an interpolation of the lattice parameters, that constrains the output of this layer to be in .

Sixth Layer (): Output Calibrator The last layer is a one-dimensional piece-wise linear transform parameterized by a set of pairs of keypoints-and-values, where the keypoints are uniformly-spaced over its input range , and its output range is determined by the training. The values are initialized to be the identity function. This transform is constrained to be monotonic by adding linear inequality constraints to the training (see (4)) that restrict each of the one-dimensional look-up table values to be greater than its left neighbor [Gupta et al., 2016].

7.2 Training and Optimization

Training the proposed aggregation function where represents all of the DLN parameters for and is a constrained structural risk minimization problem. Given a training set for :

(4)

where expresses the monotonicity constraints and any range constraints on the DLN parameters.

Note the proposed DLN structure for DLN and means all parameters of (1

) have gradients computable with the chain rule and backpropagation. We solve (

4) using the Light-Touch algorithm [Cotter et al., 2016] to handle the monotonicity constraints on top of Adagrad [Duchi et al., 2011]. Light-Touch samples the constraints, learning which ones are most likely to be violated, and inexpensively penalizes them as the training progresses. It thereby converges to a feasible solution without enforcing feasibility throughout. Once optimization has finished, we project the final iterate, to guarantee monotonicity.

7.3 Hyperparameter Optimization

When learning the aggregate function, we tune hyperparameters including learning rate, number of epochs, output dimension of the calibrated lattice layer that maps token features into outputs, and number of keypoints used in each calibration layer in the DLN. We choose the combination of hyperparameters that achieves best model performance measured on the validation dataset.

In the cases when SFE is used to convert sparse categorical features into dense features, we also tune hyperparameters that affect tokenization. Specifically, we tune (a) the maximum size of the created tokens, which means the maximum order of ngrams if the sparse feature is text and the maximum subset size if it is an unordered set of strings; and (b) the filtering criteria of tokens, including the count threshold and the maximum confidence interval width.

8 Appendix B: Examples

Figure 3: Example SFE tokenization of an input undordered set. It uses a prebuilt token table storing all the filtered tokens and their feature values. The input is a set of 4 categories . We enumerate subsets of it starting from a maximum subset size , and output ones found in the token table until all categories in the input set are covered. When , we skip the only subset because it is not found in the token table. Then we proceed to , out of the 4 subsets we found one in the token table. We add the subset and its features to the output. Since there is one item in the input set not covered by any subset, we continue with . Out of the 6 subsets of size 2, we skip 3 of them (, , ) that are already fully covered by a chosen subset (). Out of the rest 3, two of them are found in the token table and . We output both. Up until now, all items in the input set are covered, so we stop here and output all 3 selected subsets as tokens.

8.1 Example of SFE Fallback Logic

Fig. 3 illustrates how the SFE fallback logic works when we generate a set of dense feature vectors from an un-ordered set of categories. The general idea is to find large subsets of the input to cover all the categories, and fall back to their smaller subsets if they are not found in a pre-built token table. A subset does not make to the token table if it does not appear frequent enough in the training data according to the filtering criteria.

8.2 Complete SFE and Learned Aggregation Function Example

We show the inner workings of the aggregation function on an example from the recipes dataset. The example, whose true cuisine is French but for which we are evaluating the candidate cuisine of Mexican, consists of the seven ingredients {sugar, salt, fennel bulb, water, lemon olive oil, grapefruit juice}. As mentioned above, we consider subsets of up to size 3 for this problem. As the combination of ingredients is rather rare, we only find 2 subsets of size 3 (out of possibilities) that appeared in the training data at least 5 times and are therefore in the token table: {salt, water, fennelbulb} and {salt, water, sugar}. Since neither lemon olive oil nor grapefruit juice appear in any frequent subsets of size 3, the model searches for any subsets of size 2 containing one of those ingredients and doesn’t find any; finally it finds grapefruit juice as a singleton in the token table, while lemon olive oil never appears in any form in the token table.

We therefore have the following 3 total tokens:

  1. {salt, water, fennel bulb}: ; count = 17; subset size = 3; is full match = 0; number of ingredients: 6; number of tokens: 3

  2. {salt, water, sugar}: ; count = 510; subset size = 3; is full match = 0; number of ingredients: 6; number of tokens: 3

  3. {grapefruit juice}: ; count = 10; subset size = 1; is full match = 0; number of ingredients: 6; number of tokens: 3

Recall that there are 20 possible cuisines, so 0.05 is the break-even uninformative prior value. Here, both subsets of size 3 have fairly weak/neutral evidence while the prior for grapefruit juice is actually quite high.

The intermediate outputs for these tokens are:

  1. {salt, water, fennel bulb}: -0.005

  2. {salt, water, sugar}: -0.156

  3. {grapefruit juice}: 0.075

Our labels are -1/1, so positive outputs represent the model leaning towards yes and negative outputs are the opposite. Interestingly, the {salt, water, sugar} subset is much more negative than {grapefruit juice} is positive, even though its prior is much closer to neutral. We can explain this by looking at the supporting information: the former has a subset size of 3, teaching the model to trust it more than a subset of size 1, and it appears 510 times vs. 10, leading to the same conclusion.

Finally, the outputs are averaged together and fed through , which in the case is a simple piecewise linear transform, yielding the final output of -0.28. The classifier has correctly identified the recipe as not being Mexican.