1 Introduction
We consider the problem of learning a function that acts on a variablelength set of unordered feature vectors. For example, for one of the experiments we will predict the sales of a product based on its customer reviews, where for each customer review, there are features, such as that review’s star rating and word count. Recently, Zaheer et al. (2017) showed that for such permutationinvariant countable sets , all valid functions can be expressed as a transform of the average of pertoken transforms:
(1) 
where is the th token out of tokens in the example set , , and . (Note, we have changed their expression from a sum to an average to make the generalization of classic aggregation functions like the norm clearer; the two forms are equivalent because one of the features for each token can be the number of tokens ).
Zaheer et al. (2017) propose training neural networks representing and , which can be jointly optimized, and they call deep sets. This strategy inherits the arbitrary expressability of DNN’s (Zaheer et al., 2017). Another formulation of (1) comes from support distribution machines (Muandet et al., 2012; Poczos et al., 2012), which in this context can be expressed as:
(2) 
where are training examples, is a kernel, and those training examples with nonzero play the role of finitelysampled support distributions. Other work has also defined kernels for distributions derived from sets of inputs (Kondor and Jebara, 2003)
(in less related work, some machinelearning algorithms have also been proposed for permutationinvariant inputs, without handling variablelength inputs, e.g.
(Shivaswamy and Jebara, 2006)). Another related approach is that of Hartford et al. (2016), who create a deep neural network architecture which takes variablesized bimatrix game as input and allows permutation invariance across the actions (i.e. rows and columns of the payoff matrix).2 Interpretable Set Functions with Lattice Models
In this paper, we propose using the deep lattice network (DLN) function class (You et al., 2017) for the and transforms in (1), which enables engineering more interpretable models than DNNs. This produces a new kind of DLN that we refer to as a DLN aggregation function, which we abbreviate in some places as DLN agg function. DLNs improves interpretability in two key ways: (i) the visualizability of the first layer of 1d calibrator curves, (ii) the ability to capture prior knowledge about global trends (aka monotonicity), as detailed in the following subsections. We also explain how bottlenecking (1) by setting improves debuggability.
2.1 Calibrator Curves Promote Visual Understanding
The first layer of a DLN is a calibration layer that automates feature preprocessing by learning a 1d nonlinear transform for each of the features using 1d piecewise linear functions. The resulting 1d calibrators are easy to visualize at and interpret (see Fig. 1). Specifically, we define the th output of in (1) to take the form:
(3) 
where each calibrator is an onedimensional piecewise linear function, stored as a lookup table parameterized by vector , and after each feature is calibrated, the features are fused together by other DLN layers represented here as with parameters .
Such discriminativelytrained perfeature transforms have been shown to be an efficient way to capture nonlinearities in each feature (e.g. (Sharma and Bala, 2002; Howard and Jebara, 2007; Gupta et al., 2016)), and can also be framed as having a first layer to the model that is a generalized additve model (GAM) (Hastie and Tibshirani, 1990)
. Each piecewiselinear calibrator can equivalently be expressed as a sum of weighted, shifted ReLu’s
(You et al., 2017), but the lookup table parameterization enables monotonicity regularization.2.2 Monotonicity Regularization Promotes EndtoEnd Model Understanding
For many applications, there is domain knowledge that some features should have a monotonic impact on the output. Thus a particularly interpretable way to regularize is to constrain a model to capture such domain knowledge (see e.g. (Groeneboom and Jongbloed, 2014; Barlow et al., 1972; Howard and Jebara, 2007; Daniels and Velikova, 2010; Sill and AbuMostafa, 1997; Kotlowski and Slowinski, 2009; You et al., 2017; Canini et al., 2016; Gupta et al., 2016)). For example, when training a model to predict sales of a product given its customer reviews , we will constrain that if the star rating for the th review is increased, the predicted product sales should never go down.
Monotonicity constraints especially improve interpretability and debuggability for nonlinear models, because no matter how complex the learned model is, the user knows the model respects their specified global properties, for example, that better reviews will never hurt predicted product sales. Monotonicity constraints are in general a handy regularizer because the perfeature constraints can be set a priori by domain experts without needing to tune some how much regularization, and the resulting regularization is robust to domain shift between the train and test distributions.
DLNs are a stateoftheart function class for efficiently enabling monotonicity constraints (You et al., 2017)
. A DLN can alternate three kinds of layers: (i) calibration layers of onedimensional piecewise linear transforms as in (
3), (ii) linear embedding layers, and (iii) layers of multidimensional lattices (interpolated lookup tables) which enable nonlinear mixing of inputs. All three types of layers can be constrained for monotonicity, resulting in endtoend monotonicity guarantees (by composition).
2.3 Special Case: For Better Debuggability and Memory Usage
We will show with the proposed DLN agg functions that for real problems we may be able to use a restricted architecture with just output from the function in (1), which has debuggability and memory advantages.
Debuggability: We have found that restricting such that and greatly aids interpretability, because the agg function becomes a visualizable 1d transform after an average of token values, and each token value can be viewed and individually debugged, especially since each is a smooth monotonic function and can be easily debugged with partial dependence plots. In the customer reviews example, this makes it easy to quickly identify if a particular review is dominating the prediction, and if so, what it is about that review’s features that is important. Limiting to
still enables learning variations of most of the classic aggregation functions, such as min, max, unnormalized weighted mean, geometric mean (which are also all monotonic with respect to the main feature). However, to express a normalized weighted mean of the form
, requires the pertoken function to produce two output values (), one for the numerator and one for the denominator.Memory Usage: Using or even just may make it possible to substantially reduce runtime memory, if there are a finite number of possible tokens , because one can compute the offline for every possible token , and then only store its values for each token. At runtime, one sees the exact tokens that are needed, retrieves the precomputed values, and takes the average and applies .
3 Experiments on Sets of Feature Vectors
We demonstrate the proposed aggregation functions with two realworld case studies (more experiments in Section 5). The DLN Agg function architecture we use is illustrated in Fig. 2.
3.1 Implementation Details
For all the experiments in this paper, as show in Fig. 2, we use a 6layer DLN Agg function archiecture composed of calibrated lattice models for , followed by an average, and then is composed of a dimensional calibrated lattice model followed by a final onedimensional calibration layer. All DLN layers are differentiable, and thus we jointly train the parameters of and
using backpropagation of gradients as in (
1). See Appendix A in the supplemental for more details on our implementation, initialization, and optimization of and , which largely follow the descriptions in other recent papers on lattice models (You et al., 2017; Canini et al., 2016; Gupta et al., 2016).We will provide opensource Tensor Flow code to implement the proposed agg functions, building on the DLN layers and monotonicity projection operators of the open source Tensor Flow Lattice package (github.com/tensorflow/lattice).
We compare to deep sets (Zaheer et al., 2017): for all the all deep sets comparisons, we model and each as 3hiddenlayer fullyconnected DNNs implemented in TensorFlow, with the number of hidden nodes for each layer fixed to be the same, and was also set to that value. We then trained using the ADAM optimizer Kingma and Ba (2014)
. The hyperparameters validated over were learning rate, the number of units in each hidden layer, and number of training iterations.
3.2 Case Study: How Customer Reviews Affect Product Sales
This case study illustrates the interpretability of the proposed DLN aggregation. The goal is to understand how different aspects of product reviews affect product sales. The data, from a luxury goods company [name redacted for blind review], will be made publicly available on Kaggle. The training set label is number of sales of the product over a six month window, and the training set are all the products in stock during that time period, and the training features are derived from all product reviews posted at the end of that time period. The validation and test sets are analogous, but for the next two six month periods, respectively. This produces train/validation/test samples, which are nonIID due to the time shift and because of the validation and test products are the same as the training products, albeit with the statistics collected over different time periods (the rest are newlyreleased products). Each product is described by customer reviews, and the features (i) the star rating of each review, (ii) the word count for the review, and (iii) the number of reviews that product got . While tiny, this realworld example is excellent for analyzing and comparing flexibilityregularization tradeoffs. For the proposed aggregation functions, we constrained the predicted sales to be monotonically increasing in the star rating, and in the number of reviews (which signifies popularity).
Results are given in Table 1
. DLN agg functions are able to achieve the best performance on the test set in this example; deep sets, along with the naive linear regression baseline, perform substantially worse. The large gap in performance is likely due to two factors: (1) the small size of the dataset and (2) the test set being nonIID with respect to the training set. Both of these create advantages for simpler and more regularized DLN models.
Train Set  Validation Set  Test Set  

Averaged Aggregation, Linear Fusion  3308  3771  8221 
Deep Sets  561.8  2377  8323 
DLN Agg,  3054  3454  7502 
DLN Agg,  2646  2894  7737 
Mean Absolute Error Estimating Product Sales From Reviews
3.3 Case Study: Predicting User Intent
For this binary classification problem from a large internet services company, the goal is to predict if a given query (a string containing multiple words) is seeking a specific type of result. We use examples, which we split randomly into training/validation/test sets in 8/1/1 proportions. Each query is broken into ngrams, and each ngram has corresponding pieces of information. Two of the features should be positive signals for the intent (e.g. what percentage of users who issued that ngram in the past were seeking this result type), and their effect is constrained to be monotonic. The other eight features are conditional features, e.g. how popular the ngram is, or the order of (number of terms in) the ngram.
Table 2 shows that specific set function models are significantly better than models on preaveraged features. Furthermore, DLNs with are the best performing of the set function models. The DLN performs very similarly to the Deep Sets approach.
Train Set  Validation Set  Test Set  

Averaged Aggregation, Linear Fusion  0.623  0.609  0.610 
Averaged Aggregation, DNN Fusion  0.634  0.623  0.624 
Deep Sets  0.662  0.644  0.643 
DLN Agg,  0.653  0.644  0.643 
DLN Agg,  0.674  0.648  0.646 
Accuracy for Classifying User Intent
4 Semantic Feature Engine
We propose applying set function learning to handle sparse categorical variables in a debuggable and stable way, an approach we call the
Semantic Feature Engine (SFE). A common approach to sparse categoricals is to create a Boolean feature for each possible category, and either use these directly as predictors or train an embedding. These strategies work well, but have poor interpretability, debuggability, and can be highly variable across retrainings causing unwanted churn and instability (Cormier et al., 2016). By contrast, our proposed SFE converts the sparse categoricals into one dense, understandable feature that is an estimate of , for some label .For example, suppose the goal is to produce a classifier that predicts if a movie will be rated PG (suitable for allages), and we want to use information about the movie’s actors. SFE produces a feature that is an estimate , which will have the semantic meaning of an actor prior feature. This feature could then be combined with other dense and meaningful features, such as the movie’s budget or studio’s previous track record, to produce a final model that is both powerful and interpretable.
In general, SFE produces a feature that is an estimate of some label given some set . In simple cases one can simply form the point estimate . More generally, may not have occurred often enough in the training data to derive a straightforward point estimate; for example, many movies have sets of actors who have never appeared together before. To address this, the key idea of SFE is to convert into a set of tokens , estimate for each token, then learn the best aggregation of the set of estimates to form the best overall estimate , which can be used as a feature in a bigger model. See Appendix B (8.2) in the Supplemental for a complete worked example.
Tokenization and Fallback Rules: If is not a single element, one must choose a tokenization rule to produce a set of tokens from a given example . For example, for text, a standard tokenization is to break the text into order ngrams up to some max order . For a set of categorical variable, such as {actors}, one can tokenize it into all tuples up to some max subset size . If is a pair of sets (e.g. the actors in a candidate movie to recommend, and the list of all actors in movies the user has previously watched), tokens can be crosses or set differences across the pair. In addition, we suggest adding fallback rules if the token values are missing for a given token. For sets of categorical variables, our fallback is to iteratively consider size subsets for any categories not contained in an existing token of subset size until each category appears in at least one token if possible (see Appendix B for full examples). The last fallback is always to set the token values to missing.
Training Sets: The SFE needs an SFE token training set of pairs to train the pertoken estimates of . This can be the same training set as used to learn the SFE aggregation function, and in fact this simple approach often works well. Using different training sets can reduce overfitting. One may also prefer to use different labels, for example, training the SFE pertoken estimates of on a large (but noisy) dataset of clicks, but training the aggregration function to produce on a smaller, cleaner, humanlabeled dataset.
Token Table Building: Given tokenization rules, build a token table by iterating through the SFE token training set to populate a table with empirical estimates of , and possibly other aggregate statistics such as how often the token was seen in the training data. One can also store nonaggregated tokenspecific information, such as a token’s subset size, in a table during the precomputation phase; alternatively, these values could be populated at training time. Note that when training the set function, one can also add examplespecific details (e.g. for actors, their salary or number of lines in the given movie) to provide additional information on how to weight the different token values. Cumulatively, these techniques provide the features per token .
Token Table Filtering:
To reduce table size and improve the statistical significance of the SFE signal, one should filter the token table before learning the SFE aggregation function. Two filtering rules that we have found useful are: (i) a count threshold (that is, if there are too few examples of a specific token in the token training set, then it should be dropped from the token table), (ii) a confidence interval threshold (that is, if one of the token features is an estimate of a target label
, and the confidence interval of that estimate is too big, then it should be dropped from the token table). Compared to countthresholding, confidence interval filtering will keep more lowerfrequency tokens that have high labelagreement amongst their occurrences.Learn an Aggregation Function: Given the tokens and their token values, train a set function as per (1) to estimate the label , with monotonicity constraints on the token features based on domain knowledge. Generally, it makes sense to constrain the point estimate feature to have a monotonic impact on the aggregation function’s prediction of .
Extended Example: Let if the movie is PG and otherwise. Let represent the set of actors, and suppose a new movie comes out with actors {Alice, Bob, Carol and David}. Suppose the tokenization rule is to use all subsets of up to size 3, with a fallback to smaller subsets if their parent sets are not in the SFE token table. This rule produces tokens { {Alice, Carol}, {Bob, Carol}, {David}}, meaning that no set of 3 actors appeared in the table, nor did David appear with any of the other actors in the table. Suppose our token features are (i) the percentage of movies with that set of actors that are PG, and (ii) the number of movies with that set of actors. Then we might find that , , and . We can then apply a set function that was trained on all pre2017 movies and was constrained to be monotonically increasing in the first feature, which might produce a movie title priorprobability of , which can then be combined with other features for a final prediction of the movie’s rating.
Separability: Here, we have separated the overall model into three parts: (i) the token table that stores token values, (ii) the set function that combines the token features across the tokens for each example, (iii) the followon model which might take many SFE features as inputs. This separability has two key advantages in practice. First, each of these parts can has semantic meanings that aids interpretability and debuggability. Second, each of these three parts can be refreshed or improved independently, which reduces churn (Cormier et al., 2016) and system complexity. That said, jointly training the SFE set function and the followon classifier could result in additional metric gains.
5 Experiments using the Semantic Feature Engine To Create Sets
We also evaluate the performance of our set function learning approach when applied as part of the Semantic Feature Engine to sparse categorical predictors; that is, precomputing statistics about each category and learning a set function over token statistics. We show that as in the earlier experiments, it outperforms Deep Sets in learning the best set function over the tokens. We also compare it to a strategy of directly learning a deep neural network (DNN) on a multihot encoding of the categories and find that it tends to perform similarly well, with the added benefits of interpretability and debuggability.
For each experiment, our original predictors are a variablelength set of categories, and our SFE tokens are subsets of categories. In particular, we compute features for each token that is then fed into the set function: (a) the average label, computed over the training data, for a token; (b) how frequently the token appears; (c) the size of the subset that the token represents; (d) whether the token fully matches a set’s list of categories; (e) the number of categories in the set; and (f) the number of tokens generated from the set. Note that these fall into three buckets: (a) is the direct estimate of the label’s value for a token; (b), (c), and (d) are tokenspecific features that can provide information on how much to weight the tokens; and (e) and (f) are setspecific features (the same for all tokens in an example) to help calibrate the outputs and nature of the aggregation.
For the DNN comparison, we use standard TensorFlow embeddings, first creating a tf.feature_column.embedding_column over the raw categories (that appear in the train set), then feeding that into a tf.estimator.DNNClassifier with two hidden layers before a final softmax layer. We optimize the ADAM learning rate, number of epochs, and size of the embedding and hidden layers over the validation set. New validationset or testset categories are ignored; only embeddings from categories that also appear in the training set are used in evaluation.
Train Set  Validation Set  Test Set  

DNN on Attributes  0.793  0.789  0.790 
Deep Sets  0.794  0.786  0.785 
DLN Agg  0.795  0.785  0.785 
DLN Agg  0.795  0.786  0.786 
5.1 CelebA
There are images of faces Liu et al. (2015), which we randomly split 70/10/20 into a train/validation/test set. Each face is described by 40 binary attributes, such as whether the subject has blond hair, earrings, or a mustache. There is also a Boolean feature for whether the face was judged to be attractive. We treat the problem as a binary classifier of predicting whether a face is labeled as attractive based on its attributes, and use the Semantic Feature Engine to generate conditional probabilities of attractiveness for all subsets of attributes, whose estimated values have confidence intervals of 0.2 or under.
Results in Table 3 show that DLNs on aggregated token subset information, including the very simple model, perform better than Deep Sets models with substantially higher dimensional functions. DNNs directly on the attributes, however, perform the best of all models considered, with nearly 0.4% higher test accuracy than the bestperforming DLN.
Train Acc  Validation Acc  Test Acc  Test Prec@1  Test Prec@3  

DNN on Attributes  0.986  0.974  0.974  0.728  0.883 
Deep Sets  0.981  0.973  0.973  0.707  0.888 
DLN Agg  0.991  0.974  0.973  0.710  0.880 
DLN Agg  0.984  0.974  0.974  0.729  0.890 
5.2 Cuisine Classification from Recipe Ingredient List
The recipes dataset (www.kaggle.com/kaggle/recipeingredientsdataset) consists of recipes represented by their list of ingredients, and the cuisine they come from. It is randomly split 70/10/20 into a train/validation/test set. We build a model that takes a list of ingredients and a cuisine and acts as a binary classifier that estimates if it’s a correct match, that is, the model outputs an estimate of . There is a fixed set of 20 possible cuisines, so we create one positive and nineteen negative training samples from each row of the original dataset. Note that the problem can also be thought of as a multiclass classifier; for that reason, we also report the multiclass metrics precision@1 and precision@3 (which compute how often the correct cuisine’s score was the top or among the top3 predictions for all candidate cuisines for a given recipe).
We start by doing some basic preprocessing on the ingredients, such as converting to lower case and wordstemming, to make it more likely that the equivalent ingredients are identified; we will upload a Kaggle kernel with the details. Then, we use SFE to tokenize the resulting ingredient set with each item crossed with each cuisine, see Appendix B (8.2) for a complete example. We filter out token values for any sets of ingredients that appear fewer than 5 times in the training data. We consider subsets of ingredients up to size 3; we crossvalidated max subset sizes up to 5. Using all subsets is not feasible; many recipes have over 20 ingredients and the longest has 65, meaning that hundreds or even thousands of tokens are being aggregated for some examples even with a max subset size of 3. Another unique feature of the recipes dataset compared with the other benchmarks is that the vocabulary is very large. There are over unique ingredients, meaning that many ingredients in the test set do not appear or barely appear in the training set.
As Table 4 shows, DLN agg functions with (in particular, here) perform the best of all models on the precision metrics, while DNNs are slightly higher on binary classifier accuracy. The DLN performs similarly to the highercomplexity Deep Sets model.
Train Set  Validation Set  Test Set  

DNN on Attributes  7.04  7.39  7.20 
Deep Sets  7.29  7.45  7.26 
DLN Agg  7.04  7.37  7.19 
DLN Agg  7.02  7.35  7.19 
5.3 Wine
The wine dataset (www.kaggle.com/zynicide/winereviews) consists of different wines, along with their quality, price, country of origin, and a set of descriptive terms culled from reviews (out of a total set of 39 possible adjectives such as complex, oak, and velvet). We focus on predicting the quality, which is scored on a 100point scale, using the set of review adjectives. As discussed in the SFE section, this review prior score could be used in a followon interpretable model that also incorporated the other features such as price, though we do not do so here. We consider all subsets and require sets of adjectives to appear at least 32 times in the training data before entering the token table.
Table 5 shows that DLNs perform best on the wine dataset, with even the DLN performing better than the DNN. Deep Sets struggles here, performing far worse than either of the competitors.
6 Conclusions
We have shown that we can learn DLN aggregation functions over sets that provide similar accuracy to deep sets Zaheer et al. (2017), but provide greater interpretability due to three key aspects. First DLN agg functions enable monotonicity constraints, providing endtoend highlevel understanding and greater predictability of even highly nonlinear models. Second, the firstlayer of 1d perfeature calibrator functions can be visualized and interpreted. Third, we showed that we can simplify the middle layer to a pertoken score before averaging (), with slight or no loss of accuracy on realworld problems. This greatly aids debuggability as it makes it easier to determine which tokens are most responsible for the output, and whether any of the pertoken scores are noisy or suspicious.
We show that learning on sets is broadly applicable with our semantic feature engine proposal, which converts highly sparse features into a dense estimate of . Our experiments show these estimates were similar in accuracy to applying a DNN to the sparse features. The main advantage of the SFE over the DNN is its greater interpretability and debuggability. We expect SFE will also show greater stability and less churn if retrained. SFE features can be combined with other information in followon DLN models, with monotonicity regularization on the SFE features, which makes the followon DLN more interpretable and stable over retrainings.
References
 Barlow et al. [1972] R. E. Barlow, D. J. Bartholomew, J. M. Bremner, and H. D. Brunk. Statistical inference under order restrictions; the theory and application of isotonic regression. Wiley, New York, USA, 1972.
 Canini et al. [2016] K. Canini, A. Cotter, M. M. Fard, M. R. Gupta, and J. Pfeifer. Fast and flexible monotonic functions with ensembles of lattices. Advances in Neural Information Processing Systems (NIPS), 2016.
 Cormier et al. [2016] Q. Cormier, M. Milani Fard, and M. R. Gupta. Launch and iterate: Reducing prediction churn. Advances in Neural Information Processing Systems (NIPS), 2016.
 Cotter et al. [2016] A. Cotter, M. R. Gupta, and J. Pfeifer. A Light Touch for heavily constrained SGD. In 29th Annual Conference on Learning Theory, pages 729–771, 2016.
 Daniels and Velikova [2010] H. Daniels and M. Velikova. Monotone and partially monotone neural networks. IEEE Trans. Neural Networks, 21(6):906–917, 2010.
 Duchi et al. [2011] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal Machine Learning Research, 12:2121–2159, 2011.
 Groeneboom and Jongbloed [2014] P. Groeneboom and G. Jongbloed. Nonparametric estimation under shape constraints. Cambridge Press, New York, USA, 2014.
 Gupta et al. [2016] M. R. Gupta, A. Cotter, J. Pfeifer, K. Voevodski, K. Canini, A. Mangylov, W. Moczydlowski, and A. V. Esbroeck. Monotonic calibrated interpolated lookup tables. Journal of Machine Learning Research, 17(109):1–47, 2016. URL http://jmlr.org/papers/v17/15243.html.
 Hartford et al. [2016] J. S. Hartford, J. R. Wright, and K. LeytonBrown. Deep learning for predicting human strategic behavior. In Advances in Neural Information Processing Systems, pages 2424–2432, 2016.
 Hastie and Tibshirani [1990] T. Hastie and R. Tibshirani. Generalized Additive Models. Chapman Hall, New York, 1990.
 Howard and Jebara [2007] A. Howard and T. Jebara. Learning monotonic transformations for classification. Advances in Neural Information Processing Systems (NIPS), 2007.
 Kingma and Ba [2014] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kondor and Jebara [2003] R. Kondor and T. Jebara. A kernel between sets of vectors. In Proceedings of the 20th International Conference on Machine Learning (ICML), pages 361–368, 2003.
 Kotlowski and Slowinski [2009] W. Kotlowski and R. Slowinski. Rule learning with monotonicity constraints. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 537–544. ACM, 2009.

Liu et al. [2015]
Z. Liu, P. Luo, X. Wang, and X. Tang.
Deep learning face attributes in the wild.
In
Intl. Conf. Computer Vision (ICCV)
, pages 3730–3738, 2015.  Muandet et al. [2012] K. Muandet, K. Fukumizu, F. Dinuzzo, and B. Schoelkopf. Learning from distributions with support measure machines. Advances in Neural Information Processing Systems (NIPS), 2012.
 Poczos et al. [2012] B. Poczos, L. Xiong, D. Sutherland, and J. Schenider. Support distribution machines. available on arXiv, 2012.
 Sharma and Bala [2002] G. Sharma and R. Bala. Digital Color Imaging Handbook. CRC Press, New York, 2002.
 Shivaswamy and Jebara [2006] P. K. Shivaswamy and T. Jebara. Permutation invariant svms. In Advances in Neural Information Processing Systems, 2006.
 Sill and AbuMostafa [1997] J. Sill and Y. S. AbuMostafa. Monotonicity hints. Advances in Neural Information Processing Systems (NIPS), pages 634–640, 1997.
 You et al. [2017] S. You, D. Ding, K. Canini, J. Pfeifer, and M. R. Gupta. Deep lattice networks and partial monotonic functions. Advances in Neural Information Processing Systems (NIPS), 2017.
 Zaheer et al. [2017] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. Salakhutdinov, and A. Smola. Deep sets. Advances in Neural Information Processing Systems (NIPS), 2017.
7 Appendix A: More Implementation Details
We provide more details of our implementation, particularly how we constrain the inputandoutput range of each layer.
7.1 Details on Each Layer
First Layer (): Calibrators Each of the calibrators has a input range bounded in a userdefined range, which we set based on the train set or domain knowledge. Each calibrator is a onedimensional piecewise linear function stored as a set of pairs of keypointsandvalues, where the keypoints are spaced from
in line with the quantiles of the inputs, the
calibrator values get jointly trained but are bounded to , and the values are initialized to form a linear function spanning the inputoutput range. Every calibrator is constrained to be a monotonic function by adding linear inequality constraints on the adjacent lookup table parameters that restrict each of the onedimensional lookup table values to be greater than its left neighbor [Gupta et al., 2016]. If the function has outputs, then there are separate calibrators.Second Layer (): Lattices The lattice layer follows the calibration layer and thus takes inputs in . Each set of inputs goes into one of lattices. The outputs of this lattice layer are all bounded to be in , and all the lattices parameters are initialized to . Each lattice is represented by a lookup table with parameters, where usually suffices, but was tuned on the validation set for some of our experiments to a higher integer value to form a finergrained lattice. Each lattice is constrained to be monotonic increasing or decreasing with respect to a userspecified set of features by adding appropriate linear inequality constraints on the lattice’s lookup table parameters (see Gupta et al. [2016] for details). For each experiment, we used a priori domain knowledge to specify the monotonicity constraints.
Third Layer (Simple Average): The outputs from each of the lattices are averaged, such that the average layer outputs outputs total. We use an average rather than a sum (as in (1)) so that the input(s) to is guaranteed bounded to .
Fourth Layer (: Calibrators): This layer are largely similar to the first and second layers. A slight difference is that the inputs to the fourth layer lie in , as described above, and that the keypoints are initialized uniformly across the range rather than according to empirical quantiles from the data. Like the other calibrators, their outputs are bounded in .
Fifth Layer (: Lattice on inputs): The calibrated values from the fourth layer get fused together by one dimensional lattice. We explictly constrain the lattice parameters to all be in , and since the output of this layer is just an interpolation of the lattice parameters, that constrains the output of this layer to be in .
Sixth Layer (): Output Calibrator The last layer is a onedimensional piecewise linear transform parameterized by a set of pairs of keypointsandvalues, where the keypoints are uniformlyspaced over its input range , and its output range is determined by the training. The values are initialized to be the identity function. This transform is constrained to be monotonic by adding linear inequality constraints to the training (see (4)) that restrict each of the onedimensional lookup table values to be greater than its left neighbor [Gupta et al., 2016].
7.2 Training and Optimization
Training the proposed aggregation function where represents all of the DLN parameters for and is a constrained structural risk minimization problem. Given a training set for :
(4) 
where expresses the monotonicity constraints and any range constraints on the DLN parameters.
Note the proposed DLN structure for DLN and means all parameters of (1
) have gradients computable with the chain rule and backpropagation. We solve (
4) using the LightTouch algorithm [Cotter et al., 2016] to handle the monotonicity constraints on top of Adagrad [Duchi et al., 2011]. LightTouch samples the constraints, learning which ones are most likely to be violated, and inexpensively penalizes them as the training progresses. It thereby converges to a feasible solution without enforcing feasibility throughout. Once optimization has finished, we project the final iterate, to guarantee monotonicity.7.3 Hyperparameter Optimization
When learning the aggregate function, we tune hyperparameters including learning rate, number of epochs, output dimension of the calibrated lattice layer that maps token features into outputs, and number of keypoints used in each calibration layer in the DLN. We choose the combination of hyperparameters that achieves best model performance measured on the validation dataset.
In the cases when SFE is used to convert sparse categorical features into dense features, we also tune hyperparameters that affect tokenization. Specifically, we tune (a) the maximum size of the created tokens, which means the maximum order of ngrams if the sparse feature is text and the maximum subset size if it is an unordered set of strings; and (b) the filtering criteria of tokens, including the count threshold and the maximum confidence interval width.
8 Appendix B: Examples
8.1 Example of SFE Fallback Logic
Fig. 3 illustrates how the SFE fallback logic works when we generate a set of dense feature vectors from an unordered set of categories. The general idea is to find large subsets of the input to cover all the categories, and fall back to their smaller subsets if they are not found in a prebuilt token table. A subset does not make to the token table if it does not appear frequent enough in the training data according to the filtering criteria.
8.2 Complete SFE and Learned Aggregation Function Example
We show the inner workings of the aggregation function on an example from the recipes dataset. The example, whose true cuisine is French but for which we are evaluating the candidate cuisine of Mexican, consists of the seven ingredients {sugar, salt, fennel bulb, water, lemon olive oil, grapefruit juice}. As mentioned above, we consider subsets of up to size 3 for this problem. As the combination of ingredients is rather rare, we only find 2 subsets of size 3 (out of possibilities) that appeared in the training data at least 5 times and are therefore in the token table: {salt, water, fennelbulb} and {salt, water, sugar}. Since neither lemon olive oil nor grapefruit juice appear in any frequent subsets of size 3, the model searches for any subsets of size 2 containing one of those ingredients and doesn’t find any; finally it finds grapefruit juice as a singleton in the token table, while lemon olive oil never appears in any form in the token table.
We therefore have the following 3 total tokens:

{salt, water, fennel bulb}: ; count = 17; subset size = 3; is full match = 0; number of ingredients: 6; number of tokens: 3

{salt, water, sugar}: ; count = 510; subset size = 3; is full match = 0; number of ingredients: 6; number of tokens: 3

{grapefruit juice}: ; count = 10; subset size = 1; is full match = 0; number of ingredients: 6; number of tokens: 3
Recall that there are 20 possible cuisines, so 0.05 is the breakeven uninformative prior value. Here, both subsets of size 3 have fairly weak/neutral evidence while the prior for grapefruit juice is actually quite high.
The intermediate outputs for these tokens are:

{salt, water, fennel bulb}: 0.005

{salt, water, sugar}: 0.156

{grapefruit juice}: 0.075
Our labels are 1/1, so positive outputs represent the model leaning towards yes and negative outputs are the opposite. Interestingly, the {salt, water, sugar} subset is much more negative than {grapefruit juice} is positive, even though its prior is much closer to neutral. We can explain this by looking at the supporting information: the former has a subset size of 3, teaching the model to trust it more than a subset of size 1, and it appears 510 times vs. 10, leading to the same conclusion.
Finally, the outputs are averaged together and fed through , which in the case is a simple piecewise linear transform, yielding the final output of 0.28. The classifier has correctly identified the recipe as not being Mexican.
Comments
There are no comments yet.