We propose building models with multiple layers of lattices, which we refer to as deep lattice networks (DLNs). While we hypothesize that DLNs may generally be useful, we focus on the challenge of learning flexible partially-monotonic functions, that is, models that are guaranteed monotonic with respect to a user-specified subset of the inputs. For example, if one is predicting whether to give someone else a loan, we expect and would like to constrain the prediction to be monotonically increasing with respect to the applicant’s income, if all other features are unchanged. Imposing monotonicity acts as a regularizer, improves generalization to test data, and makes the end-to-end model more interpretable, debuggable, and trustworthy.
To learn more flexible partial monotonic functions, we propose architectures that alternate three kinds of layers: linear embeddings, calibrators, and ensembles of lattices, each of which is trained discriminatively to optimize a structural risk objective and obey any given monotonicity constraints. See Fig. 2 for an example DLN with nine such layers.
Lattices are interpolated look-up tables, as shown in Fig.1. Lattices have been shown to be an efficient nonlinear function class that can be constrained to be monotonic by adding appropriate sparse linear inequalities on the parameters GuptaEtAl:2016 , and can be trained in a standard empirical risk minimization framework Garcia:09 ; GuptaEtAl:2016 . Recent work showed lattices could be jointly trained as an ensemble to learn flexible monotonic functions for an arbitrary number of inputs Canini:2016 .
Calibrators are one-dimensional lattices, which nonlinearly transform a single input GuptaEtAl:2016 ; see Fig. 1 for an example. They have been used to pre-process inputs in two-layer models: calibrators-then-linear models Jebara:2007 , calibrators-then-lattice models GuptaEtAl:2016 , and calibrators-then-ensemble-of-lattices model Canini:2016 . Here, we extend their use to discriminatively normalize between other layers of the deep model, as well as act as a pre-processing layer. We also find that using a calibrator for a last layer can help nonlinearly transform the outputs to better match the labels.
We first describe the proposed DLN layers in detail in Section 2. In Section 3, we review more related work in learning flexible partial monotonic functions. We provide theoretical results characterizing the flexibility of the DLN in Section 4, followed by details on our TensorFlow implementation and numerical optimization choices in Section 5. Experimental results demonstrate the potential on benchmark and real-world scenarios in Section 6.
2 Deep Lattice Network Layers
We describe in detail the three types of layers we propose for learning flexible functions that can be constrained to be monotonic with respect to any subset of the inputs. Without loss of generality, we assume monotonic means monotonic non-decreasing (one can flip the sign of an input if non-increasing monotonicity is desired). Let
be the input vector to theth layer, with inputs, and let denote the th input for . Table 1
summarizes the parameters and hyperparameters for each layer. For notational simplicity, in some places we drop the notationif it is clear in the context. We also denote as the subset of that are to be monotonically constrained, and as the subset of that are non-monotonic.
Linear Embedding Layer: Each linear embedding layer consists of two linear matrices, one matrix that linearly embeds the monotonic inputs , and a separate matrix that linearly embeds non-monotonic inputs
, and one bias vector. To preserve monotonicity on the embedded vector , we impose the following linear inequality constraints:
The output of the linear embedding layer is:
Note that only the first coordinates of needs to be a monotonic input to the layer. These two linear embedding matrices and bias vector are discriminatively trained.
Each calibration layer consists of a separate one-dimensional piecewise linear transform for each input at that layer,that maps to , so that
Here each is a 1D lattice with key-value pairs , and the function for each input is linearly interpolated between the two values corresponding to the input’s surrounding values. An example is shown on the left in Fig. 1.
However, enforcing monotonicity and boundedness constraints for the calibrator output is much simpler with the parameterization of each keypoint’s input-output values, as we discuss shortly.
Before training the DLN, we fix the input range for each calibrator to , and we fix the keypoints to be uniformly-spaced over . Inputs that fall outside are clipped to that range. The calibrator output parameters are discriminatively trained.
For monotonic inputs, we can constrain the calibrator functions to be monotonic by constraining the calibrator parameters to be monotonic, by adding the linear inequality constraints
into the training objective Canini:2016 . We also experimented with constraining all calibrators to be monotonic (even for non-monotonic inputs) for more stable/regularized training.
Ensemble of Lattices Layer: Each ensemble of lattices layer consists of lattices. Each lattice is a linearly interpolated multidimensional look-up table; for an example, see the middle and right pictures in Fig. 1. Each -dimensional look-up table takes inputs over the -dimensional unit hypercube , and has parameters , specifying the lattice’s output for each of the vertices of the unit hypercube. Inputs in-between the vertices are linearly interpolated, which forms a smooth but nonlinear function over the unit hypercube. Two interpolation methods have been used, multilinear interpolation and simplex interpolation GuptaEtAl:2016 (also known as Lovász extension lovasz1983submodular ). We use multilinear interpolation for all our experiments, which can be expressed where the non-linear feature transformation are the linear interpolation weights that input puts on each of the parameters such that the interpolated value for is , and , where is the coordinate vector of the th vertex of the unit hypercube, and . For example, when , and .
The ensemble of lattices layer produces outputs, one per lattice. When creating the DLN, if the th layer is an ensemble of lattices, we randomly permute the outputs of the previous layer to be assigned to the inputs of the ensemble. If a lattice has at least one monotonic input, then that lattice’s output is constrained to be a monotonic input to the next layer; in this way we guarantee monotonicity end-to-end for the DLN.
Partial monotonicity: The DLN is constructed to preserve an end-to-end partial monotonicity with respect to a user-specified subset of the inputs. As we described, the parameters for each component (matrix, calibrator, lattice) can be constrained to be monotonic with respect to a subset of inputs by satisfying certain linear inequality constraints GuptaEtAl:2016 . Also if a component has a monotonic input, then the output of that component is treated as a monotonic input to the following layer. Because the composition of monotonic functions is monotonic, the constructed DLN belongs to the partial monotonic function class. The arrows in Figure 2 illustrate this construction, i,e,, how the th layer output becomes a monotonic input to th layer.
We detail the hyperparameters for each type of DLN layer in Table 1. Some of these hyperparameters constrain each other since the number of outputs from each layer must be equal to the number of inputs to the next layer; for example, if you have a linear embedding layer with outputs, then there are inputs to the next layer, and if that next layer is a lattice ensemble, its hyperparameters must obey .
|Linear Embedding||, ,|
|inputs per lattice|
3 Related Work
Prior to this work, the state-of-the-art in learning expressive partial monotonic functions for inputs was 2-layer networks consisting of a layer of calibrators followed by an ensemble of lattices Canini:2016 , with parameters appropriately constrained for monotonicity, which built on earlier work of Gupta et al. GuptaEtAl:2016 that constructed only a single calibrated lattice, and was restricted to around inputs due to the number of parameters for each lattice. This work differs in three key regards.
First, we alternate layers to form a deeper, and hence potentially more flexible, network. Second, a key question addressed in Canini et al. Canini:2016
is how to decide which features should be put together in each lattice in their ensemble. They found that random assignment worked well, but required large ensembles. Smaller (and hence faster) models with the same accuracy could be trained by using a heuristic pre-processing step they proposed (crystals) to identify which features interact nonlinearly. This pre-processing step requires training a lattice for each pair of inputs to judge that pair’s strength of interaction, which scales as , and we found it can be a large fraction of overall training time for .
We solve the problem of determining which inputs should interact in each lattice by using a linear embedding layer before an ensemble of lattices layer to discriminatively and adaptively learn during training how to map the features to the first ensemble-layer lattices’ inputs. This strategy also means each input to a lattice can be a linear combination of the features, which is a second key difference to that prior work Canini:2016 .
, the calibrator keypoint values were fixed a priori based on the quantiles of the features, which is challenging to do for the calibration layers mid-DLN, because the quantiles of their inputs are evolving during training. Instead, we fix the keypoint values uniformly over the bounded calibrator domain.
Learning monotonic single-layer neural nets by constraining the neural net weights to be positive dates back to Archer and Wang in 1993 archerWang:1993 , and that basic idea has been re-visited by others Wang:1994 ; KayUngar:2000 ; Dugas:2009 ; Minin:2010 , but with some negative results about the obtainable flexibility even with multiple hidden layers DanielsVelikova:2010 . Sill Sill:1998 proposed a three-layer monotonic network that used an early form of monotonic linear embedding and max-and-min-pooling. Daniels and Velikova DanielsVelikova:2010
extended Sill’s result to learn a partial monotonic function by combining min-max-pooling, also known as adaptive logic networksarmstrong1996adaptive
, with partial monotonic linear embedding, and show that their proposed architecture is an universal approximator for partial monotone functions. None of these prior neural networks were demonstrated on problems with more thanfeatures, nor trained on more than a few thousand examples. For our experiments we implemented a positive neural network and a min-max-pooling network with TensorFlow.
4 Function Class of Deep Lattice Networks
We offer some results and hypotheses about the function class of deep lattice networks, depending on whether the lattices are interpolated with multilinear interpolation (which forms multilinear polynomials), or simplex interpolation (which forms locally linear surfaces).
4.1 Cascaded multilinear lookup tables
We show that a deep lattice network made up only of lattices (without intervening layers of calibrators or linear embeddings) is equivalent to a single lattice defined on the input features if multilinear interpolation is used. It is easy to construct counter-examples showing that this result does not hold for simplex-interpolated lattices.
Suppose that a lattice has inputs that can each be expressed in the form , where the are mutually disjoint and represents multilinear interpolation weights. Then the output can be expressed in the form . That is, the lattice preserves the functional form of its inputs, changing only the values of the coefficients and the linear interpolation weights .
Each input of the lattice can be expressed in the following form:
This is a multilinear polynomial. Analogously, the output can be expressed in the following form:
Note the product in the expression: and are both multilinear polynomials, but within each term of the product, only one is present, since one of the two has exponent and the other has exponent . Furthermore, since each is a function of a different subset of , we conclude that the entire product is a multilinear polynomial. Since the sum of multilinear polynomials is still a multilinear polynomial, we conclude that is a multilinear polynomial. Any multilinear polynomial on variables can be converted to a -dimensional multilinear lookup table, which concludes the proof. ∎
Theorem 1 can be applied inductively to every layer of a cascaded lookup table down to the final output . Thus, we can show that a cascaded lookup table using multilinear interpolation is equivalent to a single multilinear lattice defined on all features.
4.2 Universal approximation of partial monotone functions
Theorem 4.1 in Daniels:2010 states that partial monotone linear embedding with min and max pooling can approximate any partial monotone functions on the hypercube. We show in the next lemma that simplex-interpolated lattices can represent min or max pooling. Thus we can use two cascaded simplex interpolated lattice layers with a linear embedding layer to approximate any partial monotone function on the hypercube.
Let and , and be the simplex interpolation weights. Then
From GuptaEtAl:2016 , , where is the sorted order such that , so by definition, it is easy to see the above result. ∎
4.3 Locally linear functions
If simplex interpolation GuptaEtAl:2016 (aka the Lovász extension) is used, the deep lattice network produces a locally linear function, because each layer is locally linear, and compositions of locally linear functions are locally linear. Note that a input lattice interpolated with simplex interpolation has linear pieces GuptaEtAl:2016 . We hypothesize that if one cascades an ensemble of lattices into a lattice, that the number of locally linear pieces is on the order .
5 Numerical Optimization Details for the DLN
Operators: We implemented 1D calibrator and multilinear interpolation over a lattice as new C++ operators in TensorFlow tensorflow2015-whitepaper and express each layer as a computational graph node using these new and existing TensorFlow operators. We will make the code publicly available via the TensorFlow open source project. We use the ADAM optimizer kingma2014adam and batched stochastic gradients to update model parameters. After each gradient update, we project parameters to satisfy their monotonicity constraints. The linear embedding layer’s constraints are element-wise non-negativity constraints, so its projection clips each negative component to zero. Projection for each calibrator is isotonic regression with total ordering, which we implement with the pool-adjacent-violator algorithm ayer1955empirical for each calibrator. Projection for each lattice is isotonic regression with partial ordering, resulting in linear constraints for each lattice GuptaEtAl:2016 . We solved it with consensus optimization and alternating direction method of multipliers boyd2011distributed to parallelize the projection computations with a convergence criterion of .
Initialization: For linear embedding layers, we initialize each component in the linear embedding matrix with IID Gaussian noise . The initial mean of is to bias the initial parameters to be positive so that they are not clipped to zero by the first monotonicity projection. However, because the calibration layer before the linear embedding outputs in and thus is expected to have output , initializing the linear embedding with a mean of introduces an initial bias: . To counteract that we initialize each component of the bias vector, , to , so that the initial expected output of the linear layer is .
We initialize each lattice’s parameters to be a linear function spanning , and add IID Gaussian noise to each parameter. We initialize each calibrator to be a linear function that maps to (and did not add any noise).
We present results on the same benchmark dataset (Adult) with the same monotonic features as in Canini et al. Canini:2016 , and for three problems from a large internet services company where the monotonicity constraints were specified by product groups. For each experiment, every model considered is trained with monotonicity guarantees on the same set of inputs. See Table 2 for a summary of the datasets.
For classification problems, we used logistic loss, and for the regression, we used squared error. For each problem, we used a validation set to optimize the hyperparameters for each model architecture: the learning rate, the number of training steps, etc. For an ensemble of lattices, we tune the number of lattices, , and number of inputs to each lattice, . All calibrators for all models used a fixed number of 100 keypoints, and set as an input range.
For crystals Canini:2016 we validated the number of ensembles, , and number of inputs to each lattice, , as well as ADAM stepsize and number of loops. For min-max net DanielsVelikova:2010 , we validated the number of groups, , and dimension of each group , as well as ADAM stepsize and number of loops.
For datasets where all features are monotonic, we also train a deep neural network with a non-negative weight matrix and ReLU as an activation unit with a final fully connected layer with non-negative weight matrix, which we call monotonic DNN. We tune the depth of hidden layers, , and the activation units in each layer .
All the result table contains an additional column to denote model parameters; means and .
|Dataset||Type||# Features (# Monotonic)||# Training||# Validation||# Test|
|User Intent||Classify||49 (19)||241,325||60,412||176,792|
|Rater Score||Regress||10 (10)||1,565,468||195,530||195,748|
6.1 User Intent Case Study (Classification)
For this real-world problem from a large internet services company, the problem is to classify the user intent. We report results the best validated model for different DLN architectures, such as Calibration(Cal)-Linear(Lin)-Calibration(Cal)-Ensemble of Lattices(EnsLat)-Calibration(Cal)-Linear(Lin).
The test set is not IID with the train and validation set in that the train and validation set are collected from the U.S., and the test set is collected from 20 other countries, and as a result we see the notable difference between the validation and the test accuracy. This experiment is set-up to test generalization ability.
The results are summarized in Table 3, sorted by the validation accuracy. Two of the DLN architectures outperform crystals and min-max net in terms of test accuracy.
6.2 Adult Benchmark Dataset (Classification)
, we set the function to be monotonically increasing in capital-gain, weekly hours of work and education level, and the gender wage gap. We used one-hot encoding for the other categorical features, for 90 features in total. We randomly split the usual train setUCI 80-20 and trained over the , and validated over the .
For DLN architecture, we used Cal-Lin-Cal-EnsLat-Cal-Lin layer. Results in Table 4 show the DLN provides better accuracy than the min-max network or crystals.
6.3 Rater Score Prediction Case Study (Regression)
In this task, we train a model to predict a rater score for a candidate result, where each rater score is averaged over 1-5 raters, and takes on 5-25 possible values. All 10 monotonic features are required to be monotonic. Results in Table 5 show DLN has slightly better validation and test MSE than all other models.
|Validation MSE||Test MSE||# Parameters|
6.4 Usefulness Case Study (Classifier)
In this task, we train a model to predict whether a candidate result contains useful information or not. All 9 features are required to be monotonic, and we use Cal-Lin-Cal-EnsLat-Cal-Lin DLN architecture. Table 6 shows the DLN has better validation and test accuracy than other models.
In this paper, we combined three types of layers, (1) calibrators, (2) linear embeddings, and (3) lattices, to produce a new class of models that combines the flexibility of deep networks with the regularization, interpretability and debuggability advantages that come with being able to impose monotonicity constraints on some inputs.
M. R. Gupta, A. Cotter, J. Pfeifer, K. Voevodski, K. Canini, A. Mangylov,
W. Moczydlowski, and A. Van Esbroeck.
Monotonic calibrated interpolated look-up tables.
Journal of Machine Learning Research, 17(109):1–47, 2016.
-  E. K. Garcia and M. R. Gupta. Lattice regression. In Advances in Neural Information Processing Systems (NIPS), 2009.
-  K. Canini, A. Cotter, M. M. Fard, M. R. Gupta, and J. Pfeifer. Fast and flexible monotonic functions with ensembles of lattices. Advances in Neural Information Processing Systems (NIPS), 2016.
-  A. Howard and T. Jebara. Learning monotonic transformations for classification. Advances in Neural Information Processing Systems (NIPS), 2007.
-  László Lovász. Submodular functions and convexity. In Mathematical Programming The State of the Art, pages 235–257. Springer, 1983.
-  N. P. Archer and S. Wang. Application of the back propagation neural network algorithm with monotonicity constraints for two-group classification problems. Decision Sciences, 24(1):60–75, 1993.
A neural network method of density estimation for univariate unimodal data.Neural Computing & Applications, 2(3):160–167, 1994.
-  H. Kay and L. H. Ungar. Estimating monotonic functions and their bounds. AIChE Journal, 46(12):2426–2434, 2000.
-  C. Dugas, Y. Bengio, F. Bélisle, C. Nadeau, and R. Garcia. Incorporating functional knowledge in neural networks. Journal Machine Learning Research, 2009.
-  A. Minin, M. Velikova, B. Lang, and H. Daniels. Comparison of universal approximators incorporating partial monotonicity by structure. Neural Networks, 23(4):471–475, 2010.
-  H. Daniels and M. Velikova. Monotone and partially monotone neural networks. IEEE Trans. Neural Networks, 21(6):906–917, 2010.
-  J. Sill. Monotonic networks. Advances in Neural Information Processing Systems (NIPS), 1998.
-  William W Armstrong and Monroe M Thomas. Adaptive logic networks. Handbook of Neural Computation, Section C1. 8, IOP Publishing and Oxford U. Press, ISBN 0 7503 0312, 3, 1996.
-  H. Daniels and M. Velikova. Monotone and partially monotone neural networks. IEEE Trans. Neural Networks, 21(6):906–917, 2010.
-  Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
-  Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Miriam Ayer, H Daniel Brunk, George M Ewing, William T Reid, Edward Silverman, et al. An empirical distribution function for sampling with incomplete information. The annals of mathematical statistics, 26(4):641–647, 1955.
-  Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011.
-  C. Blake and C. J. Merz. UCI repository of machine learning databases, 1998.