1 Introduction
We propose building models with multiple layers of lattices, which we refer to as deep lattice networks (DLNs). While we hypothesize that DLNs may generally be useful, we focus on the challenge of learning flexible partiallymonotonic functions, that is, models that are guaranteed monotonic with respect to a userspecified subset of the inputs. For example, if one is predicting whether to give someone else a loan, we expect and would like to constrain the prediction to be monotonically increasing with respect to the applicant’s income, if all other features are unchanged. Imposing monotonicity acts as a regularizer, improves generalization to test data, and makes the endtoend model more interpretable, debuggable, and trustworthy.
To learn more flexible partial monotonic functions, we propose architectures that alternate three kinds of layers: linear embeddings, calibrators, and ensembles of lattices, each of which is trained discriminatively to optimize a structural risk objective and obey any given monotonicity constraints. See Fig. 2 for an example DLN with nine such layers.
Lattices are interpolated lookup tables, as shown in Fig.
1. Lattices have been shown to be an efficient nonlinear function class that can be constrained to be monotonic by adding appropriate sparse linear inequalities on the parameters GuptaEtAl:2016 , and can be trained in a standard empirical risk minimization framework Garcia:09 ; GuptaEtAl:2016 . Recent work showed lattices could be jointly trained as an ensemble to learn flexible monotonic functions for an arbitrary number of inputs Canini:2016 .Calibrators are onedimensional lattices, which nonlinearly transform a single input GuptaEtAl:2016 ; see Fig. 1 for an example. They have been used to preprocess inputs in twolayer models: calibratorsthenlinear models Jebara:2007 , calibratorsthenlattice models GuptaEtAl:2016 , and calibratorsthenensembleoflattices model Canini:2016 . Here, we extend their use to discriminatively normalize between other layers of the deep model, as well as act as a preprocessing layer. We also find that using a calibrator for a last layer can help nonlinearly transform the outputs to better match the labels.
We first describe the proposed DLN layers in detail in Section 2. In Section 3, we review more related work in learning flexible partial monotonic functions. We provide theoretical results characterizing the flexibility of the DLN in Section 4, followed by details on our TensorFlow implementation and numerical optimization choices in Section 5. Experimental results demonstrate the potential on benchmark and realworld scenarios in Section 6.
2 Deep Lattice Network Layers
We describe in detail the three types of layers we propose for learning flexible functions that can be constrained to be monotonic with respect to any subset of the inputs. Without loss of generality, we assume monotonic means monotonic nondecreasing (one can flip the sign of an input if nonincreasing monotonicity is desired). Let
be the input vector to the
th layer, with inputs, and let denote the th input for . Table 1summarizes the parameters and hyperparameters for each layer. For notational simplicity, in some places we drop the notation
if it is clear in the context. We also denote as the subset of that are to be monotonically constrained, and as the subset of that are nonmonotonic.Linear Embedding Layer: Each linear embedding layer consists of two linear matrices, one matrix that linearly embeds the monotonic inputs , and a separate matrix that linearly embeds nonmonotonic inputs
, and one bias vector
. To preserve monotonicity on the embedded vector , we impose the following linear inequality constraints:(1) 
The output of the linear embedding layer is:
Note that only the first coordinates of needs to be a monotonic input to the layer. These two linear embedding matrices and bias vector are discriminatively trained.
Calibration Layer:
Each calibration layer consists of a separate onedimensional piecewise linear transform for each input at that layer,
that maps to , so thatHere each is a 1D lattice with keyvalue pairs , and the function for each input is linearly interpolated between the two values corresponding to the input’s surrounding values. An example is shown on the left in Fig. 1.
Each 1D calibration function is equivalent to a sum of weightedandshifted Rectified Linear Units (ReLU), that is, a calibrator function
can be equivalently expressed as(2) 
where
However, enforcing monotonicity and boundedness constraints for the calibrator output is much simpler with the parameterization of each keypoint’s inputoutput values, as we discuss shortly.
Before training the DLN, we fix the input range for each calibrator to , and we fix the keypoints to be uniformlyspaced over . Inputs that fall outside are clipped to that range. The calibrator output parameters are discriminatively trained.
For monotonic inputs, we can constrain the calibrator functions to be monotonic by constraining the calibrator parameters to be monotonic, by adding the linear inequality constraints
(3) 
into the training objective Canini:2016 . We also experimented with constraining all calibrators to be monotonic (even for nonmonotonic inputs) for more stable/regularized training.
Ensemble of Lattices Layer: Each ensemble of lattices layer consists of lattices. Each lattice is a linearly interpolated multidimensional lookup table; for an example, see the middle and right pictures in Fig. 1. Each dimensional lookup table takes inputs over the dimensional unit hypercube , and has parameters , specifying the lattice’s output for each of the vertices of the unit hypercube. Inputs inbetween the vertices are linearly interpolated, which forms a smooth but nonlinear function over the unit hypercube. Two interpolation methods have been used, multilinear interpolation and simplex interpolation GuptaEtAl:2016 (also known as Lovász extension lovasz1983submodular ). We use multilinear interpolation for all our experiments, which can be expressed where the nonlinear feature transformation are the linear interpolation weights that input puts on each of the parameters such that the interpolated value for is , and , where is the coordinate vector of the th vertex of the unit hypercube, and . For example, when , and .
The ensemble of lattices layer produces outputs, one per lattice. When creating the DLN, if the th layer is an ensemble of lattices, we randomly permute the outputs of the previous layer to be assigned to the inputs of the ensemble. If a lattice has at least one monotonic input, then that lattice’s output is constrained to be a monotonic input to the next layer; in this way we guarantee monotonicity endtoend for the DLN.
Partial monotonicity: The DLN is constructed to preserve an endtoend partial monotonicity with respect to a userspecified subset of the inputs. As we described, the parameters for each component (matrix, calibrator, lattice) can be constrained to be monotonic with respect to a subset of inputs by satisfying certain linear inequality constraints GuptaEtAl:2016 . Also if a component has a monotonic input, then the output of that component is treated as a monotonic input to the following layer. Because the composition of monotonic functions is monotonic, the constructed DLN belongs to the partial monotonic function class. The arrows in Figure 2 illustrate this construction, i,e,, how the th layer output becomes a monotonic input to th layer.
2.1 Hyperparameters
We detail the hyperparameters for each type of DLN layer in Table 1. Some of these hyperparameters constrain each other since the number of outputs from each layer must be equal to the number of inputs to the next layer; for example, if you have a linear embedding layer with outputs, then there are inputs to the next layer, and if that next layer is a lattice ensemble, its hyperparameters must obey .
Layer  Parameters  Hyperparameters 

Linear Embedding  , ,  
Calibrators  keypoints,  
input range  
Lattice Ensemble  for  lattices 
inputs per lattice 
3 Related Work
Prior to this work, the stateoftheart in learning expressive partial monotonic functions for inputs was 2layer networks consisting of a layer of calibrators followed by an ensemble of lattices Canini:2016 , with parameters appropriately constrained for monotonicity, which built on earlier work of Gupta et al. GuptaEtAl:2016 that constructed only a single calibrated lattice, and was restricted to around inputs due to the number of parameters for each lattice. This work differs in three key regards.
First, we alternate layers to form a deeper, and hence potentially more flexible, network. Second, a key question addressed in Canini et al. Canini:2016
is how to decide which features should be put together in each lattice in their ensemble. They found that random assignment worked well, but required large ensembles. Smaller (and hence faster) models with the same accuracy could be trained by using a heuristic preprocessing step they proposed (
crystals) to identify which features interact nonlinearly. This preprocessing step requires training a lattice for each pair of inputs to judge that pair’s strength of interaction, which scales as , and we found it can be a large fraction of overall training time for .We solve the problem of determining which inputs should interact in each lattice by using a linear embedding layer before an ensemble of lattices layer to discriminatively and adaptively learn during training how to map the features to the first ensemblelayer lattices’ inputs. This strategy also means each input to a lattice can be a linear combination of the features, which is a second key difference to that prior work Canini:2016 .
The third difference is that in previous work Jebara:2007 ; GuptaEtAl:2016 ; Canini:2016
, the calibrator keypoint values were fixed a priori based on the quantiles of the features, which is challenging to do for the calibration layers midDLN, because the quantiles of their inputs are evolving during training. Instead, we fix the keypoint values uniformly over the bounded calibrator domain.
Learning monotonic singlelayer neural nets by constraining the neural net weights to be positive dates back to Archer and Wang in 1993 archerWang:1993 , and that basic idea has been revisited by others Wang:1994 ; KayUngar:2000 ; Dugas:2009 ; Minin:2010 , but with some negative results about the obtainable flexibility even with multiple hidden layers DanielsVelikova:2010 . Sill Sill:1998 proposed a threelayer monotonic network that used an early form of monotonic linear embedding and maxandminpooling. Daniels and Velikova DanielsVelikova:2010
extended Sill’s result to learn a partial monotonic function by combining minmaxpooling, also known as adaptive logic networks
armstrong1996adaptive, with partial monotonic linear embedding, and show that their proposed architecture is an universal approximator for partial monotone functions. None of these prior neural networks were demonstrated on problems with more than
features, nor trained on more than a few thousand examples. For our experiments we implemented a positive neural network and a minmaxpooling network with TensorFlow.4 Function Class of Deep Lattice Networks
We offer some results and hypotheses about the function class of deep lattice networks, depending on whether the lattices are interpolated with multilinear interpolation (which forms multilinear polynomials), or simplex interpolation (which forms locally linear surfaces).
4.1 Cascaded multilinear lookup tables
We show that a deep lattice network made up only of lattices (without intervening layers of calibrators or linear embeddings) is equivalent to a single lattice defined on the input features if multilinear interpolation is used. It is easy to construct counterexamples showing that this result does not hold for simplexinterpolated lattices.
Lemma 1.
Suppose that a lattice has inputs that can each be expressed in the form , where the are mutually disjoint and represents multilinear interpolation weights. Then the output can be expressed in the form . That is, the lattice preserves the functional form of its inputs, changing only the values of the coefficients and the linear interpolation weights .
Proof.
Each input of the lattice can be expressed in the following form:
This is a multilinear polynomial. Analogously, the output can be expressed in the following form:
Note the product in the expression: and are both multilinear polynomials, but within each term of the product, only one is present, since one of the two has exponent and the other has exponent . Furthermore, since each is a function of a different subset of , we conclude that the entire product is a multilinear polynomial. Since the sum of multilinear polynomials is still a multilinear polynomial, we conclude that is a multilinear polynomial. Any multilinear polynomial on variables can be converted to a dimensional multilinear lookup table, which concludes the proof. ∎
Theorem 1 can be applied inductively to every layer of a cascaded lookup table down to the final output . Thus, we can show that a cascaded lookup table using multilinear interpolation is equivalent to a single multilinear lattice defined on all features.
4.2 Universal approximation of partial monotone functions
Theorem 4.1 in Daniels:2010 states that partial monotone linear embedding with min and max pooling can approximate any partial monotone functions on the hypercube. We show in the next lemma that simplexinterpolated lattices can represent min or max pooling. Thus we can use two cascaded simplex interpolated lattice layers with a linear embedding layer to approximate any partial monotone function on the hypercube.
Lemma 2.
Let and , and be the simplex interpolation weights. Then
Proof.
From GuptaEtAl:2016 , , where is the sorted order such that , so by definition, it is easy to see the above result. ∎
4.3 Locally linear functions
If simplex interpolation GuptaEtAl:2016 (aka the Lovász extension) is used, the deep lattice network produces a locally linear function, because each layer is locally linear, and compositions of locally linear functions are locally linear. Note that a input lattice interpolated with simplex interpolation has linear pieces GuptaEtAl:2016 . We hypothesize that if one cascades an ensemble of lattices into a lattice, that the number of locally linear pieces is on the order .
5 Numerical Optimization Details for the DLN
Operators: We implemented 1D calibrator and multilinear interpolation over a lattice as new C++ operators in TensorFlow tensorflow2015whitepaper and express each layer as a computational graph node using these new and existing TensorFlow operators. We will make the code publicly available via the TensorFlow open source project. We use the ADAM optimizer kingma2014adam and batched stochastic gradients to update model parameters. After each gradient update, we project parameters to satisfy their monotonicity constraints. The linear embedding layer’s constraints are elementwise nonnegativity constraints, so its projection clips each negative component to zero. Projection for each calibrator is isotonic regression with total ordering, which we implement with the pooladjacentviolator algorithm ayer1955empirical for each calibrator. Projection for each lattice is isotonic regression with partial ordering, resulting in linear constraints for each lattice GuptaEtAl:2016 . We solved it with consensus optimization and alternating direction method of multipliers boyd2011distributed to parallelize the projection computations with a convergence criterion of .
Initialization: For linear embedding layers, we initialize each component in the linear embedding matrix with IID Gaussian noise . The initial mean of is to bias the initial parameters to be positive so that they are not clipped to zero by the first monotonicity projection. However, because the calibration layer before the linear embedding outputs in and thus is expected to have output , initializing the linear embedding with a mean of introduces an initial bias: . To counteract that we initialize each component of the bias vector, , to , so that the initial expected output of the linear layer is .
We initialize each lattice’s parameters to be a linear function spanning , and add IID Gaussian noise to each parameter. We initialize each calibrator to be a linear function that maps to (and did not add any noise).
6 Experiments
We present results on the same benchmark dataset (Adult) with the same monotonic features as in Canini et al. Canini:2016 , and for three problems from a large internet services company where the monotonicity constraints were specified by product groups. For each experiment, every model considered is trained with monotonicity guarantees on the same set of inputs. See Table 2 for a summary of the datasets.
For classification problems, we used logistic loss, and for the regression, we used squared error. For each problem, we used a validation set to optimize the hyperparameters for each model architecture: the learning rate, the number of training steps, etc. For an ensemble of lattices, we tune the number of lattices, , and number of inputs to each lattice, . All calibrators for all models used a fixed number of 100 keypoints, and set as an input range.
For crystals Canini:2016 we validated the number of ensembles, , and number of inputs to each lattice, , as well as ADAM stepsize and number of loops. For minmax net DanielsVelikova:2010 , we validated the number of groups, , and dimension of each group , as well as ADAM stepsize and number of loops.
For datasets where all features are monotonic, we also train a deep neural network with a nonnegative weight matrix and ReLU as an activation unit with a final fully connected layer with nonnegative weight matrix, which we call monotonic DNN. We tune the depth of hidden layers, , and the activation units in each layer .
All the result table contains an additional column to denote model parameters; means and .
Dataset  Type  # Features (# Monotonic)  # Training  # Validation  # Test 

Adult  Classify  90 (4)  26,065  6,496  16,281 
User Intent  Classify  49 (19)  241,325  60,412  176,792 
Rater Score  Regress  10 (10)  1,565,468  195,530  195,748 
Thresholding  Classify  9 (9)  62,220  7,764  7,919 
6.1 User Intent Case Study (Classification)
For this realworld problem from a large internet services company, the problem is to classify the user intent. We report results the best validated model for different DLN architectures, such as Calibration(Cal)Linear(Lin)Calibration(Cal)Ensemble of Lattices(EnsLat)Calibration(Cal)Linear(Lin).
The test set is not IID with the train and validation set in that the train and validation set are collected from the U.S., and the test set is collected from 20 other countries, and as a result we see the notable difference between the validation and the test accuracy. This experiment is setup to test generalization ability.
The results are summarized in Table 3, sorted by the validation accuracy. Two of the DLN architectures outperform crystals and minmax net in terms of test accuracy.
Validation  Test  # Parameters  

Accuracy  Accuracy  
CalLinCalEnsLatCalLin  74.39%  72.48%  27,903  
Crystals  74.24%  72.01%  15,840  
CalLinCalEnsLatCalLat  73.97%  72.59%  16,893  
MinMax network  73.89%  72.02%  31,500  
CalLinCalLat  73.88%  71.73%  7,303 
6.2 Adult Benchmark Dataset (Classification)
We compare accuracy on the benchmark Adult dataset UCI , where a model predicts whether a person’s income is greater than or equal to $50,000, or not. Following Canini et al. GuptaEtAl:2016
, we set the function to be monotonically increasing in capitalgain, weekly hours of work and education level, and the gender wage gap. We used onehot encoding for the other categorical features, for 90 features in total. We randomly split the usual train set
UCI 8020 and trained over the , and validated over the .For DLN architecture, we used CalLinCalEnsLatCalLin layer. Results in Table 4 show the DLN provides better accuracy than the minmax network or crystals.
Validation  Test  # Parameters  

Accuracy  Accuracy  
CalLinCalEnsLatCalLin  86.50%  86.08%  40,549  
Crystals  86.02%  85.87%  3,360  
MinMax network  85.28%  84.63%  57,330 
6.3 Rater Score Prediction Case Study (Regression)
In this task, we train a model to predict a rater score for a candidate result, where each rater score is averaged over 15 raters, and takes on 525 possible values. All 10 monotonic features are required to be monotonic. Results in Table 5 show DLN has slightly better validation and test MSE than all other models.
Validation MSE  Test MSE  # Parameters  

CalLinCalEnsLatCalLin  1.2078  1.2096  81,601  
Crystals  1.2101  1.2109  1,980  
MinMax network  1.3474  1.3447  5,500  
Monotonic DNN  1.3920  1.3939  2,341 
6.4 Usefulness Case Study (Classifier)
In this task, we train a model to predict whether a candidate result contains useful information or not. All 9 features are required to be monotonic, and we use CalLinCalEnsLatCalLin DLN architecture. Table 6 shows the DLN has better validation and test accuracy than other models.
Validation  Test  # Parameters  

Accuracy  Accuracy  
CalLinCalEnsLatCalLin  66.08%  65.26%  81,051  
Crystals  65.45%  65.13%  9,920  
MinMax network  64.62%  63.65%  4,200  
Monotonic DNN  64.27%  62.88%  2,012 
7 Conclusions
In this paper, we combined three types of layers, (1) calibrators, (2) linear embeddings, and (3) lattices, to produce a new class of models that combines the flexibility of deep networks with the regularization, interpretability and debuggability advantages that come with being able to impose monotonicity constraints on some inputs.
References

[1]
M. R. Gupta, A. Cotter, J. Pfeifer, K. Voevodski, K. Canini, A. Mangylov,
W. Moczydlowski, and A. Van Esbroeck.
Monotonic calibrated interpolated lookup tables.
Journal of Machine Learning Research
, 17(109):1–47, 2016.  [2] E. K. Garcia and M. R. Gupta. Lattice regression. In Advances in Neural Information Processing Systems (NIPS), 2009.
 [3] K. Canini, A. Cotter, M. M. Fard, M. R. Gupta, and J. Pfeifer. Fast and flexible monotonic functions with ensembles of lattices. Advances in Neural Information Processing Systems (NIPS), 2016.
 [4] A. Howard and T. Jebara. Learning monotonic transformations for classification. Advances in Neural Information Processing Systems (NIPS), 2007.
 [5] László Lovász. Submodular functions and convexity. In Mathematical Programming The State of the Art, pages 235–257. Springer, 1983.
 [6] N. P. Archer and S. Wang. Application of the back propagation neural network algorithm with monotonicity constraints for twogroup classification problems. Decision Sciences, 24(1):60–75, 1993.

[7]
S. Wang.
A neural network method of density estimation for univariate unimodal data.
Neural Computing & Applications, 2(3):160–167, 1994.  [8] H. Kay and L. H. Ungar. Estimating monotonic functions and their bounds. AIChE Journal, 46(12):2426–2434, 2000.
 [9] C. Dugas, Y. Bengio, F. Bélisle, C. Nadeau, and R. Garcia. Incorporating functional knowledge in neural networks. Journal Machine Learning Research, 2009.
 [10] A. Minin, M. Velikova, B. Lang, and H. Daniels. Comparison of universal approximators incorporating partial monotonicity by structure. Neural Networks, 23(4):471–475, 2010.
 [11] H. Daniels and M. Velikova. Monotone and partially monotone neural networks. IEEE Trans. Neural Networks, 21(6):906–917, 2010.
 [12] J. Sill. Monotonic networks. Advances in Neural Information Processing Systems (NIPS), 1998.
 [13] William W Armstrong and Monroe M Thomas. Adaptive logic networks. Handbook of Neural Computation, Section C1. 8, IOP Publishing and Oxford U. Press, ISBN 0 7503 0312, 3, 1996.
 [14] H. Daniels and M. Velikova. Monotone and partially monotone neural networks. IEEE Trans. Neural Networks, 21(6):906–917, 2010.
 [15] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
 [16] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [17] Miriam Ayer, H Daniel Brunk, George M Ewing, William T Reid, Edward Silverman, et al. An empirical distribution function for sampling with incomplete information. The annals of mathematical statistics, 26(4):641–647, 1955.
 [18] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011.
 [19] C. Blake and C. J. Merz. UCI repository of machine learning databases, 1998.
Comments
There are no comments yet.