1 Introduction
In machine learning and statistical modeling, feature mappings are intrainstance transformations, usually denoted by
, that map instance vectors to higher dimensional spaces in which they are more linearly separable, allowing linear models to capture nonlinearities [1].A well known and widely used feature mapping is the polynomial expansion, which produces a new feature for each degree monomial in the original features. If the original features are , the order2 polynomial features are , and the order3 polynomial features are . A order polynomial feature expansion of the feature space allows a linear model to learn polynomial relationships between dependent and independent variables. This mapping was first utilized in a published experiment by Joseph Diaz Gergonne in 1815 [2, 3].
While other methods for capturing nonlinearities have been developed, such as kernels (the direct offspring of feature mappings), trees, generative models, and neural networks, feature mappings are still a popular tool
[4, 5, 6]. The instanceindependent nature of feature mappings allows them to pair well with linear parameter estimation techniques such as stochastic gradient descent, making them a candidate for certain types of largescale machine learning problems when
.The compressed sparse row (CSR) matrix format [7] is widely used [8, 9, 10, 11] and supported [12, 13, 14, 15], and is considered the standard data structure for sparse, computationally heavy tasks. However, polynomial feature expansions cannot be performed directly on CSR matrices, nor on any sparse matrix format, without intermediate densification steps. This densification not only adds overhead, but wastefully computes combinations of features that have a product of zero, which are then discarded during conversion into a sparse format.
We describe an algorithm that takes a CSR matrix as input and produces a CSR matrix for its degree polynomial feature expansion with no intermediate densification. The algorithm leverages the CSR format to only compute products of features that result in nonzero values. This exploits the sparsity of the data to achieve an improved time complexity of on each vector of the matrix where is the degree of the expansion, is the dimensionality, and is the density. The standard algorithm has time complexity . Since , our algorithm is a significant improvement for small . While the algorithm we describe uses CSR matrices, it can be readily adapted to other sparse formats.
2 Notation
We denote matrices by uppercase bold letters thus: . The the row of is written . All vectors are written in bold, and , with no subscript, is a vector. Nonbold letters are scalars. We sometimes use ‘slice’ notation on subscripts, so that indicates the second through fifth elements of the vector .
A CSR matrix representation of an row matrix consists of three vectors: , , and and a single number: the number of columns of . The vectors and contain the same number of elements, and hold the column indices and data values, respectively, of all nonzero elements of . The vector has entries. The values in index both and . The th entry of tells where the data describing nonzero columns of are within the other two vectors: contain the column indices of those entries; contain the entries themselves. Since only nonzero elements of each row are held, the overall number of columns of must also be stored, since it cannot be derived from the other data.
Scalars, vectors, and matrices are often decorated with a superscript , which is not to be interpreted as an exponent, but instead as an indicator of polynomial expansion: For example, if the CSR for is , then is the vector that holds columns for nonzero values in ’s quadratic feature expansion CSR representation.
Uppercase refers to the degree of a polynomial or interaction expansion. When a superscript (kappa) appears, it indicates that the element below it is in a polynomially expanded context of degree . For example, if is the number of nonezero elements in the vector th row vector of some matrix, is the number of nonzero elements in the polynomial expansion of that row vector. Lowercase refers to a column index.
3 Motivation
In this section we present a strawman algorithm for computing polynomial feature expansions on dense matrices. We then modify the algorithm slightly to operate on a CSR matrix, to expose its infeasibility in that context. We then show how the algorithm would be feasible with a bijective mapping from tuples of column indicies in the input matrix to column indices in the polynomial expansion matrix, which we then derive in the following section.
It should be noted that in practice, as well as in our code and experiments, expansions for degrees are also generated. The final design matrix is the augmentation of all such expansions. However, the algorithm descriptions in this paper omit these steps as they would become unnecessarily visually and notationally cumbersome. Extending them to include all degrees less than is trivial and does not affect the complexity of the algorithm as the terms that involve dominate.
3.1 Dense Second Degree Polynomial Expansion Algorithm
A natural way to calculate polynomial features for a matrix is to walk down its rows and, for each row, take products of all combinations of elements. To determine in which column of products of elements in belong, a simple counter can be set to zero for each row of and incremented after each polynomial feature is generated. This counter gives the column of into which each expansion feature belongs. This is shown in Algorithm 1.
3.2 Incomplete Second Degree CSR Polynomial Expansion Algorithm
Now consider how this algorithm might be modified to accept a CSR matrix. Instead of walking directly down rows of , we will walk down sections of and partitioned by , and instead of inserting polynomial features into , we will insert column numbers into and data elements into . Throughout the algorithm, we use variables named , with sub or superscripts, to indicate the number of nonzero entries in either a matrix or a row of a matrix. See Algorithm 2.
The crux of the problem is at line 25 of Algorithm 2. Given the arbitrary columns involved in a polynomial feature of , we need to determine the corresponding column of . We cannot simply reset a counter for each row as we did in the dense algorithm, because only columns corresponding to nonzero values are stored. Any time a column that would have held a zero value is implicitly skipped, the counter would err.
To develop a general algorithm, we require a mapping from a list of columns of to a single column of . If there are columns of and columns of , this can be accomplished by a bijective mapping of the following form:
(1) 
where (i) , (ii) are elements of , and (iii) is an element of . (For interaction features, the constraint is .)
Stated more verbosely, we require a bijective mapping from tuples consisting of column indicies of the original input to where the column index of the corresponding product of features in the polynomial expansion. While any bijective mapping would suffice, a common order in which to produce polynomial features is , , , , , , , , , for where the elements of the tuples are column indices. That is, the further to the right an index is, the sooner it is incremented. If we want for our algorithm to be backwards compatible with existing models, the mapping must use the this same ordering.
4 Construction of Mappings
Within this section, , , and denote column indices. We will construct mappings for second () and third () degree interaction and polynomial expansions. To accomplish this, we will require the triangle and tetrahedral numbers. We denote the th triangle number as and the th tetrahedral number as .
For reference, we list the first five triangle and tetrahedral numbers in the following table:
0  0  0 
1  1  1 
2  3  4 
3  6  10 
4  10  20 
4.1 Second Degree Interaction Mapping
For second order interaction features, we require a bijective function that maps the elements of the ordered set
(2) 
to the elements of the ordered set
(3) 
For , we can view the desired mapping as one that maps the coordinates of matrix cells to . If we fill the cells of the matrix with the codomain, the target matrix is as follows:
(4) 
where the entry in row , column , displays the value of .
It will be simpler to instead construct a preliminary mapping, of the following form:
(5) 
and then subtract the preliminary mapping from the total number of elements in the codomain to create the final mapping. Note that in equation 5 we have the following relation:
(6) 
where is the value of any cell in row of equation 5.
Therefore, the following mapping will produce equation 5:
(7) 
We can now use this result to construct the desired mapping by subtracting it from the size of the codomain:
(8) 
4.2 Second Degree Polynomial Mapping
In this case, the target matrix is of the form
(9) 
A very similar analysis can be done for the case to yield
(10) 
4.3 Third Degree Interaction Mapping
For we can no longer view the necessary function as mapping matrix coordinates to cell values; rather, tensor coordinates to cell values. For simplicity, we will instead list the column index tuples and their necessary mappings in a table. We shall consider the case of .
Again, it is simpler to find a mapping that maps to the reverse the target indices (plus one) and create a final mapping by subtracting that mapping from the number of elements in the codomain. We therefore seek a preliminary mapping of the form
10  0  
9  1  
8  2  
7  3  
6  4  
5  5  
4  6  
3  7  
2  8  
1  9 
The mapping has been partitioned according to the dimension. Note that within each partition is a mapping very similar to the equivalent, but with the indices shifted by a function of . For example, when , the indices are shifted by , when , the shift is , and finally, when , the shift is . The preliminary mapping is therefore
(11) 
and the final mapping is therefore
(12)  
(13)  
(14) 
4.4 Third Degree Polynomial Mapping
The analysis of the polynomial case is very similar to that of the interaction case. However, the target mapping now includes the edge of the simplex, as it included the diagonal of the matrix in the polynomial case. The analysis yields the mapping
(15) 
5 Higher Order Mappings
It can be seen that mappings to higher orders can be constructed inductively. A degree mapping is a function of the simplex numbers and reparameterized versions of all lower order mappings. However, in practice, higher degrees are not often used as the dimensionality of the expanded vectors becomes prohibitively large. A fourth degree polynomial expansion of a vector would have dimensions.
6 Final CSR Polynomial Expansion Algorithm
With the mapping from columns of to a column of , we can now write the final form of the innermost loop of the algorithm from 3.2. Let the polynomial mapping for be denoted . Then the innermost loop can be completed as follows:
The algorithm can be generalized to higher degrees by simply adding more nested loops, using higher order mappings, modifying the output dimensionality, and adjusting the counting of nonzero polynomial features in line 9.
7 Higher Degree and Interaction Algorithms
Most of the steps for higher degrees and interaction expansions (as opposed to polynomial) are the same as for the polynomial case. The differences are that for higher degrees, an extra loop is needed to iterate over another column index, and a different mapping is required. For interaction expansions, the column indices are never allowed to equal each other, so each loop executes one less time, and an interaction mapping is required. Also, for interaction expansions, the way is computed on line 9 of Algorithm 2. Instead of , we have .
8 Time Complexity
8.1 Analytical
Calculating degree polynomial features via our method for a vector of dimensionality and density requires products. The complexity of the algorithm, for fixed , is therefore
(16)  
(17)  
(18) 
For a matrix of size , the complexity is therefore . The dense algorithm (Algorithm 1) does not leverage sparsity, and its complexity is . Since , the sparse algorithm scales polynomially with the degree of the polynomial expansion.
8.2 Empirical Results
To empirically verify the average time complexity of our algorithm, we implemented both the sparse version and the baseline in the Cython programming language so that results would be directly comparable. We sought the relationships between runtime and the instance count (), the instance dimensionality (), and the instance density ().
To find these relationships, we individually varied , , and while holding the remaining two constant. For each of these configurations, we generated
matrices and took the average time to reduce variance. The time to densify did not count against the dense algorithm. Figure
1 summarizes our findings.Varying the density () (column 1) shows that our algorithm scales polynomially with , but that the baseline is unaffected by it. The runtimes for both algorithms increase polynomially with the dimensionality (), but ours at a significantly reduced rate. Likewise, both algorithms scale linearly with the instance count (), but ours to a much lesser degree.
Note that the point at which the sparse and dense algorithms intersect when varying is to the right of , which is when a matrix technically becomes sparse. The point at which this happens will depend on , but the results show that our algorithm is useful even for some dense matrices.
9 Conclusion
We have developed an algorithm for performing polynomial feature expansions on CSR matrices that scales polynomially with respect to the density of the matrix. The areas within machine learning that this work touches are not en vogue, but they are workhorses of industry, and every improvement in core representations has an impact across a broad range of applications.
References
 [1] G.X. Yuan, C.H. Ho, and C.J. Lin, “Recent advances of largescale linear classification,” Proceedings of the IEEE, vol. 100, no. 9, pp. 2584–2603, 2012.

[2]
J. Gergonne, “The application of the method of least squares to the interpolation of sequences,”
Historia Mathematica, vol. 1, no. 4, pp. 439–447, 1974. 
[3]
K. Smith, “On the standard deviations of adjusted and interpolated values of an observed polynomial function and its constants and the guidance they give towards a proper choice of the distribution of observations,”
Biometrika, vol. 12, no. 1/2, pp. 1–85, 1918.  [4] P. A. Barker, F. A. StreetPerrott, M. J. Leng, P. Greenwood, D. Swain, R. Perrott, R. Telford, and K. Ficken, “A 14,000year oxygen isotope record from diatom silica in two alpine lakes on mt. kenya,” Science, vol. 292, no. 5525, pp. 2307–2310, 2001.
 [5] Y.W. Chang, C.J. Hsieh, K.W. Chang, M. Ringgaard, and C.J. Lin, “Training and testing lowdegree polynomial data mappings via linear svm,” Journal of Machine Learning Research, vol. 11, no. Apr, pp. 1471–1490, 2010.
 [6] P. Shaw, D. Greenstein, J. Lerch, L. Clasen, R. Lenroot, N. Gogtay, A. Evans, J. Rapoport, and J. Giedd, “Intellectual ability and cortical development in children and adolescents,” Nature, vol. 440, no. 7084, pp. 676–679, 2006.
 [7] Y. Saad, “Sparskit: a basic tool kit for sparse matrix computations,” 1994.
 [8] H. Liu, S. Yu, Z. Chen, B. Hsieh, and L. Shao, “Sparse matrixvector multiplication on nvidia gpu,” International Journal of Numerical Analysis & Modeling, Series B, vol. 3, no. 2, pp. 185–191, 2012.
 [9] A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson, “Parallel sparse matrixvector and matrixtransposevector multiplication using compressed sparse blocks,” in Proceedings of the twentyfirst annual symposium on Parallelism in algorithms and architectures. ACM, 2009, pp. 233–244.
 [10] N. Bell and M. Garland, “Efficient sparse matrixvector multiplication on cuda,” Nvidia Technical Report NVR2008004, Nvidia Corporation, Tech. Rep., 2008.
 [11] J. B. White and P. Sadayappan, “On improving the performance of sparse matrixvector multiplication,” in HighPerformance Computing, 1997. Proceedings. Fourth International Conference on. IEEE, 1997, pp. 66–71.
 [12] G. Guennebaud, B. Jacob et al., “Eigen v3,” http://eigen.tuxfamily.org, 2010.
 [13] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, N. Bouchard, D. WardeFarley, and Y. Bengio, “Theano: new features and speed improvements,” arXiv preprint arXiv:1211.5590, 2012.
 [14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikitlearn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
 [15] R. Koenker, P. Ng et al., “Sparsem: A sparse matrix package for r,” Journal of Statistical Software, vol. 8, no. 6, pp. 1–9, 2003.
Comments
There are no comments yet.