Leveraging Sparsity to Speed Up Polynomial Feature Expansions of CSR Matrices Using K-Simplex Numbers

03/16/2018
by   Anadrew Nystrom, et al.
Google
Brown University
0

We provide an algorithm that speeds up polynomial and interaction feature expansion on sparse matrices. The algorithm operates on and produces a compressed sparse row matrix with no densification. It works by defining a bijective function involving simplex numbers of column indices in the original matrix to column indices in the expanded matrix. This function allows for only the nonzero elements to be iterated over and multiplied together, greatly improving the expansion of sparse matrices in compressed sparse row format. For a vector of dimension D and density 0 < d < 1, the algorithm has time complexity Θ(d^KD^K) where K is the polynomial-feature order; this is an improvement by a factor d^K over the standard method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/03/2018

Efficient Distributed-Memory Parallel Matrix-Vector Multiplication with Wide or Tall Unstructured Sparse Matrices

This paper presents an efficient technique for matrix-vector and vector-...
10/12/2018

Expressing Sparse Matrix Computations for Productive Performance on Spatial Architectures

This paper addresses spatial programming of sparse matrix computations f...
07/11/2019

Schatten Norms in Matrix Streams: Hello Sparsity, Goodbye Dimension

The spectrum of a matrix contains important structural information about...
12/10/2020

Efficient Distributed Transposition Of Large-Scale Multigraphs And High-Cardinality Sparse Matrices

Graph-based representations underlie a wide range of scientific problems...
03/23/2020

Projected Estimation for Large-dimensional Matrix Factor Models

Large-dimensional factor models are drawing growing attention and widely...
11/29/2019

Sinkhorn limits in finitely many steps

Applied to a nonnegative m× n matrix with a nonzero σ-diagonal, the sequ...
10/14/2020

On Minor Left Prime Factorization Problem for Multivariate Polynomial Matrices

A new necessary and sufficient condition for the existence of minor left...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In machine learning and statistical modeling, feature mappings are intra-instance transformations, usually denoted by

, that map instance vectors to higher dimensional spaces in which they are more linearly separable, allowing linear models to capture nonlinearities [1].

A well known and widely used feature mapping is the polynomial expansion, which produces a new feature for each degree- monomial in the original features. If the original features are , the order-2 polynomial features are , and the order-3 polynomial features are . A -order polynomial feature expansion of the feature space allows a linear model to learn polynomial relationships between dependent and independent variables. This mapping was first utilized in a published experiment by Joseph Diaz Gergonne in 1815 [2, 3].

While other methods for capturing nonlinearities have been developed, such as kernels (the direct offspring of feature mappings), trees, generative models, and neural networks, feature mappings are still a popular tool

[4, 5, 6]

. The instance-independent nature of feature mappings allows them to pair well with linear parameter estimation techniques such as stochastic gradient descent, making them a candidate for certain types of large-scale machine learning problems when

.

The compressed sparse row (CSR) matrix format [7] is widely used [8, 9, 10, 11] and supported [12, 13, 14, 15], and is considered the standard data structure for sparse, computationally heavy tasks. However, polynomial feature expansions cannot be performed directly on CSR matrices, nor on any sparse matrix format, without intermediate densification steps. This densification not only adds overhead, but wastefully computes combinations of features that have a product of zero, which are then discarded during conversion into a sparse format.

We describe an algorithm that takes a CSR matrix as input and produces a CSR matrix for its degree- polynomial feature expansion with no intermediate densification. The algorithm leverages the CSR format to only compute products of features that result in nonzero values. This exploits the sparsity of the data to achieve an improved time complexity of on each vector of the matrix where is the degree of the expansion, is the dimensionality, and is the density. The standard algorithm has time complexity . Since , our algorithm is a significant improvement for small . While the algorithm we describe uses CSR matrices, it can be readily adapted to other sparse formats.

2 Notation

We denote matrices by uppercase bold letters thus: . The the row of is written . All vectors are written in bold, and , with no subscript, is a vector. Non-bold letters are scalars. We sometimes use ‘slice’ notation on subscripts, so that indicates the second through fifth elements of the vector .

A CSR matrix representation of an -row matrix consists of three vectors: , , and and a single number: the number of columns of . The vectors and contain the same number of elements, and hold the column indices and data values, respectively, of all nonzero elements of . The vector has entries. The values in index both and . The th entry of tells where the data describing nonzero columns of are within the other two vectors: contain the column indices of those entries; contain the entries themselves. Since only nonzero elements of each row are held, the overall number of columns of must also be stored, since it cannot be derived from the other data.

Scalars, vectors, and matrices are often decorated with a superscript , which is not to be interpreted as an exponent, but instead as an indicator of polynomial expansion: For example, if the CSR for is , then is the vector that holds columns for nonzero values in ’s quadratic feature expansion CSR representation.

Uppercase refers to the degree of a polynomial or interaction expansion. When a superscript (kappa) appears, it indicates that the element below it is in a polynomially expanded context of degree . For example, if is the number of nonezero elements in the vector th row vector of some matrix, is the number of nonzero elements in the polynomial expansion of that row vector. Lowercase refers to a column index.

3 Motivation

In this section we present a strawman algorithm for computing polynomial feature expansions on dense matrices. We then modify the algorithm slightly to operate on a CSR matrix, to expose its infeasibility in that context. We then show how the algorithm would be feasible with a bijective mapping from -tuples of column indicies in the input matrix to column indices in the polynomial expansion matrix, which we then derive in the following section.

It should be noted that in practice, as well as in our code and experiments, expansions for degrees are also generated. The final design matrix is the augmentation of all such expansions. However, the algorithm descriptions in this paper omit these steps as they would become unnecessarily visually and notationally cumbersome. Extending them to include all degrees less than is trivial and does not affect the complexity of the algorithm as the terms that involve dominate.

3.1 Dense Second Degree Polynomial Expansion Algorithm

A natural way to calculate polynomial features for a matrix is to walk down its rows and, for each row, take products of all -combinations of elements. To determine in which column of products of elements in belong, a simple counter can be set to zero for each row of and incremented after each polynomial feature is generated. This counter gives the column of into which each expansion feature belongs. This is shown in Algorithm 1.

1:  Input: data , size
2:   empty matrix
3:  for  to  do
4:     
5:     for  to  do
6:        for  to  do
7:           
8:           
9:        end for
10:     end for
11:  end for
Algorithm 1 Dense Second Order Polynomial Expansion

3.2 Incomplete Second Degree CSR Polynomial Expansion Algorithm

Now consider how this algorithm might be modified to accept a CSR matrix. Instead of walking directly down rows of , we will walk down sections of and partitioned by , and instead of inserting polynomial features into , we will insert column numbers into and data elements into . Throughout the algorithm, we use variables named , with sub- or superscripts, to indicate the number of nonzero entries in either a matrix or a row of a matrix. See Algorithm 2.

1:  Input: data , size
2:   vector of size
3:  
4:  
5:  for  to  do
6:     
7:     
8:     
9:     
10:     
11:     
12:  end for
13:   vector of size
14:   vector of size
15:   vector of size
16:  
17:  for  to  do
18:     
19:     
20:     
21:     
22:     for  to  do
23:        for  to  do
24:           
25:           
26:           
27:        end for
28:     end for
29:  end for
Algorithm 2 Incomplete Sparse Second Order Polynomial Expansion

The crux of the problem is at line 25 of Algorithm 2. Given the arbitrary columns involved in a polynomial feature of , we need to determine the corresponding column of . We cannot simply reset a counter for each row as we did in the dense algorithm, because only columns corresponding to nonzero values are stored. Any time a column that would have held a zero value is implicitly skipped, the counter would err.

To develop a general algorithm, we require a mapping from a list of columns of to a single column of . If there are columns of and columns of , this can be accomplished by a bijective mapping of the following form:

(1)

where (i) , (ii) are elements of , and (iii) is an element of . (For interaction features, the constraint is .)

Stated more verbosely, we require a bijective mapping from tuples consisting of column indicies of the original input to where the column index of the corresponding product of features in the polynomial expansion. While any bijective mapping would suffice, a common order in which to produce polynomial features is , , , , , , , , , for where the elements of the tuples are column indices. That is, the further to the right an index is, the sooner it is incremented. If we want for our algorithm to be backwards compatible with existing models, the mapping must use the this same ordering.

4 Construction of Mappings

Within this section, , , and denote column indices. We will construct mappings for second () and third () degree interaction and polynomial expansions. To accomplish this, we will require the triangle and tetrahedral numbers. We denote the th triangle number as and the th tetrahedral number as .

For reference, we list the first five triangle and tetrahedral numbers in the following table:

0 0 0
1 1 1
2 3 4
3 6 10
4 10 20
Table 1: The first five triangle () and tetrahedral () numbers.

4.1 Second Degree Interaction Mapping

For second order interaction features, we require a bijective function that maps the elements of the ordered set

(2)

to the elements of the ordered set

(3)

For , we can view the desired mapping as one that maps the coordinates of matrix cells to . If we fill the cells of the matrix with the codomain, the target matrix is as follows:

(4)

where the entry in row , column , displays the value of .

It will be simpler to instead construct a preliminary mapping, of the following form:

(5)

and then subtract the preliminary mapping from the total number of elements in the codomain to create the final mapping. Note that in equation 5 we have the following relation:

(6)

where is the value of any cell in row of equation 5.

Therefore, the following mapping will produce equation 5:

(7)

We can now use this result to construct the desired mapping by subtracting it from the size of the codomain:

(8)

4.2 Second Degree Polynomial Mapping

In this case, the target matrix is of the form

(9)

A very similar analysis can be done for the case to yield

(10)

4.3 Third Degree Interaction Mapping

For we can no longer view the necessary function as mapping matrix coordinates to cell values; rather, tensor coordinates to cell values. For simplicity, we will instead list the column index tuples and their necessary mappings in a table. We shall consider the case of .

Again, it is simpler to find a mapping that maps to the reverse the target indices (plus one) and create a final mapping by subtracting that mapping from the number of elements in the codomain. We therefore seek a preliminary mapping of the form

10 0
9 1
8 2
7 3
6 4
5 5
4 6
3 7
2 8
1 9
Table 2: Form of the required mapping for , .

The mapping has been partitioned according to the dimension. Note that within each partition is a mapping very similar to the equivalent, but with the indices shifted by a function of . For example, when , the indices are shifted by , when , the shift is , and finally, when , the shift is . The preliminary mapping is therefore

(11)

and the final mapping is therefore

(12)
(13)
(14)

4.4 Third Degree Polynomial Mapping

The analysis of the polynomial case is very similar to that of the interaction case. However, the target mapping now includes the edge of the simplex, as it included the diagonal of the matrix in the polynomial case. The analysis yields the mapping

(15)

5 Higher Order Mappings

It can be seen that mappings to higher orders can be constructed inductively. A degree mapping is a function of the -simplex numbers and reparameterized versions of all lower order mappings. However, in practice, higher degrees are not often used as the dimensionality of the expanded vectors becomes prohibitively large. A fourth degree polynomial expansion of a vector would have dimensions.

6 Final CSR Polynomial Expansion Algorithm

With the mapping from columns of to a column of , we can now write the final form of the innermost loop of the algorithm from 3.2. Let the polynomial mapping for be denoted . Then the innermost loop can be completed as follows:

1:  for  to  do
2:     
3:     
4:     
5:     
6:     
7:     
8:  end for
Completed Inner Loop of Algorithm 2

The algorithm can be generalized to higher degrees by simply adding more nested loops, using higher order mappings, modifying the output dimensionality, and adjusting the counting of nonzero polynomial features in line 9.

7 Higher Degree and Interaction Algorithms

Most of the steps for higher degrees and interaction expansions (as opposed to polynomial) are the same as for the polynomial case. The differences are that for higher degrees, an extra loop is needed to iterate over another column index, and a different mapping is required. For interaction expansions, the column indices are never allowed to equal each other, so each loop executes one less time, and an interaction mapping is required. Also, for interaction expansions, the way is computed on line 9 of Algorithm 2. Instead of , we have .

8 Time Complexity

8.1 Analytical

Calculating -degree polynomial features via our method for a vector of dimensionality and density requires products. The complexity of the algorithm, for fixed , is therefore

(16)
(17)
(18)

For a matrix of size , the complexity is therefore . The dense algorithm (Algorithm 1) does not leverage sparsity, and its complexity is . Since , the sparse algorithm scales polynomially with the degree of the polynomial expansion.

8.2 Empirical Results

To empirically verify the average time complexity of our algorithm, we implemented both the sparse version and the baseline in the Cython programming language so that results would be directly comparable. We sought the relationships between runtime and the instance count (), the instance dimensionality (), and the instance density ().

To find these relationships, we individually varied , , and while holding the remaining two constant. For each of these configurations, we generated

matrices and took the average time to reduce variance. The time to densify did not count against the dense algorithm. Figure-

1 summarizes our findings.

Varying the density () (column 1) shows that our algorithm scales polynomially with , but that the baseline is unaffected by it. The runtimes for both algorithms increase polynomially with the dimensionality (), but ours at a significantly reduced rate. Likewise, both algorithms scale linearly with the instance count (), but ours to a much lesser degree.

Note that the point at which the sparse and dense algorithms intersect when varying is to the right of , which is when a matrix technically becomes sparse. The point at which this happens will depend on , but the results show that our algorithm is useful even for some dense matrices.


Figure 1: Summary performance comparison plots for quadratic (top) and cubic (bottom) cases showing how the algorithm’s performance varies with , , and ; our sparse algorithm is shown in blue, the dense algorithm in red. Each point in each graph is an average of 20 runs, and the time used in densification is not included in the dense-algorithm timings. In the quadratic case, sparsity loses its advantage at about 67%, and at about 77% for the cubic case, though these precise intersections depend on . In general, taking advantage of sparsity shows large benefits, so large that it’s difficult to see that the performance does not actually change linearly with (column 2); figure 2 gives further details.

Figure 2: A closer view of only the sparse runtimes while varying (left) and (right) for . The left subplot shows that varying gives polynomial growth in runtime; quadratic for (dashed line) and cubic for (dotted line). These nonlinearities were not apparent in Figure 1 due to the much greater runtimes of the dense algorithm. The right subplot shows linear growth in runtime for both. These findings are in accordance with the analysis of section 8.1.

9 Conclusion

We have developed an algorithm for performing polynomial feature expansions on CSR matrices that scales polynomially with respect to the density of the matrix. The areas within machine learning that this work touches are not en vogue, but they are workhorses of industry, and every improvement in core representations has an impact across a broad range of applications.

References

  • [1] G.-X. Yuan, C.-H. Ho, and C.-J. Lin, “Recent advances of large-scale linear classification,” Proceedings of the IEEE, vol. 100, no. 9, pp. 2584–2603, 2012.
  • [2]

    J. Gergonne, “The application of the method of least squares to the interpolation of sequences,”

    Historia Mathematica, vol. 1, no. 4, pp. 439–447, 1974.
  • [3]

    K. Smith, “On the standard deviations of adjusted and interpolated values of an observed polynomial function and its constants and the guidance they give towards a proper choice of the distribution of observations,”

    Biometrika, vol. 12, no. 1/2, pp. 1–85, 1918.
  • [4] P. A. Barker, F. A. Street-Perrott, M. J. Leng, P. Greenwood, D. Swain, R. Perrott, R. Telford, and K. Ficken, “A 14,000-year oxygen isotope record from diatom silica in two alpine lakes on mt. kenya,” Science, vol. 292, no. 5525, pp. 2307–2310, 2001.
  • [5] Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J. Lin, “Training and testing low-degree polynomial data mappings via linear svm,” Journal of Machine Learning Research, vol. 11, no. Apr, pp. 1471–1490, 2010.
  • [6] P. Shaw, D. Greenstein, J. Lerch, L. Clasen, R. Lenroot, N. Gogtay, A. Evans, J. Rapoport, and J. Giedd, “Intellectual ability and cortical development in children and adolescents,” Nature, vol. 440, no. 7084, pp. 676–679, 2006.
  • [7] Y. Saad, “Sparskit: a basic tool kit for sparse matrix computations,” 1994.
  • [8] H. Liu, S. Yu, Z. Chen, B. Hsieh, and L. Shao, “Sparse matrix-vector multiplication on nvidia gpu,” International Journal of Numerical Analysis & Modeling, Series B, vol. 3, no. 2, pp. 185–191, 2012.
  • [9] A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson, “Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks,” in Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures.   ACM, 2009, pp. 233–244.
  • [10] N. Bell and M. Garland, “Efficient sparse matrix-vector multiplication on cuda,” Nvidia Technical Report NVR-2008-004, Nvidia Corporation, Tech. Rep., 2008.
  • [11] J. B. White and P. Sadayappan, “On improving the performance of sparse matrix-vector multiplication,” in High-Performance Computing, 1997. Proceedings. Fourth International Conference on.   IEEE, 1997, pp. 66–71.
  • [12] G. Guennebaud, B. Jacob et al., “Eigen v3,” http://eigen.tuxfamily.org, 2010.
  • [13] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, N. Bouchard, D. Warde-Farley, and Y. Bengio, “Theano: new features and speed improvements,” arXiv preprint arXiv:1211.5590, 2012.
  • [14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
  • [15] R. Koenker, P. Ng et al., “Sparsem: A sparse matrix package for r,” Journal of Statistical Software, vol. 8, no. 6, pp. 1–9, 2003.