Abstract
Despite the importance of sparse matrices in numerous fields of science, software implementations remain difficult to use for nonexpert users, generally requiring the understanding of underlying details of the chosen sparse matrix storage format. In addition, to achieve good performance, several formats may need to be used in one program, requiring explicit selection and conversion between the formats. This can be both tedious and errorprone, especially for nonexpert users. Motivated by these issues, we present a userfriendly sparse matrix class for the C++ language, with a highlevel application programming interface deliberately similar to the widely used MATLAB language. The class internally uses two main approaches to achieve efficient execution: (i) a hybrid storage framework, which automatically and seamlessly switches between three underlying storage formats (compressed sparse column, RedBlack tree, coordinate list) depending on which format is best suited and/or available for specific operations, and (ii) templatebased metaprogramming to automatically detect and optimise execution of common expression patterns. To facilitate relatively quick conversion of research code into production environments, the class and its associated functions provide a suite of essential sparse linear algebra functionality (eg., arithmetic operations, submatrix manipulation) as well as highlevel functions for sparse eigendecompositions and linear equation solvers. The latter are achieved by providing easytouse abstractions of the lowlevel ARPACK and SuperLU libraries. The source code is open and provided under the permissive Apache 2.0 license, allowing unencumbered use in both opensource projects and closedsource commercial products.
Keywords: numerical linear algebra, sparse matrix, mathematical software, C++ language.
AMS MSC2010 codes: 65F50, 65Y04, 65Y15, 68N99, 68P05, 15A99.
1 Introduction
Recent decades have seen the frontiers of scientific computing increasingly push towards the use of larger and larger datasets. In fact, frequently the data to be represented is so large that it cannot fully fit into working memory. Fortunately, in many cases the data has many zeros and can be represented in a compact manner, allowing users to work with sparse matrices of extreme size with few nonzero elements. However, converting code from using dense matrices to using sparse matrices—a common task when scaling code to larger data—is not always straightforward.
Existing opensource frameworks may provide several separate sparse matrix classes, each with their own data storage format. For instance, SciPy [23] has 7 sparse matrix classes: bsr_matrix, coo_matrix, csc_matrix, csr_matrix, dia_matrix, dok_matrix, and lil_matrix. Each storage format is best suited for efficient execution of a specific set of operations (eg., matrix multiplication vs. incremental matrix construction). Other frameworks may provide only one sparse matrix class, with severe runtime penalties if it is not used in the right way. This can be challenging and bewildering for users who simply want to create and use sparse matrices, and do not have the time, expertise, or desire to understand the advantages and disadvantages of each format. To achieve good performance, several formats may need to be used in one program, requiring explicit selection and conversion between the formats. This plurality of sparse matrix classes complicates the programming task, increases the likelihood of bugs, and adds to the maintenance burden.
Motivated by the above issues, we present a userfriendly sparse matrix class for the C++ language [29], with a highlevel application programming interface (function syntax) that is deliberately similar to MATLAB. The sparse matrix class uses a hybrid storage framework, which automatically and seamlessly switches between three data storage formats, depending on which format is best suited and/or available for specific operations:

Compressed Sparse Column (CSC), used for efficient and nuanced implementation of core arithmetic operations such as matrix multiplication and addition, as well as efficient reading of individual elements;

RedBlack Tree (RBT), used for both robust and efficient incremental construction of sparse matrices (i.e., construction via setting individual elements onebyone, not necessarily in order);

Coordinate List (COO), used for lowmaintenance and straightforward implementation of relatively complex and/or lesserused sparse matrix functionality.
The COO format is important to point out, as the source code for the sparse matrix class is distributed and maintained as part of the opensource Armadillo library [25]. Due to its simpler nature, the COO format facilitates functionality contributions from timeconstrained and/or nonexpert users, as well as reducing maintenance and debugging overhead for the library maintainers.
To further promote efficient execution, the sparse matrix class implements a delayed evaluation^{1}^{1}1 In delayed evaluation, the evaluation of a given compound mathematical expression is delayed until its value is required (ie., assigned to a variable). This is in contrast to eager evaluation (also known as strict evaluation), where each component of a compound expression is evaluated immediately. approach [19] based on C++ features such as operator overloading [29] and template metaprogramming [1, 31]. In contrast to simply using bruteforce evaluation of mathematical expressions, the delayed evaluation framework allows automatic compiletime analysis of such expressions, which in turns allows for automatic detection and optimisation of common expression patterns.
Overall, the sparse matrix class provides an intuitive interface that is very close to a typical dense matrix API; this can help with rapid transition of densespecific code to sparsespecific code. In addition, we demonstrate that the overhead of the hybrid format is minimal, and that the format is able to choose the optimal representation for a variety of sparse linear algebra tasks. This makes the format and implementation suitable for realworld prototyping and production usage.
Although there are many other sparse matrix implementations in existence, to our knowledge ours is the first to offer a unified interface with automatic format switching under the hood. Most toolkits are limited to either a single format or multiple formats the user must manually convert between. As mentioned earlier, SciPy contains no fewer than seven formats, and the comprehensive SPARSKIT package [24] contains 16. In these toolkits the user must manually convert between formats. On the other hand, both MATLAB and GNU Octave [11] contain sparse matrix implementations, but they supply only the CSC format [9], meaning that users must write their code in special ways to ensure its efficiency [20].
We continue the paper as follows. In Section 2 we overview the functionality provided by the sparse matrix class and its associated functions. The delayed evaluation approach is overviewed in Section 3. In Section 4 we describe the underlying storage formats used by the class, and the scenarios that each of the formats is best suited for. In Section 5 we discuss the costs for switching between the formats. Section 6 provides an empirical evaluation showing the advantages of the hybrid storage framework and the delayed evaluation approach. The salient points and avenues for further exploitation are summarised in Section 7. This paper is a revised and extended version of our earlier work [26].
2 Functionality
To allow prototyping directly in C++ as well as to facilitate relatively quick conversion of research code into production environments, the sparse matrix class and its associated functions provide a userfriendly suite of essential sparse linear algebra functionality, including fundamental operations such as addition, matrix multiplication and submatrix manipulation. Various sparse eigendecompositions and linear equation solvers are also provided. C++ language features such as overloading of operators (eg., * and +) [29] are exploited to allow mathematical operations with matrices to be expressed in a concise and easytoread manner. For example, given sparse matrices A, B, and C, a mathematical expression such as
D = (A + B) C^{T}
can be written directly in C++ as
sp_mat D = 0.5 * (A + B) * C.t();
where sp_mat is our sparse matrix class. Lowlevel details such as memory management are hidden, allowing the user to concentrate effort on mathematical details. Table 1 lists a subset of the available functionality, while Figure 1 contains a complete C++ program which briefly demonstrates usage of the sparse matrix class. Sparse eigendecompositions and linear equation solutions are accomplished through integration with lowlevel routines in the de facto standard ARPACK [17] and SuperLU libraries [18]. The resultant highlevel functions automatically take care of the cumbersome and errorprone lowlevel management required with these libraries.
In effect, the aggregate of the sparse matrix class, operator overloading and associated functions on sparse matrices is an instance of a Domain Specific Language (sparse linear algebra) embedded within the host C++ language [21, 27]. This allows complex algorithms relying on sparse matrices to be easily developed and integrated within a larger C++ program, making the sparse matrix class directly useful in application/product development.
3 TemplateBased Optimisation of Compound Expressions
The sparse matrix class uses a delayed evaluation approach, allowing several operations to be combined to reduce the amount of computation and/or temporary objects. In contrast to bruteforce evaluations, delayed evaluation can provide considerable performance improvements as well as reduced memory usage [32]. The delayed evaluation machinery is accomplished through template metaprogramming [1, 31], where a typebased signature of a compound expression (set of consecutive mathematical operations) is automatically constructed. The C++ compiler is then automatically induced to detect common expression patterns at compile time, followed by selecting the most computationally efficient implementations.
As an example of the possible efficiency gains, let us consider the expression trace(A.t() * B), which often appears as a fundamental quantity in semidefinite programs [30]
. These computations are thus used in a wide variety of diverse fields, most notably machine learning
[5, 13, 16]. A bruteforce implementation would evaluate the transpose first, A.t(), and store the result in a temporary matrix T1. The next operation would be a time consuming matrix multiplication, T1 * B, with the result stored in another temporary matrix T2. The trace operation (sum of diagonal elements) would then be applied on T2. The explicit transpose, full matrix multiplication and creation of the temporary matrices is suboptimal from an efficiency point of view, as for the trace operation we require only the diagonal elements of the A.t() * B expression.Templatebased expression optimisation can avoid the unnecessary operations. Let us declare two lightweight objects, Op and Glue, where Op objects are used for representing unary operations, while Glue objects are used for representing binary operations. The objects are lightweight as they do not store actual sparse matrix data; instead the objects only store references to matrices and/or other Op and Glue objects. Ternary and more complex operations are represented through combinations of Op and Glue objects. The exact type of each Op and Glue object is automatically inferred from a given mathematical expression through template metaprogramming.
In our example, the expression A.t() is automatically converted to an instance of the lightweight Op object with the following type:
Op<sp_mat, op_trans>
where Op<...> indicates that Op is a template class, with the items between ‘<’ and ‘>’ specifying template parameters. In this case the Op<sp_mat, op_trans> object type indicates that a reference to a matrix is stored and that a transpose operation is requested. In turn, the compound expression A.t() * B is converted to an instance of the lightweight Glue object with the following type:
Glue< Op<sp_mat, op_trans>, sp_mat, glue_times>
where the Glue object type in this case indicates that a reference to the preceding Op object is stored, a reference to a matrix is stored, and that a matrix multiplication operation is requested. In other words, when a user writes the expression trace(A.t() * B), the C++ compiler is induced to represent it internally as trace(Glue< Op<sp_mat, op_trans>, sp_mat, glue_times>(A,B)).
There are several implemented forms of the trace() function, one of which is automatically chosen by the C++ compiler to handle the Glue< Op<sp_mat, op_trans>, sp_mat, glue_times> expression. The specific form of trace() takes references to the A and B matrices, and executes a partial matrix multiplication to obtain only the diagonal elements of the A.t() * B expression. All of this is accomplished without generating temporary matrices. Furthermore, as the Glue and Op objects only hold references, they are in effect optimised away by modern C++ compilers [31]: the resultant machine code appears as if the Glue and Op objects never existed in the first place.
The templatebased delayed evaluation approach has also been employed for other functions, such as the diagmat() function, which obtains a diagonal matrix from a given expression. For example, in the expression diagmat(A + B), only the diagonal components of the A + B expression are evaluated.
4 Underlying Sparse Storage Formats
The three underlying storage formats (CSC, RBT, COO) were chosen to give overall efficient execution across several use cases, as well as minimising the difficulty of implementation and code maintenance burden where possible. Specifically, our focus is on the following main use cases:

flexible adhoc construction and elementwise modification of sparse matrices via unordered insertion of elements, where each new element is inserted at a random location;

incremental construction of sparse matrices via quasiordered insertion of elements, where each new element is inserted at a location that is past all the previous elements according to columnmajor ordering;

multiplication of dense vectors with sparse matrices;

multiplication of two sparse matrices;

operations involving bulk coordinate transformations, such as flipping matrices column or rowwise.
Below we briefly describe each storage format and its benefits and limitations. We use to indicate the number of nonzero elements of the matrix, while n_rows and n_cols indicate the number of rows and columns, respectively.
4.1 Compressed Sparse Column (CSC)
The CSC format [24] uses columnmajor ordering where the elements are stored columnbycolumn, with consecutive nonzero elements in each column stored consecutively in memory. Three arrays are used to represent a sparse matrix:

the values array, which is a contiguous array of floating point numbers holding the nonzero elements,

the rows array, which is a contiguous array of integers holding the corresponding row indices (ie., the th entry contains the row of the th element), and

the col_offsets array, which is a contiguous array of integers holding offsets to the values array, with each offset indicating the start of elements belonging to each column.
Following C++ convention [29], all arrays use zerobased indexing, ie., the initial position in each array is denoted by . For consistency, element locations within a matrix are also encoded as starting at zero, ie., the initial row and column are both denoted by . Furthermore, the row indices for elements in each column are kept sorted in ascending manner. In many applications, sparse matrices have more nonzero elements than the number of columns, leading to the col_offsets array being typically much smaller than the values array.
Let us denote the th entry in the col_offsets array as , the th entry in the rows array as , and the th entry in the values array as . The number of nonzero elements in column is determined using , where, by definition, is always and is equal to . If column has nonzero elements, then the first element is obtained via , and is the corresponding row of the element. An example of this format is shown in Figure 2(b).
The CSC format is wellsuited for efficient sparse linear algebra operations such as vectormatrix multiplication. This is due to consecutive nonzero elements in each column being stored next to each other in memory, which allows modern CPUs to speculatively read ahead elements from the main memory into fast cache memory [22]. The CSC format is also suited for operations that do not change the structure of the matrix, such as elementwise operations on the nonzero elements (eg., multiplication by a scalar). The format also affords relatively efficient random element access; to locate an element (or determine that it is not stored), a single lookup to the beginning of the desired column can be performed, followed by a binary search [6] through the rows array to find the element.
While the CSC format provides a compact representation yielding efficient execution of linear algebra operations, it has two main disadvantages. The first disadvantage is that the design and implementation of efficient algorithms for many sparse matrix operations (such as matrixmatrix multiplication) tend to be nontrivial [3, 24]. This stems not only from the sparse nature of the data, but also due to the need to (i) explicitly keep track of the column offsets, and (ii) keep the row indices for elements in each column sorted in ascending manner. In our experience, the process of designing and implementing efficient matrix processing algorithms in CSC is a timeconsuming affair — it is both finicky and prone to subtle bugs.
The second disadvantage of CSC is the computational effort required to insert a new element [9]. In the worstcase scenario, memory for three new largersized arrays (containing the values and locations) must first be allocated, the position of the new element determined within the arrays, data from the old arrays copied to the new arrays, data for the new element placed in the new arrays, and finally the memory used by the old arrays deallocated. As the number of elements in the matrix grows, the entire process becomes slower.
There are opportunities for some optimisation, such as using oversized storage to reduce memory allocations, where a new element past all the previous elements can be readily inserted. However, this does not help when a new nonzero element is inserted between two existing nonzero elements. It is also possible to perform batch insertions with some speedup by first sorting all the elements to be inserted and then merging with the existing data arrays. While the above approaches can be effective, they require the user to explicitly deal with cumbersome lowlevel storage details instead of focusing on highlevel functionality.
The CSC format was chosen over the related Compressed Sparse Row (CSR) format [24] for two main reasons: (i) to ensure compatibility with external libraries such as the SuperLU solver [18], and (ii) to ensure consistency with the surrounding infrastructure provided by the Armadillo library, which uses columnmajor dense matrix representation to take advantage of lowlevel functions provided by LAPACK [2].
4.2 RedBlack Tree (RBT)
To address the efficiency problems with element insertion at arbitrary locations, we first represent each element as a 2tuple, = , where index encodes the location of the element as . Zerobased indexing is used. This encoding implicitly assumes columnmajor ordering of the elements. Secondly, rather than using a simple linked list or an array based representation, the list of the tuples is stored as a RedBlack Tree (RBT), a selfbalancing binary search tree [6].
Briefly, an RBT is a collection of nodes, with each node containing the 2tuple described above and links to two children nodes. There are two constraints: (i) each link points to a unique child node, and (ii) there are no links to the root node. The index within each 2tuple is used as the key to identify each node. An example of this structure for a simple sparse matrix is shown in Figure 2(c). The ordering of the nodes and height of the tree (number of node levels below the root node) is controlled so that searching for a specific index (ie., retrieving an element at a specific location) has worstcase complexity of . Insertion and removal of nodes (ie., matrix elements), also has the worstcase complexity of . If a node to be inserted is known to have the largest index so far (eg., during incremental matrix construction), the search for where to place the node can be omitted, which in practice can considerably speed up the insertion process.
With the above element encoding, traversing an RBT in an ordered fashion (from the smallest to largest index) is equivalent to reading the elements in columnmajor ordering. This in turn allows for quick conversion of matrix data stored in RBT format into CSC format. The location of each element is simply decoded via row = (), and column = , where, for clarity, is the integer version of , rounded towards zero. These operations are accomplished via direct integer arithmetic on CPUs.
In our hybrid storage framework, the RBT format is used for incremental construction of sparse matrices, either in an ordered or unordered fashion, and a subset of elementwise operations (such as inplace addition of values to specified elements). This in turn enables users to construct sparse matrices in the same way they might construct dense matrices—for instance, a loop over elements to be inserted without regard to storage format.
While the RBT format allows for fast element insertion, it is less suited than CSC for efficient linear algebra operations. The CSC format allows for exploitation of fast caches in modern CPUs due to the consecutive storage of nonzero elements in memory [22]. In contrast, accessing consecutive elements in the RBT format requires traversing the tree (following links from node to node), which in turn entails accessing node data that is not guaranteed to be consecutively stored in memory. Furthermore, obtaining the column and row indices requires explicit decoding of the index stored in each node, rather than a simple lookup in the CSC format.
4.3 Coordinate List Representation (COO)
The Coordinate List (COO) is a general concept where a list = of 3tuples represents the nonzero elements in a matrix. Each 3tuple contains the location indices and value of the element, ie., = . The format does not prescribe any ordering of the elements, and a simple linked list [6] can be used to represent . However, in a computational implementation geared towards linear algebra operations [24], is often represented as a set of three arrays:

the values array, which is a contiguous array of floating point numbers holding the nonzero elements of the matrix;

the rows array, a contiguous array of integers holding the row index of the corresponding value; and

the columns array, a contiguous array of integers holding the column index of the corresponding value.
As per the CSC format, all arrays use zerobased indexing, ie., the initial position in each array is .
The arraybased representation of COO is related to CSC, with the main difference that for each element the column indices are explicitly stored. This leads to the primary advantage of the COO format: it can greatly simplify the implementation of matrix processing algorithms. It also tends to be a natural format many nonexpert users expect when first encountering sparse matrices. However, due to the explicit representation of column indices, the COO format contains redundancy and is hence less efficient (spacewise) than CSC for representing sparse matrices. An example of this is shown in Figure 2(d).
To contrast the differences in effort required in implementing matrix processing algorithms in CSC and COO, let us consider the problem of sparse matrix transposition. When using the COO format this is trivial to implement: simply swap the rows array with the columns array and then resort the elements so that columnmajor ordering is maintained. However, the same task for the CSC format is considerably more specialised: an efficient implementation in CSC would likely use an approach such as the elaborate TRANSP algorithm by Bank and Douglas [3], which is described through a 47line pseudocode algorithm with annotations across two pages of text.
Our initial implementation of sparse matrix transposition used the COO based approach. COO was used simply due to shortage of available time for development and the need to flesh out other parts of sparse matrix functionality. When time allowed, we reimplemented sparse matrix transposition to use the abovementioned TRANSP algorithm. This resulted in considerable speedups, due to no longer requiring the timeconsuming sort operation. We verified that the new CSCbased implementation is correct by comparing its output against the previous COObased implementation on a large set of test matrices.
The relatively straightforward nature of COO format hence makes it wellsuited for: (i) functionality contributed by timeconstrained and/or nonexpert users, (ii) relatively complex and/or lesscommon sparse matrix operations, and (iii) verifying the correct implementation of algorithms in the more complex CSC format. The volunteer driven nature of the Armadillo project makes its vibrancy and vitality depend in part on contributions received from users and the maintainability of the codebase. The number of core developers is small (ie., the authors of this paper), and hence difficulttounderstand or difficulttomaintain code tends to be avoided, since the resources are simply not available to handle that burden.
The COO format is currently employed for lesscommonly used tasks that involve bulk coordinate transformations, such as reverse() for flipping matrices column or rowwise, and repelem(), where a matrix is generated by replicating each element several times from a given matrix. While it is certainly possible to adapt these functions to directly use the more complex CSC format, at the time of writing we have spent our timeconstrained efforts on optimising and debugging more commonly used parts of the sparse matrix class.
5 Automatically Switching Between Storage Formats
To avoid the problems associated with selection and manual conversion between formats, our sparse matrix class uses a hybrid storage framework that automatically and seamlessly switches between the data storage formats described in Section 4. By default, matrix elements are stored in CSC format. When required, data in CSC format is internally converted to either the RBT or COO format, on which an operation or set of operations is performed. The matrix is automatically converted (‘synced’) back to the CSC format the next time an operation requiring the CSC format is performed.
The actual underlying storage details and conversion operations are completely hidden from the user, who may not necessarily be knowledgeable about (or care to learn about) sparse matrix storage formats. This allows for simplified user code, which in turn increases readability and lowers maintenance. In contrast, other toolkits without automatic format conversion can cause either slow execution (as a nonoptimal storage format might be used), or require many manual conversions. As an example, Figure 3 shows a short Python program using the SciPy toolkit [23] and a corresponding C++ program using the hybrid sparse matrix class. Manually initiated format conversions are required for efficient execution in the SciPy version; this causes both development time and code required to increase. If the user does not carefully consider the type of their sparse matrix at all times, they are likely to write inefficient code. In contrast, in the C++ program the format conversion is done automatically and behind the scenes.
A potential drawback of the automatic conversion between formats is the added computational cost. However, it turns out that COO/CSC conversions can be done in time that is linear in the number of nonzero elements in the matrix, and that CSC/RBT conversions can be done at worst in loglinear time. Since most sparse matrix operations are more expensive (eg., matrix multiplication), the conversion overhead turns out to be mostly negligible in practice. Below we present straightforward algorithms for conversion and note their asymptotic complexity in terms of the notation [6]. This is followed by discussing practical considerations that are not directly taken into account by the notation.
5.1 Conversion Between COO and CSC
Since the COO and CSC formats are quite similar, the conversion algorithms are straightforward. In fact the only parts of the formats to be converted are the columns and col_offsets arrays with the rows and values arrays remaining unchanged.
The algorithm for converting COO to CSC is given in Figure 4(a). In summary, the algorithm first determines the number of elements in each column (lines 68), and then ensures that the values in the col_offsets array are consecutively increasing (lines 910) so that they indicate the starting index of elements belonging to each column within the values array. The operations listed on line 5 and lines 910 each have a complexity of approximately , while the operation listed on lines 68 has a complexity of , where is the number of nonzero elements in the matrix and n_cols is the number of columns. The complexity is hence . As in most applications the number of nonzero elements will be considerably greater than the number of columns, the overall asymptotic complexity in these cases is .
The corresponding algorithm for converting CSC to COO is shown in Figure 4(b). In essence the col_offsets array is unpacked into a columns array with length . As such, the asymptotic complexity of this operation is .
5.2 Conversion Between CSC and RBT
The conversion between the CSC and RBT formats is also straightforward and can be accomplished using the algorithms shown in Figure 5. In essence, the CSC to RBT conversion involves encoding the location of each matrix element to a linear index, followed by inserting a node with that index and the corresponding element value into the RBT. The worstcase complexity for inserting all elements into an RBT is . However, as the elements in the CSC format are guaranteed to be stored according to columnmajor ordering (as per Section 4.1), and the location encoding assumes columnmajor ordering (as per Section 4.2), the insertion of a node into an RBT can be accomplished without searching for the node location. While the worstcase cost of is maintained due to tree maintenance (ie., controlling the height of the tree) [6], in practice the amortised insertion cost is typically lower due to avoidance of the search.
Converting an RBT to CSC involves traversing through the nodes of the tree from the lowest to highest index, which is equivalent to reading the elements in columnmajor format. The value stored in each node is hence simply copied into the corresponding location in the CSC values array. The index stored in each node is decoded into row and column indices, as per Section 4.2, with the CSC row_indices and col_offsets arrays adjusted accordingly. The worstcase cost for finding each element in the RBT is , which results in the asymptotic worstcase cost of for the whole conversion. However, in practice most consecutive elements are in nearby nodes, which on average reduces the number of traversals across nodes, resulting in considerably lower amortised conversion cost.
5.3 Practical Considerations
Since the conversion algorithms given in Figures 4 and 5 are quite straightforward, the notation does not hide any large constant factors. For COO/CSC conversions the cost is , while for CSC/RBT conversions the worstcase cost in . In contrast, many mathematical operations on sparse matrices have much higher computational cost than the conversion algorithms. Even simply adding two sparse matrices can be much more expensive than a conversion. Although the addition operation still takes time (assuming is identical for both matrices), there is a lot of hidden constant overhead, since the sparsity pattern of the resulting matrix must be computed first [24]. A similar situation applies for multiplication of two sparse matrices, which for square matrices takes time [8], but in practice tends to be much slower due to the many passes and extra overhead of computing the output sparsity structure [3].
Sparse matrix factorisations are much more expensive, meaning that any conversion overhead is essentially negligible. A sparse LU factorisation is superlinear [15] as well as other factorisations like the Cholesky factorisation, which costs time [14]. Other factorisations and higherlevel operations exhibit similar complexity characteristics. Given this, the cost of format conversions is heavily outweighed by the user convenience that they allow.
6 Empirical Evaluation
To empirically demonstrate the usefulness of the hybrid storage framework and the templatebased expression optimisation mechanism, we have performed a set of experiments, measuring the wallclock time (elapsed real time) required for:

unordered element insertion into a sparse matrix, where the elements are inserted at random locations in random order;

quasiordered element insertion into a sparse matrix, where each new inserted element is at a random location that is past the previously inserted element, under the constraint of columnmajor ordering;

calculation of , where and are randomly generated sparse matrices;

obtaining a diagonal matrix from the expression, where and are randomly generated sparse matrices.
In all cases the sparse matrices have a size of 10,00010,000, with four settings for the density of nonzero elements: 0.01%, 0.1%, 1%, 10%. The experiments were done on a machine with an Intel Xeon E52630L CPU running at 2 GHz, using the GCC v5.4 compiler. Each experiment was repeated 10 times, and the average wallclock time is reported. The wallclock time measures the total time taken from the start to the end of each run, and includes necessary overheads such as memory allocation.
Figure 6 shows the average wallclock time taken for element insertion done directly using the underlying storage formats (ie., CSC, COO, RBT, as per Section 4), as well as the hybrid approach which uses RBT followed by conversion to CSC. The CSC and COO formats use oversized storage as a form of optimisation (as mentioned in Section 4.1), where the underlying arrays are grown in chunks of 1024 elements in order to reduce both the number of memory reallocations and array copy operations due to element insertions.
In all cases bar one, the RBT format is the quickest for insertion, generally by one or two orders of magnitude. The conversion from RBT to CSC adds negligible overhead. For the single case of quasiordered insertion to reach the density of 0.01%, the COO format is slightly quicker than RBT. This is due to the relatively simple nature of the COO format, as well as the ordered nature of the element insertion where the elements are directly placed into the oversized COO arrays (ie., no sorting required). Furthermore, due to the very low density of nonzero elements and the chunked nature of COO array growth, the number of reallocations of the COO arrays is relatively low. In contrast, inserting a new element into RBT requires the allocation of memory for a new node, and modifying the tree to append the node. For larger densities (), the COO element insertion process quickly becomes more time consuming than RBT element insertion, due to to an increased amount of array reallocations and the increased size of the copied arrays. Compared to COO, the CSC format is more complex and has the additional burden of recalculating the column offsets (col_offsets) array for each inserted element.
Figure 7 shows the wallclock time taken to calculate the expressions trace(A.t()*B) and diagmat(A+B), with and without the aid of the automatic templatebased optimisation of compound expression described in Section 3. For both expressions, employing expression optimisation leads to considerable reduction in the wallclock time. As the density increases (ie., more nonzero elements), more time is saved via expression optimisation.
For the trace(A.t()*B) expression, the expression optimisation computes the trace by omitting the explicit transpose operation and performing a partial matrix multiplication to obtain only the diagonal elements. In a similar fashion, the expression optimisation for the diagmat(A+B) expression directly generates the diagonal matrix by performing a partial matrix addition, where only the diagonal elements of the two matrices are added. As well as avoiding full matrix addition, the generation of a temporary intermediary matrix to hold the complete result of the matrix addition is also avoided.
7 Conclusion
Motivated by a lack of easytouse tools for sparse matrix development, we have proposed and implemented a sparse matrix class in C++ that internally uses a hybrid storage framework. The framework automatically and seamlessly switches between several underlying formats, depending on which format is best suited and/or available for specific operations. This allows the user to write sparse linear algebra without requiring to consider the intricacies and limitations of various storage formats. In addition, a template metaprogramming framework is used to automatically optimise several common expression patterns, resulting in faster execution.
The source code for the sparse matrix class and its associated functions is included in recent releases of the crossplatform and opensource Armadillo linear algebra library [25], available from http://arma.sourceforge.net. The code is provided under the permissive Apache 2.0 license [28], allowing unencumbered use in both opensource projects and commercial closedsource products.
The sparse matrix class has already been successfully used in opensource projects such as the mlpack library for machine learning [7], and the ensmallen library for mathematical function optimisation [4]. In both cases the sparse matrix class is used to allow various algorithms to be run on either sparse or dense datasets. Furthermore, bidirectional bindings for the class are provided to the R environment via the Rcpp bridge [12].
Future avenues for exploration include integrating more specialised matrix formats in order to automatically speed up specific operations. For example, the Skyline formats [10] are useful for Cholesky factorisation and related operations, while the compressed diagonal storage format [24] can be used for operations on the main diagonal.
Acknowledgements
We would like to thank our colleagues at the University of Queensland (Ian Hayes, George Havas, Arnold Wiliem) and Data61/CSIRO (Dan Pagendam, Josh Bowden, Regis Riveret) for discussions leading to the improvements of this article.
References
 [1] D. Abrahams and A. Gurtovoy. C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond. AddisonWesley Professional, 2004.
 [2] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users’ Guide. SIAM, 1999.
 [3] R. E. Bank and C. C. Douglas. Sparse matrix multiplication package (SMMP). Advances in Computational Mathematics, 1(1):127–137, 1993.
 [4] S. Bhardwaj, R. Curtin, M. Edel, Y. Mentekidis, and C. Sanderson. ensmallen: a flexible C++ library for efficient function optimization. arXiv:1810.09361, 2018.
 [5] N. Boumal, V. Voroninski, and A. Bandeira. The nonconvex BurerMonteiro approach works on smooth semidefinite programs. In Advances in Neural Information Processing Systems, pages 2757–2765, 2016.
 [6] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, 3rd edition, 2009.
 [7] R. Curtin, M. Edel, M. Lozhnikov, Y. Mentekidis, S. Ghaisas, and S. Zhang. mlpack 3: a fast, flexible machine learning library. Journal of Open Source Software, 3:726, 2018.
 [8] T. A. Davis. Direct methods for sparse linear systems. SIAM, 2006.
 [9] T. A. Davis, S. Rajamanickam, and W. M. SidLakhdar. A survey of direct methods for sparse linear systems. Acta Numerica, 25:383–566, 2016.
 [10] I. S. Duff, A. M. Erisman, and J. K. Reid. Direct methods for sparse matrices. Oxford University Press, 2nd edition, 2017.
 [11] J. W. Eaton, D. Bateman, S. Hauberg, and R. Wehbring. GNU Octave 4.2 Reference Manual. Samurai Media Limited, 2017.
 [12] D. Eddelbuettel and C. Sanderson. RcppArmadillo: Accelerating R with highperformance C++ linear algebra. Computational Statistics & Data Analysis, 71:1054–1063, 2014.
 [13] L. El Ghaoui and H. Lebret. Robust solutions to leastsquares problems with uncertain data. SIAM Journal on Matrix Analysis and Applications, 18(4):1035–1064, 1997.
 [14] A. George and E. Ng. On the complexity of sparse QR and LU factorization of finiteelement matrices. SIAM Journal on Scientific and Statistical Computing, 9(5):849–861, 1988.
 [15] J. R. Gilbert, X. S. Li, E. G. Ng, and B. W. Peyton. Computing row and column counts for sparse QR and LU factorization. BIT Numerical Mathematics, 41(4):693–710, 2001.
 [16] G. R. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5:27–72, 2004.
 [17] R. B. Lehoucq, D. C. Sorensen, and C. Yang. ARPACK Users’ Guide: Solution of LargeScale Eigenvalue Problems with Implicitly Restarted Arnoldi Methods. SIAM, 1998.
 [18] X. S. Li. An overview of SuperLU: Algorithms, implementation, and user interface. ACM Transactions on Mathematical Software (TOMS), 31(3):302–325, 2005.
 [19] P. Liniker, O. Beckmann, and P. H. Kelly. Delayed evaluation, selfoptimising software components as a programming model. In European Conference on Parallel Processing  EuroPar 2002. Lecture Notes in Computer Science (LNCS), volume 2400, pages 666–673, 2002.
 [20] MathWorks. MATLAB Documentation  Accessing Sparse Matrices. https://www.mathworks.com/help/matlab/math/accessingsparsematrices.html, 2018.
 [21] M. Mernik, J. Heering, and A. M. Sloane. When and how to develop domainspecific languages. ACM Computing Surveys, 37(4):316–344, 2005.
 [22] S. Mittal. A survey of recent prefetching techniques for processor caches. ACM Computing Surveys, 49(2):35:1–35:35, 2016.
 [23] J. NunezIglesias, S. van der Walt, and H. Dashnow. Elegant SciPy: The Art of Scientific Python. O’Reilly Media, 2017.
 [24] Y. Saad. SPARSKIT: A basic tool kit for sparse matrix computations. Technical Report NASACR185876, NASA Ames Research Center, 1990.
 [25] C. Sanderson and R. Curtin. Armadillo: a templatebased C++ library for linear algebra. Journal of Open Source Software, 1:26, 2016.
 [26] C. Sanderson and R. Curtin. A userfriendly hybrid sparse matrix class in C++. In Mathematical Software  ICMS 2018. Lecture Notes in Computer Science (LNCS), volume 10931, pages 422–430, 2018.
 [27] M. Scherr and S. Chiba. Almost firstclass language embedding: taming staged embedded DSLs. In ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences, pages 21–30, 2015.
 [28] A. St. Laurent. Understanding Open Source and Free Software Licensing. O’Reilly Media, 2008.
 [29] B. Stroustrup. The C++ Programming Language. AddisonWesley, 4th edition, 2013.
 [30] L. Vandenberghe and S. Boyd. Semidefinite programming. SIAM Review, 38(1):49–95, 1996.
 [31] D. Vandevoorde and N. M. Josuttis. C++ Templates: The Complete Guide. AddisonWesley, 2nd edition, 2017.
 [32] T. L. Veldhuizen. C++ templates as partial evaluation. In ACM SIGPLAN Workshop on Partial Evaluation and SemanticsBased Program Manipulation, pages 13–18, 1999.
Comments
There are no comments yet.