1. Introduction
In modern mathematics a higherorder tensor is defined as an element of a tensor product space (Lim, 2017; da Silva and Machado, 2017). Higherorder tensors are coordinatefree in an abstract fashion without choosing a basis of the tensor product space. In the realm of numerical tensor calculus, higherorder tensors with a coordinate representation are considered (Lim, 2017; Hackbusch, 2014). The bases are chosen implicitly, and the values of some measurements are then recorded in the form of a multidimensional array. We define a multidimensional array as entity that holds a set of data all of the same type whose elements are arranged in a rectangular pattern. In some cases higherorder tensors are referred to as hypermatrices, way arrays or dimensional table of values (Lim, 2017; Cichocki et al., 2009; Lathauwer et al., 2000a) where is order, i.e. the number of dimensions.
Basic tensor operations are the tensortensor, tensormatrix, tensorvector multiplication, the inner and outer product of two tensors, the Kronecker, Hadamard and KhatriRao product
(Cichocki et al., 2009; Lim, 2017). Common methods utilizing tensor operations are e.g. the higher order decompositions or to calculate the eigenvalues or singular values of a higherorder tensor
(Lathauwer et al., 2000a; Kolda and Bader, 2009; Cui et al., 2014; Ng et al., 2009). Other types of tensor decomposition are the CP (CanonicalDecomposition/ParallelFactorAnalysis) (Harshman and Lundy, 1994; Faber et al., 2003) and TuckerDecomposition (Tucker, 1966; Kim and Choi, 2007) which are mainly used within the field of psychometrics and chemometrics. Other areas of application are signal processing (Savas and Eldén, 2007; FitzGerald et al., 2005), computer graphics (Vasilescu and Terzopoulos, 2002; Suter et al., 2013) and data mining (Kolda and Sun, 2008; Rendle et al., 2009).Many generalpurpose programming languages such as C, C++ or Fortran support multidimensional arrays as builtin data structures with which elements are accessed in a convenient manner. Yet, builtin data types might not meet the requirements of an application. For instance, the C++ builtin multidimensional array is not a good fit if the application requires the number of dimensions to be runtime variable. A very common approach is to provide a library that extends the generalpurpose language with userdefined data types and functions. In most cases, the interface of the libraries are designed to be close to the notation that is used within a field of application (Mernik et al., 2005). Usually, C++ is chosen to be the host language that is extended with userdefined data types. The key feature of C++ is that it enables the programmer to apply objectoriented and generic programming techniques with a simple, direct mapping to hardware and zerooverhead abstraction mechanisms (Stroustrup, 2012a; Gregor et al., 2006). Functions and types can be parametrized in terms of types and/or values supporting parametric polymorphism and allow software to be general, flexible and efficient (Stroustrup, 2012b).
We would first like to introduce C++ libraries that are related to our framework and depict similar highlevel interfaces. POOMA, described in (Reynders III and Cummings, 1998), is perhaps one the first C++ frameworks that have been designed to support arrays with multiple dimensions including tensor operations. Multidimensional arrays are generic data types where the number of dimensions are compiletime parameters. The framework supports highlevel expressions for firstlevel tensor operations. The library described in (Landry, 2003) offers a tensor class that is designed to support classical applications found in quantum mechanics. Tensor functions for highlevel tensor operations are provided. However, the framework only support tensors up to four dimensions with four elements in each dimension limiting the application of the framework to classical applications of numerical tensor calculus. In (Garcia and Lumsdaine, 2005)
, the design of generic data types for multidimensional arrays is discussed, including the addressing elements and subdomains (views) of multidimensional arrays with first and lastorder storage formats. The order and data type are compiletime (template) parameters. They offer iterators for their multidimensional arrays and suggest to parametrize their tensor algorithms in terms of multidimensional array types that generate stridebased iterators for the current dimension of the recursion. In
(Andres et al., 2010) an implementation of a multidimensional array and iterators are presented where the order and the dimension extents of tensor types are runtime parameters. The paper also discusses addressing functions, yet for the first and lastorder storage format only. Please note that (Reynders III and Cummings, 1998; Garcia and Lumsdaine, 2005; Andres et al., 2010) do not support higherorder tensor operations for numerical tensor calculus.Most C++ frameworks supporting tensor contraction provide a convenient notation that is close to Einstein’s summation convention. The CyclopsTensorFramework (CT) described in (Solomonik et al., 2013) offers a library primarily targeted at quantum chemistry applications (Solomonik et al., 2013). The order and the dimensions of their tensor data structures are dynamically configurable, while the interface omits the possibility to set a data layout or a userdefined data type. The tensor contractions are performed by index foldings with matrix operations where the indices are adjusted with respect to the tensor operation. The interface for specifying the contraction is similar to the one provided by the MiaArray library (LibMia) discussed in (Harrison and Joseph, 2016) using either strings (CT) or objects (LibMia, Blitz). For instance, after having instantiated tensor objects, the mode multiplication of a three dimensional with a matrix is given by C["ijk"]=A["ilj"]*B["kl"] in case of the CT framework, C(i,j,k)=A(i,l,k)*B(k,l) in case of the LibMia framework, or C(i,j,k)=sum(A(i,l,j)*B(k,l),l) in case of the Blitz framework (Veldhuizen, 1998) where the type of the variables i,j,k,l
of the example code snippets are userdefined. The notation necessitates the order of the arrays to be known before compiletime and does not allow the contraction mode to be runtimevariable. The implementation of tensor algorithms such as the higherorder singular value decomposition for instance, performs a sequence of
mode tensortimesvector multiplications where the mode depends on an induction variable (Lathauwer et al., 2000a; Bader and Kolda, 2006).Our framework fills this need with flexible and generic tensor classes and functions where the order, extents of the dimensions and the contraction mode(s) can be runtime variable. The interfaces of the generic tensor functions are similar to the ones provided by the Matlab library that is presented in (Bader and Kolda, 2006). The toolbox provides tensor classes and tensor operations for prototyping tensor algorithms in Matlab. The execution of some higherorder tensor operations such as tensortensormultiplication requires a tensor to be converted or unfolded into matrix. The unfolding is performed with respect to the mode of tensor operation and requires additional memory space for the unfolded tensor. However, once the unfolding is accomplished, fast matrix multiplications can be used to perform the tensor contraction. This approach is also applied in frameworks such as in (Napoli et al., 2014). The generic tensor functions of our framework execute tensor operations inplace for tensor types including domains of tensors without the process of unfolding. They also support a set of storage formats including the first and lastorder storage formats. Please note that we do not intend to replace any of the previously mentioned works, but to provide a flexible and extensible library that allows to easily validate numerical results with the Matlab toolbox provided by (Bader and Kolda, 2006).
We would like to limit our discussion to the software design of a framework for basic tensor operations. The design and implementation of highperformance algorithms exploiting data locality and the parallel processing capabilities of multicore processors lies beyond the scope of this paper. We would like to refer to (Springer and Bientinesi, 2016) in which a method for fast tensor contractions with arbitrary order and dimensions is described. The optimization techniques are similar to the ones applied for matrixmatrix multiplication. The paper presents a TCCG, a C++ code generator that generates optimized C++ code for the tensor multiplication. The code generation requires an input file where the contracting dimensions and mode of the multiplication, order, dimensions of the multidimensional arrays are specified. The work presented in (Li et al., 2015) follows a similar approach and provides code generator that is called InTensLi.
Our framework consists of a software stack with two main components. The upper part of the software stack contains flexible tensor classes with runtime variable number of dimensions and dimension extents. The classes and their member functions simplify the management of multidimensional data and the selection of single elements or multidimensional domains using generic and objectoriented programming techniques. The data layout of the multidimensional data can be adjusted at runtime as well, supporting a class of storage layouts include the first and lastorder storage formats. Member functions of the tensor classes encapsulate generic tensor functions of the lower software stack components and help the programmer to write tensor algorithms independent of the storage layout of the tensors. We have used adhoc polymorphism, i.e. operator overloading, for lowerlevel (entrywise) tensor operations in order to enable the user to write tensor algorithms close to the mathematical notation. Member functions encapsulating higherlevel tensor operations have an interface that is similar to the one provided by the toolbox discussed in (Bader and Kolda, 2006). The upper component also provides stream classes with which numerical results can be validated in Matlab. The lower part of the software stack includes generic tensor function that are defined in terms of multidimensional iterator types only. The decoupling of data storage representation and computation with multidimensional iterators allows different tensor types to be handled with same tensor function. Our framework provides its own multidimensional iterator that is one possible interface between the tensor classes and generic tensor functions. The user can extend the framework with tensor functions without modifying the tensor template classes and provide his own tensor and iterator types. In summary, the main contributions of our work are as follows:

Our implementation of higherlevel tensor operations allows the mode of the tensor multiplication and the order, dimensions, storage format, index offsets of the tensor to be runtime parameters. The user can implement tensor algorithms with arguments that depend on runtimevariable parameters.

Our implementations of tensor operations are parametrized in terms of multidimensional iterator types and do not rely on specific storage representation of data. The same tensor functions can be utilized for different tensor types including tensor references. Users can extend our framework and provide their own tensor or iterator types with a variety data layouts.

We provide tensor template classes with member functions that encapsulate tensor template functions and enable the user to program tensor algorithms in an objectoriented fashion. Numerical results are conveniently verifiable with the Matlab toolbox described in (Bader and Kolda, 2006) using overloaded stream operators. Output files can be directly used in the Matlab environment without further modification.
The remainder of the paper is organized as follows. The following Section 2 provides an overview of our tensor framework and discusses some general design decisions. Section 3 introduces the C++ implementation of our tensor data structures with the focus on the layout and access of elements. We discuss the generality and limitations of our data structure with respect to the storage format. Multidimensional iterators for data structures supporting nonhierarchical data layouts are the topic of Section 4. Section 5 discusses the design and implementation of tensor operations using multidimensional iterators as template parameters. We complete the section by exemplifying the usage of higherorder tensor operations with tensor objects. The last Section 6 provides a conclusion of this work.
2. Overview of the Tensor Framework
for_each  transform  copy 
fill  generate  count 
min_element  max_element  find 
equal  mismatch  mismatch 
all_of  none_of  any_of 
iota  accumulate  inner_product 
tensor_times_vector  tensor_times_matrix  tensor_times_tensor 
tensor_times_vectors  tensor_times_matrices  transpose 
outer_product  inner_product 
The primary scope of our framework is given by the following Table 1 that lists basic tensor template functions of the lowlevel interface of our framework. The first part of the table are firstlevel tensor operations and correspond semantically to the function templates provided in the algorithm and numeric package of the C++ standard library. Typically, firstlevel tensor operations process data elements of multidimensional arrays with the same index tuple. The second part are referred to as higherlevel tensor operations that are additionally required to implemented numerical multilinear algebra algorithms. Higherlevel tensor operations typically manipulate data elements with different index tuples. All functions listed in Table 1 are able to combine multidimensional arrays and views and to process both equally efficiently with same time complexity.
2.1. Software Nomenclature
In the following we use the nomenclature developed in (Stroustrup, 2013). An object is defined in the standard as a region of storage that has a storage duration and a type. The term object type refers to to the type with which an object is created. The C++ language offers fundamental types that are builtin. There are five standard signed and unsigned integer types, a boolean type and three floating point types. We exclude the void type for our discussion. A class is a userdefined type that contains a set of objects of other types and functions that manipulate these objects. A template is a class or a function that we parametrize with a set of types or values. It defines a family of classes or functions or an alias for a family of types. We do not distinguish between the term template class and class template as well as template function and function template. We call the generation of a class or function from a template, template instantiation where the generated template instance is called specialization. Containers are referred to as data structures that manage memory. Algorithms denote template functions that process and manipulate container data with iterators.
2.2. Software Design
The software design of our framework is greatly influenced by the design principle of the Standard Template Library (STL). The STL provides five components algorithms, containers, iterators, function objects and adaptors for this purpose that allows programmers to program data structures of the STL with their own algorithms, and to use algorithms of the STL with their own data structures (Stepanov, 1995). Similarly, the C++ standard library provides multiple containers with different capabilities and runtime complexities, template classes such as the std::vector that help to organize data (Stroustrup, 2013). Allocation of memory is usually performed with additional predefined or useddefined allocator classes. The C++ standard library also provides free function templates, also known as algorithms, that operate on a onedimensional range of container data. The range is specified by an iterator pair instantiated by the corresponding container. Iterators of the standard library are one of the five iterator types with different navigation and access capabilities that an algorithm requires for its execution. In other words, an algorithm can only process container data if the algorithm’s iterator type requirement is fulfilled by the container’s iterator.
We have used this approach to minimize the dependencies between tensor algorithms and the implementation of tensor data structures to yield a separation of concerns. Therefore, we did not use inheritance but mostly composition to establish a loose coupling of software elements. Our tensor framework consists of multiple software stack layers that are illustrated in Figure 1.
We will start with the highest layer in the software stack. It consists of two template classes fhg::tensor and fhg::tensor_view. Both provide a highlevel interface with which the user can conveniently implement tensor algorithms with member functions. Yet, the implementation of both template classes use software components and call functions of the lower layers. For instance, data organization and access is accomplished with the corresponding template classes fhg::multi_array and fhg::multi_array_view, respectively.
The template class fhg::multi_array can be regarded as a multidimensional container supporting random access with multiple and single indices. It is designed as a resource handle (similar to the std::vector) for storing a collection of elements that are ordered in a rectangular pattern. It abstracts from the storage layout in memory and provides a convenient interface to access elements with multiindices. Allocation and deallocation is handled with the help of the std::vector template that stores elements contiguously in memory. Its template parameters determine the type of the elements and allocator with which memory is acquired and released. Parameters such as order, dimensions, index offsets or the storage layout are member variables of the fhg::multi_array template. Choosing all of the parameters to be runtimevariable allowed us to verify all of tensor template functions and to provide flexible and runtimeadaptable containers.
The fhg::multi_array_view template class is similar to a container adaptor that references a selected region of an fhg::multi_array object offering the same functionality as the fhg::multi_array template class. It serves as a proxy for conveniently accessing and manipulating selected memory regions of an fhg::multi_array object. The template parameter is therefore restricted to a type that has the same properties as fhg::multi_array. We have applied the factory method design pattern without subtype polymorphism where an object of type fhg::multi_array_view can only be generated by calling the overloaded function operator of an fhg::multi_array object with a tuple of fhg::range class objects. An fhg::multi_array object therefore generates references to itself.
Both template classes fhg::multi_array and fhg::multi_array_view do not provide virtual member functions (especially a virtual destructor) and therefore promote aggregation instead of inheritance. The strategy is also used in the standard library for all containers in order to prevent indirect functions calls and runtime overhead. The template class fhg::tensor therefore contains the fhg::multi_array template class and provies additional member functions with which the user can perform arithmetic operations such as the tensor multiplications. The fhg::tensor_view template class is also designed to wrap and extend the functionality of the fhg::multi_array_view template. The element type of the both tensor templates must support scalar arithmetic operations for the unspecialized case.
The template class fhg::multi_iterator defines a multidimensional iterator with which the complete multiindex set of a multidimensional array is accessible while the array can have various data layouts, runtime variable dimensions and order. It provides member functions that create iterator pairs of type iterator_t and help to implement tensor functions without explicit specification of the data layout and dimensions. Our framework uses the fhg::stride_iterator class for this purpose. The class however does not depend on the fhg::multi_iterator template class. It can be instantiated and used separately to modify and access elements of a specific dimension for instance. Both iterator types can be conveniently instantiated with member functions of template classes.
The bottom layer contains tensor template functions that are listed in Table 1. The template parameters of the function templates denote multidimensional iterator types such as the fhg::multi_iterator. The user can utilize our or his own implementation to call the template functions as long as the multidimensional iterator type fulfills specified criteria. Using iterators, the template functions do make assumptions about the underlying data structure or management relieving the user to consider data layout. The tensor template functions are designed in a recursive fashion and implemented as nonmember function templates where the maximum depth of the recursion mostly equals the number of dimensions of the corresponding multidimensional array. They do not flatten or rearrange tensors and perform tensor operations inplace using recursion. The framework offers two algorithm packages. The first one contains function templates that have the same semantic and similar function signature as the function templates offered by the algorithm package of the C++ standard library. The second package contains function template that implement tensor multiplication operations such as the tensortensor multiplication.
3. Multidimensional Containers and Views
A multidimensional array is a dimensional table of values that are arranged in a rectangular pattern and accessible via multiple indices. If it represents higherorder tensors for a given finite basis of the tensor product space, its elements are either real or complex numbers (Lim, 2017; Golub and Van Loan, 2013). For the following discussion, we postulate multidimensional arrays to hold any type of values that can be arranged in a rectangular pattern. We use the following notation to denote a multidimensional array:
(1) 
where is the th index set with
(2) 
and . The number is a positive integer and will be referred to as the order or rank of a multidimensional array. The (dimension) extent of the dimension can be different but must be greater than one. The tuple of length is the shape tuple of a multidimensional array with . Each index is biased with an index offset . We denote with as the index offset tuple of a tensor. We can then derive the multiindex set as the Cartesian product of all index sets such that
(3) 
with . We denote an element of a multiindex set as a multiindex with . Elements of a tensor are uniquely identifiable using round brackets with a multiindex and
(4) 
Example 3.1 ().
Let be an array of order with a shape tuple where and , , . Each element of a tensor can be identified with a multiindex in . Using the notation in Eq. (1) the multidimensional array can be illustrated as follows:
A (multidimensional) view of a tensor is a reference to a specified region or domain of and has the same order and data layout as the referenced multidimensional array. It can be regarded as a lightweight handle with a shape tuple where the dimensions of the view and referenced multidimensional satisfy for . We define a section or view of a multidimensional array of order in terms indextriplets , pairs of indices or scalars for all dimensions. The indices , define the lower and upper bound of an index range where the step size for the th dimension satisfies and for . The shape tuple of the view is given by
(5) 
The th index set of a view is then defined as
(6) 
where is the th index offset of the referenced multidimensional array . Using the notation in the previous section, a view of a multidimensional array is denoted by
(7) 
Analogous to the multidimensional arrays, the multiindex set of a view is given by
(8) 
where . A slice is a special kind of view with two dimensions that have the same extent as the referenced multidimensional array. Let and be the subscripts of those two dimensions. The resulting multiindex set of the slice is given by Eq. (8) with
(9) 
A fiber has exactly one dimension that is greater than one.
Example 3.2 ().
Let be the way array from the previous example and zerobased index offsets. Let , , , and , and index triplets for choosing a section of . With being the shape tuple of the view , elements of the view are given by
3.1. Data Organization and Layout
The above notation of a tensor is an abstract representation of a multidimensional array. Usually a multidimensional array stores its elements in memory where the latter can be addressed with a single index. We refer to that single index as the (absolute) memory address. Our implementation of the tensor template class is stored contiguously in memory such that elements of array are addressable using a single index as well. The single index shall denote a relative position within the multidimensional array. In that case, the th element of in memory is then given by the linear function
(10) 
where is the number of bytes to store an element of a multidimensional array and is the (absolute) memory address of the first element. The domain of the function is given by
(11) 
which we denote as the (relative) memory index set of a multidimensional array of order where is the length of the th dimension. Therefore, we may enumerate elements of a multidimensional array with a single scalar index of the index set . We denote elements of the memory index set as memory indices of the multidimensional array that correspond to displacements relative to the memory address of the first array element.
Given a contiguous region within a single indexed memory space, the data layout (storage format) of a contiguously stored dense multidimensional array defines the ordering of its elements within the specified memory region. A multidimensional array with the dimensions has possible orderings. In practice however, only a subset of all possible orderings are considered. In case of two dimensions for instance, most programming languages arrange elements of twodimensional arrays either according to the row or columnmajor storage format where adjacent elements of a row or column are successively stored in memory. More sophisticated nonlinear layout functions have been investigated for instance in (Chatterjee et al., 1999; Elmroth et al., 2004) with the purpose to increase the data locality of dense matrix operations. These type of layout functions partition twodimensional arrays into hierarchical ordered blocks where elements of a block are stored contiguously in memory.
Our data structure supports all nonhierarchical data layouts including the first and lastorder storage formats. We denote both formats as standard data layouts of dense multidimensional arrays, where the former format is defined in the Fortran, the latter in the C and C++ language specification, respectively. Nonhierarchical data layouts can be expressed in terms of permutation tuples which we denote as the (data) layout tuple. The th element of a permutation tuple corresponds to an index subscript of a multiindex with the precedence where and . In case of the firstorder format, the layout tuple is defined as
(12) 
where the precedence of the dimension ascends with increasing index subscript. The layout tuple of the lastorder storage format is given by
(13) 
We might therefore interpret the layout tuple also as a precedence tuple of the dimensions. The set of all possible layout tuples is denoted by . The number of all possible nonhierarchical element arrangements is and equal to the number of elements in .
The th stride is a positive integer and defined as the distance between two elements with identical multiindices except the th indices differing by one. More specifally, given a layout tuple and the shape tuple , elements of a stride tuple are given by:
(14) 
Please note that for any layout tuple the corresponding stride tuple satisfies with . For instance, the Fortran language specification stores elements of a multidimensional array according to the firstorder storage format such that the corresponding stride tuple is defined as
(15) 
In case of the firstorder storage format, the th stride increases with consecutive index subscripts by the factor of . The lastorder storage format is given by Given the layout tuple in Eq. (13), the stride tuple is then
(16) 
which corresponds to the definition in the 2011 Clanguage specification. We denote the set of all stride tuples by with .
3.2. Tensor Template Class
The parameterized class fhg::tensor is a sequence container which is responsible for the data management. It resembles the vector template class std::vector of the C++ standard library. It has two template parameters, value_type and allocator. The former one determines the data type of the elements, the latter the type of the memory allocator. The type value_type must be a numeric type with which addition, subtraction, multiplication and division can be performed.
This restriction arises due to member functions such as overloaded arithmetic operators implementing pointwise tensor operations for convenience. In order to support a wider range of applications we offer the template class fhg::multi_array that does not include any arithmetic operations and therefore has lesser restrictions on the data type of its elements. It has the same template parameters and member variables as the fhg::tensor template class. The user is for instance free to set the first template parameter to boolean data type and equip the fhg::multi_array with bitwise operations.
Similar to the std::vector template class, fhg::tensor contains public member types such as value_type, size_type, difference_type, pointer, const_pointer, reference, const_reference, iterator, const_iterator. The last two types denote multidimensional iterators which will be explained in the following subsection. The template class also defines the member type tensor_view_t that is fhg::tensor_view<tensor> where tensor is the data structure itself. Consider the following listing with template class instances of the template class fhg::tensor.
Instances of the first specialization contain complex numbers in single precision, objects of the second type integer numbers and the third numbers in double precision. The template class tensor provides public member functions in order to

instantiate and copy tensor instances, e.g. tensor(),

assign data elements, e.g. operator=(..),

access elements, e.g. at(..), operator[..],

perform tensor multiplication operations, e.g. times_tensor(..), inner(..),

perform tensor pointwise operations, e.g. operator+=(..), operator(),

generate views, e.g. operator()(..),

generate iterators, e.g. begin(), mbegin(),
including all size and capacity functions of the std::vector such as size(), empty(), clear(), data(). However, we do not support stack and list operations such as push_back, pop_back, insert or erase.
The type fhg::tensor shall denote the type fhg::tensor<value_type,allocator> for given template parameters value_type and allocator. The fhg::tensor template class provides the following constructor declarations.
The default constructor of the fhg::tensor class calls the default constructor of all member variables. The fhg::tensor template class contains the following std::vector private member variables in order to store

data elements of type value_type,

extents of type size_type,

layout elements of type size_type,

strides of type size_type,

index offsets of type difference_type.
The above mentioned public member types are derived from the std::vector template class with the specified template parameters value_type and allocator. The member variable of type std::vector<value_type,allocator> helps fhg::tensor class template to allocate and free memory space where the data elements are contiguously stored. The position of the elements corresponds to indices of the plain index set given by Eq. (11). The second constructor specifies the extents with the help of fhg::shape class instances. The order is automatically derived and the length of the data vector is determined. The default layout tuple is computed with Eq. (12). The constructor can then compute the strides with Eq. (14) initializing the index offsets with zero. The remaining constructors allow to specify different offset and/or layout tuples using the auxiliary fhg::layout and fhg::offset classes, respectively. The specification of the layout using the fhg::layout class is constructed with the a permutation tuple with onebased indices. All three auxiliary classes provide an initialization with the std::initializer_list for a convenient initialization of their elements. The elements are stored with a specialization of the std::vector template class. Consider the following example where two statements instantiate objects of the fhg::tensor<double> class.
Example 3.3 ().
The first constructor generates an object A where the offsets are all zero and the elements are stored according to the firstorder storage format and therefore initializing the layout tuple with {1,2,3} as given by Eq. (12). The second object B has the same shape but initializes the index offsets with 1,1, 0 using the lastorder format for the element ordering.
We can therefore assert that the strides of objects A and B are {1,4,8} and {6,2,1}, respectively.
The copy assignment operators operator=() of the fhg::tensor class template are responsible for copyin data and protecting against selfassignment. Consider the following overloaded assignment operators.
The first method copies all private member variables of the source instance such that source and destination have the same shape tuple and the same order after the assignment. The layout, stride and index offset tuples of the destination tensor object are only copied if the order of the source and destination tensor objects differ. In either cases, the user can expect the source and destination fhg::tensor class instances to be equal and independent after the copy operation, see (Stroustrup, 2013) for a detailed discussion. We defined two tensor to be equal if they have the shape tuple, order and elements with the same multiindex independent of their layout, index offset tuple and allocator. The second assignment operator copies all elements of the source fhg::tensor_view class template instance including the shape tuple. The layout and index tuples are only copied if the order to the source and destination tensor objects differ.
Example 3.4 ().
Let A and B be fhg::tensor objects of the previous Example 3.3. The following code snippet initializes all elements of A with the value one using the assignment operator and copies the content of A to B.
As both objects do have equal extents, only data values of A are copied to B without modifying the offset and layout tuple of B.
Besides the type of the data elements, the user can change the content and the size of all member variables at runtime. The framework offers reshape and relayout member functions that allow to dynamically adjust the shape and layout tuple, respectively. The container automatically adjusts the strides and reorders the in memory if necessary. The fhg::tensor template class provides member functions for accessing elements with multiindices and scalar memory indices. The following listing depicts two member functions that return a reference of the specified element.
The first access function is a template function with a variadic template that allows to conveniently access elements with multiindicies. It transforms multiindices onto memory indices according to the stride tuple of the template class. Given the order and a stride tuple with respect to a layout tuple and stride tuple given by Eq. (14), the nonhierarchical layout function is defined by
(17) 
For fixed stride tuples and , the layout functions and coincide with the definitions provided in (Chatterjee et al., 1999; Andres et al., 2010; Garcia and Lumsdaine, 2005).
Example 3.5 ().
Let A be a multidimensional array of order 3 and type fhg::tensor with equal extents and elements all set to zero. We can create an identity tensor with ones in the superdiagonals by the following statement.
Object A is processed independently of its layout tuple.
The argument size of the variadic template however must be specified at compile time. Note that, the multidimensional array A of Example 3.5 must be of order three. The class offers additional member functions that allow to specifiy multiindices with the runtimevariable length using the std::vector. Consider the following code snippet in the next example.
Example 3.6 ().
Given a multidimensional array A of order where is initialized at runtime. We can create an identity tensor with a vector of size that is initialized with the value .
Using multiindices abstracts from the underlying data layout and enables the user to write layout invariant programs as all elements have a unique multiindex independent of the data layout, see Eq. (4). However, each element access with multiindices involves a multiindex to memory index transformation that is given by Eq. (17).
The second member function of the previous listing provides memory access with a single scalar parameter corresponding to memory indices of the memory index set of a multidimensional array, see Eq. (11). Consider the following example.
Example 3.7 ().
Let A be a multidimensional array specified in the previous example. The statement in the listing sets all elements of the object using a single induction variable j that takes values of the complete memory index set of the array.
In contrast to the access with multiindices, referencing elements with memory indices does not include index transformations. It might be seen as a lowlevel access where the user is responsible for referencing the correct elements. We might say that scalar indexing is convenient whenever the complete memory index set is accessed and the multiindex or the ordering of the data elements are not relevant for the implementation of the tensor operation. Higherlevel tensor operations such as the tensor transposition require some type of multiindex access.
3.3. View Template Class
An instance of the parameterized class fhg::tensor_view references a container instance that has been selected using fhg::range class instances.
The template parameter tensor_t can be of type fhg::tensor<value_type, allocator>. However, the user can also set a different tensor_t type. In that case tensor_t type requires the type tensor_t::value_type to be an arithmetic type and the referenced container to store its data contiguously with access to all previously mentioned public member types and variables. A template class instance fhg::tensor_view contains therefore the same public member types and member methods as the referenced template class instance fhg::tensor with only minor differences. This has the advantage that both template class instances fhg::tensor and fhg::tensor_view can be treated almost identically.
The fhg::tensor_view template class contains the following private member variables:

a pointer to the selected tensor of type tensor_t*,

a pointer to the first element of type value_type*,

ranges of type fhg::range,

extents of type size_type,

strides of type size_type.
The layout and offset tuple can be obtained by calling the appropriate functions of the referenced fhg::tensor object. The constructor only requires the specification of the referenced fhg::tensor and fhg::range objects that are used to initialize the remaining private member are initialized. The th fhg::range object defines an index set that must be a subset of the index set of the fhg::tensor as we have already previously defined, see Eq. (6). Moreover, the constructor of the fhg::range class takes either two or three indices where first, last and step corresponds to , and indices of Eq. (6).
The constructor also tests if the specified ranges are valid and do not violate any bounds of the selected tensor_t instance. The extents of the view are therefore computed according to Eq. (5). The pointer to the first element of the array is given by adding an offset to the pointer to first element of the selected fhg::tensor. The offset is computed by combining the lower bounds first using the layout function in Eq. (17) such that
(18) 
where and are the stride and offset tuple of the fhg::tensor object and is the lower bound first of the th fhg::range instance. The stride tuple of a view can be computed with Eq. (14).
The tensor template class fhg::tensor provides two methods that overload the function call operator and generate class instances of type fhg::tensor_view<tensor>. The latter type is equivalent to tensor_view_t where tensor is the type of the calling template class instance. Please consider the following listing which is an excerpt of the fhg::tensor template class.
The first function is a template function with a template parameter pack domain that is recursively unpacked for either integers or objects of type fhg::range, i.e. index triplet or pairs. The unpacking however is performed with protected template functions that are called from the constructor of the corresponding tensor_view_t class.
Both methods are declared as friend in the tensor_view template class and only callable from the referenced fhg::tensor objects. The first constructor generates a reference to an unselected tensor_t object where no domain is specified. This means that the reference covers the complete domain of the tensor object that has been selected. The second constructor uses a variadic template that processes either integers or objects of type fhg::range. Both constructors are called from the tensor template class from the overloaded function call operator. Please note that fhg::tensor_view the class instance cannot be instantiated without a tensor_t object as the default constructor of fhg::tensor_view the template class is deleted. Consider the following example that illustrates the creation of an view object Av of type tensor_view<tensor<float>>. Please note that only a tensor_t object is able to call the protected fhg::tensor_view constructor with a function call operator of tensor_t.
Example 3.8 ().
Let A be an object tensor of type tensor<float> from the previous Example 3.3 with n = {4,2,3} and zerobased index offsets. Let also r1 = range(1,2,3), r2 = range(0,1) and r3 = range(2) be index ranges with which the way array A is selected. According to Eq. (5) the shape tuple of the view Av is then given by nv = {2,2,1}.
The way array A can also be seen as a composition of three slices Av0, Av1 and Av2 where the ith slice is given by
where an empty range object contains all indices of the corresponding index set.
The instantiation of the view Av is only exemplified in the following listing in order to demonstrate the instantiation procedure. Please note that we did not provide a mechanism for counting the number of fhg::tensor_view objects associated to one tensor_t object. The implementation of such a mechanism requires tensor objects to know about their references and to signal its views when the destructor is called. An existing fhg::tensor_view object can therefore become invalid when its referenced fhg::tensor instance does not exist any more. This situation is similar to a dangling pointer. The user is therefore responsible for avoiding the situation where the referenced fhg::tensor object falls out of scope or is deleted. An alternative way to implement views of tensors is to only allow temporary fhg::tensor_view instances that is an rvalue. In order to enable this restriction, the copy constructor and copy assignment operator needs to be deleted or hidden from the user. In that case, only functions that allow only rvalue references of fhg::tensor_view instances using the && declarator operator. Please consider following listing that demonstrates the instantiation of tensor views.
Example 3.9 ().
Let A be an instance of fhg::tensor<float> template class with the shape tuple (3,4,2). We can then create a view Av of A by calling the overloaded call operator operator() of A with fhg::range instances.
The operator calls the constructor of the tensor_view_t class with the specified ranges. Please note that instead of using the auto specifier, we explicitly defined the type of the view for demonstration purposes. The number of arguments in the parameter pack of the template function must be known at compile time. If the number of fhg::range instances are runtimevariable, the user can call the second member function that creates a view of tensor using a std::vector. Please note that statement in Example 3.9 is very similar to the MATLAB syntax where a section of a multidimensional array A is created with A(1:2,1:2,1) using the call operator in conjunction with the colon operator.
Access functions of the fhg::tensor_view template class with multiindices depict the same interface as for the fhg::tensor template class.
However, the access function cannot use the layout function in Eq. (17) in order to generate memory indices. Each index in needs to be transformed into an index of the set before an element can be accessed. Memory indices of a view’s elements are then given by the function with
(19) 
where is the offset given by Eq. (18) and is a modified stride tuple with .
Example 3.10 ().
Let A be the multidimensional array where all the extent of the dimensions are 4. Similar to the Example 3.5, we can instantiate all diagonal elements of a slice Av to one using a single forloop.
Internally, the at() operation computes the relative memory indices from the multiindices (i,i,0) according to Eq. (19).
The instantiation of multidimensional iterators, views and the interface will be postponed to the following chapters.
3.4. Output Stream Class
In order to compare and verify our numerical results with MATLAB we provided our own output stream class fhg::matlab_ostream for a formatted output of the fhg::tensor and fhg::tensor_view instances. The class overloads stream operators << for the formatted output. The constructor of the class expects an instance of the std::basic_ios class template from the standard template library such as std::cout. The user can directly utilize the instance fhg::mcout of the fhg::matlab_ostream for convenience. The stream function relayouts the multidimensional array for the firstorder storage format and outputs the elements of the tensor using the MATLAB notation. One can also input additional MATLAB commands into the stream such as plotting statements. In this way, MATLAB scripts can be generated with our tensor library. Consider the following listing where the complete tensor A is inserted into the standard output stream.
Example 3.11 ().
The elements of A are stored according to the lastorder storage format and their values are initialized with indices of the plain index set.
The output can be directly copied into MATLAB’s command window and executed if the reference tensor Aref is already defined. Please note that the matlab_stream object could have been also instantiated with an output file stream std::ofstream. We have used this approach to verify the results of our tensor algorithms with the tensor toolbox described in (Bader and Kolda, 2006). The framework also provides overloaded input and output stream operators for the tensor class template in conjunction with standard output streams.
4. Multidimensional Iterator
An iterator is data structure that represents positions of elements in a container and allows to traverse between successive elements of the container. The iterator concept allows to decouple or minimize the dependency between algorithms that use iterators and data structures that create iterators. Pointers for instance are akin to iterators and may be regarded as an instance of the iterator concept. A pair of iterators define a (halfopen) range of the referenced container. Containers of the standard template library provide member functions begin() and end() in order to generate a halfopen range for the corresponding container where the iterator points to the first element and to the position after the last element.
The standard template library divides an iterator into five iterator categories with different capabilities where randomaccess iterators provide the largest number of access and iteration methods with iterator arithmetic similar to that of a pointer. Moreover, it accesses any valid memory locations in constant time. Template functions provided by the standard template library operate with iterators and determine the required category of the iterators. The std::for_each() template function for instance works with input iterators while std::sort() requires iterators with a randomaccess iterator tag. Hence, not every container can instantiate a randomaccess iterator such as associative container std::map such that std::sort() is not compatible with iterators generated by std::map objects. However, iterator instances generated by std::vector, std::array and std::deque objects support randomaccess and can use the function std::sort(). In case of the containers std::vector, std::array and std::deque that implement onedimensional array with a contiguous memory region, the halfopen range covers the complete (memory) index set . Moreover, the memory locations of the iterators generated by begin() and end() are and , respectively, where is the number of bytes to store an element as previously discussed. We have implemented two template classes fhg::iterator and fhg::multi_iterator that allow to iterate over the multiindex set of a multidimensional array or a view. Objects of fhg::multi_iterator instantiate fhg::iterator objects in order to define stridebased ranges.
4.1. Stridebased Iterator
The iterator fhg::stride_iterator has the same traits as the iterator type of the std::vector and therefore exhibits the same template class signature.
It provides all access properties of a randomaccess iterator and is therefore tagged as such. The template parameter iterator_t is an iterator type and must be accepted by the template class std::iterator_traits or by any of its pointer specializations for accessing the public member types such as iterator_category. We might think of the fhg::stride_iterator template class as an iterator adaptor for the standard randomaccess iterator with the same member functions and same interface. However, a stridebased iterator also features an additional member variable in order to store a stride of type std::size_t. The constructor has therefore the following signature.
An object of type iterator_t points to a valid memory location which we will denote as k. The second parameter of the constructor determines the stride w with which the iteration is performed.
The copyassignment operator of the fhg::iterator copies the current position k and the stride w of the other argument. We therefore consider two dimensionbased iterators i1 and i2 equal if the current positions i1.k, i2.k and the strides i1.w, i2.w of the iterators are equal. The expression (i1=i2) == i2 therefore returns true as both iterators have equal position and stride after the assignment (i1=i2). The following example illustrates how two stridebased iterators define a range for a given dimension.
Example 4.1 ().
Let A be a three dimensional array with elements of type float stored according to the firstorder storage format and let {4,3,2} and {1,2,3} be the shape and data layout tuple, respectively. The resulting stride tuple is then given by {1,4,12} according to Eq. (16). We can then use the following two statements where the first line instantiates iterators of std::vector and the second to iterators of the template class fhg::tensor.
The expressions A.data() returns a pointer to first element. The array w is the stride tuple of A. The first half open range [f1,l1) covers all elements with memory indices from 0 and to 12. The second range [f2,l2) only covers elements with the multiindices for that corresponds to a fiber of the multidimensional array with elements that have the memory indices 0, 4 and 8, see Eq. (17).
In order to iterate over the second dimension using the iterator pair [f1,l1), one has to explicitly provide the corresponding stride w[1]. In case of the second iterator pair, no stride has to be specified within the loop.
Example 4.2 ().
Let f1,l1 and f2,l2 be the iterator pair that have been defined in Example 4.1. Both of the following statements can be used to initialize the first row of the object A.
Please note that the standard iterator f1 needs to be explicitly incremented with the stride of the array A. The multidimensional iterator encapsulates the stride from the algorithm such that a normal increment with the operator++() or operator+=(1) suffices. Consider for instance the following function call of the std::fill function that corresponds to the previous statements.
The iterator can therefore be used with all algorithms in the algorithm and numeric headers of the C++ standard library. For convenience, the user can also call member functions of the container classes to instantiate stridebased iterators and to define a range.
The first two functions define a range for a given dimension that must be smaller the order where the first element of the range always corresponds the first element of the corresponding container. The second argument of the last two functions correspond to a multiindex position that define the initial displacement within the multiindex space except the dimension that has been specified by dim. The same initialization of the multidimensional container along the second dimension can be stated in one line.
The user does not have to keep track of the layout of the container A and can read or write elements with the help of ranges that can be conveniently instantiated with appropriate function calls. Moreover, the user can combine fibers of multidimensional containers tuples using template functions of the C++ standard library. For instance, the inner product of two fibers of two multidimensional arrays with the same length can be simply expressed with the std::inner_product template function.
Please note that the objects A and B can be instances of type tensor or fhg::tensor_view with different layout tuples. We also provide begin() and end() member functions for the fhg::tensor template class. The iterators have unit strides and define a range that covers the complete memory index set of the container class. They can be categorized as randomaccess iterators and behave just as iterators provided by std::vector. The user can initialize all elements an fhg::tensor object A with the help of the std::fill template function or the rangebased forloop and the range defined by A.begin() and A.end().
4.2. Multidimensional Iterator
The template class fhg::multi_iterator defines a multidimensional iterator that allows to define the multiindex set of a multidimensional array or view. It is not used as an iterator per se, but functions as a factory class that instantiates stridebased iterators such as the previously described fhg::iterator class.
The template parameter stride_iterator_t is an iterator type that must allow to instantiate stridebased iterators with a current position and a stride. The template class contains four private member variables:

the current pointer type pointer,

the order of the container of type std::size_t and

two pointers to stride and extent tuples of type std::size_t*.
The multidimensional iterator provides a single constructor with which all member variables are initialized.
The fhg::tensor and fhg::tensor_view template classes offer the member functions mbegin() and mend() to conveniently instantiate fhg::multi_iterator classes. Next to the default assignments, the template class fhg::multi_iterator overloads the assignment operator for stridebased iterators.
The operator assigns the current pointer with the pointer of the argument. Next to the constructor, assignment and comparison operator, the fhg::multi_iterator provides two factory methods that instantiate stridebased iterators.
The instantiated stridebased iterators are initialized with the dimension r and the current position of the calling fhg::multi_iterator. In this way, we can iterate over the complete multiindex sets of multidimensional arrays and views with nonhierarchical data layouts. The following example illustrates how a threedimensional array A is iterated with the multidimensional and stridebased iterators fhg::multi_iterator and fhg::stride_iterator. Listing 1 can be regarded as the baseline implementation for firstorder template functions.
Let A be an object of type fhg::tensor or fhg::tensor_view of order three storing its elements according to for instance the firstorder storage format. Let I be a multidimensional iterator that points to the first element of A. We can then use the template function base in Listing 1 to initialize all elements with the value v. The function generates a range with the stridebased iterator j2 and j2End for the outermost dimension. Each increment operation ++j2 adds its stride to its internal pointer until the end of the range is reached. Using the assignment operator of I, the internal pointer of the stridebased iterator j2 is copied to I with which the stridebased iterator j1 is initialized. The stridebased iterator j0 is initialized in the same manner. Finally, the elements of A are set to the value v. We can also replace the innermost loop with
where std::fill is the template function provided by the C++ standard library.
5. Tensor Function Templates
Our tensor algorithms are implemented with multidimensional iterators and support a combination of multidimensional arrays and views with different data layouts, arbitrary order and extents. All of following tensor algorithms are parametrized by multidimensional iterator types that need

to provide two member functions begin() and end() in order to select and iterate over the corresponding index range and

to have the same functionality as a standard iterator with at least the input iterator capabilities.
Additional requirements of a dimension specific iteration depends on the algorithm that may require different iterator capabilities. An example of such an iterator is the multidimensional iterator template class fhg::multi_iterator that has been introduced in the previous section. Users can specify their own iterator type and use our tensor template functions. Please note that the member functions of fhg::tensor and fhg::tensor_view call the following first and if applicable higherorder functions. Both data structures, for instance, overload the relational and numeric operators. They provide a convenient interface that allows to express numerical algorithms close to the mathematical notation without specifying multidimensional iterators. The user may decide whether to call the higherorder tensor functions using member functions of tensor template class instances or to call them with a different multidimensional container or multidimensional iterator.
5.1. FirstLevel Tensor Operations
Function  Example (MATLAB)  Description 

for_each() 
C = C+3  Performs a unary operation for each element 
copy() 
C = A  Copies elements starting with the first element 
copy_if() 
C(A>3) = A(A>3)  Copies elements that match a criterion 
transform() 
C = A+3  Modifies, copies elements accord. to a unary op. 
transform() 
C = A+B  Combines elements accord. to a binary op. 
fill() 
C = 3  Replaces each element with a given value 
generate() 
C(i) = i.*i  Replaces each element with the result of an op. 
count() 
sum(A(:)==3)  Returns the number of elements 
count_if()  sum(A(:)<=2)  Returns the number of elements matching a crit. 
min_element() 
[,i] = min(A(:))  Returns the element with the smallest value 
max_element()  [,i] = max(A(:))  Returns the element with the largest value 
find() 
find(C(:)==3)  Searches for the first element matching the value 
find_if()  find(C(:)<=2)  Searches for the first element matching a crit. 
equal() 
all(C(:)==A(:)  Returns whether two ranges are equal 
mismatch()  find(C(:)!=A(:)  Returns the first elements that differ 
all_of() 
all(C(:)==3)  Returns whether all elements match a crit. 
any_of()  sum(C(:)==3)>0  Returns whether at least one element matches a crit. 
none_of()  sum(C(:)==3)==0  Returns whether none of the elements matches a crit. 
iota() 
C(:) = [1:n]  Replaces each element with incremented values 
accumulate()  sum(C(:))  Combines all element values (accord. to a binary op.) 
inner_product()  dot(C(:),A(:))  Combines all elements (accord. to a binary op.) 
Firstlevel tensor template functions such as fhg::for_each implement algorithms of the C++ standard library for dense multidimensional arrays and their views. Table 2 lists some of the implemented template functions of our tensor framework that make use of multidimensional iterators. Similar to the C++ standard library, the user can pass function objects to some of the firstlevel template functions that apply one or more function objects in each iteration. Functions objects are functionlike objects that provide a function call operator allowing the function to have a state. The user can define its own class with a function call operator to instantiate a function object or construct a function object using lambdaexpressions. Most of our firstlevel template functions require unary or binary predicates and unary or binary function objects just as it is the case for the C++ standard library.
In contrast to the C++ standard algorithms, our firstlevel tensor functions are designed for multidimensional arrays and views. They are implemented in a recursive fashion and iterate over multiple halfopen ranges. The iteration is accomplished with a multidimensional iterator where one or more elements of multidimensional arrays and views with different data layouts but the same multiindex are combined. Please note that dense multidimensional arrays with a contiguous data layout can also be manipulated with template functions of the C++ standard library. However, they are designed for containers with one dimension and therefore cannot manipulate arrays with different data layouts or with noncontiguously stored elements.
Please consider the fhg::for_each template function in Listing 2. The template function is recursively iterates over the complete memory index set of a multidimensional array using multidimensional iterators of type InputIt and applies the unary function of type UnaryFn to the elements that are referenced by the iterator. The implementation accepts any order greater zero with which fhg::for_each is called and therefore is the generalized version of the base function that has been provided in the previous section in Listing 1. The iteration of the memory index set is accomplished by recursively adding memory indices in each function call with a newly defined index range given by in.begin(r) and in.end(r).
More specifically, if p is the order of the multidimensional array, the function for_each needs to be initially called with p1. In this case, the outermost loop iterates over the halfopen range that is defined by it.begin(p1) and it.end(p1) covering the index range of the th dimension. The copyconstructed iterator f is then copied to the next function instance where . This is repeated until r equals zero where the first dimension of the data structure is accessed and the std::for_each template function of the C++ standard library is called. The multidimensional iterators f and l define a range and act like a standard iterator with input iterator capabilities. The maximum depth of the recursion is therefore p1 and the recursive function is called times where is the th extent of the multidimensional array or view.
Incrementing the th stridebased iterator in the th loop corresponds to a shift of its internal pointer by the th stride . The th iteration therefore sets the internal pointer of the th stridebased iterator to with , where is the memory address computed by the previous stridebased iterator and is the number of bytes to store an element of A. Given zerobased indices for all , the address of an element is given by
(20) 
where is the layout function already defined in Eq. (10). In case of a multidimensional array view, the th stride of the multidimensional iterator is given by where is step of the th range and is the stride of the referenced multidimensional array as discussed in Section 3.3. Additionally, the index of the first element is initialized with where is the th lower bound of the view that has been used in Eq. (19).
Example 5.1 ().
Let be a tensor or view of order with the shape tuple and any nonhierarchical layout. The following equation denotes the initialization of with a scalar.
(21) 
Given a tensor object A with elements of type float, the next listing implements Eq. (21) and initializes elements of an array A with the value alpha using the template function fhg::for_each.
Note that the object A can be a view and may have any arbitrary nonhierarchical layout.
Please note the similarity of the function signature of fhg::for_each with the one of std::for_each. In case of fhg::for_each, the first argument must be a multidimensional iterator that is capable of creating multiple halfopen index ranges. The second argument must be a unary function that is accepted, i.e. applicable by the std::for_each function. The first template parameter of the function indicates that the multidimensional iterator needs only to have input iterator traits. This might be different for other template functions such as fhg::sort postulating a random iterator.
Consider the fhg::transform template function in Listing 3. The implementation recursively applies the unary function fn to elements of a multidimensional array that is referenced by the multidimensional input iterator in. The results are stored in an array with the same order and extents using the multidimensional output iterator out. The template functions returns an iterator to the element past the last transformed element. The first iterator in points to and iterates over a multidimensional array or view with a data layout that may be different from the multidimensional array or view referenced by the second iterator.
The control flow is similar to the one of the fhg::for_each function with the difference that each function call returns the current output pointer. The innermost loop calls the C++ standard library std::transform template function. The dereferenced input and output iterators correspond to elements with the same multiindex regardless of the data layout. All template functions listed in Table 2 are implemented in the same manner.
Example 5.2 ().
Let and be multidimensional arrays of order with the same shape tuple . The multiplication of a scalar with elements of a tensor is then given by the equation
(22) 
where and . Let A and C be objects of type fhg::tensor or fhg::tensor_view with the same shape tuple and let alpha be a scalar. Eq. (22) can then be implemented with help of the fhg::transform function.
Example 5.3 ().
Let , and be multidimensional arrays with same shape tuples and a binary operation that is either an addition, subtraction, multiplication or division. The elements of are then given by the equation
(23) 
where and . Let A, B and C be objects of type fhg::tensor or fhg::tensor_view with the same shape tuple. Similar to the previous example, Eq. (23) can be implemented using the fhg::transform function.
Please note that we did not specify the storage formats and index offsets of the tensor data types. User of the framework is free to implement different types operations with appropriate binary operators using their own template classes or multidimensional iterators. For convenience, the fhg::tensor and fhg::tensor_view template classes provide member functions that perform entrywise operations with overloaded operators for different argument types. The overloaded operator+=() member function of both template classes calls the fhg::transform template function if its arguments are of type fhg::tensor and fhg::tensor_view. If the argument is a scalar of type const_reference or value_type, the template function fhg::for_each is called, see Example 21.
5.2. HigherLevel Tensor Operations
Higherlevel tensor operations exhibit a higher arithmetic intensity ratio than firstlevel tensor operations and perform one or more inner products over specified dimensions. Table 3 provides an overview of the implemented higherorder tensor operations. We will discuss the implementation of the three higherorder tensor operations, that are the tensortimesvector and tensortimes matrix and tensortimestensor multiplication. Please note that the transposition is included in this subsection as it can be expressed with a copy operation and a modified stride tuple.
Function  Notation  Example 

transpose()  C = permute(A,[2,1,3])  
tensor_times_vector()  C = ttv(A,b,1)  
tensor_times_matrix()  C = ttm(A,B,1)  
tensor_times_tensor()  C = ttt(A,B,[1 3],[2 3])  
outer_product()  C = ttt(A,B)  
inner_product()  C = ttt(A,B,[1:3])  
norm()  C = norm(A)  
tensor_times_vectors()  C = ttv(A,u,v,[1,3])  
tensor_times_matrices()  C = ttm(A,U,V,[1,3]) 
Let be an way array with the shape tuple and let be a way array with the shape tuple . Let also be a way array with its shape tuple . The mode tensorvector multiplication, denoted by , multiplies with according to
(24) 
where and . Please note that the tensorvector product equals to a common vectormatrix multiplication if and . For , the vector is multiplied with each frontal slice of the array . Listing 4 implements the tensortimesvector multiplication according to Eq. (24) where the contracting dimension must be greater than zero. Our framework provides also an implementation for this case. The second and third arguments, and , track the recursion depth where for and for and are initially set to and .
The fhg::ttv template function has a similar control flow compared to the previously discussed implementation of the fhg::transform template function. One difference is the recursive call in line 5 where the recursion depth equals the mode . After recursive function calls with , the inner product of two fibers is performed in line 10 using the std::inner_product of the standard library. The range of the first fiber is given by the range that is defined by the iterators A.begin(m) and A.end(m). The start of the second range is given by B.begin(0). The result is stored into the location pointed by fc. Interpreting lines 8 to 11 as slicevector multiplications, we might say that the tensorvector multiplication is composed of multiple slicetimesvector multiplications.
Example 5.4 ().
LetA, b and C are tensor objects of the same element type and layout and let the shape tuples of A, b and C be given by the tuples , and , respectively. We can then use following statement that performs the mode tensortimesvector multiplication.
A second implementation with only two branches is given when . The first branch contains the expressions of the elsebranch that are executed for . The second branch contains the computation of a loop over the second dimension (instead of the first) and computes the inner product of two fibers of the first dimension. We might therefore say that for , the tensortimesvector multiplication consists of multiple vectortimesmatrix multiplication where the matrix is the frontal slice of the first input array pointed by A.
Let be an way array with the shape tuple and let be a way array with the shape tuple . Let also be an way array with the shape tuple . The mode tensormatrix multiplication, denoted by , multiplies with according to
(25) 
with and . The tensormatrix multiplication corresponds to a tensorvector product when . If the order of the arrays and equals , the tensormatrix product corresponds to a common matrixmatrix multiplication.
The implementation of tensormatrix multiplication is almost identical to the recursive template function ttv where only the control flow of the most inner recursion differs. Moreover, the tensortimesmatrix multiplication can be expressed in terms of multiple slicematrix multiplications where one parameter suffices in order to generate ranges for the input and output iterators. Consider Listing 5 that illustrates the implementation of the innermost part of the template function fhg::ttm. It is the base case of the recursion and corresponds to a slicetimesmatrix multiplication. Parameter m is the contraction dimension and r denotes the recursion depth that initially is set to . Please note that needs to be greater than zero in this case. Our framework provides a second implementation for with a control and data flow that is similar to the tensortimesvector implementation.
The initial addresses of the input and output slices are generated in line 5 and 7 by calling A.begin(0) and C.begin(0), respectively. The initial address of the input matrix is generated in line 6 by calling B.begin(0). The instantiated stridebased iterators fa0, fb0 and fc0 iterate in lines 7 and 9 along the first dimension of the corresponding data structures and point to fibers and rows of the input and output slices, respectively. The range of the output fiber is determined in line 8 by the instantiation of the stride based iterators fcm and lcm. The value for each element of the output fiber is given by the inner product of the input fiber and input matrix row in lines 10 and 11. The range of the input fiber and matrix row are given by the ranges (A.begin(m),A.end(m)) and (B.begin(1), B.end(1)), respectively. Please note that iterator B needs to be initialized with the initial address in line 7.
The tensortensor product is the general form of the tensormatrix and tensorvector multiplication. Let be way array with and let be a way array with where and , . The mode tensortensor multiplication, depicted by , computes elements of the way array according to
(26) 
where the shape tuples satisfy the following equations:
(27)  
The first , elements of the permutation tuples and , respectively, determine the dimension ordering for . The last elements determine the dimensions that are contracted.
Equations (26) and (27) can be used to define the tensormatrix and tensorvector multiplication. Let and be the shape tuples of the first and second operand. A tensortensor multiplication is the mode tensorvector multiplication, if , , and , such that . The mode tensormatrix multiplication is given if and , where , , and , such that . The inner product is given when and for some , while the outer product is given when for some and .
Listing 6 illustrates the recursive implementation of the tensortimestensor multiplication as defined in Eq. (26). The implementation is recursive and performs the contraction without unfolding the tensors using multidimensional iterators only. Let us first consider the function signature of fhg::ttt. The first parameter k is the recursion depth of the template function. The parameter q is the number of dimensions of the input arrays that are contracted. The parameters r and s are the remaining number of dimensions of the input arrays that sum up to the order of the output array. The next two variables phi and psi are or point to permutation tuples that are needed to specify the dimension ordering of the output array and the contraction. The template function expects all integer variables to be zerobased. The control flow of the recursive function fhg::ttt contains four branches using ifelse statements. In each branch multidimensional iterators are adjusted and stridebased iterators are instantiated and incremented with respect to the equations (26) and (27). The else statement performs the innermost loop of the tensor multiplication which computes the inner product of two fibers.
The recursive template function fhg::ttt is called initially with . For , the control flow branches into the loop in line 7 that calls the function fhg::ttt with modified iterators A, C and with k+1. The iteration length of the th loop where is smaller than is determined by the stridebased iterators fa, la and fc which are instantiated with the multidimensional input and output iterators A and C, respectively. The iterator pair fa and la must be generated with the permutation tuple phi in order to satisfy Eq. (27). Similarly, for , stridebased iterators fb, lb and fc are instantiated with the multidimensional input and output iterators B and C, respectively. In this case, the permutation tuple psi is used, in order to generate the correct iterator pair for fb and lb with the indices psi[kr]. For , stridebased iterators are instantiated with the iterators A, B and indices phi[ks], psi[kr] of the dimensions to be contracted. Finally, when the last branch is reached with , the inner product of two fibers of the input arrays are computed. The dimension indices for the instantiation of the stridebased iterators with the multidimensional iterators A and B are phi[r+q1] and psi[s+q1].
Example 5.5 ().
Let A, B and C be tensor objects (such as fhg::tensor or fhg::tensor_view) of the same type and layout and let the shape tuples of the tensor objects A, B and C are {3,4,2}, {5,4,3,6} and {3,1,2}, respectively. We can then write the following statement in order to multiply A with B.
The first statement initializes C++ arrays that represents the permutation tuples phi and psi. The second statement calls the function fhg::ttt in Listing 6 and performs the 2mode tensortimestensor multiplication.
Higherorder tensor operations are also provided as member functions of the fhg::tensor class template and may be combined with pointwise operations. The following code snippet shows two template methods of the fhg::tensor template class for the tensormatrix and tensorvector multiplication.
The tensor instance in function times_matrix needs to have at most two dimensions. In function times_vector one the two extents of tensor must be one. The second parameter mode determines the dimension over which the contraction is performed, see equations (24) and (25). Please note that the template class fhg::tensor_view provides member functions with the same signature. As we will show with examples, the function signatures for higherorder tensor operations are similar to the function signatures that are provided in (Bader and Kolda, 2006). The user can therefore easily verify and compare the results of our tensor functions with those provided in (Bader and Kolda, 2006).
Example 5.6 ().
Consider the following listing, where A, b, B and C are tensor objects of the same element type and layout. Let the shape tuples of A, b, B and C be {3,4,2,6}, {3,1}, {4,5,6} and {2,5}, respectively. We can then express the multiplication of the tensor A with the vector b along the first dimension and the multiplication of the tensor B with the matrix C along the second dimension with the following statement.
The resulting temporary array object D is of type fhg::tensor<float> and has the shape tuple fhg::shape {4,2,6}.
The first expression A.times_vector(b,1) internally calls the template function ttv where the mode is zero. The second expression B.times_matrix(C,2) calls the template function ttm which is shown in Listing 5. Please consider the following member function signature of times_tensor.
Function times_tensor encapsulates the template function fhg::ttt presented in Listing 6 and provides a userfriendly interface. The parameters phi and psi can have different sizes and determine the permutation of the dimension indices of the input and output arrays. The second parameter q determines the number of contraction as discussed for the template function fhg::ttt.
The possibility to permute the resulting dimension indices of the output array can be regarded as a minor extension of the equivalent functions that are provided in (Bader and Kolda, 2006). The tensor classes provide two member function that exhibit a simpler interface with the minor restriction of not being able to permute the dimension indices of the output array.
The first function allows to specify one permutation tuple which specifies the permutation for both tensors. In case of the second function, parameter phi and psi with equal size specify the permutation of the dimension indices of the left and right hand side tensors. In both cases, the permutation tuples specify the dimension indices to be contracted and can be smaller than the order of the tensors.
The following listing illustrates a multiplication of two tensors A and B with unequal shape and layout tuples where the multiplication is performed according to Eq. (26).
Example 5.7 ().
Consider the following listing, where A, B and C are tensor objects of the same type and layout. Let the shape tuples of A, B and C be {3,4,2,6}, {4,3,2} and {2,3,4}, respectively.
The first statement multiplies A with B with the permutation tuple {2,3} which contracts the dimension 2 and 3 of the input tensors. The second statement performs a multiplication of the 2nd, 3rd and 3rd, 1st dimensions of the first and second input tensors, respectively.
The comments denote Matlab statements that contain userdefined functions of the toolbox discussed in (Bader and Kolda, 2006). In some cases, a tensor is multiplied with a sequence of vectors and matrices such as in case of the higherorder singular value decomposition (Lathauwer et al., 2000b; Bader and Kolda, 2006). The following listing exemplifies the function call where the tensor A from the previous example is multiplied with multiple vectors.
Example 5.8 ().
Let A be the tensor from the previous example and let a, b and c be tensor tensor objects of type fhg::tensor<float> representing vectors. The multiplication of a list of vectors denoted by {a,b,c} with the dimensions given by {1,2,4} is then given by the following statements.
The commented statement performs the same operation where the last parameter denotes the excluded mode.
Function times_vectors internally calls the times_vector function starting with the vector c to the vector a. The implementation of the higherorder power method in Listing 7 uses the times_vectors.
The method is discussed in (Lathauwer et al., 2000b) and can be regarded as generalization of the best rankone approximation for matrices (Bader and Kolda, 2006)
. The method estimates the best rankone approximation of a real valued tensor
of order , by finding a rank tensor composed of a scalar and unitnorm vectors with which the least square cost function(28) 
is minimized. The first parameter K is maximum number of iterations of the outer loop. We could additionally insert a convergence criteria such as difference of the previous and current value of as stated in (Lathauwer et al., 2000b). The second parameter is the tensor of type fhg::tensor<T> that shall be approximated. The third input parameter u are starting values of the unitnorm vectors. In (Lathauwer et al., 2000b) the starting values are set as the most dominant left singular vectors of the matrix unfolding of . The last input parameter l denote scaling factors .
6. Conclusions
We have presented a flexible C++ tensor framework that allows users to conveniently implement tensor algorithms with C++ and to easily verify numerical results with the toolbox presented in (Bader and Kolda, 2006). The framework contains a multilayered software stack and applies different types of abstractions at each layer.
Users can choose the highlevel layers of the framework to instantiate tensor classes with arbitrary nonhierarchical data layout, order and dimension extents. Member functions of the tensor classes help to generate views and access multidimensional data supporting all types of nonhierarchical data layouts including the first and lastorder storage formats. The transparency of the data access functions with respect to the data layout can be identified as a unique feature of our framework.
The lower layers of the software stack provide tensor template functions that are implemented only in terms of multidimensional iterator types separating algorithms and data structures following the design principle of the Standard Template Library. Users of our framework are able to include their own tensor types and extend the functionality of our library without modification of the tensor template classes. We have presented our own multidimensional and stridebased iterator classes and exemplified their usage together with implementations of first and higherlevel tensor operations.
Member functions of the tensor classes encapsulate tensor template functions and allow to program tensor algorithms in a convenient fashion. We have exemplified their usage and implemented a method for the best rankone approximation of real valued tensors that has been described in (Lathauwer et al., 2000b). All of the tensor template functions have been implemented in a recursive fashion and execute inplace with no restriction of the contraction mode, order or dimension extents.
In future, we plan to provide and incorporate inplace tensor functions into our framework for a faster computation. We would like to offer or integrate parallelized and cacheefficient versions of the template functions for different data layouts and investigate their runtime behavior.
Appendix A Software
a.1. Organization and Usage
TLib is a headeronly library that only depends on the C++ standard library. The main folder of our framework has two subdirectories. The tensor/code directory contains all library code contents with class declarations and implementations. The tensor/examples directory contains all examples which demonstrate the instantiation and usage of the tensor class template and tensor functions. The declaration of the class templates and function templates are found in the tensor/code/inc folder where as the definitions are placed in the folder tensor/code/src. When working with our tensor library, the inclusion of the code/inc/tensor.h is sufficient. However, one could also use the tensor function templates without our tensor classes but a different multidimensional array. In such a case, only the corresponding header files can be included.
a.2. Examples
TLib is implemented in C++ using features of the C++14 standard. Users of our framework need to compile the examples in the tensor/examples with a C++ compiler that supports the C++14 standard. We have tested our framework using gcc version 6.3.0 and clang version 4.0.0 compilers. We have provided a Makefile that generates all executables in the tensor/examples/build directory. Examples 7 to 11 also generate MATLAB script files for verification that are placed in the tensor/examples/out directory. The script files can be directly executed in MATLAB. Examples 8 to 11 also require the MATLAB toolbox described in (Bader and Kolda, 2006) with which the numerical results are validated.
Acknowledgements.
We are grateful to Fritz MayerLindenberg, Thomas Perschke and Konrad Moren for their helpful comments and suggestions. The authors wish to express special thanks to Volker Schatz for his assistance with various aspects of the project.References
 (1)
 Andres et al. (2010) Björn Andres, Ullrich Köthe, Thorben Kröger, and Fred A. Hamprecht. 2010. RuntimeFlexible Multidimensional Arrays and Views for C++98 and C++0x. CoRR abs/1008.2909 (2010).
 Bader and Kolda (2006) Brett W. Bader and Tamara G. Kolda. 2006. Algorithm 862: MATLAB tensor classes for fast algorithm prototyping. ACM Trans. Math. Softw. 32 (December 2006), 635–653. Issue 4.
 Chatterjee et al. (1999) Siddhartha Chatterjee, Alvin R. Lebeck, Praveen K. Patnala, and Mithuna Thottethodi. 1999. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures (SPAA ’99). ACM, New York, NY, USA, 222–231.
 Cichocki et al. (2009) A. Cichocki, R. Zdunek, Phan A.H., and S. Amari. 2009. Nonnegative Matrix and Tensor Factorizations (1 ed.). John Wiley & Sons.
 Cui et al. (2014) ChunFeng Cui, YuHong Dai, and Jiawang Nie. 2014.
Comments
There are no comments yet.