On the Preservation of Spatio-temporal Information in Machine Learning Applications

06/15/2020 ∙ by Yigit Oktar, et al. ∙ 0

In conventional machine learning applications, each data attribute is assumed to be orthogonal to others. Namely, every pair of dimension is orthogonal to each other and thus there is no distinction of in-between relations of dimensions. However, this is certainly not the case in real world signals which naturally originate from a spatio-temporal configuration. As a result, the conventional vectorization process disrupts all of the spatio-temporal information about the order/place of data whether it be 1D, 2D, 3D, or 4D. In this paper, the problem of orthogonality is first investigated through conventional k-means of images, where images are to be processed as vectors. As a solution, shift-invariant k-means is proposed in a novel framework with the help of sparse representations. A generalization of shift-invariant k-means, convolutional dictionary learning, is then utilized as an unsupervised feature extraction method for classification. Experiments suggest that Gabor feature extraction as a simulation of shallow convolutional neural networks provides a little better performance compared to convolutional dictionary learning. Many alternatives of convolutional-logic are also discussed for spatio-temporal information preservation, including a spatio-temporal hypercomplex encoding scheme.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In traditional signal processing and machine learning problems, each data dimension (attribute) is assumed to be orthogonal to others. In other words, there is no distinction between cross-relations of dimensions. While signals carry information through a spatio-temporal configuration, assuming such orthogonality of signal dimensions is highly ill-posed even for D cases. This phenomenon is depicted simply in Fig. 1.

Let us numerically analyze the severity of the problem of casting signals as vectors. Assume that an -sized vector is received through the orthogonality consideration and it is known that the original form is an -sized D signal. If one tries to recover the original spatial configuration without further knowledge (i.e., which value was in which cell), all possible spatial configurations are equally likely. This problem becomes even more serious when the dimensionality of the signal itself increases. Consider an -sized vector is received again but the underlying signal is now assumed to be an image. Not only there are permutations involved but also one needs to guess the height and width of the image. In general, for an -sized vector and a -dimensional original signal, the number of possible spatial configurations that the signal could have been in is given as where is the -th Piltz function, which gives the number of ordered factorization of as a product of terms Sandor (1996).

Figure 1: (Left) There is orthogonal consideration. Every pairwise relation between dimensions is indistinguishable because of orthogonality. (Right) While considering spatial configuration of a -cell D signal, the relation between cells x and z is obviously different from the other relations, i.e., x and z are not neighbors.

When the above described issue is undertaken, it is not hard to see that many conventional machine learning formulations are highly ill-posed from the perspective of real world signals. Let us now consider the case of -means to be applied on vectorized real world signals, and suppose images for simplicity. As

-means originally assumes orthogonality of dimensions, it is easy to apply the usual Euclidean distance metric between vectors. However, it is indeed questionable whether it will capture the notion of distance between two images or rather the average of two images. An example in this light can be given from the domain of Computer Graphics. A direct linear interpolation between two rotation matrices is not natural, thus quaternions are utilized leading to a formulation called spherical linear interpolation 

Jafari and Molaei (2014). A similar consideration might also be superior in the clustering problem of images using -means. However, it is not trivial to cast a general image as a quaternion-like structure for further processing.

Let us try to prove that direct vectorized distance calculation is not natural for images by giving a more concrete example. Assume that there is a main image of the number as exemplified in Fig. 2(a). The question here is which other image is more similar to this main image. Is it the number in Fig. 2(b) having relatively same spatial position within the frame, or is it the number in Fig. 2(c) with exact shape but linearly shifted in the frame? Vectorized distance measure will dictate that is closer to the main image, which is definitely not natural. Therefore, a shift-invariant distance metric could be more powerful in this case.

(a) Main image of

(b) The number

(c) Another image of
Figure 2: Vectorized distance will dictate that is closer to the main image. However, it is indeed more natural to say that two images of are more similar to each other.

For given two images and , the standard (vectorized) Euclidean distance is given in Eqn. (1). This formula can be enhanced with a shift-invariant adaptation as in Eqn. (2) where denotes the image

zero-padded on its sides. Alternatively, a shift-invariant distance notion can also be given in terms of inverse of cross-correlation as in Eqn. (

3). Nevertheless, even if a suitable distance metric is found to designate the closest centroid, it is not trivial to obtain the average of a cluster as the new mean for the next step.


In this study, -means formulation will be considered within a sparse representations framework to provide a self-sufficient shift-invariant version. As noted in earlier studies Oktar and Turkan (2018, 2019, 2020), the original -means problem can be expressed in a sparse representations framework as a dictionary learning problem. A shift-invariant version of -means can then be derived through a much recent convolutional dictionary learning formulation. It is not a surprise that a convolutional approach leads to a shift-invariant scheme, as convolution is an operator which breaks orthogonality assumption by considering neighboring data points group by group, forming a relation between spatial regions in the signal.

The paper is organized as follows. Section 2 gives the mathematical description of the proposed shift-invariant -means concept, followed by a generalization through convolutional dictionary learning for classification. Section 3 details experimental setup and reports experimental results obtained from the proposed concepts. Later, Sec. 4 discusses many alternatives of convolutional-logic for spatio-temporal information preservation, including a spatio-temporal hypercomplex encoding scheme. Section 5 finally concludes this paper with a brief summary.

2 Convolutional Sparse Representations

It is possible to mathematically formulate the conventional -means problem in a sparse representations framework given in Eqn. 4 as follows,


where the matrix is an over-complete dictionary and is the sparse representation of the data point . Each sparse vector contains only one non-zero component and this component is forced to be positive and sum-to-one. Dictionary columns as atoms (namely ) designate centroids.

While Eqn. (4) represents a direct formulation of classical -means, it corresponds to the problematic orthogonality consideration as mentioned previously. A possible shift-invariant alternative of -means is given in Eqn. (5) as follows,


where denotes the convolution operator and is the index of the optimal convolutional atom, or in other words the convolutional centroid that is assigned to the data point. Notice here that the non-zero entry of is not forced to be , but can now be anything. Therefore, this formulation is not only shift-invariant but also invariant to the magnitude of the pattern. However, this should then be complemented by an atom normalization process.

Because of the linearity property, atoms in can also be expressed in a large convolutional dictionary to be denoted by as depicted in Fig. 3. The local dictionary consists of convolutional atoms, whereas the global dictionary

is filled with zeros outside the convolutional area. In this regard, the mathematical optimization in Eqn. (

5) evolves to Eqn. (6) where denotes the index of the single non-zero element from the top and modulo (number of clusters) determines the index of the assigned convolutional centroid .

Figure 3: The local dictionary consists of convolutional atoms, whereas the global dictionary is filled with zeros outside the convolutional area.

2.1 A solution to shift-invariant -means

Since the optimization in Eqn. (6) is highly non-convex, an approximate iterative solution is employed alternating between assignment to clusters and centroid update akin to Llyod’s algorithm for the original -means problem Jain (2010). This procedure directly corresponds to sparse coding and dictionary update steps, respectively, in terms of sparse representations.

In this light, the data assignment step is solved with Orthogonal Matching Pursuit (OMP) Pati et al. (1993) assuming is fixed, to satisfy the -norm sparsity constraint. On the other side, a straight-forward utilization of conventional dictionary update algorithms, such that Method of Optimal Directions (MOD) Engan et al. (1999) or KSVD Aharon et al. (2006), is not very obvious because the inherent subdictionary composed of convolutional centroids is only to be updated in . To solve this problem, each individual block of the overall sparse representation is extracted as an individual subproblem, on which MOD (i.e., least-squares) update is applied. As the last step, the final updated subdictionary is attained by averaging all of the resulting individual subdictionaries. To the best of the available knowledge, this naive solution to the centroid update problem is not extensively covered in literature, thus it can be coined as Method of Optimal Subdirections on Average (MOSA).

Experimental results indicate that this adaptation of shift-invariant -means provides better results when compared to its original version for datasets in which considerable shifts exist.

2.2 Convolutional dictionary learning as a generalization

Encouraged by the superiority of the shift-invariant -means formulation obtained through a convolutional sparse representation as an unsupervised task, the question is then to generalize this convolutional approach to other machine learning tasks such as classification. The claim is that an unsupervised feature extraction layer that is performed through convolutional dictionary learning as a generalization, can provide superiority over orthogonal-only consideration in also supervised tasks. This claim has already been validated in literature many times Zeiler et al. (2010); Pu et al. (2016); Garcia-Cardona and Wohlberg (2018) but an extensive comparison with the classical orthogonality consideration is usually missing.

In this regard, a shift from the strict -norm constraint to a more lenient -norm is considered. There are two main reasons behind this decision. First of all, it is unclear how to set the sparsity level in an -norm formulation since denser choices drastically affect the computational complexity in greedy approaches and sparser solutions can lead to severe information loss. Importantly, most practical studies are based on -norm in literature Garcia-Cardona and Wohlberg (2018).

With the above consideration, a final optimization for convolutional dictionary learning is given in Eqn. (7) by introducing the -norm regularization into the formula via a Lagrange multiplier . Iterative solutions which alternates between convolutional sparse coding and dictionary update exist in literature Garcia-Cardona and Wohlberg (2018).


In fact, the aim of this study is not to devise new approaches to above optimization but to utilize it as an approach to the orthogonality problem. This unsupervised convolutional decomposition of a signal can be regarded as a feature extraction method that tackles the problem of orthogonality, where the extracted features for the data point are formed by concatenating the corresponding sparse codes, i.e., . Note that concatenation here still assumes orthogonality; however, there now exists a convolutional-logic before the orthogonality consideration which alleviates the main drawbacks of it from the start. The effectiveness of such a layer is to be experimentally tested against various other feature extraction methods in an extensive manner.

3 Experimental Results

In the following, two sets of experiments are performed corresponding to the discussions raised in Sec. 2.1 and Sec. 2.2. All experiments are carried on an Intel(R) Core(TM) i-HQ CPU @ GHz GB RAM machine running on Microsoft Windows using Matlab a.

Figure 4: Clustering accuracy (%) of -means (KM) based methods as a function of mean shift applied on MNIST. The proposed shift-invariant KM (KM) is robust to shifts.

3.1 Shift-invariant -means

In this set of experiments, a dataset is formed by extracting first training images of each class from the MNIST handwritten digit database LeCun et al. (2010), making a total of images. Four modified versions of this dataset are then obtained to test the shift-invariance property. First of all, empty images of sizes , , and pixels are initialized and original digits are inserted into these widened images with certain uniformly random shifts in x and y directions. Mean shifts in axes are chosen as , , and pixels, respectively, suiting the size of images. The clustering accuracy rates of -means (KM), Kernel KM Dhillon et al. (2004), Ensemble KM Iam-on and Garrett (2010) and shift-invariant KM (KM) on these cases are illustrated in Fig. 4.

Not surprisingly, the performance of KM stays relatively stable in cases of varying shifts, whereas all other methods start to perform poorly when shifts are introduced. It is obvious that a mean shift of pixels is enough to disrupt the functionality of classical methods for these datasets. Considering original images of sizes pixels, this roughly corresponds to a mean shift of of the whole image size. Note also that classical methods perform nearly poor as a random guess method (RAND) in cases of extreme shifts, e.g., pixels or correspondingly . KM has , Kernel KM has , Ensemble KM has and KM has clustering accuracy in the case of pixels shift applied on MNIST. This proves that neither kernelization nor ensembles can provide an efficient solution to the shift-invariance problem.

One may argue that a simple preprocessing step, which extracts a precise subimage of the digit in each image, would be enough to sustain shift-invariance for clustering these images; however, such a naive approach cannot be a general solution for natural images. On the other hand, the logic in KM provides an automatic solution, which is both theoretically and practically sound, without any need for preprocessing. The simplicity and effectiveness of this clustering approach can further pave way to more general techniques with the same logic applied on other machine learning tasks in some settings. In fact, a generalization of KM via convolutional dictionary learning can be utilized as a powerful unsupervised feature extraction method for classification that alleviates the drawbacks of the classical orthogonality consideration.

3.2 Convolutional dictionary learning

In this set of experiments, convolutional dictionary learning as an unsupervised feature extraction method is compared against various other well-known feature extraction schemes. An existing library called SPORCO Wohlberg (2017)

is utilized for convolutional dictionary learning. In the following reported experiments, linear support vector machine (SVM) classifiers are employed after the feature extraction phase. The motivation behind the linear SVM usage is that, a successful feature extraction must transform the sample space into a linearly separable one as much as possible.

There are three employed versions of dictionary learning methods. The global-only dictionary learning (DL) operates over dictionary atoms of size pixels, namely atoms cover sample images globally. The patch-based dictionary learning (PDL) trains over dictionary atoms of size pixels, where local image patches are extracted in a sliding window manner. This type of approach can be regarded as a local-only one. Both DL and PDL methods are realized through regular dictionary learning iterative steps, i.e., sparse coding and dictionary update. In the proposed method, namely convolutional dictionary learning (CDL), atoms are of size pixels but now Eqn. 7 is in action instead. While considering the structure of the dictionary in a D form of Fig. 3, CDL can be classified as a both local and global approach. Effects of regular versus convolutional approaches are apparent in the learned atoms at the end of the training process as exemplified in Fig. 5. Notice that convolutional approach results in filters having Gabor-like appearance.

(a) Regular - PDL

(b) Convolutional - CDL
Figure 5: Patch-based versus convolutional dictionaries learned on MNIST. For a clear visualization, atoms are of size .

Other well-known methods that take spatial information in images into account are Histogram of Oriented Gradients (HOG) Dalal and Triggs (2005), Local Binary Patterns (LBP) Ojala et al. (1996) and Gabor Feature Extraction (GFE) Haghighat et al. (2015). For HOG, a cell size of is chosen with orientation histogram bins and signed orientation is not used. For LBP, number of neighbors is and radius of circular pattern to select neighbors is determined as . Rotation information is also encoded. The cell size is and no normalization is performed. In GFE, a Gabor filter-bank of filters is employed of size with different scales and orientations.

Another important categorization of methods is given through whether they perform dimensionality reduction or expansion. The last two methods to be mentioned, namely Autoencoders (AE) and Principal Component Analysis (PCA) both perform dimensionality reduction. Notice that HOG and LBP also accomplish effective dimensionality reduction while other methods instead go through an expansion process. A pooling procedure is closely tied to expansion in case of spatial methods, and is usually performed to reduce the computational cost with the advantage of certain rotation/position invariance. In methods with dimensionality expansion (DL, PDL, CDL, GFE), DL and PDL do not perform an additional pooling since they do not truly preserve spatial configuration. Although PDL takes local spatial information into account, there is no trivial way to perform a meaningful pooling on top. On the other hand, CDL contains a max pooling layer and GFE has an average pooling layer, of cell sizes

in both cases.

Table 1: Feature extraction methods in the benchmark.
Figure 6: Classification accuracy () as a function of varying training sizes applied on (top) MNIST and (bottom) USPS.

Table 1 summarizes all feature extraction methods in the benchmark. Note that “Spatial” attribute appearing in this table is an antonym for the word “orthogonality” in the context of this study. For example, both PDL and CDL can be described as spatial methods since they process images by considering pixels within certain local neighborhoods. However, each pixel is indistinguishable from the others in DL because of the vectorization of the whole frame, resulting in an orthogonality consideration.

After having described all methods in detail, Fig. 6 depicts classification performance as a function of varying training sizes applied on MNIST and USPS Hull (1994) databases. As a global-only dictionary learning method, the inferior performance of DL in case of small training sizes is obvious. A similar behavior is also slightly observable in CDL as a both global and local dictionary learning approach. Although PDL does not perform poorly in small training sizes, it does not provide noticeable advantage over DL in the long run, while CDL outperforms both DL and PDL performing at the capacity of HOG when most of the dataset is used. HOG and GFE together compete for the top performance, whereas CDL performs a little poorer but it is better than LBP. Most importantly, it is apparent that PDL cannot be an alternative to convolutional-logic at least for the D case.

Table 2: Classification accuracy () of feature extraction methods with linear SVM applied on the whole MNIST and USPS datasets.

Table 2 lists the final classification accuracy results with linear SVM applied on the whole MNIST and USPS databases. GFE is the top performing method as an unsupervised simulation of first layers of a convolutional neural network (CNN). Additionally, CDL and HOG compete for the second place.

Figure 7: Classification accuracy () as a function of different patch/kernel sizes applied on (preprocessed) MIT-BIH using linear SVM classifiers.

The convolutional dictionary learning concept is further applied in a D setting. The MIT-BIH arrhythmia dataset Moody and Mark (2001), in which the signals correspond to electrocardiogram (ECG) shapes of heartbeats for cases unaffected (normal) and affected by different arrhythmias, is used. These signals are preprocessed and segmented, each segment represents a heartbeat, one of the five different classes Kachuee et al. (2018).

Preliminary experimentation suggests that the results could be highly dependent on the chosen patch/kernel size as CDL performs poorly for small patch/kernel sizes. These results are summarized in Fig. 7. In this figure, all methods are devised to be resource-wise equivalent, i.e., they have equal dimensionality of features. DL, PDL and CDL algorithms have the same definitions as in D while they are translated into D equivalent versions. Finally, CNN here denotes a D convolutional neural network as a substructure of a regular

D version. For a fair comparison, the architecture of CNN is composed of a convolutional layer, a batch normalization layer, a ReLU layer, a max pooling layer, a fully connected layer, a softmax and a classification layer. In other words, the convolutional-logic is applied once (without getting deep) before the classification stage.

The main observation here is that all spatially-aware methods (PDL, CDL, CNN) outperform the orthogonality consideration of DL, as long as the patch/kernel size is of enough size. It is apparent that a relatively small patch sizes cause CDL to perform very poorly. Such behavior is not observable for CNN which performs well for all kernel sizes chosen. The most surprising result is that PDL outperforms CNN nearly for all cases. However, note that CNN here does not have a deep architecture. The other surprising point is that CDL is the worst among all spatially-aware methods. It is possible that the employed SPORCO library may not be optimized for D settings.

To verify the generality of above results, another D problem from a different domain is chosen for the classification of electric devices according to their electric usage profile through raw data. The dataset is obtained from Chen et al. (2015) and it contains train and test samples of size , with possible classification labels. In parallel to Fig. 7, quite similar results are obtained in Fig. 8. With enough patch/kernel size, PDL performance is similar to that of CNN. All methods outperform the baseline of DL.

Figure 8: Classification accuracy () as a function of different patch/kernel sizes applied on the raw Electric Devices dataset using linear SVM classifiers.

Inspired by all above experiments measuring the effect of patch/kernel size, the final simulation results on the patch/kernel effect (using the whole MNIST database) are depicted in Fig. 9. It is clearly observable that CDL nearly matches the performance of a shallow CNN, while PDL performs poorly in this D case. As a conclusion, one can expect PDL as an alternative to CNN in D and CDL in D, as long as patch/kernel size is sensible. Another note is that GFE followed by a linear SVM classifier is a viable unsupervised way of simulating a shallow CNN.

Figure 9: Classification accuracy () as a function of different patch/kernel sizes applied on MNIST using linear SVM classifiers.

4 Discussion on the Spatio-temporal Information Preservation

4.1 Variations on neural networks

Convolution with a kernel in the input side of a layer corresponds to a locally connected structure instead of a traditional fully connected one. Neighboring cells now occur in a relation, preserving the original spatial configuration. As an alternative to the convolutional approach then, neighboring cells in the input or the output side of a neural network layer can also be put in relation with direct edges in-between, as another way of preserving the original spatial configuration that the input cells have. Possibility of edges in-between in the same layer might force to think of a neural network as a more general directed graph. In fact, this line of logic leads to an alternative structure known as recurrent neural networks (RNN). In most general sense, RNNs represent directed graphs. Note that it is possible to build upon basic RNN structure through bidirectional logic 

Schuster and Paliwal (1997)

and long-short term memory concept 

Sak et al. (2014).

On the other hand, empirical evaluation suggests that temporal convolution, or in other words D convolutional-logic surpasses the capacity of recurrent architectures in sequence modeling Bai et al. (2018). It is still an open question whether temporal dimension should be regarded as just another spatial dimension or whether a hybrid approach is better. This is rather a deep issue related to properties of space and time. Instead, considering neural networks of any structure as directed and possibly cyclic graphs, or in other words as neural graphs, might pave way to better understanding of the brain. Note that this concept is rather different than graph neural networks which use graphs as inputs Scarselli et al. (2008).

Another generalization for neural networks is possible by considering infinite width neural networks Arora et al. (2019). Recent results suggest that deep neural networks that are allowed to become infinitely wide converge to models called Gaussian processes Lee et al. (2017). However, such studies do not consider the case when there are in-between connections within layers. Considering the existence of these connections, this can further lead to having an infinite but continuous (input or output) layers, which is indeed applicable mathematically and practically. A generalization of neural network layer cases in this sense is depicted in Fig. 10. The third case in this figure is important in that, it leads to the concept of functional machine learning. This alone may not be enough to preserve the spatial configuration of the input layer. Therefore, additional locally connected versions of these structures can also be proposed.

Figure 10: A generalization of neural network layer cases. (From left-to-right) Discrete-discrete (classical), discrete-continuous, continuous-discrete and continuous-continuous input and output layers.

4.2 Multilinear approach

4.2.1 Tensor-based sparse representations

The fact is that images are not vectors, thus vectorization breaks the spatial coherency of images which is investigated by Hazan et al. (2005). This line of thought is centralized around tensor factorization as a generalization. The study in Hazan et al. (2005) reports that by treating training images as a D cube and performing a non-negative tensor factorization (NTF); higher efficiency, discrimination and representation power can be achieved when compared to non-negative matrix factorization (NMF).

There are two main branches of tensor decomposition. In the first branch, studies are based on canonical polyadic decomposition (CPD), sometimes also referred to as CANDECOMP/PARAFAC Kolda and Bader (2009). The most relevant example from literature is K-CPD Duan et al. (2012), an algorithm of overcomplete dictionary learning for tensor sparse coding based on a multilinear version of OMP and CANDECOMP/PARAFAC decomposition. K-CPD surpasses conventional methods in a series of image denoising experiments. Most recently, a similar framework is also successfully utilized in tensor-based sparse representations for classification of multiphase medical images Wang et al. (2020). The second branch is centered around the Tucker decomposition model instead, which is a more general model than CPD Caiafa and Cichocki (2013). The study in Caiafa and Cichocki (2012) presents the foundations of the Tucker decomposition model by defining the Tensor-OMP algorithm which computes a block-sparse representation of a tensor with respect to a Kronecker basis. In Caiafa and Cichocki (2013), authors report that a block-sparse structure imposed on a core tensor through subtensors provide significant results. The Tucker model together with block-sparsity restriction may work significantly well, since the higher dimensional block structure is meaningfully applied on the original sparse tensor in the form of subtensors. There are many other studies in literature specifically based on the Tucker model of sparse representations with or without block-sparsity and additionally including dictionary learning Qi et al. (2013); Roemer et al. (2014); Peng et al. (2014).

Certain parallels can be drawn between convolutional dictionary learning and tensor-based sparse representations. As an example, the study in Huang and Anandkumar (2015) proposes a novel framework for learning convolutional models through tensor decomposition and shows that cumulant tensors have a CPD whose components correspond to convolutional filters and their circulant shifts.

On the other side, tensor-based approaches (both CPD and Tucker models) do not still provide a solution to D case. Without loss of generality, let us assume that the signal is in the form of a column vector . Since the signal is one-dimensional, there will be a single matrix for that single dimension in the Tucker model. Therefore, the model attained is in Eqn.(8). It is also possible to show that . From the CPD model perspective, there is equivalently where is the single sparse coefficient associated with atom . Hence, one arrives at a standard formulation in Eqn.(8), namely Tucker and CPD models are equivalent in one-dimensional case, all corresponding to conventional orthogonal sparse representation.


The above observation brings up an important question onto the table. Although tensor-based approaches provide advantage when the signals are multidimensional, these formulations will not provide an edge for D signals. The remedy may come from considering a D signal, not as a D vector of elements solely. In other words, a D complex vector can be formed by coding the cell positions in the imaginary parts to overcome the orthogonality problem in standard D vector representation as depicted in Fig. 11. This paves way to performing sparse representations of complex valued data, or even quaternion valued data, to accommodate more information in cases of higher dimensionality. Utmost generalization is achieved through geometric algebra as a generalization of hypercomplex numbers.

Figure 11: An encoding scheme to preserve spatio-temporal information for (top) D mono audio and (bottom) D grayscale image cases.

4.2.2 Complex, hypercomplex and geometric algebra based approaches

Note that quaternion algebra is the first hypercomplex number system to be devised that is similar to real and complex number systems Moxey et al. (2003). The study in Xu et al. (2015) states that a quaternion-based model can achieve more structured representation when compared to a tensor-based model. Comparisons between quaternion-SVD and tensor-SVD Kilmer and Martin (2011) provide their equivalence, but superiority of quaternion-SVD arises when it is combined with the sparse representation model. It is possible to formulate a quaternion-valued sparse representation of color images that surpasses the conventional logic Xu et al. (2015).

There are four possible models to represent color images as suggested in Xu et al. (2015). The first one is the monochromatic model, in which each color channel is represented separately. The second one is the concatenation model, where a single vector is formed by concatenating three color channels Mairal et al. (2007). The third is the tensor-based model, where the color image is thought of as a D cube of values. The last one is the quaternion-based model, where each color channel is assigned to each imaginary value, i.e., r,g,b to i,j,k respectively. Most importantly, all these models are analytically unified.

There is also one more possible model that is subtler. As depicted in Fig. 11, one can encode a mono audio as a vector of complex numbers where imaginary values indicate the timed position, in a similar way one can encode a grayscale image as a quaternion-valued vector where imaginary parts are allocated to indicate the pixel positions. While thinking of a color image as a D cube, there is a possible quaternion-based model in which imaginary units encode the position within this cube and the scalar denotes the value of that cell. The same quaternion-based encoding can be applied to any D scalar data.

For further machine learning in this proposed scheme, a hypercomplex to real feature extraction layer is required since current mainstream classification algorithms need real-valued data. Another option is to consult classification algorithms that can directly handle hypercomplex values. This line of logic paves way to consider complex/hypercomplex valued neural networks as viable tools Hirose (2012); Isokawa et al. (2003)

. As a future work, comparison of spatio-temporally encoded hypercomplex neural networks with conventional convolutional or recurrent neural networks may lead to deeper understanding of the deep learning concept. As a motivation, a single complex-valued neuron can solve the XOR problem 

Nitta (2003). In addition, the fact that quaternions can be used to implement associative memory in neural networks is promising Chen et al. (2017).

Another line of generalization can deal with the case when the data has more than three dimensions. In such a case, a quaternion is not enough to designate the cell position and its value. As an extension, octonion algebra can accommodate up to seven imaginary channels Popa (2016); Lazendic et al. (2018a); however, loses the associativity property. The study in Lazendic et al. (2018b) reports that all algebras of dimension larger than eight lose important properties, since they contain algebras of smaller dimension as subalgebras. This might be an issue related to physics of space and time, which is out of scope of this study. The important fact is that the domain dealing with generalization of hypercomplex numbers is called “geometric algebra” and is gaining great attention lately Wang et al. (2019).

5 Conclusion

This study aims to draw attention to orthogonal viewpoint that is taken by many machine learning methods, such as -means. Convolution operator can be used as a remedy for this problem, as it partially preserves the spatio-temporal information inherent in signals. However, one may need to find alternatives to convolutional approaches in order to further increase the understanding on this subject. Spatially sparse connections in neural networks might be an alternative. A continuous to discrete generalization of a neural network layer can also pave way to the concept of functional machine learning. Most importantly, analytic approaches such as multilinear formulations must be thoroughly investigated as alternatives. In fact, to compare methods assuming orthogonality with convolutional-logic, first of all hypercomplex versions of classical methods must be considered where imaginary parts of hypercomplex values encode the spatio-temporal placement. As noted before, D case might be a crucial case not to be underestimated.

Going back to the clustering problem, one should now notice that shift invariant -means can include rotation invariance as a more general formulation Barthelemy et al. (2012). Interestingly, the study in Bar and Sapiro (2010) notes that a log-polar mapping converts rotations and scalings to shifts in x and y axes respectively; therefore, invariance under general transformations is possible. In the bigger picture, convolutional-logic or other frameworks that sustain invariance is related to two-stream hypothesis (i.e., where pathway and what pathway), a model of the neural processing of vision as well as hearing Eysenck and Keane (2005). In other words, a spatio-temporal information preserving perspective on the clustering problem brings us closer to inner working principles of the brain. Also related to convolution, -dimensional generalization of Gabor filters can be investigated as a future work.

A final general note is the distinction between analysis versus synthesis sparse models. Throughout this study, the synthesis model is used of the form where is sparse. However, there is also the analysis model having the form , in which the dictionary multiplied by the input now results in the sparse codes in  Shekhar et al. (2014); Gu et al. (2017). Such model is closer to neural network formulations, and further investigation of analysis model might pave way to a unified perspective on sparse representation models which also includes neural networks.


  • M. Aharon, M. Elad, and A. Bruckstein (2006) K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. 54 (11), pp. 4311–4322. Cited by: §2.1.
  • S. Arora, S. S. Du, Z. Li, R. Salakhutdinov, R. Wang, and D. Yu (2019) Harnessing the power of infinitely wide deep nets on small-data tasks. arXiv preprint arXiv:1910.01663. Cited by: §4.1.
  • S. Bai, J. Z. Kolter, and V. Koltun (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: §4.1.
  • L. Bar and G. Sapiro (2010) Hierarchical dictionary learning for invariant classification. In IEEE Int. Conf. Acoustics, Speech, Signal Process., pp. 3578–3581. Cited by: §5.
  • Q. Barthelemy, A. Larue, A. Mayoue, D. Mercier, and J. I. Mars (2012) Shift & 2D rotation invariant sparse coding for multivariate signals. IEEE Trans. Signal Process. 60 (4), pp. 1597–1611. Cited by: §5.
  • C. F. Caiafa and A. Cichocki (2012) Block sparse representations of tensors using Kronecker bases. In IEEE Int. Conf. Acoustics, Speech, Signal Process., pp. 2709–2712. Cited by: §4.2.1.
  • C. F. Caiafa and A. Cichocki (2013) Computing sparse representations of multidimensional signals using Kronecker bases. Neural Comput. 25 (1), pp. 186–220. Cited by: §4.2.1.
  • X. Chen, Q. Song, and Z. Li (2017) Design and analysis of quaternion-valued neural networks for associative memories. IEEE Trans. Syst., Man, Cyber.: Syst. 48 (12), pp. 2305–2314. Cited by: §4.2.2.
  • Y. Chen, E. Keogh, B. Hu, N. Begum, A. Bagnall, A. Mueen, and G. Batista (2015) The UCR Time Series Classification Archive. Cited by: §3.2.
  • N. Dalal and B. Triggs (2005) Histograms of oriented gradients for human detection. In IEEE Comp. Soc. Conf. Comp. Vis. Patt. Recog., Vol. 1, pp. 886–893. Cited by: §3.2.
  • I. S. Dhillon, Y. Guan, and B. Kulis (2004)

    Kernel k-means: spectral clustering and normalized cuts

    In ACM Int. Conf. Knowl. Discovery Data Mining, pp. 551–556. Cited by: §3.1.
  • G. Duan, H. Wang, Z. Liu, J. Deng, and Y. Chen (2012) K-CPD: learning of overcomplete dictionaries for tensor sparse coding. In IEEE Int. Conf. Patt. Recog., pp. 493–496. Cited by: §4.2.1.
  • K. Engan, S. O. Aase, and J. H. Husoy (1999) Method of optimal directions for frame design. In IEEE Int. Conf. Acoustics, Speech, Signal Process., Vol. 5, pp. 2443–2446. Cited by: §2.1.
  • M. W. Eysenck and M. T. Keane (2005) Cognitive psychology: a student’s handbook. Taylor & Francis. Cited by: §5.
  • C. Garcia-Cardona and B. Wohlberg (2018) Convolutional dictionary learning: a comparative review and new algorithms. IEEE Trans. Comput. Imag. 4 (3), pp. 366–381. Cited by: §2.2, §2.2, §2.2.
  • S. Gu, D. Meng, W. Zuo, and L. Zhang (2017) Joint convolutional analysis and synthesis sparse representation for single image layer separation. In IEEE Int. Conf. Comp. Vis., pp. 1708–1716. Cited by: §5.
  • M. Haghighat, S. Zonouz, and M. Abdel-Mottaleb (2015) CloudID: trustworthy cloud-based and cross-enterprise biometric identification. Expert Syst. Appli. 42 (21), pp. 7905–7916. Cited by: §3.2.
  • T. Hazan, S. Polak, and A. Shashua (2005) Sparse image coding using a 3D non-negative tensor factorization. In IEEE Int. Conf. Comp. Vis., Vol. 1, pp. 50–57. Cited by: §4.2.1.
  • A. Hirose (2012) Complex-valued neural networks. Vol. 400, Springer Science & Business Media. Cited by: §4.2.2.
  • F. Huang and A. Anandkumar (2015) Convolutional dictionary learning through tensor factorization. In Feature Extraction: Modern Questions and Challenges, pp. 116–129. Cited by: §4.2.1.
  • J. J. Hull (1994) A database for handwritten text recognition research. IEEE Trans. Patt. Anal. Mach. Intell. 16 (5), pp. 550–554. Cited by: §3.2.
  • N. Iam-on and S. Garrett (2010) Linkclue: a matlab package for link-based cluster ensembles. J. Stat. Software 36 (9), pp. 1–36. Cited by: §3.1.
  • T. Isokawa, T. Kusakabe, N. Matsui, and F. Peper (2003) Quaternion neural network and its application. In Int. Conf. Knowledge-based Intell. Inf. Eng. Syst., pp. 318–324. Cited by: §4.2.2.
  • M. Jafari and H. Molaei (2014) Spherical linear interpolation and bezier curves. General Sci. Res. 2 (1), pp. 13–17. Cited by: §1.
  • A. K. Jain (2010) Data clustering: 50 years beyond K-means. Patt. Recog. Lett. 31 (8), pp. 651–666. Cited by: §2.1.
  • M. Kachuee, S. Fazeli, and M. Sarrafzadeh (2018) Ecg heartbeat classification: a deep transferable representation. In IEEE Int. Conf. Healthcare Informatics, pp. 443–444. Cited by: §3.2.
  • M. E. Kilmer and C. D. Martin (2011) Factorization strategies for third-order tensors. Linear Algebra Appli. 435 (3), pp. 641–658. Cited by: §4.2.2.
  • T. G. Kolda and B. W. Bader (2009) Tensor decompositions and applications. SIAM Review 51 (3), pp. 455–500. Cited by: §4.2.1.
  • S. Lazendic, H. De Bie, and A. Pizurica (2018a) Octonion sparse representation for color and multispectral image processing. In European Signal Process. Conf., pp. 608–612. Cited by: §4.2.2.
  • S. Lazendic, A. Pizurica, and H. De Bie (2018b) Hypercomplex algebras for dictionary learning. In Conf. Applied Geo. Algebras Comp. Sci. Eng., pp. 57–64. Cited by: §4.2.2.
  • Y. LeCun, C. Cortes, and C. J. C. Burges (2010) MNIST Handwritten Digit Database. Cited by: §3.1.
  • J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein (2017) Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165. Cited by: §4.1.
  • J. Mairal, M. Elad, and G. Sapiro (2007) Sparse representation for color image restoration. IEEE Trans. Image Process. 17 (1), pp. 53–69. Cited by: §4.2.2.
  • G. B. Moody and R. G. Mark (2001) The impact of the MIT-BIH arrhythmia database. IEEE Eng. Med. Biology Mag. 20 (3), pp. 45–50. Cited by: §3.2.
  • C. E. Moxey, S. J. Sangwine, and T. A. Ell (2003) Hypercomplex correlation techniques for vector images. IEEE Trans. Signal Process. 51 (7), pp. 1941–1953. Cited by: §4.2.2.
  • T. Nitta (2003) Solving the XOR problem and the detection of symmetry using a single complex-valued neuron. Neural Netw. 16 (8), pp. 1101–1105. Cited by: §4.2.2.
  • T. Ojala, M. Pietikainen, and D. Harwood (1996) A comparative study of texture measures with classification based on featured distributions. Patt. Recog. 29 (1), pp. 51–59. Cited by: §3.2.
  • Y. Oktar and M. Turkan (2018) A review of sparsity-based clustering methods. Signal Process. 148, pp. 20–30. Cited by: §1.
  • Y. Oktar and M. Turkan (2019) K-polytopes: a superproblem of k-means. Signal, Image, Video Process. 13 (6), pp. 1207–1214. Cited by: §1.
  • Y. Oktar and M. Turkan (2020) Evolutionary simplicial learning as a generative and compact sparse framework for classification. Signal Process. 174, pp. 107634. Cited by: §1.
  • Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad (1993) Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In Proc. Asilomar Conf. Signals, Syst., Comp., pp. 40–44. Cited by: §2.1.
  • Y. Peng, D. Meng, Z. Xu, C. Gao, Y. Yang, and B. Zhang (2014) Decomposable nonlocal tensor dictionary learning for multispectral image denoising. In IEEE Conf. Comp. Vis. Patt. Recog., pp. 2949–2956. Cited by: §4.2.1.
  • C. Popa (2016) Octonion-valued neural networks. In Int. Conf. Artif. Neural Netw., pp. 435–443. Cited by: §4.2.2.
  • Y. Pu, W. Yuan, A. Stevens, C. Li, and L. Carin (2016) A deep generative deconvolutional image model. In Artif. Intell. Stat., pp. 741–750. Cited by: §2.2.
  • N. Qi, Y. Shi, X. Sun, J. Wang, and B. Yin (2013) Two dimensional synthesis sparse model. In IEEE Int. Conf. Multimedia Expo, pp. 1–6. Cited by: §4.2.1.
  • F. Roemer, G. Del Galdo, and M. Haardt (2014) Tensor-based algorithms for learning multidimensional separable dictionaries. In IEEE Int. Conf. Acoustics, Speech, Signal Process., pp. 3963–3967. Cited by: §4.2.1.
  • H. Sak, A. Senior, and F. Beaufays (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Annual Conf. Int. Speech Comm. Assoc., Cited by: §4.1.
  • J. Sandor (1996) On the arithmetical functions dk(n) and d*k(n). Portugaliae Mathematica 53, pp. 107–116. Cited by: §1.
  • F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2008) The graph neural network model. IEEE Trans. Neural Netw. 20 (1), pp. 61–80. Cited by: §4.1.
  • M. Schuster and K. K. Paliwal (1997) Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45 (11), pp. 2673–2681. Cited by: §4.1.
  • S. Shekhar, V. M. Patel, and R. Chellappa (2014) Analysis sparse coding models for image-based classification. In IEEE Int. Conf. Image Process., pp. 5207–5211. Cited by: §5.
  • J. Wang, J. Li, X. Han, L. Lin, H. Hu, Y. Xu, Q. Chen, Y. Iwamoto, and Y. Chen (2020) Tensor-based sparse representations of multi-phase medical images for classification of focal liver lesions. Patt. Recog. Lett. 130, pp. 207–215. Cited by: §4.2.1.
  • R. Wang, K. Wang, W. Cao, and X. Wang (2019) Geometric algebra in signal and image processing: a survey. IEEE Access 7, pp. 156315–156325. Cited by: §4.2.2.
  • B. Wohlberg (2017) SPORCO: a python package for standard and convolutional sparse representations. In Proc. Python in Sci. Conf., pp. 1–8. Cited by: §3.2.
  • Y. Xu, L. Yu, H. Xu, H. Zhang, and T. Nguyen (2015) Vector sparse representation of color image using quaternion matrix analysis. IEEE Trans. Image Process. 24 (4), pp. 1315–1329. Cited by: §4.2.2, §4.2.2.
  • M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus (2010) Deconvolutional networks. In IEEE Comp. Soc. Conf. Comp. Vis. Patt. Recog., pp. 2528–2535. Cited by: §2.2.