1 Introduction
In traditional signal processing and machine learning problems, each data dimension (attribute) is assumed to be orthogonal to others. In other words, there is no distinction between crossrelations of dimensions. While signals carry information through a spatiotemporal configuration, assuming such orthogonality of signal dimensions is highly illposed even for D cases. This phenomenon is depicted simply in Fig. 1.
Let us numerically analyze the severity of the problem of casting signals as vectors. Assume that an sized vector is received through the orthogonality consideration and it is known that the original form is an sized D signal. If one tries to recover the original spatial configuration without further knowledge (i.e., which value was in which cell), all possible spatial configurations are equally likely. This problem becomes even more serious when the dimensionality of the signal itself increases. Consider an sized vector is received again but the underlying signal is now assumed to be an image. Not only there are permutations involved but also one needs to guess the height and width of the image. In general, for an sized vector and a dimensional original signal, the number of possible spatial configurations that the signal could have been in is given as where is the th Piltz function, which gives the number of ordered factorization of as a product of terms Sandor (1996).
When the above described issue is undertaken, it is not hard to see that many conventional machine learning formulations are highly illposed from the perspective of real world signals. Let us now consider the case of means to be applied on vectorized real world signals, and suppose images for simplicity. As
means originally assumes orthogonality of dimensions, it is easy to apply the usual Euclidean distance metric between vectors. However, it is indeed questionable whether it will capture the notion of distance between two images or rather the average of two images. An example in this light can be given from the domain of Computer Graphics. A direct linear interpolation between two rotation matrices is not natural, thus quaternions are utilized leading to a formulation called spherical linear interpolation
Jafari and Molaei (2014). A similar consideration might also be superior in the clustering problem of images using means. However, it is not trivial to cast a general image as a quaternionlike structure for further processing.Let us try to prove that direct vectorized distance calculation is not natural for images by giving a more concrete example. Assume that there is a main image of the number as exemplified in Fig. 2(a). The question here is which other image is more similar to this main image. Is it the number in Fig. 2(b) having relatively same spatial position within the frame, or is it the number in Fig. 2(c) with exact shape but linearly shifted in the frame? Vectorized distance measure will dictate that is closer to the main image, which is definitely not natural. Therefore, a shiftinvariant distance metric could be more powerful in this case.
For given two images and , the standard (vectorized) Euclidean distance is given in Eqn. (1). This formula can be enhanced with a shiftinvariant adaptation as in Eqn. (2) where denotes the image
zeropadded on its sides. Alternatively, a shiftinvariant distance notion can also be given in terms of inverse of crosscorrelation as in Eqn. (
3). Nevertheless, even if a suitable distance metric is found to designate the closest centroid, it is not trivial to obtain the average of a cluster as the new mean for the next step.(1) 
(2) 
(3) 
In this study, means formulation will be considered within a sparse representations framework to provide a selfsufficient shiftinvariant version. As noted in earlier studies Oktar and Turkan (2018, 2019, 2020), the original means problem can be expressed in a sparse representations framework as a dictionary learning problem. A shiftinvariant version of means can then be derived through a much recent convolutional dictionary learning formulation. It is not a surprise that a convolutional approach leads to a shiftinvariant scheme, as convolution is an operator which breaks orthogonality assumption by considering neighboring data points group by group, forming a relation between spatial regions in the signal.
The paper is organized as follows. Section 2 gives the mathematical description of the proposed shiftinvariant means concept, followed by a generalization through convolutional dictionary learning for classification. Section 3 details experimental setup and reports experimental results obtained from the proposed concepts. Later, Sec. 4 discusses many alternatives of convolutionallogic for spatiotemporal information preservation, including a spatiotemporal hypercomplex encoding scheme. Section 5 finally concludes this paper with a brief summary.
2 Convolutional Sparse Representations
It is possible to mathematically formulate the conventional means problem in a sparse representations framework given in Eqn. 4 as follows,
(4) 
where the matrix is an overcomplete dictionary and is the sparse representation of the data point . Each sparse vector contains only one nonzero component and this component is forced to be positive and sumtoone. Dictionary columns as atoms (namely ) designate centroids.
While Eqn. (4) represents a direct formulation of classical means, it corresponds to the problematic orthogonality consideration as mentioned previously. A possible shiftinvariant alternative of means is given in Eqn. (5) as follows,
(5) 
where denotes the convolution operator and is the index of the optimal convolutional atom, or in other words the convolutional centroid that is assigned to the data point. Notice here that the nonzero entry of is not forced to be , but can now be anything. Therefore, this formulation is not only shiftinvariant but also invariant to the magnitude of the pattern. However, this should then be complemented by an atom normalization process.
Because of the linearity property, atoms in can also be expressed in a large convolutional dictionary to be denoted by as depicted in Fig. 3. The local dictionary consists of convolutional atoms, whereas the global dictionary
is filled with zeros outside the convolutional area. In this regard, the mathematical optimization in Eqn. (
5) evolves to Eqn. (6) where denotes the index of the single nonzero element from the top and modulo (number of clusters) determines the index of the assigned convolutional centroid .(6) 
2.1 A solution to shiftinvariant means
Since the optimization in Eqn. (6) is highly nonconvex, an approximate iterative solution is employed alternating between assignment to clusters and centroid update akin to Llyod’s algorithm for the original means problem Jain (2010). This procedure directly corresponds to sparse coding and dictionary update steps, respectively, in terms of sparse representations.
In this light, the data assignment step is solved with Orthogonal Matching Pursuit (OMP) Pati et al. (1993) assuming is fixed, to satisfy the norm sparsity constraint. On the other side, a straightforward utilization of conventional dictionary update algorithms, such that Method of Optimal Directions (MOD) Engan et al. (1999) or KSVD Aharon et al. (2006), is not very obvious because the inherent subdictionary composed of convolutional centroids is only to be updated in . To solve this problem, each individual block of the overall sparse representation is extracted as an individual subproblem, on which MOD (i.e., leastsquares) update is applied. As the last step, the final updated subdictionary is attained by averaging all of the resulting individual subdictionaries. To the best of the available knowledge, this naive solution to the centroid update problem is not extensively covered in literature, thus it can be coined as Method of Optimal Subdirections on Average (MOSA).
Experimental results indicate that this adaptation of shiftinvariant means provides better results when compared to its original version for datasets in which considerable shifts exist.
2.2 Convolutional dictionary learning as a generalization
Encouraged by the superiority of the shiftinvariant means formulation obtained through a convolutional sparse representation as an unsupervised task, the question is then to generalize this convolutional approach to other machine learning tasks such as classification. The claim is that an unsupervised feature extraction layer that is performed through convolutional dictionary learning as a generalization, can provide superiority over orthogonalonly consideration in also supervised tasks. This claim has already been validated in literature many times Zeiler et al. (2010); Pu et al. (2016); GarciaCardona and Wohlberg (2018) but an extensive comparison with the classical orthogonality consideration is usually missing.
In this regard, a shift from the strict norm constraint to a more lenient norm is considered. There are two main reasons behind this decision. First of all, it is unclear how to set the sparsity level in an norm formulation since denser choices drastically affect the computational complexity in greedy approaches and sparser solutions can lead to severe information loss. Importantly, most practical studies are based on norm in literature GarciaCardona and Wohlberg (2018).
With the above consideration, a final optimization for convolutional dictionary learning is given in Eqn. (7) by introducing the norm regularization into the formula via a Lagrange multiplier . Iterative solutions which alternates between convolutional sparse coding and dictionary update exist in literature GarciaCardona and Wohlberg (2018).
(7) 
In fact, the aim of this study is not to devise new approaches to above optimization but to utilize it as an approach to the orthogonality problem. This unsupervised convolutional decomposition of a signal can be regarded as a feature extraction method that tackles the problem of orthogonality, where the extracted features for the data point are formed by concatenating the corresponding sparse codes, i.e., . Note that concatenation here still assumes orthogonality; however, there now exists a convolutionallogic before the orthogonality consideration which alleviates the main drawbacks of it from the start. The effectiveness of such a layer is to be experimentally tested against various other feature extraction methods in an extensive manner.
3 Experimental Results
In the following, two sets of experiments are performed corresponding to the discussions raised in Sec. 2.1 and Sec. 2.2. All experiments are carried on an Intel(R) Core(TM) iHQ CPU @ GHz GB RAM machine running on Microsoft Windows using Matlab a.
3.1 Shiftinvariant means
In this set of experiments, a dataset is formed by extracting first training images of each class from the MNIST handwritten digit database LeCun et al. (2010), making a total of images. Four modified versions of this dataset are then obtained to test the shiftinvariance property. First of all, empty images of sizes , , and pixels are initialized and original digits are inserted into these widened images with certain uniformly random shifts in x and y directions. Mean shifts in axes are chosen as , , and pixels, respectively, suiting the size of images. The clustering accuracy rates of means (KM), Kernel KM Dhillon et al. (2004), Ensemble KM Iamon and Garrett (2010) and shiftinvariant KM (KM) on these cases are illustrated in Fig. 4.
Not surprisingly, the performance of KM stays relatively stable in cases of varying shifts, whereas all other methods start to perform poorly when shifts are introduced. It is obvious that a mean shift of pixels is enough to disrupt the functionality of classical methods for these datasets. Considering original images of sizes pixels, this roughly corresponds to a mean shift of of the whole image size. Note also that classical methods perform nearly poor as a random guess method (RAND) in cases of extreme shifts, e.g., pixels or correspondingly . KM has , Kernel KM has , Ensemble KM has and KM has clustering accuracy in the case of pixels shift applied on MNIST. This proves that neither kernelization nor ensembles can provide an efficient solution to the shiftinvariance problem.
One may argue that a simple preprocessing step, which extracts a precise subimage of the digit in each image, would be enough to sustain shiftinvariance for clustering these images; however, such a naive approach cannot be a general solution for natural images. On the other hand, the logic in KM provides an automatic solution, which is both theoretically and practically sound, without any need for preprocessing. The simplicity and effectiveness of this clustering approach can further pave way to more general techniques with the same logic applied on other machine learning tasks in some settings. In fact, a generalization of KM via convolutional dictionary learning can be utilized as a powerful unsupervised feature extraction method for classification that alleviates the drawbacks of the classical orthogonality consideration.
3.2 Convolutional dictionary learning
In this set of experiments, convolutional dictionary learning as an unsupervised feature extraction method is compared against various other wellknown feature extraction schemes. An existing library called SPORCO Wohlberg (2017)
is utilized for convolutional dictionary learning. In the following reported experiments, linear support vector machine (SVM) classifiers are employed after the feature extraction phase. The motivation behind the linear SVM usage is that, a successful feature extraction must transform the sample space into a linearly separable one as much as possible.
There are three employed versions of dictionary learning methods. The globalonly dictionary learning (DL) operates over dictionary atoms of size pixels, namely atoms cover sample images globally. The patchbased dictionary learning (PDL) trains over dictionary atoms of size pixels, where local image patches are extracted in a sliding window manner. This type of approach can be regarded as a localonly one. Both DL and PDL methods are realized through regular dictionary learning iterative steps, i.e., sparse coding and dictionary update. In the proposed method, namely convolutional dictionary learning (CDL), atoms are of size pixels but now Eqn. 7 is in action instead. While considering the structure of the dictionary in a D form of Fig. 3, CDL can be classified as a both local and global approach. Effects of regular versus convolutional approaches are apparent in the learned atoms at the end of the training process as exemplified in Fig. 5. Notice that convolutional approach results in filters having Gaborlike appearance.
Other wellknown methods that take spatial information in images into account are Histogram of Oriented Gradients (HOG) Dalal and Triggs (2005), Local Binary Patterns (LBP) Ojala et al. (1996) and Gabor Feature Extraction (GFE) Haghighat et al. (2015). For HOG, a cell size of is chosen with orientation histogram bins and signed orientation is not used. For LBP, number of neighbors is and radius of circular pattern to select neighbors is determined as . Rotation information is also encoded. The cell size is and no normalization is performed. In GFE, a Gabor filterbank of filters is employed of size with different scales and orientations.
Another important categorization of methods is given through whether they perform dimensionality reduction or expansion. The last two methods to be mentioned, namely Autoencoders (AE) and Principal Component Analysis (PCA) both perform dimensionality reduction. Notice that HOG and LBP also accomplish effective dimensionality reduction while other methods instead go through an expansion process. A pooling procedure is closely tied to expansion in case of spatial methods, and is usually performed to reduce the computational cost with the advantage of certain rotation/position invariance. In methods with dimensionality expansion (DL, PDL, CDL, GFE), DL and PDL do not perform an additional pooling since they do not truly preserve spatial configuration. Although PDL takes local spatial information into account, there is no trivial way to perform a meaningful pooling on top. On the other hand, CDL contains a max pooling layer and GFE has an average pooling layer, of cell sizes
in both cases.DL  PDL  CDL  HOG  LBP  GFE  AE  PCA  
Learning  ✓  ✓  ✓  ✗  ✗  ✗  ✓  ✓ 
Spatial  ✗  ✓  ✓  ✓  ✓  ✓  ✗  ✗ 
Dimensions 
Table 1 summarizes all feature extraction methods in the benchmark. Note that “Spatial” attribute appearing in this table is an antonym for the word “orthogonality” in the context of this study. For example, both PDL and CDL can be described as spatial methods since they process images by considering pixels within certain local neighborhoods. However, each pixel is indistinguishable from the others in DL because of the vectorization of the whole frame, resulting in an orthogonality consideration.
After having described all methods in detail, Fig. 6 depicts classification performance as a function of varying training sizes applied on MNIST and USPS Hull (1994) databases. As a globalonly dictionary learning method, the inferior performance of DL in case of small training sizes is obvious. A similar behavior is also slightly observable in CDL as a both global and local dictionary learning approach. Although PDL does not perform poorly in small training sizes, it does not provide noticeable advantage over DL in the long run, while CDL outperforms both DL and PDL performing at the capacity of HOG when most of the dataset is used. HOG and GFE together compete for the top performance, whereas CDL performs a little poorer but it is better than LBP. Most importantly, it is apparent that PDL cannot be an alternative to convolutionallogic at least for the D case.
Dataset  DL  PDL  CDL  HOG  LBP  GFE  AE  PCA 

MNIST  
USPS 
Table 2 lists the final classification accuracy results with linear SVM applied on the whole MNIST and USPS databases. GFE is the top performing method as an unsupervised simulation of first layers of a convolutional neural network (CNN). Additionally, CDL and HOG compete for the second place.
The convolutional dictionary learning concept is further applied in a D setting. The MITBIH arrhythmia dataset Moody and Mark (2001), in which the signals correspond to electrocardiogram (ECG) shapes of heartbeats for cases unaffected (normal) and affected by different arrhythmias, is used. These signals are preprocessed and segmented, each segment represents a heartbeat, one of the five different classes Kachuee et al. (2018).
Preliminary experimentation suggests that the results could be highly dependent on the chosen patch/kernel size as CDL performs poorly for small patch/kernel sizes. These results are summarized in Fig. 7. In this figure, all methods are devised to be resourcewise equivalent, i.e., they have equal dimensionality of features. DL, PDL and CDL algorithms have the same definitions as in D while they are translated into D equivalent versions. Finally, CNN here denotes a D convolutional neural network as a substructure of a regular
D version. For a fair comparison, the architecture of CNN is composed of a convolutional layer, a batch normalization layer, a ReLU layer, a max pooling layer, a fully connected layer, a softmax and a classification layer. In other words, the convolutionallogic is applied once (without getting deep) before the classification stage.
The main observation here is that all spatiallyaware methods (PDL, CDL, CNN) outperform the orthogonality consideration of DL, as long as the patch/kernel size is of enough size. It is apparent that a relatively small patch sizes cause CDL to perform very poorly. Such behavior is not observable for CNN which performs well for all kernel sizes chosen. The most surprising result is that PDL outperforms CNN nearly for all cases. However, note that CNN here does not have a deep architecture. The other surprising point is that CDL is the worst among all spatiallyaware methods. It is possible that the employed SPORCO library may not be optimized for D settings.
To verify the generality of above results, another D problem from a different domain is chosen for the classification of electric devices according to their electric usage profile through raw data. The dataset is obtained from Chen et al. (2015) and it contains train and test samples of size , with possible classification labels. In parallel to Fig. 7, quite similar results are obtained in Fig. 8. With enough patch/kernel size, PDL performance is similar to that of CNN. All methods outperform the baseline of DL.
Inspired by all above experiments measuring the effect of patch/kernel size, the final simulation results on the patch/kernel effect (using the whole MNIST database) are depicted in Fig. 9. It is clearly observable that CDL nearly matches the performance of a shallow CNN, while PDL performs poorly in this D case. As a conclusion, one can expect PDL as an alternative to CNN in D and CDL in D, as long as patch/kernel size is sensible. Another note is that GFE followed by a linear SVM classifier is a viable unsupervised way of simulating a shallow CNN.
4 Discussion on the Spatiotemporal Information Preservation
4.1 Variations on neural networks
Convolution with a kernel in the input side of a layer corresponds to a locally connected structure instead of a traditional fully connected one. Neighboring cells now occur in a relation, preserving the original spatial configuration. As an alternative to the convolutional approach then, neighboring cells in the input or the output side of a neural network layer can also be put in relation with direct edges inbetween, as another way of preserving the original spatial configuration that the input cells have. Possibility of edges inbetween in the same layer might force to think of a neural network as a more general directed graph. In fact, this line of logic leads to an alternative structure known as recurrent neural networks (RNN). In most general sense, RNNs represent directed graphs. Note that it is possible to build upon basic RNN structure through bidirectional logic
Schuster and Paliwal (1997)and longshort term memory concept
Sak et al. (2014).On the other hand, empirical evaluation suggests that temporal convolution, or in other words D convolutionallogic surpasses the capacity of recurrent architectures in sequence modeling Bai et al. (2018). It is still an open question whether temporal dimension should be regarded as just another spatial dimension or whether a hybrid approach is better. This is rather a deep issue related to properties of space and time. Instead, considering neural networks of any structure as directed and possibly cyclic graphs, or in other words as neural graphs, might pave way to better understanding of the brain. Note that this concept is rather different than graph neural networks which use graphs as inputs Scarselli et al. (2008).
Another generalization for neural networks is possible by considering infinite width neural networks Arora et al. (2019). Recent results suggest that deep neural networks that are allowed to become infinitely wide converge to models called Gaussian processes Lee et al. (2017). However, such studies do not consider the case when there are inbetween connections within layers. Considering the existence of these connections, this can further lead to having an infinite but continuous (input or output) layers, which is indeed applicable mathematically and practically. A generalization of neural network layer cases in this sense is depicted in Fig. 10. The third case in this figure is important in that, it leads to the concept of functional machine learning. This alone may not be enough to preserve the spatial configuration of the input layer. Therefore, additional locally connected versions of these structures can also be proposed.
4.2 Multilinear approach
4.2.1 Tensorbased sparse representations
The fact is that images are not vectors, thus vectorization breaks the spatial coherency of images which is investigated by Hazan et al. (2005). This line of thought is centralized around tensor factorization as a generalization. The study in Hazan et al. (2005) reports that by treating training images as a D cube and performing a nonnegative tensor factorization (NTF); higher efficiency, discrimination and representation power can be achieved when compared to nonnegative matrix factorization (NMF).
There are two main branches of tensor decomposition. In the first branch, studies are based on canonical polyadic decomposition (CPD), sometimes also referred to as CANDECOMP/PARAFAC Kolda and Bader (2009). The most relevant example from literature is KCPD Duan et al. (2012), an algorithm of overcomplete dictionary learning for tensor sparse coding based on a multilinear version of OMP and CANDECOMP/PARAFAC decomposition. KCPD surpasses conventional methods in a series of image denoising experiments. Most recently, a similar framework is also successfully utilized in tensorbased sparse representations for classification of multiphase medical images Wang et al. (2020). The second branch is centered around the Tucker decomposition model instead, which is a more general model than CPD Caiafa and Cichocki (2013). The study in Caiafa and Cichocki (2012) presents the foundations of the Tucker decomposition model by defining the TensorOMP algorithm which computes a blocksparse representation of a tensor with respect to a Kronecker basis. In Caiafa and Cichocki (2013), authors report that a blocksparse structure imposed on a core tensor through subtensors provide significant results. The Tucker model together with blocksparsity restriction may work significantly well, since the higher dimensional block structure is meaningfully applied on the original sparse tensor in the form of subtensors. There are many other studies in literature specifically based on the Tucker model of sparse representations with or without blocksparsity and additionally including dictionary learning Qi et al. (2013); Roemer et al. (2014); Peng et al. (2014).
Certain parallels can be drawn between convolutional dictionary learning and tensorbased sparse representations. As an example, the study in Huang and Anandkumar (2015) proposes a novel framework for learning convolutional models through tensor decomposition and shows that cumulant tensors have a CPD whose components correspond to convolutional filters and their circulant shifts.
On the other side, tensorbased approaches (both CPD and Tucker models) do not still provide a solution to D case. Without loss of generality, let us assume that the signal is in the form of a column vector . Since the signal is onedimensional, there will be a single matrix for that single dimension in the Tucker model. Therefore, the model attained is in Eqn.(8). It is also possible to show that . From the CPD model perspective, there is equivalently where is the single sparse coefficient associated with atom . Hence, one arrives at a standard formulation in Eqn.(8), namely Tucker and CPD models are equivalent in onedimensional case, all corresponding to conventional orthogonal sparse representation.
(8) 
The above observation brings up an important question onto the table. Although tensorbased approaches provide advantage when the signals are multidimensional, these formulations will not provide an edge for D signals. The remedy may come from considering a D signal, not as a D vector of elements solely. In other words, a D complex vector can be formed by coding the cell positions in the imaginary parts to overcome the orthogonality problem in standard D vector representation as depicted in Fig. 11. This paves way to performing sparse representations of complex valued data, or even quaternion valued data, to accommodate more information in cases of higher dimensionality. Utmost generalization is achieved through geometric algebra as a generalization of hypercomplex numbers.
4.2.2 Complex, hypercomplex and geometric algebra based approaches
Note that quaternion algebra is the first hypercomplex number system to be devised that is similar to real and complex number systems Moxey et al. (2003). The study in Xu et al. (2015) states that a quaternionbased model can achieve more structured representation when compared to a tensorbased model. Comparisons between quaternionSVD and tensorSVD Kilmer and Martin (2011) provide their equivalence, but superiority of quaternionSVD arises when it is combined with the sparse representation model. It is possible to formulate a quaternionvalued sparse representation of color images that surpasses the conventional logic Xu et al. (2015).
There are four possible models to represent color images as suggested in Xu et al. (2015). The first one is the monochromatic model, in which each color channel is represented separately. The second one is the concatenation model, where a single vector is formed by concatenating three color channels Mairal et al. (2007). The third is the tensorbased model, where the color image is thought of as a D cube of values. The last one is the quaternionbased model, where each color channel is assigned to each imaginary value, i.e., r,g,b to i,j,k respectively. Most importantly, all these models are analytically unified.
There is also one more possible model that is subtler. As depicted in Fig. 11, one can encode a mono audio as a vector of complex numbers where imaginary values indicate the timed position, in a similar way one can encode a grayscale image as a quaternionvalued vector where imaginary parts are allocated to indicate the pixel positions. While thinking of a color image as a D cube, there is a possible quaternionbased model in which imaginary units encode the position within this cube and the scalar denotes the value of that cell. The same quaternionbased encoding can be applied to any D scalar data.
For further machine learning in this proposed scheme, a hypercomplex to real feature extraction layer is required since current mainstream classification algorithms need realvalued data. Another option is to consult classification algorithms that can directly handle hypercomplex values. This line of logic paves way to consider complex/hypercomplex valued neural networks as viable tools Hirose (2012); Isokawa et al. (2003)
. As a future work, comparison of spatiotemporally encoded hypercomplex neural networks with conventional convolutional or recurrent neural networks may lead to deeper understanding of the deep learning concept. As a motivation, a single complexvalued neuron can solve the XOR problem
Nitta (2003). In addition, the fact that quaternions can be used to implement associative memory in neural networks is promising Chen et al. (2017).Another line of generalization can deal with the case when the data has more than three dimensions. In such a case, a quaternion is not enough to designate the cell position and its value. As an extension, octonion algebra can accommodate up to seven imaginary channels Popa (2016); Lazendic et al. (2018a); however, loses the associativity property. The study in Lazendic et al. (2018b) reports that all algebras of dimension larger than eight lose important properties, since they contain algebras of smaller dimension as subalgebras. This might be an issue related to physics of space and time, which is out of scope of this study. The important fact is that the domain dealing with generalization of hypercomplex numbers is called “geometric algebra” and is gaining great attention lately Wang et al. (2019).
5 Conclusion
This study aims to draw attention to orthogonal viewpoint that is taken by many machine learning methods, such as means. Convolution operator can be used as a remedy for this problem, as it partially preserves the spatiotemporal information inherent in signals. However, one may need to find alternatives to convolutional approaches in order to further increase the understanding on this subject. Spatially sparse connections in neural networks might be an alternative. A continuous to discrete generalization of a neural network layer can also pave way to the concept of functional machine learning. Most importantly, analytic approaches such as multilinear formulations must be thoroughly investigated as alternatives. In fact, to compare methods assuming orthogonality with convolutionallogic, first of all hypercomplex versions of classical methods must be considered where imaginary parts of hypercomplex values encode the spatiotemporal placement. As noted before, D case might be a crucial case not to be underestimated.
Going back to the clustering problem, one should now notice that shift invariant means can include rotation invariance as a more general formulation Barthelemy et al. (2012). Interestingly, the study in Bar and Sapiro (2010) notes that a logpolar mapping converts rotations and scalings to shifts in x and y axes respectively; therefore, invariance under general transformations is possible. In the bigger picture, convolutionallogic or other frameworks that sustain invariance is related to twostream hypothesis (i.e., where pathway and what pathway), a model of the neural processing of vision as well as hearing Eysenck and Keane (2005). In other words, a spatiotemporal information preserving perspective on the clustering problem brings us closer to inner working principles of the brain. Also related to convolution, dimensional generalization of Gabor filters can be investigated as a future work.
A final general note is the distinction between analysis versus synthesis sparse models. Throughout this study, the synthesis model is used of the form where is sparse. However, there is also the analysis model having the form , in which the dictionary multiplied by the input now results in the sparse codes in Shekhar et al. (2014); Gu et al. (2017). Such model is closer to neural network formulations, and further investigation of analysis model might pave way to a unified perspective on sparse representation models which also includes neural networks.
References
 KSVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. 54 (11), pp. 4311–4322. Cited by: §2.1.
 Harnessing the power of infinitely wide deep nets on smalldata tasks. arXiv preprint arXiv:1910.01663. Cited by: §4.1.
 An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: §4.1.
 Hierarchical dictionary learning for invariant classification. In IEEE Int. Conf. Acoustics, Speech, Signal Process., pp. 3578–3581. Cited by: §5.
 Shift & 2D rotation invariant sparse coding for multivariate signals. IEEE Trans. Signal Process. 60 (4), pp. 1597–1611. Cited by: §5.
 Block sparse representations of tensors using Kronecker bases. In IEEE Int. Conf. Acoustics, Speech, Signal Process., pp. 2709–2712. Cited by: §4.2.1.
 Computing sparse representations of multidimensional signals using Kronecker bases. Neural Comput. 25 (1), pp. 186–220. Cited by: §4.2.1.
 Design and analysis of quaternionvalued neural networks for associative memories. IEEE Trans. Syst., Man, Cyber.: Syst. 48 (12), pp. 2305–2314. Cited by: §4.2.2.
 The UCR Time Series Classification Archive. Cited by: §3.2.
 Histograms of oriented gradients for human detection. In IEEE Comp. Soc. Conf. Comp. Vis. Patt. Recog., Vol. 1, pp. 886–893. Cited by: §3.2.

Kernel kmeans: spectral clustering and normalized cuts
. In ACM Int. Conf. Knowl. Discovery Data Mining, pp. 551–556. Cited by: §3.1.  KCPD: learning of overcomplete dictionaries for tensor sparse coding. In IEEE Int. Conf. Patt. Recog., pp. 493–496. Cited by: §4.2.1.
 Method of optimal directions for frame design. In IEEE Int. Conf. Acoustics, Speech, Signal Process., Vol. 5, pp. 2443–2446. Cited by: §2.1.
 Cognitive psychology: a student’s handbook. Taylor & Francis. Cited by: §5.
 Convolutional dictionary learning: a comparative review and new algorithms. IEEE Trans. Comput. Imag. 4 (3), pp. 366–381. Cited by: §2.2, §2.2, §2.2.
 Joint convolutional analysis and synthesis sparse representation for single image layer separation. In IEEE Int. Conf. Comp. Vis., pp. 1708–1716. Cited by: §5.
 CloudID: trustworthy cloudbased and crossenterprise biometric identification. Expert Syst. Appli. 42 (21), pp. 7905–7916. Cited by: §3.2.
 Sparse image coding using a 3D nonnegative tensor factorization. In IEEE Int. Conf. Comp. Vis., Vol. 1, pp. 50–57. Cited by: §4.2.1.
 Complexvalued neural networks. Vol. 400, Springer Science & Business Media. Cited by: §4.2.2.
 Convolutional dictionary learning through tensor factorization. In Feature Extraction: Modern Questions and Challenges, pp. 116–129. Cited by: §4.2.1.
 A database for handwritten text recognition research. IEEE Trans. Patt. Anal. Mach. Intell. 16 (5), pp. 550–554. Cited by: §3.2.
 Linkclue: a matlab package for linkbased cluster ensembles. J. Stat. Software 36 (9), pp. 1–36. Cited by: §3.1.
 Quaternion neural network and its application. In Int. Conf. Knowledgebased Intell. Inf. Eng. Syst., pp. 318–324. Cited by: §4.2.2.
 Spherical linear interpolation and bezier curves. General Sci. Res. 2 (1), pp. 13–17. Cited by: §1.
 Data clustering: 50 years beyond Kmeans. Patt. Recog. Lett. 31 (8), pp. 651–666. Cited by: §2.1.
 Ecg heartbeat classification: a deep transferable representation. In IEEE Int. Conf. Healthcare Informatics, pp. 443–444. Cited by: §3.2.
 Factorization strategies for thirdorder tensors. Linear Algebra Appli. 435 (3), pp. 641–658. Cited by: §4.2.2.
 Tensor decompositions and applications. SIAM Review 51 (3), pp. 455–500. Cited by: §4.2.1.
 Octonion sparse representation for color and multispectral image processing. In European Signal Process. Conf., pp. 608–612. Cited by: §4.2.2.
 Hypercomplex algebras for dictionary learning. In Conf. Applied Geo. Algebras Comp. Sci. Eng., pp. 57–64. Cited by: §4.2.2.
 MNIST Handwritten Digit Database. Cited by: §3.1.
 Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165. Cited by: §4.1.
 Sparse representation for color image restoration. IEEE Trans. Image Process. 17 (1), pp. 53–69. Cited by: §4.2.2.
 The impact of the MITBIH arrhythmia database. IEEE Eng. Med. Biology Mag. 20 (3), pp. 45–50. Cited by: §3.2.
 Hypercomplex correlation techniques for vector images. IEEE Trans. Signal Process. 51 (7), pp. 1941–1953. Cited by: §4.2.2.
 Solving the XOR problem and the detection of symmetry using a single complexvalued neuron. Neural Netw. 16 (8), pp. 1101–1105. Cited by: §4.2.2.
 A comparative study of texture measures with classification based on featured distributions. Patt. Recog. 29 (1), pp. 51–59. Cited by: §3.2.
 A review of sparsitybased clustering methods. Signal Process. 148, pp. 20–30. Cited by: §1.
 Kpolytopes: a superproblem of kmeans. Signal, Image, Video Process. 13 (6), pp. 1207–1214. Cited by: §1.
 Evolutionary simplicial learning as a generative and compact sparse framework for classification. Signal Process. 174, pp. 107634. Cited by: §1.
 Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In Proc. Asilomar Conf. Signals, Syst., Comp., pp. 40–44. Cited by: §2.1.
 Decomposable nonlocal tensor dictionary learning for multispectral image denoising. In IEEE Conf. Comp. Vis. Patt. Recog., pp. 2949–2956. Cited by: §4.2.1.
 Octonionvalued neural networks. In Int. Conf. Artif. Neural Netw., pp. 435–443. Cited by: §4.2.2.
 A deep generative deconvolutional image model. In Artif. Intell. Stat., pp. 741–750. Cited by: §2.2.
 Two dimensional synthesis sparse model. In IEEE Int. Conf. Multimedia Expo, pp. 1–6. Cited by: §4.2.1.
 Tensorbased algorithms for learning multidimensional separable dictionaries. In IEEE Int. Conf. Acoustics, Speech, Signal Process., pp. 3963–3967. Cited by: §4.2.1.
 Long shortterm memory recurrent neural network architectures for large scale acoustic modeling. In Annual Conf. Int. Speech Comm. Assoc., Cited by: §4.1.
 On the arithmetical functions dk(n) and d*k(n). Portugaliae Mathematica 53, pp. 107–116. Cited by: §1.
 The graph neural network model. IEEE Trans. Neural Netw. 20 (1), pp. 61–80. Cited by: §4.1.
 Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45 (11), pp. 2673–2681. Cited by: §4.1.
 Analysis sparse coding models for imagebased classification. In IEEE Int. Conf. Image Process., pp. 5207–5211. Cited by: §5.
 Tensorbased sparse representations of multiphase medical images for classification of focal liver lesions. Patt. Recog. Lett. 130, pp. 207–215. Cited by: §4.2.1.
 Geometric algebra in signal and image processing: a survey. IEEE Access 7, pp. 156315–156325. Cited by: §4.2.2.
 SPORCO: a python package for standard and convolutional sparse representations. In Proc. Python in Sci. Conf., pp. 1–8. Cited by: §3.2.
 Vector sparse representation of color image using quaternion matrix analysis. IEEE Trans. Image Process. 24 (4), pp. 1315–1329. Cited by: §4.2.2, §4.2.2.
 Deconvolutional networks. In IEEE Comp. Soc. Conf. Comp. Vis. Patt. Recog., pp. 2528–2535. Cited by: §2.2.
Comments
There are no comments yet.