## 1 Introduction

Due to immense increase in social media, digital business practices etc., data created, captured, copied or consumed went from 1.2 trillion GB to 59 trillion GB (2010-2020) (Source: Forbes.com, “54 Predictions About the State of Data in 2021”, Gil Press- Forbes). Hence there is a great requirement for faster and efficient methods to categorize or classify data for search or retrieval. In an abstract sense, these methods are well known in literature and are called Pattern classification Methods. Pattern classification involves efficient representation of data as

dimensional feature vectors, designing a discriminant function with classification error as criterion to decide the class membership of a new data vector. Statistical decision theory has been used historically to define the decision boundaries of pattern classes.

### 1.1 Regularization and Dimension Reduction

When the sample size is small compared to the number of variables, any model trained on such data could be overfit i.e. the classification rule learns parameters, noise in the data and hence cannot classify new samples correctly. Regularization is a method to reduce the complexity of a model by decreasing the importance of some variables to zero. Retaining relevant features which have variance in the data and dropping features with high correlation or low variance results in reduced dimensionality. Principal Component Analysis (PCA) and Independent Component Analysis (ICA) are applied to attain reduced dimensionality. Non-negative Matrix Factorization (NMF) has been used for dimension reduction in

[NMFfordimreduct]. According to [smallsampleeffectakjain], to improve the robustness of a classifier in case of few training samples of high dimensionality, features for discrimination and other design parameters such as window size used in Parzen windows approach, number of features used in decision rule, number of neighbours in k-NN method etc, have to be carefully selected. With the rise in online and mobile applications, a mathematically sound model which replaces hand crafted feature extraction and capable of working with limited computational resources and few training samples is of interest.

### 1.2 Sparse representation (SR)

SR has its roots in compressed sensing. Olshausen and Field in [olshausen1997sparseV1?], proposed that sparse representation model is similar to the receptive field properties of sensory cells in mammalian visual cortex. Field [field1994goalofsensorycoding]

has applied log-Gabor filters on images and the histograms of the resultant output distributions have high kurtosis indicating sparse structure. Field proposed, “a high kurtosis signifies that a large proportion of the sensory cells is inactive (low variance) with a small proportion of the cells describing the contents of the image (high variance) being active”. These works support the idea of sparse representation of natural images.

With rigorous proofs and with proven error bounds, sparse representation is a viable model for constrained resource based applications. SR model finds a low dimensional subspace to embed the given high dimensional signals. This embedding is performed against a fixed basis matrix called Dictionary. If the dictionary is perfect for the given set of signals, then the input signal or image can be represented with very few columns of the dictionary, with corresponding very few coefficients.

Section 2 describes the notation used throughout the article, Section 3 gives an account of sparse coding methods based on optimizations and statistical modeling based sparse coding methods. Section 4 describes the differences between orthogonal, undercomplete and non-orthogonal, overcomplete dictionaries. The similarities and differences between dictionary learning and other subspace learning methods are also discussed in the same section. Section 5

gives a review of statistical methods used in the design of discriminative dictionaries in a variety of applications like MRI data classification, surgeon classification and level of skill identification based on surgical trial data, histogram feature based supervised dictionary learning for face recognition, etc. Section

7reports usage of CNN based DL for content and style separation in images and generation of new set of images using sparse coding based Convolutional Neural Network (CNN) and Convolutional Dictionary Learning. Results of using a hybrid dictionary learning method to classify high dimensional data using a simple Multi-Layer Perceptron which is a non-parametric statistical approach, are also discussed here. Section

8 concludes. The categorization among various sparse coding algorithms and dictionary learning algorithms is depicted in Fig. 1.1.## 2 Notation

In this section we introduce our notation. Throughout the article, denotes a matrix, denotes a vector, denotes Frobenious norm of matrix . denotes complex conjugate transpose (Hermitian Conjugate) of and denotes Moore-Penrose pseudo-inverse of . For a given set of training patterns, with dimensionality , the SR model finds a representation of , , where is the Coefficient Matrix for input signals. has columns , called atoms and the matrix is called Dictionary. To get non-trivial solutions to this problem, the dictionary atoms are constrained to have . is the pseudo norm which denotes the number of nonzero components of a vector. norm of a vector is defined as . is

where

Each is the weightage given to dictionary atom in the representation of th training pattern. The problem is formulated as an optimization problem in equation (2.1)

(2.1) |

subject to . Using Lagrange multiplier method, equation (2.1) becomes equation (2.2).

(2.2) |

The above equation (2.2) is a nonlinear, non-convex, joint optimization problem which can be solved using Block Coordinate Descent (BCD) method [beck2013convergenceofBCD]. Fixing one variable and updating the other results in two linear optimization problems. Updating the coefficient matrix w.r.t. fixed dictionary is called Sparse coding given by equation (2.3)

(2.3) |

This is a combinatorial problem due to pseudo norm, making it a non-convex optimization problem. Convex relaxation of equation (2.3) is obtained by replacing norm with norm [convexrell0].

(2.4) |

Equation (2.4) is a non-smooth convex optimization problem which can be solved. Dictionary learning problem is discussed in Section 4.

## 3 Sparse Coding Methods

In this section, we give a broad overview of classification of sparse coding algorithms, based on the norm used for regularization. Sparse coding algorithms based on norm regularization are easy to implement and thus most popular. Matching Pursuit (MP) [matchingpursuit], Orthogonal Matching Pursuit (OMP) [OMP], Fast OMP [fastOMP], etc., find the sparse coefficient matrix in equation (2.3), using a greedy approach. The coefficient of dictionary atom which is highly similar to the input is updated first and the residual after subtracting the contribution of multiplied with the coefficient, is again matched with the dictionary atoms. Though these methods work well, they give sub-optimal sparsity levels and sometimes local minima as solutions.

Basis Pursuit (BP) [basispursuit], Generalised Lasso [lassogeneralized], Focal Underdetermined System Solver (FOCUSS)[FOCUSS] are some of the important methods of sparse coding using norm optimization in equation (2.4). Such piecewise linear approximations provide a guarantee of maximally sparse unique solution to the sparse coding problem. A probabilistic model for representing an observed pattern in a lower dimensional space with respect to (w.r.t.) an optimum dictionary and with a prior on the coefficient vector is given in Sparse Bayesian Learning [tipping2001sblrvm]. Sparsity inducing prior acts as a means of regularization. Each pattern y is represented as where n is additive Gaussian noise with variance . Now, the likelihood function to be maximized is equation (3.1).

(3.1) |

Several approximations to equation (3.1) have been proposed in [olshausen1997sparseV1?], [lewicki1999probabilistic], [lee1999blindsource], [lewicki2000learningovercomprepresent]

to obtain Approximate Maximum Likelihood (AML) estimates of coefficient vector

x which maximizes the log likelihood function. A collection of training patterns are assumed to be independent and different assumptions or approximations about the coefficient vectors result in different estimates. For example, in [lewicki1999probabilistic] and [lewicki2000learningovercomprepresent], components of each coefficient vector are assumed to be independently identically distributed (i.i.d) with a Laplacian prior to promote sparsity i.e. where denotes the parameter diversity.### 3.1 Importance of Statistical concepts in Sparse Coding Methods

Assumptions about data and the sampling method used determine the performance of parametric methods. Some of the Bayesian sampling techniques used in pattern recognition are Rejection sampling, Ratio of uniforms, Importance sampling, Markov Chain Monte Carlo (MCMC) methods, Slice sampling

[bayesamplingmethourl], [neal2004bayesiansamp].In [blumensath2007montecarlosampling], sparse coefficients of time series data have been estimated using Gibbs sampling and Importance sampling methods. Gibbs sampler cannot explore the entire posterior distribution but takes samples from just a single mode of the posterior distribution. It is difficult for Gibbs sampler to escape local maxima [orthogonalcomponent2010bayesian]. However, Gibbs sampler combined with annealing techniques can help in faster convergence.

Partially Collapsed Gibbs (PCG) sampler replaces some conditional distributions with marginal distributions to overcome the limitations of the standard Gibbs sampler as described by Van Dyk and Park in [pcgsampler1] and [pcgsampler2]. In [orthogonalcomponent2010bayesian]

, Bayesian inference on the unknown parameters corresponding to each sparse coefficient is conducted using samples generated by PCG sampler. These samples asymptotically follow the joint posterior distribution of the unknown model parameters and their hyperparameters. Such samples can closely approximate the joint maximum a posteriori estimate of the coefficients and the dictionary.

Importance sampler is not good for finding sparse approximations as it depends on the proposal distribution used. Importance sampler samples from a distribution (proposal) and finds the expectations w.r.t. the target distribution.

#### 3.1.1 Priors used in Sparse Approximations

Generally, the class conditional probability densities (assume features are continuous),

are unknown. If the form of the is known, but its parameters like mean and variance are unknown, these unknown parameters are estimated if some prior information is known about these parameters and then the Bayes’ decision rule is applied. Bayesian framework for estimation of parameters starts with specifying a probabilistic model from which marginal and posterior distributions can be evaluated. When we have large number of training patterns, the general prior applied is Gaussian prior. Though Gaussian prior works very well, sparsity inducing Laplacian or Cauchy priors act as a way of regularization and allow working with fewer variables than Gaussian case. Jeffrey’s prior is invariant w.r.t change of coordinates and hence works well as a prior for scale parameters. In [bayesianwithpriorsEurasip], the author has described several priors on the coefficient vector which induce sparsity. The Generalised Gaussian prior is given by equation (3.2).(3.2) |

where

The shape parameter value gives Gaussian prior which corresponds to norm regularization in equation (2.1). gives Laplacian prior which is equivalent to norm regularization. The scale parameter squeezes or stretches and along with location and shape parameters, determines the shape of a distribution. When compared to Gaussian prior, Laplacian prior and those with are good sparsity inducing priors. When the application is compression based, a higher level of sparsity is desired.

If the prior and the posterior are from the same family of probability distributions, then the prior is a conjugate prior for the likelihood function. For example, in

[orthogonalcomponent2010bayesian], additive Gaussian noise has variance , with Inverse Gaussian prior. Such conjugate priors help in arriving at a closed form posterior, avoiding numerical integration. The authors [orthogonalcomponent2010bayesian] have used PCG sampler to generate samples of the joint probability distribution of model parameters and hyperparameters, where the prior on coefficient vector is Bernoulli-Gaussian (BG) distribution with parameters and component variance for each. The hyperprior on

A generic method for sparse coding using Bayesian approach is given in Algorithm 1. For simplicity, one-dimensional signals are considered and corresponding errors in , coefficients of , have to be determined using dictionary . In [empiricalpriorwipf2007empirical], the authors have used a flexible prior based on original data. But, highly sparse priors like Jeffrey’s prior result in multimodal posteriors and hence the problem of local optima arises.

In [cauchypriorspcoding]

, the coefficients are assumed to follow Cauchy distribution which is a heavy-tailed distribution and is a member of the Levy-alpha-stable family of distributions. Cauchy proximal operator has been defined and Cauchy Convolutional Sparse Coding algorithm has been proposed to learn sparse coefficients to minimize the representation loss.

For example, Sparse Bayesian Learning (SBL) is a Bayesian approach to find sparse coefficient vectors of given observations. Multiple Snapshot SBL (M-SBL) is used for a dataset of observations constituting input data . The corresponding sparse coefficient matrix .

In [msbl2016DoA], the authors have assumed Gaussian hyperprior and achieved results comparable to the state-of-the-art. Though Gaussian hyperprior does not induce high level of sparsity, SBL algorithm which achieves maximally sparse solution even with a random dictionary [SBL_Tipping], is used in DoA estimation. The sparsity level in each coefficient vector is automatically determined at the point of convergence [SBL_visualtracking2005].

Another approach to sparse coding is to generate samples from and for MCMC sampling methods. To estimate , Gibbs sampler gives . These samples approximate . Now, samples , are used to approximate . A technical review of Bayesian approaches to sparse coding methods has been given in [spcodingbayesian].

## 4 Dictionary Learning Methods.

The origins of research into dictionary learning are in Independent Component Analysis, ICA. ICA minimizes the dependence among vector components by imposing independence upto second order [Comon:1994:ICA] i.e., the variables are linear combinations of unknown latent variables which are also assumed to be independent. For a random vector with finite covariance , ICA finds a pair of matrices , being diagonal whose entries are sorted in descending order, such that . Similar to dictionary learning, the directions here are also orthogonal, with unit norm constraint on columns of . The entries in dictionary are real numbers and the largest modulus in each column of is a positive real number. Sparse representation is closely related to ICA with these conditions and hence can be used as a preprocessing tool, just like ICA, before applying Bayesian detection and classification [Comon:1994:ICA].

Fixing obtained from equation (2.3), the joint optimization of in equation (2.2) is reduced to a linear optimization problem using Block Coordinate Descent method [block_coordinatedescent]. Updating dictionary w.r.t a fixed coefficient matrix results in equation (4.1) and this learning phase to update is called Dictionary Learning (DL).

(4.1) |

where , is the size of dictionary .

### 4.1 Orthogonal Dictionary Learning

Initially, mathematical transforms were applied on original data columns to get orthonormal dictionaries called Analytic dictionaries. Such Wavelet dictionaries, Fourier dictionaries have incoherent atoms, orthogonal to each other, hence opted for compression based applications. The level of sparsity achieved is very good with orthogonal dictionaries. Though PCA is capable of capturing major variance in data, minor details which are crucial for discrimination, are not captured. Moreover, the number of significant eigen values is specified by the user. These limitations of PCA could be overcome by a Synthesis dictionary comprising original data as atoms and then iteratively updated such that the representation error is minimal, with better representations and faster convergence [SRsurvey].

Though Non-Negative Matrix Factorization (NMF) works well for compressed representations of data, in case of natural images, NMF does not perform well when compared to overcomplete sparse representations [NMF_SR_sleepsignalclassify].

### 4.2 Overcomplete Dictionary Learning

Representation of natural images is rich when the redundancy in data is utilised in the form of overcomplete dictionaries. Atoms of an overcomplete dictionary are selected such that their number is small compared to the data size , but larger than the input dimensionality i.e., . Unlike undercomplete, orthogonal dictionaries, these overcomplete dictionaries are used in reconstruction based applications like image denoising, inpainting where missing or corrupted part of an image is reconstructed.

Overcomplete dictionary works in contrast with Vector Quantization (VQ), in which each sample is mapped to exactly one prototype. Dictionary learning algorithms could be used to update prototype vectors as in [DLforVQ], where dictionary learning helps in better quantization of ECG patterns.

### 4.3 Structured Dictionary Learning

When the training set comprises of features along with their class labels, structured dictionaries could be generated. Sub-dictionaries of all classes are grouped together to form a global shared dictionary which represents features shared by all classes. Sub-dictionaries have atoms used to represent features particularly of a specific class. Minimal reconstruction error in equation (4.1) w.r.t sub-dictionaries decides the label of test pattern. When the number of classes increases, computation of structured sub-dictionaries becomes expensive. A single shared dictionary whose atoms have features of each class as well as common features shared by all classes, saves memory and time. In [Yang2011FisherDD], to learn a discriminative structured dictionary, reconstruction error term is designed such that a given class of data is represented best by the global dictionary and the corresponding class dictionary but not by other class dictionaries. Fisher criterion, i.e., minimal intra-class scatter and maximal inter-class scatter, is imposed on sparse coefficient vectors, making both the coefficients and the dictionary atoms discriminative, leading to better classification results. Within-class-scatter is given by and Between-class-scatter is given by where is the mean sparse coefficient vector of class and is the mean sparse coefficient vector of all the data. Fisher Discrimination Dictionary Learning (FDDL) uses alternating optimization method with Fisher Discrimination based Sparse Coding in equation (4.2).

(4.2) |

where could be computed by finding the eigen values of the scatter matrices and . Thus, simple statistical concepts used in FDDL, help in learning a discriminative dictionary. Unsupervised and Supervised methods of learning such structured dictionaries are given in Section 4.4 and Section 4.5.

### 4.4 Unsupervised Dictionary Learning Algorithms

Unsupervised algorithms for dictionary learning result in generative or representative dictionaries, usually applied in image denoising, deblurring and inpainting. The missing pixels of an image can be reconstructed with the help of generative dictionaries. In each iteration, Method of Optimal Directions (MOD) [MOD_Engan] updates the dictionary by computing pseudo-inverse of coefficient matrix, which causes slow convergence. Another unsupervised algorithm, K-SVD [ksvd]

is a generalisation of k-means algorithm, which converges faster due to simultaneous update of both coefficient vectors and dictionary atoms. Only the elements corresponding to non-zero components of coefficient vector are considered to compute residual signal,

and Singular Value Decomposition (SVD) is applied to diagonalize the residual,

. The first column of gives updated atom and the product of first diagonal element and the first row ofgives updated coefficient vector. Retaining only the major part of the signal in the form of few large singular values, effectively reduces noise and gives a better representation of the signals. Incremental Codebook Optimization

[mairal2010onlinedictlearn] and Locality Constrained Linear Coding [wang2010locality] are other unsupervised dictionary learning algoritms.### 4.5 Supervised Dictionary Learning Algorithms

Supervised dictionary learning gives discriminative dictionaries for classification of patterns using labels of patterns in the formulation of objective function. Face recognition in the presence of obstructions and different moods and postures is an important application where discriminative dictionaries are used. Discriminative-KSVD (DKSVD) learns a discriminative dictionary by incorporating label information into the objective function of K-SVD.

(4.3) |

The matrix is always normalized column-wise, so the regularization penalty can be dropped to get

(4.4) |

The label matrix is approximated by a classifier matrix and the coefficient matrix , using alternating optimization (BCD), given by equation (4.3). With , as regularization parameters, K-SVD algorithm is applied to optimize equation (4.4). A similar approach to learning a discriminative dictionary is Label Consistent KSVD (LCKSVD) [LCKSVD]. If dictionary atoms are coherent, then there is multiple representation problem. So, a compact dictionary is preferred with which similar signals(from the same class) can be described by roughly same set of atoms with almost similar coefficients. Application of statistical methods in feature extraction as well as determining the size of dictionary and the dictionary columns, results in better discriminative dictionaries.

## 5 Statistical Concepts in Dictionary Learning

The problem of identifying a dictionary relies on the assumptions of statistical independence and non-Gaussian distribution set as prior

[karin_1_l1regn]. The ratio of majority and minority class cardinality could be high leading to high misclassification cost. A probablistic model for sparse representation based classification has been given in [classimbalanceprobab], to address the problem of class imbalance in dataset. A cost sensitive classification rule based on Bayesian framework with sparse coefficients as features has not only improved accuracy but also reduced misclassification cost.### 5.1 Histogram of Oriented Gradients (HoG)

In case of high dimensionality, feature descriptors are used to avoid unnecessary computations involved in classification. Histogram of Oriented Gradients (HoG) is a feature descriptor used to define an image by the pixel intensities and intensities of gradients of pixels. Gradients define the edges of an image, so extraction of HoG feature descriptor is same as extracting edges.

Histogram of Oriented Gradients generates gradients at each point of image providing invariance to occlusions, illumination and expression changes. In [hog_grpsp], group sparse coding with HoG feature descriptors is used to achieve good results on face recognition.

### 5.2 Use of Correlation Analysis in Dictionary Learning

Correlation is the value of association between two independent or one independent and other dependent variables, determined by measuring the Correlation coefficient (Pearson, Kendall, Spearman) and also the direction of their relationship i.e. positive correlation or negative correlation. Quantification of this association involves computing correlation coefficient ranging between . In [pearsoncorrcoefftfacerecog]

, Pearson product moment correlation coefficient is combined with the sparse reconstruction error of samples for face recognition. While reconstruction error tries to reduce the error between test sample and same class samples, Pearson correlation coefficient maximizes the error between test sample and other class samples, for improved classification results.

Canonical Correlation Analysis (CCA) is an extension of bivariate to multivariate analysis. When there are several factors influencing a single outcome, it is multivariate data and the corresponding correlation analysis is called CCA. In

[structdictlearnbasedoncorr], the unknown block structure of dictionary is explored using the correlation among dictionary atoms. This method gives control over the size of blocks. Maximum correlation quotient between the test sample and training samples and the reconstruction residual are weighted in the decision function to determine the label of the test signal.## 6 Parametric Approaches to Estimation of Dictionary Parameters

Parametric approaches make some assumptions about the population distribution from which the training data originated. Central Limit theorem is crucial to these assumptions. The theorem states that if sufficiently large number of random samples are drawn (with replacement) from any population with mean

and variance , then the distribution of sample means will be approximately Gaussian. Whenever there is uncertainty about the probability model of data, Gaussian probability model can be assumed, to derive the population parameters.Parametric approach to dictionary learning assumes a known distribution from which the columns of dictionary are drawn and tries to estimate the parameters of the distribution, such as size and the atoms themselves, by using maximum likelihood maximization to derive mean and covariance of dictionary column distribution. Full posterior estimates are provided using a Bayesian framework, which takes care of uncertainty and unseen data generally observed in biomedical applications. For representing an observed pattern in a lower dimensional space w.r.t. a coefficient vector and with a prior on the dictionary parameters , the likelihood function to be maximized is equation (6.1).

(6.1) |

Approximate Maximum Likelihood estimation of an unknown but deterministic dictionary using equation (6.1) is equivalent to Method of Optimal Directions (MOD) when the noise is assumed to be Gaussian [ILSDLA2007]. In [overcompletedictandsr], an algorithm to find a joint Maximum A posteriori Probability (MAP) estimate of an unknown random initial dictionary and the corresponding coefficient matrix, is given.

In [dictparaestimate], a Bayesian Approach has been employed to estimate dictionary atoms and dictionary size

along with the sparse coefficient vector hyperparameters. Additive noise is assumed whose variance is modelled from a gamma distribution with unknown parameters. Each dictionary atom or column has been assumed to be randomly drawn from a uniform distribution with components from

. Such uniform prior is non-informative, so this assumption is equivalent to taking a random initial dictionary, whose columns have unit norm. The coefficient vectors have been modelled as a zero-mean Gaussian where the covariance matrix is determined by hyperparameters which are assumed to be independently gamma distributed.The dictionary atom parameters, hyperparameters on coefficient vectors, noise variance are determined by approximating to a MAP estimate, obtained by iteratively maximizing the log-posterior density w.r.t. each of them, keeping the others fixed. This approach is equivalent to the Block Coordinate Descent technique employed to optimize equation (2.2).

A closed form solution to maximizing likelihood function in equation (6.1) is intractable, but Monte Carlo methods like Gibbs sampler and Metropolis-Hastings are used to approximate closed-form posteriors of dictionary variables [MCMCparadictionaries]. A Markov Chain (MC) is said to be ergodic or irreducible if it is eventually possible to reach every state from each state with positive probability. In [MCMCparadictionaries], uniform ergodicity properties of high dimensional Markov Chain which imply convergence to a stationary distribution independent of the initial states, have been discussed.

To approximate posteriors of dictionary, Group-wise sampling and aggregation have been used to identify group-wise similar functional brain networks of different persons in [samplingmethodsdictlearn]. Signal sampling and sparse coding on task fMRI data for learning a shared dictionary within a group of persons has helped in identifying and examining common cortical functional networks at individual level and population level. The authors have used No sampling, random sampling, uniform random sampling, 2-ring and 4-ring sampling methods and the corresponding statistical significance tests have been conducted.

Data driven overcomplete dictionaries enable flexible representations of data and the quality of an overcomplete dictionary could be determined using diversity measures like distance between atoms, reconstruction error, coherence among atoms. The Babel measures and entropy from information theory measure the randomness in a system. A high value of entropy denotes spread of atoms in a dictionary [overcompdictentropy].

Active Dictionary Learning updates dictionary atoms from the information in training data, using different strategies. Selecting the most useful sample by uncertainty sampling and by generalization error are classical strategies. The sample whose label cannot be decided is called uncertainty sample and can be decided using posterior probability, margin sampling and entropy based methods

[activeDL].When the samples are complexly structured like trees and sequences, entropy based queries retrieve informative samples for dictionary building. The uncertainty sample based on entropy is given by

where is the set of dictionary parameters.

When the training set contains both labeled and unlabeled samples, informativeness of samples could be decided by the probability distribution of class-specific reconstruction error, which determines how well the current dictionary is able to discriminate the sample.

In [activediscrimiDL], the authors have used both reconstruction error of a sample w.r.t. shared dictionary and its entropy on the probability distribution over class-specific reconstruction error, to determine the dictionary. Here, level of discrimination of dictionary is given by the entropy on the probability distribution of error of labeled samples and level of representation is given by the distribution of the error of unlabeled samples.

### 6.1 Hidden Markov Model (HMM)- Discriminative Dictionary Learning

With Hidden Markov Model, it is possible to describe sparsity profile as each hidden state represents a set of non-zero coefficients. In

[hiddenmarkovsparselearning], the problem of sparse representation has been modeled as a HMM. The approach in this paper has combined filtering based on HMM and manifold-based dictionary learning for estimating both the non-zero coefficients and the dictionary.An equivalence relation, partitioning the set of dictionaries into equivalence classes, has been introduced. A direct search for the equivalence class which contains the true dictionary has been used. The observations are decoupled using a new technique called Change-of-measure, so that the observations are all uniformly, identically distributed.

Expectation Maximization has been used to recursively update state in the Markov chain i.e., coefficient matrix with Gaussian prior, transition matrix of Markov chain and the dictionary.

Sparse HMM has been used in [sparseHMMsurgical], to model surgical gestures, where the dictionary is a set of basic surgical motions. The algorithm to learn a dictionary for all gestures and an HMM grammar describing the transitions among gestures has been proposed here. New motion data is classified based on these dictionaries and grammars. Viterbi algorithm is used for surgeme classification.

Given a surgery trial , assign a surgeme label to each frame . Skill-level from is assigned to the trial . The surgeme label is a hidden (unobserved) state modeled as a Markov process with transition probability . Thus, an observation at time , depends on hidden state through the emission probability density , which is generally assumed to be Gaussian or a mixture of Gaussians.

Also, is expressed as a superposition of atoms from a dictionary corresponding to gestures. Hence, depends on another hidden variable i.e., .

For each hidden state , a Laplacian prior is imposed on , to get a sparse latent variable, given in equation (6.2).

(6.2) |

where is parameter and is the size of dictionary corresponding to . Now,

where is an overcomplete dictionary corresponding to surgeme .

Bayesian Expectation Maximization is applied to learn all the transition probabilities and the parameters of each surgeme model , for each .

To get the surgeme labels of a given trial , a dynamic programming approach has been given. If the number of states is finite, then the algorithm converges.

For skill-level classification, three Sparse HMM models are learnt for expert, intermediate and novice levels. Level of skill is determined by using Viterbi algorithm [sparseHMMsurgical].

## 7 Non-parametric Approaches to Discriminative DL

Unsupervised and Supervised DL algorithms discussed in Section 4.4 and Section 4.5 are non-parametric approaches to DL [MCMCparadictionaries]. Parametric dictionaries consider uncertainty in data and avoid local optima. This property of parametric dictionaries improves generalization of sparse representation model.

In Supervised Dictionary Learning algorithms [Mairal_supervisedDL, Yang2011FisherDD, dksvd, LCKSVD], sparse codes consistent with class labels are generated for both generative and discriminative models.

In [Joint_L]

, the objective function is formulated combining classification error and the representation error of both labeled and unlabeled data, with a constraint on number of coefficients. All these algorithms are tersely mathematically formulated, tested on datasets for face recognition like Extended YaleB, AR dataset and handwritten numerals data sets MNIST and USPS.

If the form of is unknown, there are non-parametric approaches like Parzen windows, K-nearest neighbour rule, Multi-Layer Perceptron (MLP) with back propagation, to estimate from the observed data. To improve generalization, data based methods require huge data. A simple perceptron with one hidden layer is capable of solving any problem (Cybenko’s theorem [cybenko1989approximation]

). Considered as a non-parametric method to estimate the optimum weights of neural network, MLP does not make any assumptions about the data and is used to decide boundaries based on the observed data

[jain2000statisticalPR].With the increase in input dimensionality, the number of hidden neurons increases exponentially. Convolutional neural networks, Deep Belief Networks with several hidden layers are being used in computer vision and pattern recognition, to achieve best classification results. In deep neural networks, where data paucity could affect generalization, auto-encoder is applied for dimensionality reduction. When the training samples are limited and feature extraction is carried out by several hidden layers, there could be problems like vanishing gradient and overfit. The learning time increases as the gradient vanishes in back propagation

[hochreiter1998vanishing].If the feature extraction step of MLP could be replaced with sparse representation, the classifying capability of MLP could be used to classify data with high dimensionality and fewer samples.

In [zazo2019convolutionaldictlearn], a one-to-one correspondence between the sparse coding step, and deep CNNs, has been proposed, representing images using wavelet analysis, sparse coding, and dictionary learning. Dense signal gives the scale, while SR that selects a few dictionary atoms, gives the detail. Hierarchical convolutional sparse coding (H-CSC) and Convolutional Dictionary Learning have been used alternatingly, to generate a different set of images combining the content of one set of images with the style of another set of images [seo2020dictionary].

To overcome the limitations of both Dictionary Learning and Deep Learning, a hybrid method has been proposed, selecting optimal weights and picking the best performing compact architecture empirically, in

[madhuri2019telugu]. Sparse coefficients of samples of same class are similar and those of different classes are quite different when computed using a single shared dictionary [LCKSVD]. Here, the authors have used this property of sparse coefficients and Discriminative K-SVD to learn a dictionary to classify datasets which have large number of classes and huge class imbalance ratio.For example, Telugu OCR dataset UHTelPCC [rakeshrtip2r] has high class imbalance as shown in Fig.7.4. Telugu script characters have structural complexity which makes their image feature extraction complex. Also, there is confusing pairs problem as given in Fig.7.1, very commonly found in Dravidian scripts.

(A) |

(pa) |

(ha) |

(vaa) |

A hybrid method which makes use of the sparse codes as input features avoids tedious feature extraction overhead in deep networks as shown in Fig.7.2, leading to a compact MLP architecture.

Initialising , sparse codes generated using equation (4.4) are given as input to a simple MLP with two hidden layers as shown in Fig.7.3

. The MLP architecture has a dense layer (with ReLU activation), a batch normalization layer, a dropout layer, another dense layer (with ReLU activation). The output layer (with softargmax activation) corresponds to categorical labels of the dataset. The addition of batch normalization layer between hidden layers maps the nonlinear features to the linear part of the activation function. Dropout layer has been applied to eliminate the problem of overfit. The MLP is trained on sparse codes generated using DKSVD, and evaluated with sparse codes of test images. Train and test sets of sparse codes are generated w.r.t. same shared dictionary

[madhuri2019telugu].Algorithm 2 [madhuri2019telugu], has been tested on UHTelPCC, a printed Telugu connected component dataset and MNIST dataset.

### 7.1 UHTelPCC

UHTelPCC is a Telugu dataset, contains 70000 binary connected components of size pixels from 325 classes. UHTelPCC is available at

http://scis.uohyd.ac.in/ chakcs/UHTelPCC.zip. These 70000 samples are divided into training (50000), validation (10000) and test (10000) sets. Computation times reported in Table 7.1 correspond to training the MLP and validating. Model accuracy is depicted in Fig. 7.5, Model loss in Fig. 7.6.

The method has been tested on sparse codes generated from dictionaries of different sizes. The choice of proper size of dictionary for each class is a tradeoff between computing time and accuracy [madhuri2019telugu]. From Table 7.1, dictionary size of 20 atoms for each class gives 98.7% accuracy for UHTelPCC with dimensionality 1024.

#Atoms | K=16 | K=20 | K=24 | K=26 |

Time | 19s | 24s | 27s | 32s |

Accuracy | 97.9 | 98.7 | 98.73 | 98.91 |

F1-score | 0.9856 | 0.9963 | 0.9991 | 0.9998 |

### 7.2 Mnist

MNIST [ciregan2012multi] is a hand written numerals dataset of 60000 samples for training and 10000 for testing. The dimensionality is 784 and a dictionary size of 18 atoms per class gives 96.3% accuracy.

#Atoms | K=14 | K=16 | K=18 | K=23 |
---|---|---|---|---|

Time | 18s | 21s | 22s | 32s |

Accuracy | 95.34 | 95.12 | 96.32 | 96.4 |

Reduced training and testing times for both UHTelPCC and MNIST, from Table 7.1 and Table 7.2, suggest the low computational complexity of the model. This non-parametric method of learning classifier weights, supports the idea of using statistical concepts in sparse coding as well as dictionary learning.

## 8 Conclusion

The transformation of dictionary learning from orthogonal transforms to overcomplete analytic transforms to overcomplete synthesis dictionaries is followed by parametric dictionary learning. In this review article, we present an overview of using probabilistic models, with different priors and hyper-priors on variables, parametric and non-parametric approaches to parameter estimation, used in sparse representation algorithms. Sampling techniques used in sparse representation to overcome problems like multi-modal data, class imbalance in data, unlabeled data mixed with labeled data and high dimensionality, are discussed. Design of structured, overcomplete dictionaries using entropy analysis of data and examples of research articles presenting Hidden Markov Models for dictionary learning and sparse coding are given. Research articles which combine CNNs with sparse representation to separate content and style in images as well as a hybrid method which combines the representational capabilities of dictionary learning with the classifying capabilities of a neural network are discussed.