Correlation and Class Based Block Formation for Improved Structured Dictionary Learning

08/04/2017 ∙ by Nagendra Kumar, et al. ∙ ERNET India 0

In recent years, the creation of block-structured dictionary has attracted a lot of interest. Learning such dictionaries involve two step process: block formation and dictionary update. Both these steps are important in producing an effective dictionary. The existing works mostly assume that the block structure is known a priori while learning the dictionary. For finding the unknown block structure given a dictionary commonly sparse agglomerative clustering (SAC) is used. It groups atoms based on their consistency in sparse coding with respect to the unstructured dictionary. This paper explores two innovations towards improving the reconstruction as well as the classification ability achieved with the block-structured dictionary. First, we propose a novel block structuring approach that makes use of the correlation among dictionary atoms. Unlike the SAC approach, which groups diverse atoms, in the proposed approach the blocks are formed by grouping the top most correlated atoms in the dictionary. The proposed block clustering approach is noted to yield significant reductions in redundancy as well as provides a direct control on the block size when compared with the existing SAC-based block structuring. Later, motivated by works using supervised a priori known block structure, we also explore the incorporation of class information in the proposed block formation approach to further enhance the classification ability of the block dictionary. For assessment of the reconstruction ability with proposed innovations is done on synthetic data while the classification ability has been evaluated in large variability speaker verification task.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Learned dictionary based sparse representation (SR) finds successful application in various signal processing domains, such as image denoising [1], image recognition [2, 3]

, face recognition 

[4, 5, 6], speaker identification/verification [7, 8, 9], and fingerprint identification[10]. In SR domain, existing data driven dictionary learning techniques can be broadly divided into three categories: supervised, semi-supervised and unsupervised. The dictionary learned utilizing class labels are referred to as supervised. Whereas those learned using weak supervision in form of any assumed structure/constraint are termed as semi-supervised dictionary. Both these kinds of dictionaries produce more discriminative sparse codes than the unsupervised ones, thus result in better classification performance. In SR domain, usually redundant (over-complete) dictionaries are preferred. Such dictionaries have more columns (atoms) than rows (data dimensionality). Sometimes, lesser number of examples than the data dimensionality involved are available for learning the dictionary. Thus, only under-complete dictionary could be learned unless we project the data to appropriate low-dimensional space. Nevertheless, the use of under-complete dictionaries have been reported in SR based classification tasks [9, 11].

In recent past, the dictionary learning has received a lot of attention in SR domain. Combining the K-means clustering and the singular value decomposition (SVD), a widely used dictionary learning approach is proposed and is referred to as KSVD 

[12]. In learning of the KSVD dictionary, the reconstruction error is minimized under the constraint on sparsity for the given data. Though not being optimized for producing discriminative sparse codes, some works have explored the KSVD dictionaries in the classification task [4, 9]. In the literature, a few classification-driven dictionaries are also proposed such as supervised KSVD (S-KSVD) [13] dictionary and label-consistent KSVD (LC-KSVD) [14]

dictionary. The S-KSVD algorithm incorporates the Fisher discriminant criterion in dictionary learning. Whereas in the LC-KSVD algorithm, a linear transformation that maps the sparse code to more discriminative one is also learned along with the dictionary. Though yielding enhanced classification performance, these supervised dictionaries neither use any block structure nor explicitly minimize the within-class redundancy. As the result of that, such dictionaries are found to yield inconsistent sparse codes for the same class data.

In addition to supervised dictionary, some block-structured dictionaries are also proposed. The introduction of block structure in a dictionary is noted to enhance not only its reconstruction ability [15] but also its classification ability [16]. Initial works simply exploit the known block structures in sparse coding with no emphasis on learning such dictionaries [17, 18, 19, 20]. The block-KSVD (BKSVD) [15]

dictionary is probably the first attempt towards learning an unsupervised block-structured dictionary. It employs a sparse agglomerative clustering (SAC) algorithm for estimating the unknown block structure. Given a dictionary, the SAC algorithm estimates the block structure by iteratively grouping its atoms based on sparse coding. As the SAC employs orthogonal matching pursuit (OMP) 

[21] for sparse coding, the grouped atoms happen to be diverse (less correlated). Thus, if the given dictionary comprises of correlated atoms those are less likely to be grouped together in the SAC approach. This affects the classification performance due to inconsistency in sparse coding. Addressal the above mentioned weakness in the estimated block structure is the prime motivation behind this work.

Further, in the context of image recognition, we come across a proposal of supervised block-structured dictionary learning approach that employs intra-block coherence suppression for reducing the redundancy and is referred to as the IBCS [3] dictionary. The minimization of intra-block coherence in a dictionary is critical for consistency in the resulting sparse codes. In that work, the block structure is initialized in a supervised manner and is kept fixed during the dictionary learning. It would be interesting to explore the adaptation of block structure while retaining the class supervision. Motivated by these works, we propose a classification-driven dictionary learning approach and contrast its performance with existing approaches on synthetic data as well as real data. The main contributions of this work are as follows:

  • A novel block structuring algorithm is proposed that exploits the similarity among dictionary atoms rather than that among the resultant sparse codes.

  • The proposed block formation approach is shown to reduce the inter-block coherence as well as providing a more precise control on the block size in contrast to the SAC algorithm.

  • Use of class supervision in block formation for enhancing the classification performance achieved with the learned block-structured dictionary.

The remainder of the paper is organized as follows. First, the prior work on the dictionary learning using the block structure is discussed in Section II. The proposed correlation based greedy clustering algorithm is discussed in Section III. In Section IV, we formulate the classification driven dictionary learning approach. Section V and Section VI present the evaluation of the proposed approach on synthetic and real data, respectively. The paper is concluded in Section VII.

Ii Prior work on block-structured dictionary

The idea behind learning a block-structured dictionary is to exploit any structure that is embedded in the signals for producing more efficient sparse representation. A variety of algorithms have been proposed in the literature for this purpose. In initial works [17, 19], it is assumed that the block structure of the dictionary is known a priori. Later, in [15], an unsupervised SAC algorithm is proposed for deriving the block structure from the data. The dictionary is learned using the BKSVD algorithm while iteratively updating both the block structure and the atoms of the dictionary. Learning of a block-structured dictionary along with its block structure having a maximum block size of can be formulated as,

(1)
such that

where is the dictionary having numbers of -dimensional atoms, is the data matrix, is sparse code matrix, is the Frobinous norm, is the -norm over and finds the number of non-zero blocks, is set of indices in the th block, is chosen block sparsity and is the number of blocks.

The dictionary update process in the KSVD and BKSVD algorithms is quite similar, except the later involves block-by-block update. In contrast to the KSVD algorithm, the BKSVD algorithm requires about times lesser number of SVDs. Thus, the computational complexity in the BKSVD dictionary update is significantly lesser. The SVD ensures that intra-block atoms are orthonormal. This minimizes the redundancy in the dictionary as well as the inconsistency in the sparse coding, hence improving the classification performance.

Ii-a Sparse Agglomerative Clustering

The SAC algorithm employs an iterative process for estimating the block structure from the sparse code matrix of the training data obtained using the OMP. The algorithm starts by considering each atom as a block. At each iteration, it merges two blocks based on the maximum intersection in the involved sparse codes while satisfying the constraint on the maximum block size. To illustrate this merging process, a toy-dictionary has been created by taking three arbitrary atoms and out of a larger sized dictionary . Further,

arbitrary selected training data vectors are sparse coded over that toy-dictionary. In this illustration, as the dictionary has only

atoms, there are possible ways to merge any two atoms to make a block. On finding the intersection among the obtained sparse codes, the pair of atoms having the largest intersection among the sparse codes are grouped together. Fig. 1 shows the block formation at very first iteration of the SAC algorithm. For atoms and , the match in the indices of sparse codes happens to be maximum, so these atoms form the first group.

Figure 1: Illustration of maximal intersection based group formation in the SAC process. The sparse codes belonging to atoms and of a -atom toy-dictionary for randomly selected training vectors are shown in plots (a), (b) and (c), respectively.
Figure 2: Graphical display of the OMP based sparse coding for a -atom dictionary having correlated atoms and .

In the existing SAC process, the OMP algorithm has been employed for sparse coding of the data. The OMP, being a greedy iterative algorithm, selects one atom at each step that is most correlated to the current residual. The selected atoms are highly uncorrelated with each other. Thus, in the SAC process, the formed blocks contain diverse atoms rather than similar ones. Assume a dictionary happens to contain two or more moderately correlated atoms. So while sparse coding the data over that dictionary, the OMP algorithm is expected to select any one of them based on the similarity. Fig. 2 graphically displays the OMP based sparse coding of a target vector over a dictionary having two correlated atoms say and . After selecting either of them, the current residual will no longer lie in the directions of the correlated atoms rather it would become correlated with another atom in the dictionary say . On account of that there exists a finite possibility that those correlated atoms will appear in different blocks if the SAC process is followed. In classification task, the existing SAC based block-structured dictionary may produce the sparse codes for the same class enrollment and test data involving different blocks. This inconsistency in the sparse coding leads to degradation in the classification performance. One can also employ other sparse coding schemes that do not perform orthogonalization while selecting the atoms unlike the OMP. The least angle regression (LARS) [22] algorithm is one such alternative and we hypothesize that it should provide some improvement in the SAC. Following this argument, we modified the existing SAC process to include LARS-based sparse coding and created a new block-structured dictionary.

For assessing the quality of the block structure, we have computed the pairwise correlations among all atoms of the OMP-SAC-based and the LARS-SAC-based block dictionaries. Both these dictionaries have been initialized with the same -atom KSVD dictionary while learning. The number of pairs having correlation value greater than in these two dictionaries are plotted in Fig. 3. On comparing the number of atom-pairs having higher correlation than the chosen threshold, the LARS-SAC-based BKSVD dictionary is noted to exhibit significantly lower inter-block coherence than the OMP-SAC-based BKSVD dictionary. Later in Section VI-B, we also show that the LARS-SAC-based BKSVD dictionary also yields better SV performance than the existing OMP-SAC-based BKSVD dictionary. Motivated by the improved block structure quality with increased correlation among grouped atoms, we explore the clustering of the atoms into block structure based on correlation criterion rather than the intersection among indices of the sparse codes. In the next section, we describe the correlation based greedy clustering algorithm for producing the block-structured dictionary. Following that we explore the inclusion of class information in block formation to produce more discriminative block-structured dictionary.

Figure 3: The profiles of pairwise correlations (sorted in descending order) among the dictionary atoms for different block-structured dictionaries. It can be seen that use of LARS sparse coding in the SAC has resulted in significant reduction in mutual coherence among dictionary atoms.
Figure 4: The CGC block structuring algorithm (a) flow chart, (b) an example explaining the involved steps in the context of a dictionary having atoms and the maximum block size of .

Iii Correlation based Greedy Clustering Algorithm

In this section, we propose an approach for determining the block structure given a dictionary exploiting the similarity among its atoms. As the clustering is done on the basis of the pairwise correlation among the dictionary atoms, we refer to this approach as Correlation based Greedy Clustering (CGC) algorithm. The flow diagram of this algorithm is shown in Fig. 4(a) along with an example explaining the involved steps.

Initially, no block structure is assumed, i.e., . From the given dictionary, first the absolute correlations among the atom-pairs are computed and arranged in form of a symmetric matrix . For storing the information about the pair indices an indicator matrix is created. First column of the matrix denotes the indices of dictionary atoms that are yet to be assigned to any block. Each row of , excluding the very first entry, simply lists the indices of all the atoms that are available for grouping. At each iteration, the CGC algorithm finds a predefined block size numbers of top most correlated atoms. For this purpose, the cumulative sum of largest correlation values in each row of is computed and stored in a local vector . The indices of selected top correlated atoms are stored into another local selection matrix . The group of atoms that result in highest cumulative correlation sum form a new block. After finding the new block, first the block structure is updated. Following that the matrices and are updated by discarding the correlation and index information of all those atoms that have formed the block, respectively. These steps are repeated until all dictionary atoms have been assigned to a block. At each iteration only one block is formed and when the maximum block size criterion is no longer satisfied the remaining atoms are group in a lower size block. For fast update, the order of the indices of yet to be grouped atoms in should be kept same as in the symmetric matrix as shown in Fig. 4(b).

In CGC algorithm, with formation of every new block the value of the pairwise correlation for the unassigned atoms keep decreasing. This results in the last few unassigned atoms exhibiting very small correlation. Assigning these atoms to predefined bigger size block may affect the consistency of the sparse coding. To address this issue, we have gradually reduced the maximum block size in steps of one when about of the total atoms of the dictionary are left to be grouped.

The steps of the CGC algorithm are illustrated in Fig. 4(b) by considering an arbitrary dictionary having atoms. For ease of illustration, the maximum block size is kept as . At beginning, the dictionary block structure is initialized as . In the first iteration, the atoms having indices and are found to have the highest correlation. So those form the first block. In the second iteration, the highest correlation is noted for atoms having indices and , so the second block is formed by them. In the third iteration, only th atom is left to be assigned, so it alone forms the third block. Thus, the estimated block structure turns out as .

Iv Classification Driven Block-Structured Dictionary

In addition to the SAC-BKSVD based dictionary, we find a proposal of intra-block coherence suppression (IBCS) [3] based block-structured dictionary in the context of image recognition task. Unlike the SAC-BKSVD approach, in the IBCS dictionary learning, the class labels are used in determining the blocks and so obtained block structure is kept fixed during dictionary learning. On exploring in SV task, we found that the IBCS based block-structured dictionary outperforms both SAC-BKSVD and CGC-BKSVD based dictionaries. Despite having unadapted block structure, the improved detection cost obtained for the IBCS dictionary highlights the impact of the class supervision in block formation. Motivated by that, we propose a novel block structuring scheme that allows the grouping of atoms within-class only.

The proposed scheme is intended to enhance the ability of a dictionary to produce more discriminative sparse codes. The discriminative sparse codes should exhibit following property

(2)

where is the sparse coefficient matrix for the th class training dataset and the atoms in block structure corresponding to the th class. Therefore, in ideal case, all non-zero coefficients for correspond to only.

Towards achieving this goal, we define an objective function for learning a discriminative dictionary as

(3)

where is the th class training data, is the th class sub-dictionary, is number of blocks in the dictionary and, and denote the th block and its indices, respectively. In (3), the first term ensures the good reconstruction ability, the second term reduces the intra-block redundancy and the third term enhances the discrimination in sparse codes to aid the classification. The simultaneous optimization of all three constraints in (3) may not be feasible.

Here, we wish to highlight that the first two constraints in (3) can be optimized by evoking the existing BKSVD dictionary learning technique. In fact, it would ensure that all intra-block coherences are zero. The formation of blocks using either the SAC or the proposed CGC algorithm does not utilize the class information. So the atoms within the blocks may belong to two or more classes. As a result of that, the dominant coefficients in the sparse coding of training data belonging to different classes could involve the same set of blocks. With multi-class data being involved with a block, the updated dictionary will lose the ability to produce more discriminative sparse coding of the data. Towards addressing this issue, we have explored the inclusion of the class supervision in the block formation. In the following sub-sections, the details of supervised block structuring and the dictionary update using the well-known SVD approach are presented. The overview for learning the proposed dictionary is given in Algorithm 1.

Iv-a Block structure: initialization and update

For including the class supervision in the block formation, first the dictionary is initialized by selecting predefined number of examples from each of the classes. Let the class indices for such a dictionary be stored in a vector defined as

(4)

where is number of examples in the th class.

Now for optimizing the block structure within each class, the proposed CGC algorithm is invoked in a constrained manner. To preserve the class supervision, appropriate constraints are introduced in the merging process of CGC algorithm to allow the grouping of atoms from the same class only. For this, a simple approach would be to perform block structuring separately for each class while indexing the blocks across the classes uniquely.

  Input: Given training dataset with labels and maximum number of dictionary atoms per class.
  Step 1. Obtain the initial dictionary and their corresponding class label .
  for fixed number of iterations i.
  for fixed number of iterations j i.
  Step 2. Compute the correlation among dictionary atoms.
  Step 3. Find the block structure using either CGC algorithm or supervised CGC algorithm.
  end for
  Step 4. Compute the sparse coefficients matrix using BOMP algorithm.
  Step 5. Update the block-structured dictionary .
  end for
  Output: Dictionaries and corresponding block structure .
Algorithm 1 Procedure for learning improved block-structured dictionary.

Iv-B Dictionary Update

Given the block structure and the initial dictionary, the training data is sparse coded using the block-OMP (BOMP) algorithm [23]. For updating the dictionary, all training data vectors associated with each of the blocks in the resulting sparse codes are collected. Let denotes the list of indices of all those training data that have nonzero sparse coefficient for the th block in the th class . For updating the th block in , the representation error excluding the contribution of that block is computed as

(5)

where is the error matrix or the data for the th block in the th class, is the number of blocks in the th class, is the indices of th block in the th class and remaining terms have the usual meaning. Now is factorized into using the SVD algorithm. The representation error is minimized by replacing the dictionary atoms and the selected sparse coefficients with top rank components obtained using SVD as

and

(6)

Both dictionary and its block structure are updated to achieve convergence or for pre-determined number of iterations. Obviously, the atoms in the updated happen to be orthonormal to each other. Thus one of the criteria laid in (3) is met perfectly.

Figure 5: Percentage of (exact) block recovery using proposed CGC algorihtm while varing the block size and average intra-block correlations.
Figure 6: Comparison of the reconstruction errors for synthetic data trained block dictionary using OMP-SAC and CGC algorithms under varying conditions: (a) number of iterations involved in learning the dictionary, (b) signal-to-noise ratio (SNR) of the training data, (c) block size in the learned block structure, and (d) number of blocks employed in generating the synthetic training data. The highlighted entries across the tables correspond to usage of the identical values of studied parameters.

V Experiments on the synthetic data

In this section, we evaluate the proposed CGC algorithm for recovery of the underlying block structure and compare the reconstruction errors for the KSVD based block dictionaries learned using the OMP-SAC- and CGC-based block structuring algorithms on synthetic data. All experiments are repeated times and their averaged performance are reported.

V-a Block recovery

For this study, we decided to create block dictionaries having atoms with dimensional data. For this purpose, first a randomly initialized matrix having rows and columns was created, where is chosen block size. For each of the columns in the created matrix, additional clones were derived by adding random noise in varying scales. On collecting all these columns, we obtained a -atom initial dictionary having an oracle block structure such that each block contains atoms. For studying the effect of degree of correlation among the initial dictionary atoms, separate dictionaries were created following the above outline procedure with the average intra-block correlation being controlled by noise added during cloning. For all such synthetically created initial dictionaries, the average inter-block correlation for top atom-pairs is found to lie in the range -.

Fig. 5 shows the block recovery performances of the CGC algorithm for varying intra-block correlation and block size. A block is considered to be recovered only when the estimated block indices are identical to those in the oracle block structure . The CGC algorithm is noted to recover underlying block structure perfectly for average intra-block correlation being or higher regardless the block size considered. Whereas, when the average intra-block correlation drops lower than , the accuracy of the block recovery drops sharply owing to decreasing gap between intra- and inter-block correlations. One might wonder such high correlations may not exist in real (speech) data created dictionary, it is already shown in Fig. 3 that do.

V-B Reconstruction performance

We now evaluate the CGC-BKSVD dictionary in terms of reconstruction performances under varying condition while contrasting with existing OMP-SAC-BKSVD dictionary. The reconstruction performance for the synthetic data matrix over a learned dictionary is computed through sparse coding with a block sparsity of and defined as .

For generating the data for dictionary learning, we have first created a -atom dictionary using the procedure outlined in V-A using synthetic data. For this dictionary, the block size is kept as and the average intra-block correlation is kept as . From so created dictionary having known block structure, weighted sum of randomly selected blocks is computed to generate a synthetic data vector. Following this scheme data samples are derived for experimentation. For assessing the robustness of the dictionary learning approaches, the noisy versions of synthetic data are also created by adding white Gaussian noise at different signal-to-noise ratios (SNRs).

For learning both kinds OMP-SAC- and CGC-based block-structured dictionaries, the same KSVD learned dictionary is used as initialization. Unlike the former which iteratively updated both dictionary and the block structure, in the later only dictionary is updated iteratively while keeping the block structure estimated in first iteration fixed. Therefore, we studied the effect of iterations involved in dictionary learning in two cases. The reconstruction errors for this study are tabulated in Fig. 6(a). The CGC approach not only converges faster but also outperforms the contrast.

Fig. 6(b) shows the impact of addition of noise in the dictionary learning data for OMP-SAC- and CGC-based block dictionaries. The table lists the reconstruction performance for the noiseless data with respect to dictionaries learned using noisy data. Note that, the reconstruction error for the CGC case at dB SNR matches with that for the OMP-SAC case under noiseless data. Thus, the CGC approach also maintains the edge over the OMP-SAC approach under noisy data.

In the CGC-based approach, the initial dictionary is clustered based on the pairwise correlation among its atoms. Thus, we hypothesize that the tuning of the block size during the dictionary learning is less critical than that in the OMP-SAC-based approach. Fig. 6(c) lists the reconstruction errors for varying block size employed in dictionary learning and these results support our hypothesis.

In the earlier discussed synthetic data generation process, the variability of generated data depends on the number of blocks selected from the oracle block dictionary. For assessing the modeling ability of the proposed approach, different variability data sets are created by varying the number of blocks employed while generating the synthetic data from to . Separate dictionaries are learned using both the approaches on those data sets and the corresponding reconstruction errors are tabulated in Fig. 6(d). Even for higher variability data the CGC-based approach is noted to yield better reconstruction performance than that of OMP-SAC-based approach.

Vi Evaluation of Classification Performance

In this section, we evaluate the impact of the proposed innovations in the context of speaker verification (SV) task. The different SV systems developed in this work are evaluated on telephone condition test data sets in the NIST 2012 SRE [24]. Different sparse representation based SV (SR-SV) systems using unsupervised (KSVD, (OMP/LARS)-SAC-BKSVD) and supervised (IBCS and BKSVD) learned dictionaries are developed. The i-vector [25] Gaussian probabilistic linear discriminant analysis (GPLDA) [26] based SV system is also created for primary contrast.

Vi-a Experimental Setup

The setup employed for the real data experiments is identical to our earlier work [9]. In the following, we briefly mention the essential detail only. For more detailed description about database, performance measure, and signal processing the reader is referred to [9]. The SRE12 speech data set contains a total of speakers (female and male). The test set contains telephone recorded speech utterances, from which about million verification trials are created. The test data is partitioned into three subsets based on the environment and noise conditions. For the development of SV system, the speaker’s utterances from NIST SRE06, SRE08, and SRE10 data sets have been used. From the development data, a total of female and male data i-vectors are derived. The speech data is analyzed to compute commonly used dimensional mel frequency cepstral coefficients. These are then augmented with their delta and double delta coefficients, thus resulting in a -dimensional final feature vectors. A Gaussian mixture based universal background model (UBM) [27] is used in gender-dependent modeling. For evaluating the performance of different SV systems developed, the BOSARIS toolkit[28] has been used. The performance of the SV systems are measured using the detection cost function as per NIST protocol [24]. It is defined as the mean of the normalized detection costs corresponding to the probability of target being set as and respectively. For being evaluated at low false alarm rates, the measure suits for high security applications. The cosine distance scoring (CDS) measure is used to find the scores for the SR-SV methods.

For developing the SV system, the telephone recorded development data is partitioned into two parts: train and test. The initial system for tuning the parameters is trained on the dev-train dataset and then evaluated on the dev-test dataset. A total of dev-trials are created from dev-test data containing female and male speakers. All SV systems explored in this work are modeled in gender-dependent manner.

Vi-A1 Factor Analysis based Modeling

Gender-dependent UBMs are learned using the telephone speech data derived from female and male speakers. The utterances in the development data are redistributed to have an average duration of seconds after voice activity detector for being of varying duration. In the i-vector based contrast SV system, dimensional representational vectors are derived using the total variability matrix (T-matrix) learned on the telephone recorded data. In GPLDA modeling, dimensional speaker subspaces are used and it includes whitening, length normalization followed by projection into the unit sphere. The GPLDA parameters are learned using pooled telephone and microphone recorded development data i-vectors. In the SR-SV systems, a gender-dependent joint factor analysis (JFA) [29]

is employed for session/channel compensation of the Gaussian mixture model (GMM)-UBM mean supervectors. Following that

dimensional speaker factors are derived and further details of the same are available in our earlier work [30].

SV System Block Structure System Detection Cost () %EER
Dictionary type

Feature/Classifier

Size Class Sup. Updation Code TC2 TC4 TC5 Avg. Avg.

Contrast

T-matrix i-vector/Bayes - - - S0 0.411 0.543 0.446 0.467 5.22
JFA-matrix spk-factor/CDS - - - S1 0.523 0.634 0.552 0.570 5.72
KSVD sparse-vector/CDS - - - S2 0.494 0.605 0.516 0.538 9.94
OMP-SAC-BKSVD variable no adapted S3 0.447 0.527 0.427 0.467 14.40
LARS-SAC-BKSVD variable no adapted S4 0.428 0.508 0.430 0.455 11.31
IBCS variable yes unadapted S5 0.450 0.475 0.399 0.441 17.72
Sup. block-BKSVD variable yes unadapted S6 0.438 0.497 0.408 0.447 13.22

Proposed

CGC-BKSVD sparse-vector/CDS fixed no adapted S7 0.424 0.511 0.410 0.448 12.23
variable no adapted S8 0.422 0.502 0.403 0.442 12.25
Sup. CGC-BKSVD variable yes adapted S9 0.386 0.443 0.366 0.398 13.51
Fusion of systems S0+S5 0.362 0.434 0.364 0.387 4.34
S0+S9 0.313 0.397 0.336 0.349 4.14
Table I: Performances of the proposed block-structured dictionary based SV systems and those of contrast systems on the NIST SRE12 telephone recorded test data set.

Vi-A2 Sparse Representation based Modeling

The gender-dependent KSVD dictionaries are randomly initialized by selecting development data corresponding to and utterances for female and male cases, respectively. For learning KSVD dictionaries, iterations are performed. The unsupervised BKSVD dictionaries are initialized with corresponding KSVD dictionaries. Whereas, each of the supervised dictionaries is initialized with class-specific KSVD learned sub-dictionaries having a maximum of atoms per class (speaker). All kinds of dictionaries are trained on the speaker factors pooled from both the telephone and the microphone development data. Learning of the IBCS and SAC-BKSVD based dictionaries usually require iterations, while the CGC-BKSVD and supervised CGC-BKSVD based dictionaries is noted to converge in iterations. All the block-structured dictionary are learned keeping the block size and block sparsity of and , respectively. In all SR-SV systems, the sparse coding of the enrollment and test data is done using the BOMP algorithm. The coding over the unsupervised and the supervised dictionaries employ the sparsity value of and , respectively.

Vi-B Results and Discussions

In this subsection, first the performances of the proposed block structuring approach and class-supervised block-structured dictionary based SV systems are discussed. Following that, the robustness of proposed SR-SV system is evaluated. Finally, the results of the fusion of the proposed SR based and the i-vector based SV systems are presented.

Vi-B1 SV performance evaluation

The system performances are primarily evaluated in terms of and the corresponding equal error rates (EERs) are reported only for reference purposes. The three conditions in the NIST SRE12 telephone test data sets are referred to as TC2, TC4 and TC5. The performances of different proposed and contrast SV systems are presented in Table I. On comparing between the i-vector GPLDA (S0 system) and the KSVD dictionary (S2 system) based SV approaches, we note that the former significantly outperforms the later. As the KSVD dictionary is learned without any supervision or block structure, a direct comparison between the S0 and S2 systems may not be fair. For this purpose, we also did CDS directly on speaker factors and the resulting SV approach (S1 system) is found to be inferior to the S2 system in term of . In an earlier work, it is shown that the block-structured KSVD dictionary outperforms the simple KSVD dictionary in the context of SR-SV [16]. The remaining performances given in Table I are discussed next in the context of two enhancement proposed for learning the block-structured dictionary.

Modified block structuring approach: In Section II-A, it is shown that LARS-SAC-based dictionary has reduced inter-block correlations, thus it is expected to yield improved SR-SV performance. From Table I, we note that the LARS-SAC-BKSVD dictionary (S4 system) results in a relative improvement of in and in EER over the OMP-SAC-BKSVD dictionary (S3 system). With fixed size block formation, the CGC-BKSVD dictionary (S7 system) is noted to yield in a relative improvement of and in when compared with the S3 and S4 systems, respectively. Further, on allowing variable block sizes in the CGC-BKSVD dictionary (S8 system) an additional relative improvement of in is obtained.
Supervision in the block formation: In Table I, the S5 system refers to the evaluation of recently proposed IBCS dictionary learning approach in SV task. In IBCS dictionary learning, a class-supervised block structure is employed and the intra-block coherences are minimized using gradient approach without updating the block structure. In contrast to that all previously discussed block dictionaries are learned using SVD along with updating the block structure. Therefore, for direct contrast with the IBCS dictionary, it would be interesting to explore the impact of employing the supervised block structure in the SVD based block dictionary learning. For this purpose, we have learned a block dictionary using BKSVD algorithm but with a class-supervised block structure which is not updated during learning. The resulting block dictionary based SV system is referred to as S6 system. It can be seen from Table I that both S5 and S6 systems yield similar SV performances. These results demonstrate the impact of class supervision in block structure in the dictionary learning. The second proposal of combining class supervision in the CGC approach for deriving the block-structured dictionary is referred to as S9 system. It can be noted that the S9 system has consistently outperformed previously discussed system on all three test sets in the primary measure .

Figure 7: Test condition averaged detection performance of the proposed and existing block-structured dictionary based SR-SV systems for varying maximum block size. The porposed (S9) system exhibits lower sensitivity than existing ones.

Vi-B2 Sensitivity to maximum block size

In the previous section, the SV performances of all block-structured dictionary based systems correspond to maximum block size of . We have explored varying the same from - in the context of three systems (S3, S4, and S9) and the corresponding test condition averaged detection costs are given in Fig. 7. For the chosen block size range, the relative deviations in averaged have turned out to be , and for S3, S4 and S9 systems, respectively. From these results, it can be inferred that the supervised CGC-BKSVD approach is more robust to variation in the block size. For greedy selection being employed in the CGC algorithm and block update using SVD, after a few iterations intra-class atoms become nearly uncorrelated irrespective of the constraint of the block size. This could be the possible reason behind the low sensitivity being exhibited by the proposed approach.

Figure 8: DET plots for salient SV systems developed in this work on TC2 test condition (S0: i-vector, S5: IBCS, S9: superivsed CGC-BKSVD). Also shown are the fusion of the SR-SV systems with the i-vector SV system.

Vi-B3 Exploiting system diversity

The various SV systems explored in this work mainly differ in terms of the criterion employed in dictionary learning and scoring. More specifically, the i-vector based SV approach involves factor analysis for learning the T-matrix and the GPLDA for scoring. Whereas the SR-SV approaches involve cluster wise eigen-decomposition for learning the dictionaries while use the CDS for scoring. To highlight the complementary behavior of these approaches, the DET curves for a few salient systems are plotted in Fig. 8

. For exploiting the diversity, the logistic regression based score-level fusion of the i-vector system (S0) with two best SR-SV systems (S5 and S9) are explored and also shown in Table 

I. The best performing fusion (S0+S9) is noted to provide a relative improvements of and in terms of and EER, respectively, when compared with the individual best performances.

Vii Conclusion

In this paper, a novel correlation based block formation approach, referred to as the CGC, is presented for learning block-structured dictionary. The CGC-based block dictionary yields improved reconstruction and classification performances. In contrast to existing SAC-based approach, the proposed one is noted to exhibit faster convergence, lower sensitive to block size and more robust to additive noise while learning the dictionary. For further enhancement in classification ability, the class information is included in the CGC. The resulting block-structured dictionary based SR-SV system provides a relative improvement of in the detection cost over the best contrast SR-SV system employing an existing supervised block-structured dictionary. On fusing the best proposed SR-SV and the state-of-the-art i-vector SV systems significant improvements in the classification performance are noted both in terms of the detection cost and the equal error rate.

References

  • [1] M. Elad and M. Aharon, “Image denoising via learned dictionaries and sparse representation,” in

    Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , vol. 1, 2006, pp. 895–900.
  • [2] S. Gao, I. W.-H. Tsang, and Y. Ma, “Learning category-specific dictionary and shared dictionary for fine-grained image categorization,” IEEE Transactions on Image Processing, vol. 23, no. 2, pp. 623–634, Feb. 2014.
  • [3] Y.-T. Chi, M. Ali, A. Rajwade, and J. Ho, “Block and group regularized sparse modeling for dictionary learning,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 377–382.
  • [4] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210–227, Feb. 2009.
  • [5] V. M. Patel, T. Wu, S. Biswas, P. J. Phillips, and R. Chellappa, “Dictionary-based face recognition under variable lighting and pose,” IEEE Transactions on Information Forensics and Security, vol. 7, no. 3, pp. 954–965, Jun. 2012.
  • [6] Z. Dong, M. Pei, and Y. Jia, “Orthonormal dictionary learning and its application to face recognition,” Image and Vision Computing, vol. 51, pp. 13–21, Jul. 2016.
  • [7] I. Naseem, R. Togneri, and M. Bennamoun, “Sparse representation for speaker identification,” in Proc. IEEE International Conference on Pattern Recognition (ICPR), 2010, pp. 4460–4463.
  • [8] J. M. K. Kua, E. Ambikairajah, J. Epps, and R. Togneri, “Speaker verification using sparse representation classification,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011, pp. 4548–4551.
  • [9] Haris B.C. and R. Sinha, “Robust speaker verification with joint sparse coding over learned dictionaries,” IEEE Transactions on Information Forensics and Security, vol. 10, no. 10, pp. 2143–2157, Oct. 2015.
  • [10] M. Liu, X. Chen, and X. Wang, “Latent fingerprint enhancement via multi-scale patch based sparse representation,” IEEE Transactions on Information Forensics and Security, vol. 10, no. 1, pp. 6–15, Jan. 2015.
  • [11] O. P. Singh, Haris B.C., and R. Sinha, “Language identification using sparse representation: A comparison between GMM supervector and i-vector based approaches,” in Proc. Annual IEEE India Conference (INDICON), 2013, pp. 1–4.
  • [12] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Transactions on Signal Processing, vol. 54, no. 11, pp. 4311–4322, Nov. 2006.
  • [13] F. Rodriguez and G. Sapiro, “Sparse representations for image classification: Learning discriminative and reconstructive non-parametric dictionaries,” University of Minnesota, IMA Preprint 2213, Tech. Rep., Dec. 2007.
  • [14] Z. Jiang, Z. Lin, and L. Davis, “Label consistent K-SVD: Learning a discriminative dictionary for recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 11, pp. 2651–2664, Nov. 2013.
  • [15] L. Zelnik-Manor, K. Rosenblum, and Y. C. Eldar, “Dictionary optimization for block-sparse representations,” IEEE Transactions on Signal Processing, vol. 60, no. 5, pp. 2386–2395, May. 2012.
  • [16] G. Sreeram, Haris B.C., and R. Sinha, “Improved speaker verification using block sparse coding over joint speaker-channel learned dictionary,” in Proc. IEEE Region 10 Conference (TENCON), 2015, pp. 1–5.
  • [17] M. Stojnic, F. Parvaresh, and B. Hassibi, “On the reconstruction of block-sparse signals with an optimal number of measurements,” arXiv preprint arXiv:0804.0041, 2008.
  • [18] Y. C. Eldar and H. Bolcskei, “Block-sparsity: Coherence and efficient recovery,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2009, pp. 2885–2888.
  • [19] Y. C. Eldar and M. Mishali, “Robust recovery of signals from a structured union of subspaces,” IEEE Transactions on Information Theory, vol. 55, no. 11, pp. 5302–5316, Nov. 2009.
  • [20] Y. C. Eldar and H. Rauhut, “Average case analysis of multichannel sparse recovery using convex relaxation,” IEEE Transactions on Information Theory, vol. 56, no. 1, pp. 505–519, Jan. 2010.
  • [21] R. Rubinstein, M. Zibulevsky, and M. Elad, “Efficient implementation of the K-SVD algorithm using batch orthogonal matching pursuit,” Technion, TR-CS-2008-08, Tech. Rep., Apr. 2008.
  • [22] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression,” The Annals of Statistics, vol. 32, no. 2, pp. 407–499, Apr. 2004.
  • [23] Y. C. Eldar, P. Kuppinger, and H. Bolcskei, “Block-sparse signals: Uncertainty relations and efficient recovery,” IEEE Transactions on Signal Processing, vol. 58, no. 6, pp. 3042–3054, Jun. 2010.
  • [24] The NIST Year 2012 Speaker Recognition Evaluation Plan, www.nist.gov/itl/iad/mig/upload/NIST SRE12 evalplan-v17-r1.pdf.
  • [25] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, May. 2011.
  • [26] D. Garcia-Romero and C. Y. Espy-Wilson, “Analysis of i-vector length normalization in speaker recognition systems,” in Proc. Interspeech, 2011, pp. 249–252.
  • [27] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital Signal Processing, vol. 10, pp. 19–41, Jan. 2000.
  • [28] The BOSARIS toolkit, accessed on 10th Dec. 2013. [Online]. Available: www.sites.google.com/site/bosaristoolkit/
  • [29] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Speaker and session variability in GMM-based speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1448–1460, May. 2007.
  • [30] N. Kumar and R. Sinha, “Class specificity and commonality based discriminative dictionary for speaker verification,” in Proc. National Conference on Communication (NCC), 2016, pp. 1–6.