Computing Similarity Queries for Correlated Gaussian Sources

01/22/2020
by   Hanwei Wu, et al.
KTH Royal Institute of Technology
0

Among many current data processing systems, the objectives are often not the reproduction of data, but to compute some answers based on the data resulting from queries. The similarity identification task is to identify the items in a database that are similar to a given query item for a given metric. The problem of compression for similarity identification has been studied in arXiv:1307.6609 [cs.IT]. Unlike classical compression problems, the focus is not on reconstructing the original data. Instead, the compression rate is determined by the desired reliability of the answers. Specifically, the information measure identification rate characterizes the minimum rate that can be achieved among all schemes which guarantee reliable answers with respect to a given similarity threshold. In this paper, we propose a component-based model for computing correlated similarity queries. The correlated signals are first decorrelated by the KLT transform. Then, the decorrelated signal is processed by a distinct D-admissible system for each component. We show that the component-based model equipped with KLT can perfectly represent the multivariate Gaussian similarity queries when optimal rate-similarity allocation applies. Hence, we can derive the identification rate of the multivariate Gaussian signals based on the component-based model. We then extend the result to general Gaussian sources with memory. We also study the models equipped with practical compone systems. We use TC- schemes that use type covering signatures and triangle-inequality decision rules as our component systems. We propose an iterative method to numerically approximate the minimum achievable rate of the TC- scheme. We show that our component-based model equipped with TC- schemes can achieve better performance than the TC- scheme unaided on handling the multivariate Gaussian sources.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

04/29/2019

Learning Image Information for eCommerce Queries

Computing similarity between a query and a document is fundamental in an...
05/16/2022

Characterization of the Gray-Wyner Rate Region for Multivariate Gaussian Sources: Optimality of Gaussian Auxiliary RV

Examined in this paper, is the Gray and Wyner achievable lossy rate regi...
07/18/2018

Robust Distributed Compression of Symmetrically Correlated Gaussian Sources

Consider a lossy compression system with ℓ distributed encoders and a ce...
09/16/2021

SEACOW: Synopsis Embedded Array Compression using Wavelet Transform

Recently, multidimensional data is produced in various domains; because ...
11/23/2018

Selected Methods for non-Gaussian Data Analysis

The basic goal of computer engineering is the analysis of data. Such dat...
06/25/2019

Coding for Crowdsourced Classification with XOR Queries

This paper models the crowdsourced labeling/classification problem as a ...
04/16/2021

To Share or not to Share: Predicting Sets of Sources for Model Transfer Learning

In low-resource settings, model transfer can help to overcome a lack of ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Part of the content of Section III is submitted to 2018 IEEE Asilomar Conference on Signals, Systems, and Computers.

The problem of efficient identification and data retrieval from large databases has become more relevant in recent years. Similarity identification requires that a database returns all data items which are similar to a given query under a similarity threshold specified by the problem. The notion of similarity is often defined by a specific metric measure, such as the Euclidean distance or the Hamming distance. It is required that false negative errors are not permitted in the retrieval process as they cannot be detected by further processing. This is important for some applications, such as security cameras and criminal forensic databases. On the other hand, although false positive errors can be detected by further verification, they increase the computational cost on the server side, and hence, reduce efficiency. Therefore, the tradeoff between the compression rate and the reliability of the answers to a given query is of interest.

The problem of similarity identification of compressed data was first studied in [1] from an information-theoretic viewpoint. In [1]

, both false positive and false negative errors are allowed, as long as the error probability vanishes with the data block-length. Our setting though is closely related to the problem of compression for similarity queries as introduced in

[9], [11]. In [9], [11] and this work, false negative errors are not permitted. [9], [11] study the problem from an information-theoretic viewpoint and introduce the term identification rate. It characterizes the minimum compression rate that guarantees query answers with a vanishing false positive probability, while false negative errors are not allowed. [9], [11] provide the identification rate for Gaussian sources with quadratic distortion and for binary sources with the Hamming distance. In [9], it is also proved that

, similar to the classical compression, the Gaussian source requires the largest compression rate among sources with the same variance.

Since it is common to encounter correlated data in the real world, it is of interest to investigate similarity identification schemes for correlated sources. [15] uses lossy compression as a building block to construct the TC- (Type Covering signatures and triangle-inequality decision rule) scheme and the LC- (Lossy Compression signatures and triangle-inequality decision rule) scheme. The LC- scheme only optimizes the quantization distortion and can be achieved by employing a rate-distortion code on the triangle-inequality principle. The TC- scheme is an improved version of the LC- scheme by optimizing jointly the quantization distortion and the expected query codeword distance. The results in [15] show that the compression rate of TC- can achieve the identification rate for the case with binary sources and the Hamming distance.

In [16]

, the authors present a shape-gain quantizer for i.i.d. Gaussian sequences: scalar quantization is applied to the magnitude of the data vector. The shape (the projection on the unit sphere) is quantized using a wrapped spherical code

[7]. [18]

proposes tree-structured vector quantizers that hierarchically cluster the data using

-center clustering. In [17]

, the authors compare two transform-based similarity identification schemes to cope with exponentially growing codebooks for high-dimensional data. One of the proposed schemes, that is, the component-based approach, exhibits both good performance and low search complexity. However, the theoretical analysis for the component-based setting is still an open problem and remains to be investigated. Besides, for correlated sources, no analytical results on the minimum achievable rates of the above schemes are provided.

In this paper, we first propose a component-based model for computing the identification rate for multivariate Gaussian sources. We use the Karhunen-Loève transform (KLT) to create independent -admissible component systems. We show that the component-based model equipped with KLT can perfectly represent the multivariate Gaussian similarity queries when optimal rate-similarity allocation applies. We then extend the result to the identification rate of the general Gaussian sources with memory. To evaluate the performance of practical schemes, we replace the optimal component system with the state-of-the-art TC- schemes. We propose an iterative method inspired by the Blahut–Arimoto (BA) [3], [2] and related algorithms [13] to numerically approximate the minimum achievable rate of TC- schemes. The simulation shows that our component-based model equipped with TC- schemes has better performance than the TC- scheme unaided on handling the multivariate Gaussian sources.

The outline of this paper is as follows: In Section , we give a brief description of the problem’s background and key concepts 111We follow the problem setup and adopt most notations in [9] and [11]. Therefore, we refer to [9] and [11] for more detailed background and problem description. In Section , we propose our component-based model for computing the identification rate of multivariate Gaussian sources. In Section , we extend the identification rate result for general Gaussian sources with memory. In Section , we propose an iterative method to approximate the minimum achievable rate of the TC- scheme. Then we compare the TC- scheme with the component-based scheme for i.i.d. and multivariate Gaussian sources. The conclusions are given in Section .

The notational conventions in this work are as follows. Uppercase nonboldface symbols such as

are used to denote random variables; and lowercase nonboldface symbols such as

are used to denote sample values of those random variables. Vectors and matrices of random variables or their sample values are denoted by boldface symbols. For example, and are vectors (or sometimes matrices from the context) of random variables and its sample values , respectively. The th entry of a vector is denoted by .

Ii Quadratic Similarity Queries

Let denote the query sequence and the data sequence. A rate- identification scheme consists of a signature assignment function: , and a query function :. The database keeps only a short signature for each . And the output decision maybe or no of a query function indicates whether and are probably -similar or not. The sequences and are called -similar if , where we restrict our consideration to additive distortion measures

(1)

is an arbitrary per-letter distance measure specified by the problem, and is the similarity threshold. Specifically, the quadratic similarity is

(2)

where is the standard Euclidean norm.

A similarity query retrieves all data items that are -similar to the query sequence. A scheme is called -admissible if we obtain maybe for any pair of data item and query which is -similar.

Now, consider a probabilistic model for database and query. The objective is to design -admissible schemes that minimize the probability of the output maybe for given distributions of database vectors and query vectors . According to [9], for a -admissible scheme, this probability is calculated as

(3)

where the second equality follows from by the requirement of -admissibility. Hence, minimizing (3) is equivalent to minimizing the probability of false positives . That is, the probability can be used as a performance measure for the investigated schemes. In the following, we use the abbreviation for the probability that a scheme outputs maybe.

For given distributions and and a similarity threshold , a rate is said to be -achievable if there exist a sequence of -admissible schemes that can achieve a vanishing as approaches infinity:

(4)

The identification rate of the source is defined as the infimum of all -achievable rates.

Iii Identification Rate of
Multivariate Gaussian Sources

A. A Component-based Model

We propose a component-based model to compute the identification rate for multivariate Gaussian sources. The idea is that the input which consists of th order multivariate Gaussian signals is first decorrelated into components by the Karhunen-Loève transform (KLT) for further processing. After the transform, we use a -admissible system for each component and they together form an -component system. The th component system answers maybe if the transformed th query-database pair satisfies .

We consider an th order zero-mean stationary Gaussian process

(5)

where is a vector of consecutive samples and is the th order autocovariance matrix. Since is a real symmetric matrix, it has the eigendecomposition as

(6)

where

is the matrix whose columns are the eigenvectors of

,

is a diagonal matrix with eigenvalues

as diagonal entries. The source can be decorrelated by the transform , that is

(7)

We denote as the transform we use for the component-based model such as , where is the input signal. Then the transpose of the eigenmatrix is the Karhunen-Loève transform (KLT).

Let denote an optimal identification system for multivariate Gaussian sources, that is, can achieve the identification rate of multivariate Gaussian sources. In next two sections, we derive conditions that preserve the -admissibility and -achievability of the optimal identification system for the -component model.

B. -admissible Condition

Maintaining the -admissibility after the transform requires that the similarity measure of the origin domain is persevered in the transform domain. Since

is an orthogonal matrix, the KLT is an orthogonal transform

and preserves the quadratic distance

(8)
(9)
(10)
(11)
(12)

In order to preserve the -admissibility of the optimal system , the similarity threshold for an th order component-based model should be at least the same as the given similarity threshold

(13)

C. -achievable Condition

Let denote the identification system for the th component. Lemma 1 shows the conditions of achieving a vanishing of the component-based model for multivariate Gaussian signals.

Lemma 1.

Consider data sequence and query sequence both being concatenations of independent blocks of zero-mean multivariate Gaussian random variables with blocklength for -similarity identification, where is the length of the whole sequence. We have a vanishing for the overall system

(14)

if and only if,

(15)
Proof.

As shown in [9], the can be bounded from the above by

(16)

where and are the typical spheres. Since and vanishes with , we focus on the first term of (C.).

Recall that the input multivariate Gaussian signals are first decorrelated by the KLT. Furthermore, the uncorrelatedness of jointly distributed Gaussian random variables imply independence, thus, the transform that decorrelates the multivariate Gaussian sources can also create independent components

. Due to the independence of the components, we can write

(17)
(18)
(19)
(20)
(21)

where represents the set of vectors that have the same signature. (19) follows from that quadratic distance is an additive distortion measure and the -admissible condition (13). (21) follows because the joint probability of independent events equals the product of their probabilities.

Therefore, the of the overall system is proportional to the product of the components’

(22)

Let the blocklength goes to infinity, as a result, the overall length of the sequence also tends to infinity. In order to have a vanishing for the overall system, there must exist some components such that its goes to zero. ∎

Due to the product property of the (22), the database vectors are labeled as maybe if and only if all of its component systems are determined as maybe. Therefore, the final output of the component-based model can be achieved by the logic AND decision. Fig. 1 shows the proposed component-based model.

Figure 1: Component-based model for similarity identification.

D. Identification Rate

We define the identification rate of a multivariate Gaussian source as the infimum of all -achievable rates. In previous sections, we show that we can use KLT to create independent component systems. We also derive the -admissible and -achievable conditions for the component-based model. In the proof of Theorem 1, we formulate the problem of computing the as a rate-similarity optimization problem under the -admissible and -achievable conditions of the component-based model. We show that under the optimal rate-similarity allocation, the -admissible and -achievable conditions for the enforced component-based model become equivalent to the corresponding conditions of the optimal system . As a result, we can conclude that the identification rate

can be achieved by the component-based model with optimal rate-similarity allocation. Note that we consider the case that the query and the database follow the same multivariate Gaussian distribution so that the query can be decorrelated by using the same KLT as used for the database.

Theorem 1.

The identification rate of th order multivariate Gaussian sources (5) is

(23)
(24)

with . is the th eigenvalue of the autocovariance matrix and is its largest eigenvalue.

The identification rate approaches infinity when the similarity threshold is

(25)
Proof.

Since the KLT is an invertible transform, the mutual information in the transform domain is preserved [12]. Thus, the achievable rate required by the component-based model is identical to the rate for the original signal. In addition, since the components are independent of each other, then the th order mutual information between the input data and its signature is the sum of the mutual information of signal and signature of all components

(26)

where denotes the th order mutual information.

From the -achievable condition of the component-based model, we know that there must exist some components that have vanishing . Furthermore, we are interested in computing the minimum achievable rate. Hence, we assume that each component uses an ideal -admissible scheme such that each component system operates on its identification rate curve. We can then define the achievable rate of the component-based model as the average of the component rates , i.e.,

(27)

For a given similarity threshold of the original signal, the infimum of all achievable rate of the component-based model can be obtained by solving the following constrained optimization problem

(28)

The first inequality constraint follows from the -admissible condition (13) for the component-based model. The second inequality constraint follows from the nonnegativity definition of the component similarities.

Note, all identification rate functions of the components are convex and strictly increasing. Hence, we consider the equivalent problem

(29)

where is a positive Lagrangian multiplier.

It is shown in [11] that the identification rate for i.i.d. Gaussian sources is

(30)

Since the variance of the component is equal to the eigenvalue of the th order autocovariance matrix , when , the derivative of the cost function with respect to can be expressed as:

(31)

By setting (31) to zero, we obtain that is determined by the eigenvalue and the value of , i.e.,

(32)

Let be expressed as , . In order to satisfy the non-negative constraint of , each component will only be assigned with a positive similarity threshold when the value of is smaller than its corresponding eigenvalue ,

(33)

In order to have at least one component assigned with positive similarity threshold, we set the largest as .

By substitute (33) into (30), we can obtain the identification rate for the th component. Then, the infimum of the achievable rate for the component-based model is

(34)

and the corresponding similarity threshold of the component-based model is

(35)

The optimal rate-similarity curve of the component-based model can be obtained by sweeping over permitted values of .

According to the Karush–Kuhn–Tucker conditions, the optimal point occurs on the constraint surface. Therefore, the inequality constraint in (28) can reach equality when optimal point is achieved

(36)

On the other hand, the optimal condition also achieves equality in (22), hence, the component-based model preserves the original . Therefore, the imposed component-based model preserves the same characteristics of the original system under the optimal rate-similarity allocation. We can conclude that the derived optimal rate-similarity functions (34), (35) based on the component-based model are identical to the identification rate function of the multivariate Gaussian sources.

For the limit case of , the model’s identification rate (34) approaches infinity and the model’s similarity threshold is

(37)

where each component similarity threshold is . If the given model’s similarity threshold is larger than , there must exist components that have similarity thresholds larger than . Therefore, according to (30), the overall identification rate approaches infinity. ∎

The Theorem shows that the optimal identification rate can be achieved by activating the components according to their variances after the KLT. At the lowest rate, only the component with the largest variance is activated (assigned with positive similarity threshold). In this case, the component model uses only one component. Then, as the rate increases, the remaining components are activated in the order of their component variances. The activated components operate according to the Pareto condition.

Similar to i.i.d. Gaussian sources, multivariate Gaussian sources also have a similarity threshold limit that the systems can achieve vanishing . The similarity threshold limit for multivariate Gaussian sources is twice of the trace of its covariance matrix. If the systems are given a similarity threshold that is larger than the similarity threshold limit of the processed source, the query and database are inherently similar. Hence, the can never vanish regardless of what system is used.

Iv Identification Rate of
Gaussian Sources with Memory

In this section, we extend the identification rate result of multivariate Gaussian sources to general Gaussian sources with memory. The proof of Theorem 1 shows that high dimensional multivariate Gaussian similarity queries can be perfectly represented by the optimal component-based model. Based on this, we study the general Gaussian sources with memory using the optimal component-based model and apply the Szegö’s theorem for sequences of Toeplitz matrices [6] under the limiting cases. Theorem 2 gives the result of the identification rate of the zero-mean stationary Gaussian process.

Theorem 2.

The identification rate function for a zero-mean stationary Gaussian process with memory is

(38)
(39)

with , where is the power spectral density of the source and is the essential supremum of .

The identification rate approaches infinity when the similarity threshold is

(40)
Proof.

Given the stationary Gaussian source , we can decompose the source into vectors of successive random variables and describe those vectors with a th order multivariate Gaussian distribution (5). Then we can apply the KLT transform on the decomposed vectors , where is the eigenmatrix of the covariance matrix . The resulted decorrelated source is given by the concatenation of the random vectors .

After the KLT, the signal is processed by the independent component -admissible systems. The identification rate of the th order multivariate Gaussian sources is known from the Theorem 1. Therefore, we can obtain the identification rate of the stationary Gaussian random process by taking the limit

(41)

The autocovariance matrices of stationary processes are Toeplitz matrices , where is the power spectral density of the source defined by the Fourier series of the elements on the th diagonal of

(42)

If the essential supremum and essential infimum of are finite, the theorem for sequences of Toeplitz matrices [6] states that

(43)

for any function is continous on the range of . Therefore, when , we can apply (43) to the identification rate function of multivariate Gaussian sources (34), (35), and obtain (38), (39).

In addition, according to the Lemma 4.1 of [5], the eigenvalues of Toeplitz matrix are bounded by the essential infimum and essential supremum of . We denote the essential supremum of as . Hence, we can rewrite the permitted values of as .

The extreme case of (40) is obtained when by following a similar arrangement of the proof of Theorem 2. ∎

The identification rate function follows a similar "reverse water-filling" process as the rate-distortion function of Gaussian sources with memory [4]. The value of starts decreasing from , the rate is first allotted to frequencies with the largest altitudes. As the value of decreases, the rate is put to frequencies with lower altitudes. The difference is that the distortion is calculated as the integral of the minimum values of the frequency attitude and the water level, while the similarity threshold is calculated as the integral of the differences between the frequency attitude and the water level. An example of the reverse water-filling process is shown in Figure 2.

Figure 2: Reverse water-filling for multivariate Gaussian sources.

The similarity threshold limit for the -achievable rate of Gaussian sources with memory is twice the power of the given signal (40). That is, if the given similarity threshold is larger than the twice of signal power, the two signals are inherently similar, and there is no system that can achieve a vanishing .

In the following example, we plot the identification rate curves for Gauss-Markov processes with different correlation coefficients.

Example 1.

We consider zero-mean Gauss-Markov processes with unit variance. The power spectral density of the Gauss-Markov process is

(44)

The largest value of is obtained when . We plot the identification rate function for Gauss-Markov processes with , and respectively. The integral in (38) and (39) can be approximated by the Riemann sum. We also plot the identification rate of i.i.d. Gaussian sources for reference. It is overlapped with the case as expected.

We can also observe that the identification rates approach infinity when the similarity thresholds reach their corresponding limits. Since more correlated signals have higher signal power, their corresponding similarity threshold limits are also larger.

Figure 3: Comparison of identification rates for Gauss-Markov processes with different correlation coefficients.

V Component-based model with
practical schemes

Theorem 1 is derived on the premise that each component uses an optimal -admissible system. However, the optimal -admissible system is difficult to achieve due to the triangle-inequality constraint that most distortion measures possess. The state-of-the-art practical schemes for the similarity identification problem are the triangle-inequality based TC- and LC- schemes proposed in [15], where the TC- scheme is consistently performs better than the LC- scheme. Therefore, we replace the ideal scheme with the practical TC- scheme for each component. The described component-based model equipped with TC- schemes is illustrated in Figure. 4.

Figure 4: Component-based model with TC- schemes.

We denote the minimum achievable rates of TC- and LC- as and , respectively. The authors in [10] show that the minimum achievable rates of TC- and LC- schemes generally hold the relation . Hence, we select the TC- scheme as the -admissible system for each component.

The next step is to evaluate the rate-similarity performance of component-based models constructed by TC- schemes. While the can be evaluated by employing a rate-distortion code on the triangle-inequality principle, the can only be computed numerically. Here, we propose an iterative method to numerically approximate the minimum achievable rate of TC- schemes. We only consider the special case where . The general case that the query and the database are drawn from different distributions can be naturally extended in similar ways as shown in previous works [8], [9].

A. Iterative Method for Approximating

It is shown in [10] that any similarity threshold below can be attained by a TC- scheme of rate , where

(45)

where is the reconstructed codeword, and and are independent. Since is a strictly increasing function with [10], for any , there exists an exposed point on the curve such that the slope of a tangent to the curve at that point is equal to .

Denote the exposed points on the curve by , where follows (45):

(46)

The achievable rate region is the area above the curve. To obtain the exposed point on the curve, it is equivalent to minimize the intersection of the tangent of the exposed point with the ordinate.

(47)

By varying over all , we then trace out the whole rate-similarity curve.

In the following, we denote the truncated discretized distribution of the source as and the marginal distribution of reconstructed codewords as . We form the conditional probability mass functions as columns of an matrix:

The expected distortions between and when averaging over their marginal distributions and joint distribution are

(48)

and

(49)

Since , the objective function (47) can be expressed as a minimization over

(50)
(51)

Note that the elements of represent probabilities and each column of is a probability mass function. This introduces the constraints and . We temporarily ignore the constraint and define a Lagrange cost function as

(52)

where are the Lagrange multipliers.

Differentiating with respect to , we have

(53)

Setting , we obtain from (A.)

(54)

or

(55)

Since , we have

(56)

We can see that is always nonnegative.

To vectorize the above operations, we define the distortion matrix as:

Hence, the conditional probability mass function (55) can be expressed as

(57)

where stands for column-wise multiplication. Then can be further normalized by

(58)

where , and where denotes the row-wise division.

The conditional probability matrix from (55) can be expressed analytically if we assume that is known. In our iterative method, we first initialize and choose an . Then, is determined according to (55). The marginal codeword distribution is updated by the Bayes’ rule

(59)

The corresponding vectorized representation is

(60)

We update and according to (57, 58) and (60) iteratively until the algorithm converges. Note that the term in (57) is constant, so it can be computed before the iterations. Finally, we approximate one point of by evaluating and .

Example 2.

We verify our algorithm by testing it on the binary-Hamming case. Figure 5 shows that the minimum achievable rates as computed by using the derived algorithm for TC- schemes is the same as the identification rate of binary sources with Hamming distance. This is consistent with the special case that for the binary-Hamming case.

Figure 5: Binary source with Hamming distortion: .

B. Iterative Method for Component-based Model

The optimal rate allocation of the component-based model constructed by schemes can be simply achieved by applying the Pareto condition. That is, each component system should operate at the point where all rate-similarity curves of the components have the same slopes

(61)

where , is the chosen value of the slope. Then we can obtain the rate-similarity curve of the -component model by traversing the values of with each component running the iterative method of the TC- scheme independently for a given .

C. Comparisons

In this section, we use the proposed iterative method to approximate the for both i.i.d. and multivariate Gaussian sources, and then compare them with and of optimal schemes. First, we derive the of LC- schemes for quadratic Gaussian sources by employing a rate-distortion code on the triangle-inequality principle (62) [8].

(62)

Consider a Gaussian source that is compressed by an optimal rate-distortion code with . The rate-distortion code can be designed with the codeword distribution [4], and we have