# Population Symbolic Covariance Matrices for Interval Data

Symbolic Data Analysis (SDA) is a relatively new field of statistics that extends classical data analysis by taking into account intrinsic data variability and structure. As SDA has been mainly approached from a sampling perspective, we introduce population formulations of the symbolic mean, variance, covariance, correlation, covariance matrix and correlation matrix for interval-valued symbolic variables, providing a theoretical framework that gives support to interval-valued SDA. Moreover, we provide an interpretation of the various definitions of covariance and correlation matrices according to the structure of micro-data, which allows selecting the model that best suits specific datasets. Our results are illustrated using two datasets. Specifically, we select the most appropriate model for each dataset using goodness-of-fit tests and quantile-quantile plots, and provide an explanation of the micro-data based on the covariance matrix.

## Authors

• 3 publications
• 1 publication
• 3 publications
• 4 publications
08/18/2021

### On the variability of the sample covariance matrix under complex elliptical distributions

We derive the variance-covariance matrix of the sample covariance matrix...
07/24/2020

### New clustering approach for symbolic polygonal data: application to the clustering of entrepreneurial regimes

Entrepreneurial regimes are topic, receiving ever more research attentio...
09/20/2021

### A Hybrid Symbolic/Numeric Solution To Polynomial SEM

There are many approaches to nonlinear SEM (structural equation modeling...
11/07/2011

### Discriminant Analysis with Adaptively Pooled Covariance

Linear and Quadratic Discriminant analysis (LDA/QDA) are common tools fo...
04/28/2017

### Diagonalisation of covariance matrices in quaternion widely linear signal processing

Recent developments in quaternion-valued widely linear processing have i...
06/30/2020

### Testing and Support Recovery of Correlation Structures for Matrix-Valued Observations with an Application to Stock Market Data

Estimation of the covariance matrix of asset returns is crucial to portf...
08/24/2021

### Quantification of intrinsic quality of a principal dimension in correspondence analysis and taxicab correspondence analysis

Collins(2002, 2011) raised a number of issues with regards to correspond...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The volume and complexity of available data in virtually all sectors of society has grown enormously, boosted by the globalization and the massive use of the Internet. New statistical methods are required to handle this new reality, and Symbolic Data Analysis (SDA), proposed by Diday in [1], is a promising research area.

When trying to characterize datasets, it may not be convenient to deal with the individual data observations (e.g. because the sample size is too large), or we may not have access to the individual data observations (e.g. because of privacy restrictions). In conventional data analysis, this problem is handled by providing single-valued summary statistics of the data characteristics (e.g. mean, variance, quantiles). The analysis can consider multiple characteristics, but these characteristics can only be single-valued. SDA extends conventional data analysis by allowing the description of datasets through multi-valued features, such as intervals, histograms, or even distributions [2, 3, 4]. These features are called symbolic variables.

Suppose we want to analyze textile sector companies in countries all over the world, e.g. in terms of two characteristics: number of customers and profit. Suppose also that we do have access to the data of individual companies in each country, but only to summary information per country. Conventional data analysis can only deal with single-valued features, like the profit variance, profit mean, or the mean number of customers. Instead, in SDA, the features (the symbolic variables) can be multi-valued, e.g. one feature can be the minimum and maximum profits, and another can be an histogram of the number of customers.

One of the main benefits of SDA has to do with the way individual data characteristics (e.g. profit or number of customers) are described. In conventional data analysis, since only single-valued features are available, one may need many features to describe a given characteristic. Moreover, the features are treated in the same way, irrespective of the characteristic they represent. For example, one may create as features to characterize the profit of the textile sector the mean, the variance, the maximum, the minimum, the median, the first and third quartiles, and so on. There is then an inflation of features to explain a single data characteristic (the profit, in this case). SDA allows explaining single data characteristics through single symbolic data variables, better tailored to analise that specific characteristic, and with potential gains in terms of dimensionality.

In SDA, the original data is called micro-data and the aggregated data is called macro-data. In the previous example, the micro-data would be the data of individual companies (labeled with the country they belong to), and the macro-data the interval of profit (between maximum and minimum) or the histogram of the number of customers, of the companies of each country. Our main interest in this paper is on interval-valued data [5, 6], where macro-data corresponds to the interval between minimum and maximum of micro-data values.

SDA is a relatively new field of statistics and has been mainly approached from a sampling perspective. The works [7, 3, 8]

introduced measures of location, dispersion, and association between symbolic random variables, formalized as a function of the observed macro-data values. The sample covariance (correlation) matrices were addressed in the context of symbolic principal component analysis in

[9, 10, 11, 12, 13, 14] and more recently in factor analysis [15]. In [14] the authors established relationships between several proposed methods of symbolic principal component analysis and available definitions of sample symbolic variance and covariance. Other areas of statistics have also been addressed by SDA like clustering (e.g. [16, 17]), discriminant analysis (vide e.g. [18, 19]

), regression analysis (vide e.g.

[20, 21]), and time series (vide e.g. [22, 23]).

Parametric approaches for interval-valued variables have also been considered [24, 25, 26, 18, 21, 27, 15]. Authors in [24]

derived maximum likelihood estimators for the mean and the variance of three types of symbolic random variables: interval-valued, histogram-valued, and triangular distribution-valued variables. In

[25]

, authors have formulated interval-valued variables as bivariate random vectors in order to introduce a symbolic regression model based on the theory of generalized linear models. The works

[26, 18, 27]

have followed a different approach. In their line of work, the centers and the logarithms of the ranges are collected in a random vector with a multivariate (skew-)normal distribution, which is used to derive methods for the analysis of variance

[26], discriminant analysis [18]

, and outlier detection

[27] of interval-valued variables.

The area of SDA is lacking theoretical support and our work is a step in this direction. Preferably, the statistical methods of SDA should be grounded on populational formulations, as in the case of conventional methods. A populational formulation allows a clear definition of the underlying statistical model and its properties, and the derivation of effective estimation methods.

In this paper, we derive population formulations of the sample symbolic mean, and of three proposals for the sample symbolic variance and covariance. We then determine the main properties of interval-valued random variables, providing a theoretical framework that gives support to interval-valued SDA. The population formulations of covariance and correlation matrices are also addressed. We focus on the main definitions which result from the various sampling proposals available in the literature, and provide an interpretation of each definition according to the structure of micro-data, which allows selecting the model that best suits specific datasets. This is illustrated using two datasets. Specifically, we select the most appropriate model for each dataset using goodness-of-fit tests and quantile-quantile plots, and provide an explanation of the micro-data based on the covariance matrices.

The paper is structured as follows. In Section 2 we introduce the population formulations of the symbolic mean, variance, covariance, correlation, covariance matrix and correlation matrix for interval-valued symbolic variables, and derive their main properties. In Section 3 we investigate conditions on the micro-data that lead to each of the symbolic covariance matrix definition under study. We also discuss conditions under which a null correlation matrix may be obtained for interval-valued variables. Section 4 presents two case studies, one based on the iris dataset and another based on credit card monthly expenses. Finally, Section 5 presents the conclusions of the paper and some directions for future work.

## 2 Symbolic covariance for interval-valued variables

In this work, we focus on the study of interval-valued variables [28], defined in the following way:

###### Definition 2.1.

is an interval-valued random variable defined on the probability space

if and only if and are random variables defined on such that .

In general, when dealing with this type of data it is considered that only macro-data, in the form of a real interval,

, is observed. Since micro-data within each interval is not observed, it is commonly assumed that it follows a Uniform distribution on

.

We consider that an interval-valued random variable, , besides being represented by the interval limits A and B, is also represented by the center and range of the interval:

 C=A+B2andR=B−A. (2.1)

Let us consider that each object is characterized by interval-valued random variables, where is the vector of the centers, and the vector of ranges describing the object. Moreover, we denote the mean vectors of centers and ranges by and , and the covariance matrices of centers and ranges by and .

When referring to a sample of size from this population, the -th sample point () can be written as , or alternatively, as and , using the centers and ranges representation, where:

 cij=bij+aij2andrij=bij−aij,fori=1,2,…,nandj=1,2,…,p. (2.2)

In the case of symbolic data analysis, the individual descriptions associated with the -th symbolic observation are all the points in the hyper-rectangle . This is a departure from classical analysis, and opens the possibility for different definitions of sample symbolic estimators. In the next sections, we start by presenting the various proposals for the sample mean, variance, and covariance. We then establish population definitions for the mean, variance, and covariance (and, from these, for the covariance and correlation matrices), that are in agreement with corresponding sample measures. Finally, we derive several properties for the population measures.

### 2.1 Sample Symbolic Mean, Variances, and Covariances

In this section, we discuss three alternative definitions of sample symbolic variance and of sample symbolic covariance; we use an upper-index to distinguish among the definitions.

In all definitions, the sample symbolic mean is defined by ():

 ¯¯¯xj =1nn∑i=1aij+bij2, (2.3)

This approach has the appeal of using the mean of the interval centers as sample symbolic mean, which makes sense, in particular, under the assumption that the micro-data, associated with , follows a symmetric distribution on .

Regarding the sample symbolic variance, the most straightforward approach is to follow the definition of conventional sample variance of the interval center:

 s(1)jj =1nn∑i=1(aij+bij2−¯¯¯x(1)j)2. (2.4)

This definition has the disadvantage of ignoring the contribution of the ranges to sample symbolic variance.

The second and third definitions try to overcome this limitation. The second definition, proposed by de Carvalho et al. [29], is based on the squared distances between the interval limits and the sample symbolic mean, and is given by:

 s(2)jj =n∑i=1(aij−¯¯¯xj)2+(bij−¯¯¯xj)22n. (2.5)

The third alternative, proposed by Bertrand and Goupil [7], is obtained from the empirical density function of an interval-value variable, assuming that the micro-data follows a Uniform distribution, and is given by:

 s(3)jj= n∑i=1b2ij+bijaij+a2ij3n−[n∑i=1bij+aij2n]2. (2.6)

For the sample symbolic covariance between interval-valued variables, we consider three definitions that are generalizations of the definitions of sample symbolic variance presented above, such that the sample covariance of a variable with itself of definition coincides with the sample variance of the same definition .

The first covariance definition, proposed Billard and Diday [2], is supported on the empirical joint density of two different interval-valued variables, and , and corresponds to the conventional sample covariance of the centers of and . It is given by ):

 s(1)jl= n∑i=1(bij+aij)(bil+ail)4n−¯¯¯xj¯¯¯xl, (2.7)

The second definition of symbolic covariance was not proposed in the literature, to our best knowledge. It is the direct generalization of the second definition of sample symbolic variance (vide equation (2.5)):

 s(2)jl=n∑i=1(aij−¯¯¯xj)(ail−¯¯¯xl)+(bij−¯¯¯xj)(bil−¯¯¯xl)2n. (2.8)

Finally, the third definition, proposed by Billard [8], is based on the explicit decomposition of the covariance into within sum of products and between sum of products:

 s(3)jl= 16nn∑i=1[(aij−¯¯¯xj)(bil−¯¯¯xl)+(bij−¯¯¯xj)(ail−¯¯¯xl) (2.9) + 2(aij−¯¯¯xj)(ail−¯¯¯xl)+2(bij−¯¯¯xj)(bil−¯¯¯xl)].

### 2.2 Population formulation

In this section, we seek population formulations for the symbolic mean, variance, covariance, correlation, covariance matrices, and correlation matrices. We start by rewriting (2.3) to (2.9) in terms of centers and ranges using (2.2). This leads to simpler expressions, with a more clear interpretation. A detailed derivation of the new expressions can be found in [13]. The sample symbolic mean, variances, and covariances are now:

 ¯¯cj= 1nn∑i=1cij (2.10) s(1)jj= 1nn∑i=1(cij−¯¯cj)2, (2.11) s(2)jj= s(1)jj+14n∑i=1r2ijn, (2.12) s(3)jj= s(1)jj+112n∑i=1r2ijn, (2.13) s(1)jl= 1nn∑i=1cijcil−¯cj¯cl, (2.14) s(2)jl= s(1)jl+14n∑i=1rijriln, (2.15) s(3)jl= s(1)jl+112n∑i=1rijriln. (2.16)

The sample symbolic mean corresponds to the sample mean of the centers. The first sample symbolic variance corresponds to the sample variance of the observed centers. The other two definitions of variance add to the first one the sample second order moment of the ranges, weighted differently in each case; in Definition 2 this weight is 1/4 and in Definition 3 is 1/12. Definition 2 of sample variance can also be interpreted as the sample variance of the centers plus the sample second order moment of interval half-ranges.

Regarding the sample covariances, the first definition corresponds to the sample covariance between the centers of two interval-valued variables. The second and third definitions add to the first one the sample moment of the product between -th and -th ranges, weighted by 1/4 and 1/12, respectively. If , then each definition of sample symbolic covariance equals the corresponding definition of sample symbolic variance.

The sample definitions can be used as starting point for the population definitions. Consider that a sample is written as the realization of a random sample of and . Then, population characteristics are obtained as limiting values111Almost sure convergence, when goes to infinity. of the estimators whose realizations lead to the values (2.10) to (2.16

). This is a direct application of the strong law of large numbers. These results are summarized in the following theorem.

###### Theorem 2.1.

Let and be the random vectors of centers and ranges associated with the interval-valued random vector , where the covariance matrix of , , and exist. Let be a sequence of random vectors independent and identically distributed to . Then, for , the strong law of large numbers guarantees that:

 ¯¯¯¯Cj= 1nn∑i=1Cij a.s.−−−→ E(Cj), (2.17) S(1)jj= 1nn∑i=1(Cij−¯¯¯¯Cj)2 a.s.−−−→ Var(Cj), (2.18) S(2)jj= S(1)jj+14n∑i=1R2ijn a.s.−−−→ Var(Cj)+E(R2j)4, (2.19) S(3)jj= S(1)jj+112n∑i=1R2ijn a.s.−−−→ Var(Cj)+E(R2j)12, (2.20) S(1)jl= 1nn∑i=1CijCil−¯Cj¯Cl a.s.−−−→ Cov(Cj,Cl), (2.21) S(2)jl= S(1)jl+14n∑i=1RijRiln a.s.−−−→ Cov(Cj,Cl)+E(RjRl)4, (2.22) S(3)jj= S(1)jl+112n∑i=1RijRiln a.s.−−−→ Cov(Cj,Cl)+E(RjRl)12. (2.23)

Theorem (2.1) compels us to define a new notation to represent the population symbolic means, symbolic variances, and symbolic covariances.

###### Definition 2.2.

Let and be two interval-valued random variables with centers, , and ranges, , whose moments and exist. Then, the population symbolic mean of is , . The -th definition of population symbolic covariance between and is , () where , , and . Finally, the -th definition of population symbolic variance is defined as , .

The definitions of population symbolic mean, variance, and covariance may be extended for interval-valued random vectors, similarly to what is done for conventional random vectors.

The natural approach for defining symbolic covariance matrices, is to base them on definitions of symbolic variance and covariance that correspond to each other. This naturally leads to three definitions of symbolic covariance matrices. However, some known algorithms for symbolic principal component analysis have considered symbolic covariance matrices that combined definitions of symbolic variance and covariance that do not correspond to each other (vide [30, 13] for more details). This strategy has the inconvenient that symbolic variance does not correspond to the evaluation of the sample symbolic covariance between an interval-valued variable with itself. However, some authors argue that such combinations may lead to better performance [12]. Thus, besides the three natural definitions referred above, we consider two other: Definition 4, combining Definition 2 of symbolic variance with Definition 1 of symbolic covariance, proposed by Cazes et al. [31], and Definition 5, combining Definition 3 of symbolic variance with Definition 1 of symbolic covariance, proposed by Wang et al. [12]. These results are summarized in the next definition.

###### Definition 2.3.

Let and be the random vectors of centers and ranges associated with the interval-valued random vector, , where and exist. Then, the population symbolic mean vector of is and are the population symbolic covariance matrices obtained according to the combinations of symbolic variances and covariances listed in Table 1.

To unify notation, we consider that for , , for , , and .

The definitions of symbolic correlation matrices follow directly from those of symbolic covariance matrices, in the following way:

###### Definition 2.4.

Let and be the random vectors of centers and ranges, respectively, associated with interval-valued random variable, , where and exist, and 222The “” in denotes an index and not an exponent. and . Then, the symbolic correlation matrices associated with definitions in Table 1 are given by and, equivalently, , . Moreover, the -th definition of symbolic correlation between two interval-valued random variables, and , is

 Cork(Xj,Xl)=Covk(Xj,Xl)√Vark(Xj)Vark(Xl),k=1,…,5. (2.24)

### 2.3 Properties

To establish properties of symbolic means, variances, and covariances we need to introduce some basic results from Moore’s Interval Algebra [32]. These results are listed in [33], but we write them here in terms of interval-valued random variables described by their centers and ranges.

###### Definition 2.5.

Let be a interval-valued random variable and , , then we define:

1. Sum of intervals:

 X1+X2=[C1+C2−R1+R22,C1+C2+R1+R22].
2. Difference of intervals:

 X1−X2=[C1−C2−R1+R22,C1−C2+R1+R22].

 X1+ω1=[C1+ω1−R12,C1+ω1+R12].
4. Multiplication by a constant:

 ω1X1=[ω1C1−|ω1|R12,ω1C1+|ω1|R12].
5. Linear combination of an interval:

 ω1X1+ω2=[ω1C1+ω2−|ω1|R12,ω1C1+ω2+|ω1|R12].
6. Linear combination of two intervals:

 ω1X1+ω2X2=[ω1C1+ω2C2−|ω1|R1+|ω2|R22,ω1C1+ω2C2+|ω1|R1+|ω2|R22].

The definition of population symbolic mean (vide Definition (2.2)) leads to the following properties:

###### Proposition 2.1.

Let , , be interval-valued random variables such that exist. Assuming , , the symbolic mean verifies the following properties:

1. .

2. .

3. .

Since proofs of these properties are quite straightforward to obtain using Definition 2.3 and the Interval Algebra results summarized in Definition 2.5, they are omitted.

The definitions of symbolic variance (vide Definition 2.3) lead to the following properties:

###### Proposition 2.2.

Let be an interval-valued random variable, such that and exist. Assuming , , the symbolic variances, , where , , and verify the following properties ():

1. .

2. if and only if is almost surely a constant, i.e. , and is almost surely null, i.e. .

3. If is a conventional random variable, i.e. , then .

4. . As a consequence, if then .

Once again, the proof of the previous properties follows immediately from definitions 2.3 and 2.5, and is therefore not included here.

To further study the properties of symbolic variance and covariance, we need to separate the cases where (, , and ), from those where ( and ) (vide Table 1 in Definition (2.3)).

###### Proposition 2.3.

Let , , be interval-valued random variables such that and exist, and , . Symbolic variances and covariances verify the following properties:

1. , .

2. If is a conventional random variable, i.e. , then , .

3. , .

4. , .

5. Moreover, if and are both negative or both positive values then , for all .

6. If and are two different interval-valued random variables, then

7. If and are two different interval-valued random variables then

###### Proof.

The proof of these results is shown in the Appendix A. ∎

Regarding the symbolic correlation, the next theorem establishes that it is a quantity between -1 and 1, as in the conventional case.

###### Theorem 2.2.

Let , , be two interval-valued random variables such that their symbolic correlation, , defined by (2.24) exists, . Then .

###### Proof.

For , considering the definitions of and the Cauchy-Schwarz inequality we obtain:

 |Cork(X1,X2)| ≤ |Cov(C1,C2)|+δkE(R1R2)√Vark(X1)Vark(X2) ≤ √Var(C1)Var(C2)+δk√E(R21)E(R22)√(Var(C1)+δk% E(R21))(Var(C2)+δkE(R22)),

thus,

 |Cork(X1,X2)|2≤Var(C1)Var(C2)+δ2kE(R21)E(R22)+2δk√Var(C1)E(R22)Var(C2)E(R21)Var(C1)Var(C2)+δ2kE(R21)E(R22)+δk(Var(C1)E(R22)+Var(C2)E(R21)).

As it can be easily be proved, for all non-negative values of and , , making and , we conclude that .

For , an analogous deduction can be made. In fact, similarly to (2.3):

 |Cork(X1,X2)| = |Cov(C1,C2)|√Vark(X1)Vark(X2) ≤ √Var(C1)Var(C2)√(Var(C1)+δkE(R21))(Var(C2)+δkE(R22)) ≤ 1.

As usual, we are also interested in understanding if symbolic correlations are also invariant under linear transformations and in what cases they reach their extreme values (-1 and 1). The next theorem clarifies these issues.

###### Theorem 2.3.

Let , , be interval-valued random variables such that their symbolic correlation, , defined by (2.24) exists and . Then the following properties hold:

1. For , . For , if and only if , i.e. is a conventional random variable.

2. where