1 Introduction
The volume and complexity of available data in virtually all sectors of society has grown enormously, boosted by the globalization and the massive use of the Internet. New statistical methods are required to handle this new reality, and Symbolic Data Analysis (SDA), proposed by Diday in [1], is a promising research area.
When trying to characterize datasets, it may not be convenient to deal with the individual data observations (e.g. because the sample size is too large), or we may not have access to the individual data observations (e.g. because of privacy restrictions). In conventional data analysis, this problem is handled by providing singlevalued summary statistics of the data characteristics (e.g. mean, variance, quantiles). The analysis can consider multiple characteristics, but these characteristics can only be singlevalued. SDA extends conventional data analysis by allowing the description of datasets through multivalued features, such as intervals, histograms, or even distributions [2, 3, 4]. These features are called symbolic variables.
Suppose we want to analyze textile sector companies in countries all over the world, e.g. in terms of two characteristics: number of customers and profit. Suppose also that we do have access to the data of individual companies in each country, but only to summary information per country. Conventional data analysis can only deal with singlevalued features, like the profit variance, profit mean, or the mean number of customers. Instead, in SDA, the features (the symbolic variables) can be multivalued, e.g. one feature can be the minimum and maximum profits, and another can be an histogram of the number of customers.
One of the main benefits of SDA has to do with the way individual data characteristics (e.g. profit or number of customers) are described. In conventional data analysis, since only singlevalued features are available, one may need many features to describe a given characteristic. Moreover, the features are treated in the same way, irrespective of the characteristic they represent. For example, one may create as features to characterize the profit of the textile sector the mean, the variance, the maximum, the minimum, the median, the first and third quartiles, and so on. There is then an inflation of features to explain a single data characteristic (the profit, in this case). SDA allows explaining single data characteristics through single symbolic data variables, better tailored to analise that specific characteristic, and with potential gains in terms of dimensionality.
In SDA, the original data is called microdata and the aggregated data is called macrodata. In the previous example, the microdata would be the data of individual companies (labeled with the country they belong to), and the macrodata the interval of profit (between maximum and minimum) or the histogram of the number of customers, of the companies of each country. Our main interest in this paper is on intervalvalued data [5, 6], where macrodata corresponds to the interval between minimum and maximum of microdata values.
SDA is a relatively new field of statistics and has been mainly approached from a sampling perspective. The works [7, 3, 8]
introduced measures of location, dispersion, and association between symbolic random variables, formalized as a function of the observed macrodata values. The sample covariance (correlation) matrices were addressed in the context of symbolic principal component analysis in
[9, 10, 11, 12, 13, 14] and more recently in factor analysis [15]. In [14] the authors established relationships between several proposed methods of symbolic principal component analysis and available definitions of sample symbolic variance and covariance. Other areas of statistics have also been addressed by SDA like clustering (e.g. [16, 17]), discriminant analysis (vide e.g. [18, 19]), regression analysis (vide e.g.
[20, 21]), and time series (vide e.g. [22, 23]).Parametric approaches for intervalvalued variables have also been considered [24, 25, 26, 18, 21, 27, 15]. Authors in [24]
derived maximum likelihood estimators for the mean and the variance of three types of symbolic random variables: intervalvalued, histogramvalued, and triangular distributionvalued variables. In
[25], authors have formulated intervalvalued variables as bivariate random vectors in order to introduce a symbolic regression model based on the theory of generalized linear models. The works
[26, 18, 27]have followed a different approach. In their line of work, the centers and the logarithms of the ranges are collected in a random vector with a multivariate (skew)normal distribution, which is used to derive methods for the analysis of variance
[26], discriminant analysis [18], and outlier detection
[27] of intervalvalued variables.The area of SDA is lacking theoretical support and our work is a step in this direction. Preferably, the statistical methods of SDA should be grounded on populational formulations, as in the case of conventional methods. A populational formulation allows a clear definition of the underlying statistical model and its properties, and the derivation of effective estimation methods.
In this paper, we derive population formulations of the sample symbolic mean, and of three proposals for the sample symbolic variance and covariance. We then determine the main properties of intervalvalued random variables, providing a theoretical framework that gives support to intervalvalued SDA. The population formulations of covariance and correlation matrices are also addressed. We focus on the main definitions which result from the various sampling proposals available in the literature, and provide an interpretation of each definition according to the structure of microdata, which allows selecting the model that best suits specific datasets. This is illustrated using two datasets. Specifically, we select the most appropriate model for each dataset using goodnessoffit tests and quantilequantile plots, and provide an explanation of the microdata based on the covariance matrices.
The paper is structured as follows. In Section 2 we introduce the population formulations of the symbolic mean, variance, covariance, correlation, covariance matrix and correlation matrix for intervalvalued symbolic variables, and derive their main properties. In Section 3 we investigate conditions on the microdata that lead to each of the symbolic covariance matrix definition under study. We also discuss conditions under which a null correlation matrix may be obtained for intervalvalued variables. Section 4 presents two case studies, one based on the iris dataset and another based on credit card monthly expenses. Finally, Section 5 presents the conclusions of the paper and some directions for future work.
2 Symbolic covariance for intervalvalued variables
In this work, we focus on the study of intervalvalued variables [28], defined in the following way:
Definition 2.1.
is an intervalvalued random variable defined on the probability space
if and only if and are random variables defined on such that .In general, when dealing with this type of data it is considered that only macrodata, in the form of a real interval,
, is observed. Since microdata within each interval is not observed, it is commonly assumed that it follows a Uniform distribution on
.We consider that an intervalvalued random variable, , besides being represented by the interval limits A and B, is also represented by the center and range of the interval:
(2.1) 
Let us consider that each object is characterized by intervalvalued random variables, where is the vector of the centers, and the vector of ranges describing the object. Moreover, we denote the mean vectors of centers and ranges by and , and the covariance matrices of centers and ranges by and .
When referring to a sample of size from this population, the th sample point () can be written as , or alternatively, as and , using the centers and ranges representation, where:
(2.2) 
In the case of symbolic data analysis, the individual descriptions associated with the th symbolic observation are all the points in the hyperrectangle . This is a departure from classical analysis, and opens the possibility for different definitions of sample symbolic estimators. In the next sections, we start by presenting the various proposals for the sample mean, variance, and covariance. We then establish population definitions for the mean, variance, and covariance (and, from these, for the covariance and correlation matrices), that are in agreement with corresponding sample measures. Finally, we derive several properties for the population measures.
2.1 Sample Symbolic Mean, Variances, and Covariances
In this section, we discuss three alternative definitions of sample symbolic variance and of sample symbolic covariance; we use an upperindex to distinguish among the definitions.
In all definitions, the sample symbolic mean is defined by ():
(2.3) 
This approach has the appeal of using the mean of the interval centers as sample symbolic mean, which makes sense, in particular, under the assumption that the microdata, associated with , follows a symmetric distribution on .
Regarding the sample symbolic variance, the most straightforward approach is to follow the definition of conventional sample variance of the interval center:
(2.4) 
This definition has the disadvantage of ignoring the contribution of the ranges to sample symbolic variance.
The second and third definitions try to overcome this limitation. The second definition, proposed by de Carvalho et al. [29], is based on the squared distances between the interval limits and the sample symbolic mean, and is given by:
(2.5) 
The third alternative, proposed by Bertrand and Goupil [7], is obtained from the empirical density function of an intervalvalue variable, assuming that the microdata follows a Uniform distribution, and is given by:
(2.6) 
For the sample symbolic covariance between intervalvalued variables, we consider three definitions that are generalizations of the definitions of sample symbolic variance presented above, such that the sample covariance of a variable with itself of definition coincides with the sample variance of the same definition .
The first covariance definition, proposed Billard and Diday [2], is supported on the empirical joint density of two different intervalvalued variables, and , and corresponds to the conventional sample covariance of the centers of and . It is given by ):
(2.7) 
The second definition of symbolic covariance was not proposed in the literature, to our best knowledge. It is the direct generalization of the second definition of sample symbolic variance (vide equation (2.5)):
(2.8) 
Finally, the third definition, proposed by Billard [8], is based on the explicit decomposition of the covariance into within sum of products and between sum of products:
(2.9)  
2.2 Population formulation
In this section, we seek population formulations for the symbolic mean, variance, covariance, correlation, covariance matrices, and correlation matrices. We start by rewriting (2.3) to (2.9) in terms of centers and ranges using (2.2). This leads to simpler expressions, with a more clear interpretation. A detailed derivation of the new expressions can be found in [13]. The sample symbolic mean, variances, and covariances are now:
(2.10)  
(2.11)  
(2.12)  
(2.13)  
(2.14)  
(2.15)  
(2.16) 
The sample symbolic mean corresponds to the sample mean of the centers. The first sample symbolic variance corresponds to the sample variance of the observed centers. The other two definitions of variance add to the first one the sample second order moment of the ranges, weighted differently in each case; in Definition 2 this weight is 1/4 and in Definition 3 is 1/12. Definition 2 of sample variance can also be interpreted as the sample variance of the centers plus the sample second order moment of interval halfranges.
Regarding the sample covariances, the first definition corresponds to the sample covariance between the centers of two intervalvalued variables. The second and third definitions add to the first one the sample moment of the product between th and th ranges, weighted by 1/4 and 1/12, respectively. If , then each definition of sample symbolic covariance equals the corresponding definition of sample symbolic variance.
The sample definitions can be used as starting point for the population definitions. Consider that a sample is written as the realization of a random sample of and . Then, population characteristics are obtained as limiting values^{1}^{1}1Almost sure convergence, when goes to infinity. of the estimators whose realizations lead to the values (2.10) to (2.16
). This is a direct application of the strong law of large numbers. These results are summarized in the following theorem.
Theorem 2.1.
Let and be the random vectors of centers and ranges associated with the intervalvalued random vector , where the covariance matrix of , , and exist. Let be a sequence of random vectors independent and identically distributed to . Then, for , the strong law of large numbers guarantees that:
(2.17)  
(2.18)  
(2.19)  
(2.20)  
(2.21)  
(2.22)  
(2.23) 
Theorem (2.1) compels us to define a new notation to represent the population symbolic means, symbolic variances, and symbolic covariances.
Definition 2.2.
Let and be two intervalvalued random variables with centers, , and ranges, , whose moments and exist. Then, the population symbolic mean of is , . The th definition of population symbolic covariance between and is , () where , , and . Finally, the th definition of population symbolic variance is defined as , .
The definitions of population symbolic mean, variance, and covariance may be extended for intervalvalued random vectors, similarly to what is done for conventional random vectors.
The natural approach for defining symbolic covariance matrices, is to base them on definitions of symbolic variance and covariance that correspond to each other. This naturally leads to three definitions of symbolic covariance matrices. However, some known algorithms for symbolic principal component analysis have considered symbolic covariance matrices that combined definitions of symbolic variance and covariance that do not correspond to each other (vide [30, 13] for more details). This strategy has the inconvenient that symbolic variance does not correspond to the evaluation of the sample symbolic covariance between an intervalvalued variable with itself. However, some authors argue that such combinations may lead to better performance [12]. Thus, besides the three natural definitions referred above, we consider two other: Definition 4, combining Definition 2 of symbolic variance with Definition 1 of symbolic covariance, proposed by Cazes et al. [31], and Definition 5, combining Definition 3 of symbolic variance with Definition 1 of symbolic covariance, proposed by Wang et al. [12]. These results are summarized in the next definition.
Definition 2.3.
Let and be the random vectors of centers and ranges associated with the intervalvalued random vector, , where and exist. Then, the population symbolic mean vector of is and are the population symbolic covariance matrices obtained according to the combinations of symbolic variances and covariances listed in Table 1.
Variance  Covariance  Symbolic Covariance Matrix 

(1)  (1)  
(2)  (2)  
(3)  (3)  
(2)  (1)  
(3)  (1)  
To unify notation, we consider that for , , for , , and .
The definitions of symbolic correlation matrices follow directly from those of symbolic covariance matrices, in the following way:
Definition 2.4.
Let and be the random vectors of centers and ranges, respectively, associated with intervalvalued random variable, , where and exist, and ^{2}^{2}2The “” in denotes an index and not an exponent. and . Then, the symbolic correlation matrices associated with definitions in Table 1 are given by and, equivalently, , . Moreover, the th definition of symbolic correlation between two intervalvalued random variables, and , is
(2.24) 
2.3 Properties
To establish properties of symbolic means, variances, and covariances we need to introduce some basic results from Moore’s Interval Algebra [32]. These results are listed in [33], but we write them here in terms of intervalvalued random variables described by their centers and ranges.
Definition 2.5.
Let be a intervalvalued random variable and , , then we define:

Sum of intervals:

Difference of intervals:

Adding a constant:

Multiplication by a constant:

Linear combination of an interval:

Linear combination of two intervals:
The definition of population symbolic mean (vide Definition (2.2)) leads to the following properties:
Proposition 2.1.
Let , , be intervalvalued random variables such that exist. Assuming , , the symbolic mean verifies the following properties:

.

.

.
Since proofs of these properties are quite straightforward to obtain using Definition 2.3 and the Interval Algebra results summarized in Definition 2.5, they are omitted.
The definitions of symbolic variance (vide Definition 2.3) lead to the following properties:
Proposition 2.2.
Let be an intervalvalued random variable, such that and exist. Assuming , , the symbolic variances, , where , , and verify the following properties ():

.

if and only if is almost surely a constant, i.e. , and is almost surely null, i.e. .

If is a conventional random variable, i.e. , then .

. As a consequence, if then .
Once again, the proof of the previous properties follows immediately from definitions 2.3 and 2.5, and is therefore not included here.
To further study the properties of symbolic variance and covariance, we need to separate the cases where (, , and ), from those where ( and ) (vide Table 1 in Definition (2.3)).
Proposition 2.3.
Let , , be intervalvalued random variables such that and exist, and , . Symbolic variances and covariances verify the following properties:


, .

If is a conventional random variable, i.e. , then , .

, .

, .

Moreover, if and are both negative or both positive values then , for all .


If and are two different intervalvalued random variables, then

If and are two different intervalvalued random variables then
Proof.
The proof of these results is shown in the Appendix A. ∎
Regarding the symbolic correlation, the next theorem establishes that it is a quantity between 1 and 1, as in the conventional case.
Theorem 2.2.
Let , , be two intervalvalued random variables such that their symbolic correlation, , defined by (2.24) exists, . Then .
Proof.
For , considering the definitions of and the CauchySchwarz inequality we obtain:
thus,
As it can be easily be proved, for all nonnegative values of and , , making and , we conclude that .
For , an analogous deduction can be made. In fact, similarly to (2.3):
∎
As usual, we are also interested in understanding if symbolic correlations are also invariant under linear transformations and in what cases they reach their extreme values (1 and 1). The next theorem clarifies these issues.
Theorem 2.3.
Let , , be intervalvalued random variables such that their symbolic correlation, , defined by (2.24) exists and . Then the following properties hold:

For , . For , if and only if , i.e. is a conventional random variable.

where
Comments
There are no comments yet.