The evaluation of the exact full likelihood may be difficult or even impossible in situations where complex problems deal with large correlated datasets. These problems are likely to occur for instance in spatial statistics and time series frameworks in which direct computation of the normalising constant can be a very challenging task, entailing multidimensional integration of the full joint density for each value of the parameter. Many approaches have been proposed to tackle this issue. in this paper, we investigate an appealing approach based on the score matching estimator proposed by Hyvärinen in 2005 .
The score matching estimator can be regarded as a specific case of estimators derived from proper scoring rules (see Dawid and Musio (2014) 
), which are loss functions for measuring the quality of probabilistic forecasts. In particular, this estimator derives from the Hyvärinen scoring rule, which is a homogeneous proper scoring rule (see Ehm and Gneiting (2012) and Parry et al. (2012) ), namely a proper scoring rule which does not require knowledge of the normalising constant. Homogeneous scoring rules have been characterised for continuous real variables (Parry et al. (2012) ) and for discrete variables (Dawid et al. (2012) ). In a Bayesian framework, Dawid and Musio (2015)  have shown, for the case of continuous variables, how to handle Bayesian model selection with improper within-model prior distributions, by exploiting the use of homogeneous proper scoring rules. The discrete counterpart has been empirically studied by Dawid et al. (2017) . In a recent contribution, Shao et al. (2018)  consider the use of the Hyvärinen score for model comparison. Although the majority of contributions involving the use of Hyvärinen scoring rules focus on Euclidean spaces, scholars have also investigated extensions to non-Euclidean spaces: for an early study see Dawid (2007) . Recently, Mardia et al. (2016)  proposed an extension of the Hyvärinen scoring rule to compact oriented Riemannian manifolds, and Takasu et al. (2018)  constructed a novel class of homogeneous strictly proper scoring rules for statistical models on spheres.
Given the growing interest in the use of this scoring rule for very complex statistical models, in this paper we aim to derive an estimation method based on the Hyvärinen scoring rule for estimating linear Gaussian time series models.
We distinguish two separate cases: a first in which the length of a single time series increases to infinity, and a second in which the length of the time series is fixed and the number of series increases to infinity.
The consistency and asymptotic distribution of the Hyvärinen estimator are derived for the case of a single time series of increasing length. In particular, under some mild regularity conditions we derive consistency of the proposed estimator for linear Gaussian time series models, while the asymptotic distribution is proved in the specific case of autoregressive moving average (ARMA) causal invertible models. For time series with fixed length and the number of time series increasing to infinity the performances of two estimators based on the Hyvärinen scoring rule, namely the total Hyvärinen estimator and the matrix Hyvärinen estimator are compared through simulation studies with the full maximum likelihood and the pairwise maximum likelihood estimators. Three simple time series models have been considered in the design of simulations: autoregressive (AR), moving average (MA) and fractionally differenced white noise (ARFIMA).
Different behaviours can be detected for the total Hyvärinen estimator among the settings examined. In particular, it outperforms the pairwise likelihood estimator in terms of efficiency with respect to the full maximum likelihood estimator for the MA and ARFIMA processes.
The paper unfolds as follows. Section 2 introduces basic notions on scoring rules. In Section 3 we introduce the Hyvärinen scoring rule for Gaussian linear time series. Some asymptotic results for the Hyvärinen estimator are given. In the specific case of independent series we describe the total Hyvärinen estimator and the matrix Hyvärinen estimator. Section 4 summarises the results of the simulation studies. Section 5 presents some concluding remarks. Technical details are postponed to the Appendices.
2 Scoring rules
A scoring rule
is a loss function designed to measure the quality of a proposed probability distribution
, for a random variable, in light of the outcome of . Specifically, if a forecaster quotes a predictive distribution for and the event realises, then the forecaster’s loss will be . The expected value of when has distribution is denoted by .
The scoring rule is proper (relative to the class of distributions ) if
It is strictly proper if equality obtains only when .
Any proper scoring rule gives rise to a general method for parameter estimation, based on an unbiased estimating equation: see §2.2 below.
2.1 Examples of proper scoring rules
Some important proper scoring rules are the log-score, (Good (1952) ), where is the density function of , which recovers the full (negative log) likelihood; and the Brier score (Brier (1950) ). A particularly interesting case, which avoids the need to compute the normalising constant, produces the score matching estimation method of Hyvärinen (2005) , based on the following proper scoring rule:
where ranges over the whole of supplied with the Euclidean norm , is assumed twice continuously differentiable, and is the realised value of . In (2), denotes the gradient operator, and the Laplacian operator, with respect to . For we can express
The scoring rule (2) is a 2-local homogeneous proper scoring rule (see Parry et al. (2012) ). It is homogeneous in the density function , i.e. its value is unaffected by applying a positive scale factor to the density , and so can be computed even if we only know the density function up to a scale factor. Inference performed using any homogeneous scoring rule does not require knowledge of the normalising constant of the distribution.
2.2 Estimation based on proper scoring rules
Let be independent realisations of a random variable , having distribution depending on an unknown parameter , where is an open subset of . Given a proper scoring rule , let denote .
Inference for the parameter may be performed by minimising the total empirical score,
resulting in the minimum score estimator, .
Under broad regularity conditions on the model (see e.g. Barndorff-Nielsen & Cox (1994) ), satistfies the score equation:
, the gradient vector ofwith respect to . The score equation is an unbiased estimating equation (Dawid & Lauritzen (2005) ). When is the log-score, the minimum score estimator becomes the maximum likelihood estimator.
From the general theory of unbiased estimating functions, under broad regularity conditions on the model the minimum score estimate
is asymptotically consistent and normally distributed:where denotes the Godambe information matrix where is the variability matrix, and is the sensitivity matrix. In contrast to the case for the full likelihood, and are different in general: see Dawid & Musio (2014) , Dawid et al (2016). We point out that estimation of the matrix , and (to a somewhat lesser extent) of the matrix , is not an easy task: see Varin (2008) , Varin et al. (2011)  and Cattelan and Sartori (2016) .
3 Gaussian linear time series models
In this section we introduce some results based on the use of the Hyvärinen scoring rule in the setting of Gaussian linear time series models.
Consider the Gaussian linear time series model
parametrised by , and , where for , satisfies and . The are iid Gaussian variables with mean
and variance. Let be the vector of model parameters. The autocovariance function is , say. Using basic differentiation rules, it is easy to find the Hyvärinen score based on the single time series :
where the matrix has entry and is the entry of . We will denote denote the Hyvärinen estimator based on a single series by .
3.1 Asymptotic results for a single time series
In this section we analyse the asymptotic statistical properties of the Hyvärinen scoring rule estimator, based on (7), for a single time series.
The following theorem shows the consistency of the estimator in the Gaussian linear time series setting. The proof of the Theorem is deferred to Appendix and follows arguments similar to those used by Davis and Yau (2011)  to prove the consistency of the pairwise likelihood estimator.
Suppose is the linear process in (6) with and parameter . Let
be the minimum score estimator, where and , where is a compact set. If the identifiability condition
is satisfied, then as .
Once consistency has been proved, we focus on the asymptotic distribution of . Its analytic form involves the elements of the inverse of the autocovariance matrix. In order to guarantee its absolute summability, we restrict our attention to the case of ARMA causal invertible processes.
Defining , the gradient and the Hessian with respect to are given, respectively, by the following two expressions:
where denotes differentiation with respect to the components of the vector .
Suppose that is an causal and invertible process. Furthermore, assume that the Hessian matrix is invertible in a neighbourhood of . If the identifiability condition (8) holds, then
Theorem 2 shows that the Hyvärinen scoring rule estimator , in the case that is an ARMA causal invertible process, is asymptotically normally distributed with rate of decay . As is well known, the autocovariance function of an ARMA process decays exponentially, which means that an ARMA process is a short memory process, and its autocovariance function is absolutely summable Brockwell & Davis (1991) . This property, together with the duality of ARMA models under causality and invertibility, allows us to prove asymptotic normality. For the complete proof refer to the appendix.
3.2 Estimation approaches for independent time series
In the remainder of this section we discuss the case of independent series of length . We assume that is fixed while increases to infinity. We also specialise to the case that the common mean and innovation variance are known; without loss of generality we take .
Consider now independent and identically distributed processes , where , each having the -variate normal distribution with mean-vector and variance covariance matrix , with unknown parameter . Let the random matrix have the vector as its th row. We define the total Hyvärinen score (HT) as the sum of single Hyvärinen scores in (7):
The estimate of minimising the total Hyvärinen score will be denoted by .
An alternative approach is to consider as basic observable the sum-of-squares-and-products matrix , which is a sufficient statistic for the multivariate normal model, having the Wishart distribution with degrees of freedom and scale matrix . Then inference for the parameter can be performed by resorting to the Hyvärinen score based directly on the Wishart model. We will call this scoring rule the matrix Hyvärinen score.
, so that the joint distribution of the upper triangleof the sum-of-squares-and-products random matrix SSP (which has a Wishart distribution with parameters and ) has a density, and taking into consideration all of the properties of the derivatives of traces and determinants, it can be shown that the Hyvärinen score based on this joint density is
where and are the elements of the inverse matrices and of , respectively. The matrix Hyvärinen estimator for , minimising with respect to , will be denoted by .
The derivative of with respect to is
and since (see Kollo & von Rosen, p. 257). Moreover, . The derivation of the function , which after taking account of the square of (14) reduces to
involves calculations requiring the covariance matrix of the random matrix , which has an Inverse Wishart distribution with scale matrix : see Von Rosen (1998)  for details on the components of the covariance matrix.
In general, the Godambe information needed to estimate the standard error ofis not easy to derive analytically due to the form of the matrix . It should be pointed out that this approach can not be used if we have a single time series of length with increasing to , since for non-singularity of the Wishart distribution we need to assume .
4 Numerical assessment
In this section we report simulation studies designed to assess and compare the behaviours of the estimators obtained by using the total and the matrix Hyvärinen estimators. We refer to the case described in paragraph 3.2 in which is fixed and increases to . For comparison, we will consider also the full and pairwise maximum likelihood estimators (Davis & Yau (2011)  ). We discuss three examples: the first order autoregressive AR(1), the first order moving average MA(1) and the fractionally differenced white noise . Various parameter settings are considered in all simulation studies. All calculations have been done in the statistical computing environment R (
). The summary statistics shown are: average estimates of the parameters, asymptotic standard deviations () and the asymptotic relative efficiency (ARE) with respect to the maximum likelihood estimator.
4.1 First order autoregressive models
The stationary univariate autoregressive process of order , denoted by , is defined by
where is a Gaussian white noise process with mean and variance . Here , with , is the autoregressive parameter. Then are jointly normal with mean vector (where is the -dimensional unit vector), and covariance matrix , with having components ().
For comparison purposes we consider also the numerical performance of a class of pairwise likelihood estimators. Since, in the time series considered, dependence decreases in time, as in Davis & Yau (2011)  we shall restrict to the first order consecutive pairwise likelihood, rather than the complete pairwise likelihood, so that adjacent observations are more closely related than the others. This choice is motivated also by the loss in efficiency incurred in using the -th order consecutive pairwise likelihood as increases (see Davis and Yau (2011) ; Joe and Lee (2009) ). Note that, when it is known that but is unknown, the pairwise likelihood estimator of is , which is also the Yule-Walker estimator (Davis & Yau (2011) ).
The values of the model parameters are and , with the autoregressive parameter . In the simulation study, replicates are generated of processes of length . Results are summarised in Table 1. The numerical results in Table 1 and in the left-hand panel of Figure 1 suggest that and do not have high efficiency as approaches : in particular, the asymptotic efficiency of tends to for large values of . In contrast, under the same model setting, there is only a modest loss of efficiency for the pairwise likelihood-based estimator .
4.2 First order moving average models
The univariate moving average process of order , denoted by , is defined by
where and are independent Gaussian random variables with mean and variance . Then the random variables are jointly normal, each having mean and variance . The variables and are independent for , while and have covariance (). Hence, the covariance matrix of has components , if , otherwise.
As before we consider the first order consecutive pairwise likelihood since the use of a higher order consecutive pairwise likelihood is unrealistic due to the independence of and for . For , the pair
has a bivariate Gaussian distribution, in which the two components both have meanand variance , and have covariance .
The values of the model parameters are and , with the moving average parameter . In the simulation study, replicates are generated of processes of length . Results are summarised in Table 2. The simulation shows that the total Hyvärinen estimator achieves the same efficiency as the MLE in the model for values of the moving average parameter near ; see Table 2 and the right-hand panel of Figure 1. However, the loss in efficiency of the total Hyvärinen estimator is modest even when the absolute value of the moving average parameter reaches . In contrast, the pairwise likelihood estimator shows very poor performances in terms of asymptotic relative efficiency: the ARE ranges from to as increases.
4.3 Fractionally differenced white noise
The fractionally differenced white noise, , model is defined by
where is the lag operator and , and are independent Gaussian random variables with mean and variance . Then the random variables are jointly normal, with covariance matrix whose components (see Hosking (1981) ) are
(where in the right-hand side of (17), denotes the gamma function.)
As before we consider the first order consecutive pairwise likelihood since no great improvement can be detected by using a higher order consecutive pairwise likelihood: see the results of Davis and Yau (2011) . For , the pair has a bivariate Gaussian distribution, in which the two components both have mean and variance , and have covariance .
The values of the model parameters are and , with the fractional parameter . In the simulation study, replicates are generated of processes of length . Results are sumarised in Table 3. Simulation 3 shows that the total Hyvärinen estimator achieves the same efficiency as the MLE in the model near and near ; see Table 3 and the right-hand panel of Figure 1. The loss in efficiency of the total Hyvärinen estimator is very slight when . The efficiency of is poor with ARE values ranging from to . For all the estimators considered the ARE is 0 when . The pairwise estimator performs better than , however the values of ARE range from 0.6 to 0.96, reaching a maximum when , with a major loss of efficiency with respect to the total Hyvärinen estimator.
It should be noted that for the and the models no analytic expressions for the derivatives of (7) are available. The standard deviations of , and are empirical estimates of the square root of the Godambe information function, which is obtained by compounding the empirical estimates of and . The standard deviations of the pairwise maximum likelihood estimator and the maximum likelihood estimator are obtained using the analytic expressions (see Pace et al. (2011) ) for the model and the empirical counterparts for the model. Numerical evaluation of scoring rule derivatives has been carried out using the R package numDeriv; see Gilbert & Varadhan (2012) .
Results from simulations reveal that the estimators considered produce estimates very close to the true values of the parameters. However, results not shown here suggest that when the length of the series is small the pairwise likelihood estimator performs worse in terms of bias than the other estimators in all the models considered.
The left, the middle and right-hand panels of Figure 1 depict the asymptotic relative efficiency as a function of for the model, as a function of for the model, and as a function of for the model, respectively.
All the results of the simulation studies are in agreement with the findings of Davis & Yau (2011)  who focus on pairwise likelihood-based methods for linear time series.
In this paper we have considered the use of Hyvärinen scoring rules in linear time series estimation under different conditions. We have established the consistency of the Hyvärinen scoring rule estimator for a single times series under some general conditions and its asymptotic normality in an ARMA time series context.
We have investigated, for
independent time series, the performances of two estimators based on the Hyvärinen scoring rule, which can be regarded as a surrogate for a complex full likelihood. The properties of the estimators found using this scoring rule are compared with the full and pairwise maximum likelihood estimators. Three simple models are discussed: the first a stationary first order autoregressive model, the second a first order moving average model and the third a fractionally differenced white noise. In the first case the total Hyvärinen method leads to poor estimators; in contrast, in the second and third this method produces good estimators. The opposite behaviour is observed for the pairwise estimators. For the moving average process and the fractionally differenced white noise, there can be a large gain in efficiency, as compared to the pairwise likelihood method, by using the total or the matrix Hyvärinen scoring rule estimators. For the autoregressive model, in contrast, the total Hyvärinen score methods suffer a loss of efficiency asapproaches .
In all examples, results not reported here show that there is a great improvement in the performances of the matrix Hyvärinen estimator based on the Wishart model as the ratio becomes negligible. It is clear that the loss of efficiency incurred in using the Hyvärinen scoring rules or pairwise likelihood can be substantial, but this depends on the underlying model, and no overall general principle has emerged that might offer guidance for cases not yet considered. The matrix Hyvärinen estimator has the apparent advantage over the other estimators (apart from full maximum likelihood) of being based on the sufficient statistic of the model; nevertheless the total Hyvärinen estimator shows good performance in terms of efficiency.
We conclude that minimising the total Hyvärinen score may offer a viable and useful approach to estimation in models where computation of the normalising constant in the likelihood function is not feasible, and pairwise likelihood methods lead to poor estimators.
Monica Musio was supported by the project GESTA of the Fondazione di Sardegna and Regione Autonoma di Sardegna.
Appendix: Proof of Theorem 1
Let and let denote the expectation with respect to the probability distribution for defined in Equation (6). Let denote the true parameter value. From the ergodicity of , it follows that is ergodic and stationary and therefore
Since the Hyvärinen score is strictly proper we have
with equality if and only if , by the identifiability condition (8). The approach used to derive the consistency of the total Hyvärinen estimator now follows the same general argument used to derive the consistency of the pairwise likelihood estimator in Davis & Yau (2011) .
In particular, the compactness of and the inequality (19) are used as devices for proving the claim.
Appendix: Proof of Theorem 2
Define the sample gradient and Hessian as
Using a Taylor expansion of around and the consistency of Hyvärinen scoring rule estimator, it can be proved that, for some between and ,
The asymptotic distribution of can be derived by exploiting the asymptotic properties of and , together with the ergodic theorem and the fact that .
it can be shown that
In order to calculate the asymptotic distribution of we need to calculate the expectation and the variance of . The calculation of the expectation of follows easily from the unbiasdness of the scoring rule estimating equation (Dawid & Lauritzen (2005) ). However, calculation of the variance of is challenging due to the presence of the non deterministic term
It relies on the following calculation:
The first term in (21) simplifies as
The second term simplifies as
The third term in (21), which involves the fourth cumulant, vanishes as for Gaussian linear processes all the cumulant functions for are identically null Brockwell & Davis (1991) . Hence convergence of is evaluated by considering only the first non-constant term (22).
Equation (22) can be rewritten as
where and . Let and . Without lose of generality, we assume that if . Then the previous expression and consequently the first term in (21) simplifies to
The absolute summability of the autocovariance and the duality properties of autocorrelation and of its inverse for causal invertible autoregressive-moving average processes (see Cleveland (1972) , Chatfield (1979)  and Hosking (1980) ) guarantee the following holds:
-  Almeida, M. P. and B. Gidas (1993). A variational method for estimating the parameters of MRF from complete or incomplete data. Annals of Applied Probability, 3, 103–136.
-  Barndorff-Nielsen, O. E. & Cox, D. R. (1994). Inference and Asymptotics. Chapman & Hall, London.
-  Brier, G. W. (1950). Verification of forecasts expressed in terms of probability, Monthly Weather Review, 78, 1–3.
-  Brockwell, P. J. and Davis, R. A. (1991). Time Series: Theory and Methods. Springer, New York.
-  Cattelan, M. & Sartori, N. (2016). Empirical and simulated adjustments of composite likelihood ratio statistics, Journal of Statistical Computation and Simulation, 86, 1056-1067.
-  Chatfield, C. Inverse Autocorrelations. Journal of the Royal Statistical Society. Series A, 142(3), 363-377.
-  Cleveland, W. S. (1972) The Inverse Autocorrelations of a Time Series and Their Applications. Technometrics, 14(2), 277-293.
-  Davis, R. A. & Yau, C. Y. (2011). Comments on pairwise likelihood in time series models. Statistica Sinica, 21, 255–277.
-  Davison, A. C. (2003). Statistical Models. Cambridge University Press, Cambridge.
-  Dawid, A. P. & Lauritzen, S. L. (2005). The geometry of decision theory. In Proceedings of the Second International Symposium on Information Geometry and its Applications, 22–28. University of Tokyo.
-  Dawid, A.P. (2007). The geometry of proper scoring rules. Ann. Inst. Statist. Math, 59, 77–93.
-  Dawid, A. P., Lauritzen, S. L. and Parry, M. (2012). Proper local scoring rules on discrete sample spaces. Ann. Statist.—, 40, 593–608.
-  Dawid, A. P. & Musio, M. (2014). Theory and applications of proper scoring rules. Metron, 72, 169–183.
-  Dawid, A. P., Musio, M. , & Ventura, L. (2016). Minimum scoring rule inference. Scandinavian Journal of Statistics, 43, 1, 123–134.
-  Dawid, A. P. & Musio, M. (2015). Bayesian model selection based on proper scoring rules (with Discussion). Bayesian Analysis, 10(2), 479–521.
-  Dawid, A. P. & Musio, M., Columbu, S. (2017). A note on Bayesian model selection for discrete data using proper scoring rules, Statistics & Probability Letters, 129, 101–106.
-  Ehm, W., Gneiting, T. Local proper scoring rules of order two Ann. Statist., 40, 609–637.
-  Forbes, P. G. M. & Lauritzen, S. (2015) Linear estimating equations for exponential families with application to Gaussian linear concentration models, Linear Algebra and its Applications, 473, 261–283.
-  Gilbert, P. & Varadhan, R. (2012). numDeriv: Accurate Numerical Derivatives. http://CRAN.R-project.org/package=numDeriv.
-  Good, I. J. (1952). Rational decisions. Journal of the Royal Statistical Society, Series B, 14, 107–114.
-  Hosking, J. R. M. (1980). The Asymptotic Distribution of the Sample Inverse Autocorrelations of an Autoregressive-Moving Average Process. Biometrika, 67(1), 223–226.
-  Hosking, J. R. M. (1981). Fractional differencing. Biometrika, 68(1), 165–176.
Hyvärinen, A. (2005). Estimation of
non-normalized statistical models by score matching.
Journal of Machine Learning Research, 6, 695–709.
-  Hyvärinen, A. (2007). Some extensions of score matching. Computational Statistics and Data Analysis, 51, 2499–2512.
-  Kollo, T. & von Rosen, D. (2005). Advanced Multivariate Statistics with Matrices. Dordrecht: Springer.
-  Kroese, D. P., Taimre, T. & Botev, Z. I. (2011). Handbook of Monte Carlo Methods. John Wiley & Sons, Inc., Hoboken, New Jersey.
Joe, H. & Lee, Y. (2009). On weighting of bivariate
margins in pairwise likelihood.
Journal of Multivariate Analysis, 100, 670–685.
Le Cessie, S. & Van Houwelingen, J. C. (1994). Logistic regression for correlated binary data.Journal of the Royal Statistical Society, Series C, 43, 95–108.
-  Mardia, K. Kent, J., Laha, A. (2016). Score matching estimators for directional distributions. arXiv:1604.08470.
-  Musio, M. & Dawid, A. P. (2013). Local scoring rules: A versatile tool for inference. In Proceedings of the 59th ISI World Statistics Congress, Hong Kong. http://2013.isiproceedings.org/Files/STS019-P3-S.pdf.
-  Pace, L., Salvan, A., & Sartori, N. (2011). Adjusting composite likelihood ratio statistics. Statistica Sinica, 21, 129–148.
Park, J. and Haran, M. (2017). Bayesian Inference in the Presence of Intractable Normalizing Functions.Journal of the American Statistical Association, 113 (523), 1372–1390.
-  Parry, M. F., Dawid, A. P., & Lauritzen, S. L. (2012). Proper local scoring rules. The Annals of Statistics, 40, 561–592.
-  R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.
-  Shao, S., Jacob, P. E., Ding, J. & Tarokh, V. (2018). Bayesian model comparison with the Hyvärinen score: computation and consistency. Journal of the American Statistical Association, online first, DOI: 10.1080/01621459.2018.1518237
-  Takasu, Y., Yano, K., Komaki, F. (2018). Scoring rules for statistical models on spheres, Statistics & Probability Letters, 138, 111–115.
-  Varin, C. (2008). On composite marginal likelihoods. AStA Advances in Statistical Analysis, 92, 1–28.
-  Varin, C., Reid, N., & Firth, D. (2011). An overview of composite likelihood methods. Statistica Sinica, 21, 5–42.
von Rosen, D. (1988). Moments for the inverted Wishart distribution.Scandinavian Journal of Statistics, 15, 97–109.