1 Introduction
In directional statistics, the von Mises distribution is a key element in the analysis of circular data. Its distribution is given by
(1) 
where is the location parameter, the concentration parameter, and the incomplete Bessel integral of the first kind,
(2) 
The von Mises distribution is a unimodal distribution with its mode at .
corresponds to the uniform distribution, and the larger
, the more concentrated the distribution is around its mode. The von Mises distribution is a convenient way to model unimodal circular or directional data in many fields of science (Fisher, 1995) and naturally arises in a wide variety of cases (Mardia and Jupp, 2000, §3.5.4). It can be obtained by conditioning a bivariate normal distribution on the circle, is close to the wrapped normal distribution, and tends to a normal distribution when
. Also, it is a maximumentropy distribution and can be characterized by the fact that the maximum likelihood of its location parameter is the sample mean direction. As such, it is an analogue on the circle of the Gaussian distribution. It also appears as the equilibrium distribution of a von Mises process.
While there is a general agreement regarding the estimation of (Mardia and Jupp, 2000, §12.4.1), estimation of has generated more research, and several methods have been proposed so far. For a partial review, see, e.g., Mardia and Jupp (2000, §12.4.2). Here, we provide a systematic benchmark of existing estimators. More specifically, we considered 12 estimators: seven frequentist estimators, including the maximumlikelihood estimator (Mardia and Jupp, 2000, §5.3.1), the marginal maximumlikelihood estimator (Schou, 1978), two estimators with bias correction (Best and Fisher, 1981), two estimators based on the circular median (Lenth, 1981; Ko, 1992), and an estimator based on the normal approximation (Abeyasekera and Collett, 1982); and five Bayesian estimators (Dowe et al., 1996), including three maximum a posteriori (MAP) estimators and two minimum message length (MML) estimators. We assessed the behavior of these 12 estimators for datasets of size ranging from 2 to 8 192 generated with a ranging from 0 to 100. We provide detailed results as well as a more global view of the estimators’ performance. Our two main findings are that (1) for a given , most estimators have behaviors that are very similar for large datasets () and much more variable for small datasets, and (2) for a given estimator, results are very similar if one consider the mean absolute error for and the mean relative absolute error for .
2 Overview of existing estimators
We here quickly review existing approaches, including frequentist (Section 2.1) and Bayesian (Section 2.2) estimators. All approaches start from a sample of independent and identically distributed (i.i.d.) realization of a distribution with and unknown.
2.1 Frequentist estimators
Most frequentist estimators of relie on the likelihood of the parameters. Letting be the sample circular mean of the data,
(3) 
the likelihood can be expressed from Equation (1) and the properties of as
(4) 
The maximumlikelihood estimator is the value of that cancels the derivative of . Setting
(5) 
it is the solution of (Mardia and Jupp, 2000, §5.3.1)
(6) 
A marginal maximumlikelihood estimate was also proposed. It is based on the expression of the density of with respect to Lebesgue measure (Watson and Williams, 1956; Mardia, 1975),
(7) 
The maximum of this expression depends on the relationship between and (Schou, 1978)

For , ;

For , is the solution of
(8) 
For , there is no maximum (as it would correspond to ).
Using both a calculation based on a Taylor expansion around of from Equation (6) and a simulation study, Best and Fisher (1981) showed that the (regular) maximumlikelihood estimator of Equation (6) can be strongly biased for small and . They proposed to correct for this bias using an approximate expansion, leading to
(9) 
They also proposed a jackknife correction of the bias,
(10) 
By analogy with the median absolute deviation, Lenth (1981) proposed an estimator based on the median
(11) 
where is the circular median (Mardia and Jupp, 2000, §2.2.2). Ko (1992) improved this approach by replacing the expectation and average of Equation (6) with median and sampling median (see also Ducharme and Milasevic, 1990), yielding an estimator such that
(12) 
where is any estimator of that is standardized bias (SB) robust, such as the circular median or the least median square (LMS) estimator.
Finally, Abeyasekera and Collett (1982) made use of the fact that a distribution is well approximated by the normal distribution for large and therefore proposed as estimator of
the unbiased estimator for
, leading to the socalled linear estimator for(13) 
where is the linear mean, that is, the usual arithmetic mean
Note that there is an issue induced by the discontinuity at 0, which is dealt with by using a technical trick. They also proposed an improved estimator where the bias is corrected by jackknifing the linear estimator.
2.2 Bayesian estimators
In a Bayesian setting, all the information that can be inferred about the parameters and from a dataset is summarized in the posterior distribution
, which, according to Bayes’ theorem, can be expressed as
(14) 
where “” relates two expressions that are proportional. is the data likelihood, already expressed in Equation (4). is the prior distribution, which translates the information we have about the parameters before we have the data. A common approach is to assume no prior dependence between parameters, so that . is classically set as a noninformative uniform distribution on the circle, . As for , Dowe et al. (1996) considered the following two prior distributions
(15) 
The posterior distribution then yields
(16) 
They then proposed to use the maximum a posteriori (MAP) estimators
(17) 
For , they also proposed to consider estimating as the MAP of the distribution of , leading to
(18) 
Finally, they proposed two minimum message length (MML) estimators. Minimizing the message length is equivalent to maximizing
(19) 
where is the determinant of the Fisher matrix for the von Mises distribution. The resulting estimators for and are
(20) 
and
(21) 
respectively.
2.3 Existing simulation studies
We found three existing simulations studies assessing the relative performance of estimators of the concentration parameter: Best and Fisher (1981), Abeyasekera and Collett (1982), and Dowe et al. (1996). Their scope and methodology are summarized in Table 1. Best and Fisher (1981) found that and were similar and better than for small values of and . was less biased than and was less median biased. However, the bias remained for small and . Abeyasekera and Collett (1982) found that was better than , and ; that was better than for small values of and worse when was large; and, finally, that and had similar behaviors. Finally, Dowe et al. (1996) found that Bayesian methods outperformed frequentist methods, with a limited effect of the prior distribution on the behavior.
Article  Estimators  Measures of fit  
Best and Fisher (1981)  qualitative on sampling features  
,  qualitative on sampling features  
100  ,  qualitative on sampling features  
,  qualitative on sampling features  
100  , ,  qualitative on sampling features  
Abeyasekera and Collett (1982)  1 000  , , ,  sample bias and MSE  
Dowe et al. (1996)  1 000  , , ,  MAE, MSE, MKL  
1 000  , , ,  MAE, MSE, MKL  
our simulation  1 000  see Table 2  MAE, MRAE 
the number of samples used to compute mean errors. MAE: mean absolute error; MSE: mean squared error; MKL: mean Kullback–Leibler divergence; MRAE: mean relative absolute error.
3 Simulation study
3.1 Data
We generated samples from von Mises distributions. More specifically, for each of the values of , we generated “maximal” datasets. Each dataset was composed of i.i.d. realizations of a distribution with uniformly distributed on the circle and with (i.e., ). Each sample was generated using Berens (2009)’s CircStat. This procedure therefore generated a total of maximal datasets of size . From each maximal dataset, we then extracted datasets composed of the first data points with . This gave us a total of datasets on which inference was performed. On each dataset, estimation of was performed by application of each of the estimators mentioned above (see also Table 2 for a summary). We therefore ended with a set of estimates.
3.2 Evaluation
To assess the behavior of the various estimations, we proceeded as follows. For estimation , obtained by application of estimator to the first data points of simulation associated with the th value of , we first considered the absolute error of the estimation compared to the true value
(22) 
We then summarized the results over the simulations by computing the mean absolute error (MAE), defined as a sampling average
(23) 
In the same fashion, we also considered the relative error, where the above error of Equation (22) is normalized by the true value
(24) 
and the mean relative absolute error (MRAE) as its sampling average over the simulations
(25) 
3.3 Results
Detailed results can be found in the supplemental material, in the form of summary statistics (supplemental material, Tables 1–12) and graphs (supplemental material, Figures 1–6). We here focus on some key features of the behavior: computation time, algorithmic failures, the global trend as a function of , the behavior for large samples (), and the smallsample behavior.
Computation time.
The detailed computation times can be found in the supplemental material, Tables 1–12, columns 3 and 4. For all estimators, the actual value of seemed to have very limited influence on the computation time. Similarly, for most estimators (jML, mML, BF1, linear, MML3, BayesEst2jMAPkm, BayesEst3jMAPkm, and BayesEst3jMAPxy), the computation time was not a function of the sampling size; it was of the order of 40 ms for BF1 and of 20 ms for the others (see Table 3). Three estimators had computation times that were a function of data size: BF2, median1, and median2 (see Figure 1). For BF2, the computation time was roughly linear, with a doubling of the data size corresponding to a doubling of the computation time. For median1 and median2, the computation time was roughly constant up to and then increased exponentially.
Estimator  Time (mean std dev) 

jML  ms 
mMLl  ms 
BF1  ms 
linear  ms 
MML2  ms 
MML3  ms 
BayesEst2jMAPkm  ms 
BayesEst3jMAPkm  ms 
BayesEst3jMAPxy  ms 
Failures.
We first report cases where methods failed to compute an estimate (see Table 4). This happened for median2 (corresponding to ) and linear (corresponding to ). By definition, of Equation (13) is not defined for . Also, the second median estimator, , was not always defined, as they were cases for which the integral of Equation (12) was smaller than for all (and, in particular for ). Cases corresponding to failures were removed to compute the summary statistics and .


Evolution as a function of .
Graphs corresponding to and for all estimators gathered by value of can be found in Figure 2. Globally, most estimators improved as the sample size increased. One clear exception was BF2 for small and low . In that particular case, the method tended to estimate as equal to 0, leading to an error of the order of , with a correction toward a more normal trend as the data size increased. This behavior was also observed to a much lesser extend for the maximum message length estimators (MML2 and MML3) and the Bayesian estimators (BayesEst2jMAPkm, BayesEst3jMAPkm, and BayesEst3jMAPxy).
Largesample behavior.
We observed quite homogeneous results for large datasets (), both within and between estimators. First, the mean absolute error roughly decreased linearly in loglog coordinates for most estimators. The two exceptions were median1 for low , where the estimator seemed to reach a plateau independent of , and linear, for which data samples of increasing size did not seem to improve estimation. Also, the performance of a given method for different values of were quite similar when one considered the mean absolute error for and the mean relative absolute error for . The main exception was median1 due to the difference in behavior for low values versus high values of mentioned earlier in the paragraph. For the other estimators, the slope of as a function of was found to be around , corresponding to a decrease of mean (relative) absolute error in . Finally, most estimators behaved quite similarly in this largesample regime, at the exception of median1, linear and, to a lesser extend, median2.
To quantify these largesample behaviors, we fitted to each curve a model of linear regression, either
(26) 
for , or
(27) 
for . The results are summarized in Figure 3, confirming generally homogeneous behaviors (at the exception of median1, median2 and linear) that tend to depend more on the value of than on the specifics of the estimation method.
(a) estimated slope  (b) estimated intercept 

(c) predicted value for  (d) predicted value for 
(e) standard deviation of residual noise  
Result of linear regression analysis (
) for either Equation (26) (for ) or Equation (27) (for ). (a) Estimated slope . (b) Estimated intercept . (c) Predicted values of (for ) or (for ) at . (d) Predicted values of (for ) or (for ) at . (e) Standard deviation of residual noise.Small sample behavior.
Behaviors were much more variable for small sample sizes, with both a variability between methods and, for a given method, between values of . A majority of estimators (jML, mML, BF1, median1, median2, MML2, MML3) tended to have vey large average errors for very small sample sizes (). It was found that these extremely large values were due to a few very large errors. For instance, in the case of MML2, and , all but one of the 1 000 repetitions had errors smaller than 1, the remaining error being equal to .
To quantify this behavior, we modeled deviation from the linear trend as an exponential decay, either
(28) 
for or
(29) 
for . is the maximum departure of the mean (relative) absolute error from the linear trend and corresponds to the increase in mean (relative) absolute error at . A positive value indicates a mean (relative) absolute error that is larger than the linear expectation (and therefore a decreased performance), while a negative value indicates a smaller mean (relative) absolute error (and therefore an increased performance). is such that the departure from the linear trend is divided by 10 for . The larger the value of , the slower the decay. The results are summarized in Figure 4.
Regarding , we found several groups of estimators. A first group (jML, mML, BF1, median1, median2) had large positive values of regardless of . A second group (MML1 and MML2) had large positive values of only for , and small negative values for . A third group (linear, BayesEst2km, BayesEst3km, BayesEst3xy) had low (positive and negative) values of for all values of . A last estimator (mML) had small positive values of for and negative, small yet larger values of for .
Regarding , all values were found to be in the range . All estimators with large values of (corresponding to the first two groups mentioned earlier) had values in the range , indicating a very fast exponential decay — for practical purposes, it mostly affected datasets of size .
(a) estimated decay amplitude  (b) estimated decay time 

(c) standard deviation of residual noise  
4 Discussion
In the present manuscript, we assessed the behavior of 12 estimators of the concentration parameter of a von Mises distribution. We used synthetic datasets of size ranging from 2 to 8 192 generated with a ranging from 0 to 100. We provided detailed results as well as a more global view of the estimators’ performance. We found that, for a given estimator, results were very similar across values of if we considered the mean absolute error for and the mean relative absolute error for . We also found that, for a given , most estimators had similar behaviors for large datasets (), while the behaviors differed more strongly for small datasets.
The fact that the mean absolute error for was similar to the mean relative absolute error for has two consequences. It first shows that, for , error increases linearly as a function of . Larger values of are therefore expected to lead to larger estimation errors. While it also means that smaller values of tend to yield smaller estimation errors, this behavior cannot be used to our advantage for , where it would mean that error vanishes as . By contrast, it is the mean absolute error that remains stable for , i.e., error reaches a plateau when decreases below .
For large datasets, the common behavior of most estimators was found to be that the mean absolute error roughly decreased linearly in loglog coordinates, with a slope around , corresponding to a decrease of mean (relative) absolute error in . This is in line with the general estimation theory, where errors are often found to decrease as .
In the light of our results, we are able to give general practical guidelines regarding the use of the estimators tested here. First, we would not recommend the use of median1, median2 and linear unless there is evidence that they might perform well in the specific context of interest. For large datasets (), all other estimators perform in a similar fashion and there is no obvious reason to recommend one or the other. In this context, BF2’s computational burden is a disadvantage with no compensation in terms of performance. By contrast, for small datasets, BF2 was found to be the estimator with the best behavior, still at the price of a high computational cost as well as a consistent underestimation of the concentration parameter when it was small but different from 0. For instance, for , it estimated as 0 regardless of its actual value (supplemental material, Table 5). The Bayesian estimators, which only performed a little worse than BF2 but had computational costs similar to the other estimators, could prove valuable tradeoffs.
Importantly, all estimators considered here require a sample of independent and identically distributed (i.i.d.) realization of a distribution with and
unknown. Our simulation study faithfully respected this requirement. As a consequence, it did not explore the behavior of estimators with respect to the presence of outliers or model misspecification. Another consequence of the fact that we did not depart from the von Mises model is that we did not consider estimators assuming that one or several observations are identified as outliers (e.g., Winsorized estimate of
Fisher, 1982). Also, some of the estimators proposed in the literature depend on parameters (see, e.g., Lenth, 1981; Kato and Eguchi, 2016). Since the behavior of these estimators strongly depends on the choice of the parameters, they were not incorporated in the analysis.As mentioned earlier, there were cases where some of the estimators were not defined. This happened for the linear estimator and , since this estimator requires at least 4 data points. The medianbased estimator also failed sometimes, in particular for small values of , as Equation (12) had no solution (the integral was smaller than 0.5 for all values of ). In such cases, we could have decided to set the estimator to 0. However, we believe that such problem is symptomatic of a deeper problem whose investigation goes beyond the scope of the present manuscript. So we just took the estimator as is.
As a final note, the routines calculating the various estimators were written by us and were not optimized. As such, the computation times presented here should only serve as a rough indicator of the time required to apply each of them. In particular, it is possible that optimal coding of BF2 with a compiled language could greatly improve its computation time.
References
 Abeyasekera and Collett (1982) Abeyasekera, S. and D. Collett (1982). On the estimation of the parameters of the von Mises distribution. Communications in Statistics – Theory and Methods 11(18), 2083–2090.
 Berens (2009) Berens, P. (2009). CircStat: a MATLAB toolbox for circular statistics. Journal of Statistical Software 31(10), 21 pages.
 Best and Fisher (1981) Best, D. J. and N. I. Fisher (1981). The bias of the von Mises–Fisher concentration parameters. Communications in Statistics – Simulation and Computation B10(5), 493–502.
 Dowe et al. (1996) Dowe, D. L., J. J. Oliver, R. A. Baxter, and C. S. Wallace (1996). Bayesian estimation of the von Mises concentration parameter. In K. M. Hanson and R. N. Silver (Eds.), Maximum Entropy and Bayesian Methods. Fundamental Theories of Physics, Volume 79 of An International Book Series on The Fundamental Theories of Physics: Their Clarification, Development and Application, pp. 51–60. Springer, Dordrecht.
 Ducharme and Milasevic (1990) Ducharme, G. R. and P. Milasevic (1990). Estimating the concentration of the Langevin distribution. The Canadian Journal of Statistics / La Revue Canadienne de Statistique 18(2), 163–169.
 Fisher (1982) Fisher, N. I. (1982). Robust estimation of the concentration parameter of Fisher’s distribution on the sphere. Applied Statistics 31, 152–154.
 Fisher (1995) Fisher, N. I. (1995). Statistical Analysis of Circular Data. Cambridge University Press.
 Kato and Eguchi (2016) Kato, S. and S. Eguchi (2016). Robust estimation og location and concentration parameters for the von Mises–Fisher distribution. Statistical Papers 57(1), 205–234.
 Ko (1992) Ko, D. (1992). Robust estimation of the concentration parameter of the von Mises–fisher distribution. The Annals of Statistics 20(2), 917–928.
 Lenth (1981) Lenth, R. V. (1981). Robust measure of location for directional data. Technometrics 23(1), 77–81.
 Mardia (1975) Mardia, K. V. (1975). Distribution theory for the von MisesFisher distribution and its application. In G. P. Patil, S. Kotz, and J. K. Ord (Eds.), A Modern Course on Statistical Distributions in Scientific Work. volume 1 – Models and Structures, Volume 17 of NATO Advanced Study Institutes Series. Series C: Mathematical and Physical Sciences, pp. 113–130. D. Reidel Publishing Company, Dordrecht, Holland.

Mardia and
Jupp (2000)
Mardia, K. V. and P. E. Jupp (2000).
Directional Statistics.
Wiley Series in Probability and Statistics. Wiley, Chichester.
 Schou (1978) Schou, G. (1978). Estimation of the concentration parameter in von MisesFisher distributions. Biometrika 65(2), 369–377.
 Watson and Williams (1956) Watson, G. S. and E. J. Williams (1956). On the construction of significance tests on the circle and the sphere. Biometrika 43(3/4), 344–352.