DeepAI

# On the Predictive Properties of Binary Link Functions

This paper provides a theoretical and computational justification of the long held claim that of the similarity of the probit and logit link functions often used in binary classification. Despite this widespread recognition of the strong similarities between these two link functions, very few (if any) researchers have dedicated time to carry out a formal study aimed at establishing and characterizing firmly all the aspects of the similarities and differences. This paper proposes a definition of both structural and predictive equivalence of link functions-based binary regression models, and explores the various ways in which they are either similar or dissimilar. From a predictive analytics perspective, it turns out that not only are probit and logit perfectly predictively concordant, but the other link functions like cauchit and complementary log log enjoy very high percentage of predictive equivalence. Throughout this paper, simulated and real life examples demonstrate all the equivalence results that we prove theoretically.

• 1 publication
• 17 publications
07/30/2020

### Skewed link regression models for imbalanced binary response with applications to life insurance

For a portfolio of life insurance policies observed for a stated period ...
03/29/2013

### Infinitely imbalanced binomial regression and deformed exponential families

The logistic regression model is known to converge to a Poisson point pr...
11/03/2020

### Graph Enhanced High Dimensional Kernel Regression

In this paper, the flexibility, versatility and predictive power of kern...
07/21/2022

### On Feller Continuity and Full Abstraction (Long Version)

We study the nature of applicative bisimilarity in λ-calculi endowed wit...
09/24/2022

### Robustness against outliers in ordinal response model via divergence approach

This study deals with the problem of outliers in ordinal response model,...
03/07/2018

### Successive Wyner-Ziv Coding for the Binary CEO Problem under Log-Loss

An l-link binary CEO problem is considered in this paper. We present a p...
03/10/2018

### Efficient Determination of Equivalence for Encrypted Data

Secure computation of equivalence has fundamental application in many di...

## I Introduction

Given where denotes the

-dimensional vector of characteristics and

denotes the binary response variable, binary regression seeks to model the relationship between

and using

 π(xi)=Pr[Yi=1|xi]=F(η(xi)) (1)

where

 η(xi)=β0+β1xi1+⋯+βpxip=~x⊤iβi=1,⋯,n (2)

for a -dimensional vector of regression coefficients and is the cdf corresponding to the link functions under consideration. Specifically, the cdf is the inverse of the link function , such that . Table (1) provides specific definitions of the link functions considered in this paper, along with their corresponding cdfs.

The above link functions have been used extensively in a wide variety of applications in fields as diverse as medicine, engineering, economics, psychology, education just to name a few. The logit link function for which

 π(xi)=Pr[Yi=1|xi]=Λ(η(xi))=11+e−η(xi) (3)

is the most commonly used of all of them, probably because it provides a nice interpretation of the regression coefficients in terms of the ratio of the odds. The popularity of the logit link also comes from its computational convenience in the sense that its model formulation yields simpler maximum likelihood equations and faster convergence. In fact, the literature on both the theory and applications based on the logistic distribution is so vast it would be unthinkable to reference even a fraction of it. Some recent authors like

Zelterman (1989), Schumacher et al. (1996), Nadarajah (2004), Lin and Hu (2008) and Nassar and Elmasry (2012)

provide extensive studies on the characteristics of generalized logistic distributions, somehow answering the ever increasing interest in the logistic family of distributions. Indeed, applications abound that make use of both the standard logistic regression model and the so-called generalized logistic regression model, as can be seen in

van den Hout et al. (2007) and Tamura and Giampaoli (2013). The probit link, for which

 π(xi) = Pr[Yi=1|xi]=F(η(xi)) (4) = Φ(η(xi))=∫η(xi)−∞1√2πe−12z2dz

is the second most commonly used of all the link functions, with Bayesian researchers seemingly topping the charts in its use. See Basu and Mukhopadhyay (2000), Csató et al. (2000), Chakraborty (2009) for a few examples of probit use in binary classification in the Bayesian setting. Armagan and Zaretzki (2011)

is just another one of the references pointing to the use of the probit link function in the statistical data mining and machine learning communities.

As a matter of fact, it’s obvious from the plot of their densities for instance that the probit and logit are virtually identical, almost superposed one on top of the other. It is therefore not surprising that one would empirically notice virtually no difference when the two are compared on the same binary regression task. Despite this apparent indistinguishability due to many of their similarities, it is fair to recognize that the two functions different, at least by definition and by their very algebra. Chambers and Cox (1967) argue in their paper that probit and logit will yield different results in the multivariate context. Their work is a rarety in a context where most researchers seem to have settled comfortably with the acceptance of the fact that the two links are essentially the same from a utility perspective. For such researchers, using one over the other is determined solely by mathematical convenience and a matter of taste. We demonstrate both theoretically and computationally that they all predictively equivalent in the univariate case, but we also provide a characterization of the conditions under which they tend to differ in the multivariate context.

Throughout this work, we perform model comparison and model selection using both Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). Taking the view that the ability of an estimator to generalize well over the whole population, provides the best measure of it ultimate utility, we provide extensive comparisons of the performances of each link functions based on their corresponding test error. In the present work, we perform a large number of simulations in various dimensions using both artificial and real life data. Our results persistently reveal the performance indistinguishability of the links in univariate settings, but some sharp differences begin to appear as the dimension of the input space (number of variables measured) increased.

The rest of this paper is organized as follows: section presents some general definitions, namely our meaning of the terms predictive equivalence and structural equivalence, along with some computational demonstrations on simulated and real life data. This section also clearly describes our approach to demonstrating/verifying our claimed results. We show in this section, that for low to moderate dimensional spaces, goodness of fit and predictive performance measures reveal the equivalence between probit and logit. Section provides our formal proof of the equivalence of probit and logit. Section reveals that there might be some differences in performance when the input space becomes very large. Our demonstration in this section in based on the famous AT&T -dimensional Email Spam Data set. Section provides a conclusion and a discussion, along with insights into extensions of the present work.

## Ii Definitions, Methodology and Verification

Throughout this work, we consider comparing models both on the merits of goodness of fit, and predictive performance. With that in mind, we can then define equivalence both from a goodness of fit perspective and also from a predictive optimality perspective.From a predictive analytics perspective for instance, an important question to ask is: given a randomly selected vector , what is the probability that the prediction made by probit will differ from the one made by logit? In other words, how often do the probit and logit link functions yield difference predictions? This is particularly important in predictive analytics in the data mining and machine learning where the nonparametric nature of most models forces the experimenter to focus on the utility of the estimator rather than its form. We respond to this need by defining what we call the predictive equivalence.

### ii.1 Basic definitions and results

###### Definition 1.

(Binary classifier) Given an input space

and a binary response space , we define a (binary) classifier to be a function that maps elements of to , or more specifically

 h:X→{0,1}x↦h(x)

In the generalized linear model (GLM) framework, given a link function with corresponding cdf , a binary classifier under the majority rule takes the form

 h(x)=12{1+sign(π(x)−12)},

where and is the linear component. For instance, the logit binary classifier is given by

 hlogit(x)=12{1+sign(Λ(η(x))−12)},

and the the probit binary classifier is given by

 hprobit(x)=12{1+sign(Φ(η(x))−12)},

where and are as defined in Table (1).

We shall measure the predictive performance of a classifier

by choosing a loss function

and then computing the expected loss (also known as risk functional) as follows:

 R(h)=E[ℓ(Y,h(X))]=∫X×Yℓ(y,h(x))p(x,y)dxdy.

Under the zero-one loss function , the risk functional is the misclassification rate, more specifically

 R(h) = E[ℓ(Y,h(X))] = ∫X×Yℓ(y,h(x))p(x,y)dxdy=Pr[Y≠h(X)].

In practice, cannot be computed in closed-form because the distribution of is unknown. We shall therefore use the so-called the average test error or average empirical prediction error as our predictive performance measure to compare classifiers.

###### Definition 2.

(Average Test Error) Given a sample , we randomly form a training set and a test set . We typically run replications of this split, with of the data allocated to the training set and to the test set. The test error here under the symmetric zero-one loss is given by

 ^Rtest(h)=TE(h) = 1ntente∑i=11{y(te)i≠h(x(te)i)} = #{y(te)i≠h(x(te)i)}nte,

from which the average test error of over random splits of the data is given by

 ATE(h)=1RR∑r=1TEr(h),

where is the test error yielded by on the th split of the data.

###### Definition 3.

(Predictively concordant classifiers) Let and be two classifiers defined on the same -dimensional input space . We shall say that and are predictively concordant if drawn according to the density ,

 Pr[h1(X)≠h2(X)]=α.

In other words, and are predictively concordant if the probability of disagreement between the two classifiers is . When , we say that and are perfectly predictively concordant.

###### Definition 4.

(Predictively equivalent classifiers) Let and be two classifiers defined on the same -dimensional input space . We shall say that and are predictively equivalent if the difference between their average test errors is negligible, i.e., .

###### Lemma 1.

If , and , then .

Demonstration: Figure (3) below shows that the scaled version of the logistic cdf lines up almost perfectly with the standard normal.

###### Lemma 2.

Let denote the standard normal cdf. Then

###### Theorem 1.

The probit and logit link functions are perfectly predictively concordant. Specifically, given an input space and a density on ,

 Pr[hlogit(X)≠hprobit(X)]=0,

for all drawn according to .

###### Proof.

For a given , Let be the event

 E={sign(Λ(η(X))−12)≠sign(Φ(η(X))−12)}

we must show that . Based on Lemma (1), we can write as

 E={sign(Φ(√π8η(X))−12)≠sign(Φ(η(X))−12)}.

Then . Thanks to Lemma (2), it is straightforward to see that . ∎

###### Definition 5.

Let and be two binary regression models based on two different link functions defined on the same -dimensional input space. We shall say that and are structurally equivalent if there exists a nonzero real constant such that for all . In other words, the parameters of are just a scaled version of the parameters of , so that knowing the parameters of is sufficient to completely determine the parameters of , and vice-versa.

###### Theorem 2.

The logit and probit models are structurally equivalent.

###### Proof.

Thanks to Lemma (1), we can write

 Λ(x⊤β(logit)) ≈ Φ(√π8x⊤β(logit)) = Φ(x⊤√π8β(logit))=Φ(x⊤β(probit)),

where

 β(probit)≈√π8β(logit).

We have therefore found a nonzero real constant such that . ∎

### ii.2 Computational Verification via Simulation

To get deeper into how strongly related the probit and logit models are, we now seek to estimate via simulation, the constant coefficient that relates their parameter estimates. Indeed, we conjecture that and are linearly related via the regression equation

 ^β(probit)=τ+θ^β(logit)+ν,

where is the intercept and is the noise term. To estimate one instance of , we generate random replications of the dataset, and for each replication we estimate a copy of , and with it we also compute an estimate of the correlation coefficient between and . By repeating the estimation times, we gather data to determine the central tendency of and the corresponding correlation.

• For r = 1 to R

• For s = 1 to S

• Generate a replicate of the random sample of

• Estimate the logit and probit model coefficients and

• End

• Store the simulated data

• Fit , the regression model using

• Extract the coefficient from

• Compute estimate of correlation between and

• Collect , then compute relevant statistics.

Example 1: We consider a random sample of observations where the are equally spaced points in an interval , that is, , and are drawn from one of the binary regression models. For instance, we set the domain of to and generate the ’s from a Cauchit model with slope and intercept , i.e., , with

 Pr[Yi=1|xi]=π(xi)=1π[tan−1(12xi)+π2],

Using replications each running random samples, we obtain the following results, see Fig (4). The most striking finding here is that the estimated coefficient of determination is roughly equal to , indicating that the knowledge of logit coefficient almost entirely helps determine the value of the probit coefficient. Hence our claim of structural equivalence between probit and logit. The value of the slope appears to be in the neighborhood of .

Example 2: We now consider the famous Pima Indian Diabetes dataset, and obtain parameter estimates under both the logit and the probit models. The dataset is -dimensional, with , , , , , and . Under the logit model, the probability that patient has diabetes given its characteristics is given by

 Pr[Diabetesi=1|xi]=π(xi)=11+e−η(xi),

where

 η(xi) = β0+β1npreg+β2glu+β3bp+β4skin + β5bmi+β6ped+β7age.

We obtain the parameter estimates using R, and we display in the following table their values.

As can be seen in the above Table (LABEL:tab:pima:1), the ratio of the probit coefficient over the logit coefficient is still a number around for almost all the parameter. Indeed, the relationship

 ^β(probit)j≃τ+0.6^β(logit)j+ν

appears to still hold true. The deviation from that pattern observed in variable skin

is probably due to the extreme outlier in its distribution. It is important to note that although our theoretical justification was built under the simplified setting of a univariate model with no intercept, the relationship uncovered still holds true in a complete multivariate setting, with each predictor variable obeying the same relationship.

Example 3: We also consider the benchmark Crabs Leptograpsus dataset, and obtain parameter estimates under both the logit and the probit models. The dataset is -dimensional, with , , , and . Under the logit model, the probability that the sex of crab is male given its characteristics is given by

 Pr[sexi=1|xi]=π(xi)=11+e−η(xi),

where

 η(xi)=β0+β1FL+β2RW+β3CL+β4CW+β5BD.

We obtain the parameter estimates using R, and we display in the following table their values.

As can be seen in the above Table (4), the estimate of the ratio of the probit coefficient over the logit coefficient is still a number around for almots all the parameter. Indeed, the relationship

 ^β(probit)j≃τ+0.6^β(logit)j+ν

appears to still hold true. It is important to note that although our theoretical justification was built under the simplified setting of a univariate model with no intercept, the relationship uncovered still holds true in a complete multivariate setting, with each predictor variable obeying the same relationship.

###### Fact 1.

As can be seen from the examples above, the value of lies in the neighborhood of , regardless of the task under consideration. This supports and confirms our conjecture that there is a fixed linear relationship between probit coefficients and logit coefficients to the point that knowing one implies knowing the other. Hence, the two models are structurally equivalent. In a sense, wherever logistic regression has been used successfully, probit regression will do just as a job. This result confirms what was already noticed and strongly expressed by Feller (1971) (pp 52-53).

### ii.3 Likelihood-based verification of structural equivalence

In the proofs presented earlier, we focused on the parameters and never mentioned their estimates. We now provide a likelihood based verification of the structural equivalence of probit and logit. Without loss of generality, we shall focus on the univariate case where the underlying linear model does not have the intercept , so that . With denoting the predictor variable for the th observation, we have the probability model . Let and denote the estimates of for the logit and the probit link functions respectively. Our first verification of the equivalence of the above link functions consists of showing that and are linearly related through , with a coefficient of determination very close to and a slope that remains fixed regardless of the task at hand. We derive the approximate estimates of theoretically using Taylor series expansion, but we also confirm their values computationally by simulation.

###### Theorem 3.

Consider an i.i.d sample where is a real-valued predictor variable, and is the corresponding binary response. First consider fitting the probit model to the data, and let denote the corresponding estimate of . Then consider fitting the logit model and to the data, and let denote the corresponding estimate of . Then,

 ^β(probit)≃0.625^β(logit).
###### Proof.

Given an i.i.d sample and the model , the loglikelihood for is given by

 ℓ(β) = logL(β) = n∑i=1{yilogπ(xi)+(1−yi)log(1−π(xi))}.

Under the logit link function, we have . Now, using a Taylor series expansion around zero for the two most important parts of the loglikelihood function, we get

 ∂log(π(xi))∂β=xi2−x2i4β+x4i48β3−x6i480β5,

and

The derivative of the approximate log-likelihood function for the logit model is then given by

 ℓ′(β) = n∑i=1{yi(xi2−x2i4β+x4i48β3−x6i480β5)} + n∑i=1{(1−yi)(−xi2−x2i4β+x4i48β3−x6i480β5)},

which, upon ignoring the higher degree terms in the expansion becomes

 ℓ′(β)≃n∑i=1{4yixi−2xi−x2iβ}.

It is straightforward to see that solving for yields

 ^β(logit)≃2⎡⎢ ⎢ ⎢ ⎢ ⎢⎣2n∑i=1xiyi−n∑i=1xin∑i=1x2i⎤⎥ ⎥ ⎥ ⎥ ⎥⎦.

If we now consider the probit link function, we have . Using a derivation similar to the one performed earlier, and ignoring higher order terms, we get

 ℓ′(β)=∂ℓ(β)∂β ≃ n∑i=1{yi(c1xi−2c2βx2i)} + n∑i=1{(1−yi)(−c1xi−2c2βx2i)} = n∑i=1{2c1xiyi−c1xi−2c2βx2i}

where and . This leads to

 ^β(probit)≃c12c2⎡⎢ ⎢ ⎢ ⎢ ⎢⎣2n∑i=1xiyi−n∑i=1xin∑i=1x2i⎤⎥ ⎥ ⎥ ⎥ ⎥⎦.

It is then straightforward to see that

 ^β(probit)^β(logit)≃c14c2=0.625,

or equivalently

 ^β(probit)≃0.625^β(logit).

It must be emphasized that the above likelihood-based theoretical verifications are dependent on Taylor series approximations of the likelihood and therefore the factor of proportionality are bound to be inexact. It’s re-assuring however to see that our computational verification does confirm the results found by theoretical derivation.

## Iii Similarities and Differences beyond Logit and Probit

Other aspects of our work reveal that the similarities proved and demonstrated above between the probit and the logit link functions extend predictively to the other link functions mentioned above. As far as structural equivalence or the lack thereof is concerned, Appendix A contains similar derivations for the relationship between cauchit and logit, and the relationship between compit and logit. As far as, predictive equivalence is concerned, we now present a verification based on the computation of many replications of the test error.

### iii.1 Computational Verification of Predictive Equivalence

We now computationally compare the predictive merits of each of the four link functions considered so far. To this end, we compare the estimated average test error yielded by the four link functions. We do so by running replications of the split of the data set into training and test set, and at each iteration we compute the corresponding test error for the classifier corresponding to each link functions. For one iteration/replication for instance, , , and are the values of the test error generated by probit, compit, cauchit and logit respectively. After replications, we have random realizations of each of those four test errors. We then perform various statistical calculations on the replications, namely

median, mean, standard deviation, kurtosis, skewness, IQR etc…

, to assess the similarity and the differences among the link functions. We perform the similar replications for model comparison using both AIC and BIC.

Example 4: Verification of Predictive Equivalence on Artificial Data: where and are drawn for a cauchy binary regression model with and , namely where

 π(xi)=Pr[Yi=1|xi]=1π[tan−1(1+2xi)+π2].

Table(5) shows some statistics on replications of the test error. The above results suggest that the four link functions are almost indistinguishable as the estimated statistics are almost all equally across the examples.

Example 5: Verification of Predictive Equivalence on the Pima Indian Diabetes Dataset:

We once again consider the famous Pima Indian Diabetes dataset. The Pima Indian Diabetes Dataset is arguably one the most used benchmark data sets in the statistics and pattern recognition community. As can be see in Table (

6), there is virtually no difference between the models. In other words, on the Pima Indian Diabetes data set, the four link functions are predictive equivalent.

It’s also noteworthy to point out that all the four models also yield similar goodness of fit measures when scored using AIC and BIC. Indeed, Figure (5) reveals that over the replications of the split of the data into training and test set, both the AIC and BIC are distributionally similar across all the four link functions.

Despite the slight difference shown by the Cauchit model, it is fair to say that all the link functions are equivalent in terms of goodness of fit. Once again, this is yet another evidence to support and somewhat reinforce/confirm Feller (1971)’s claim that all these link functions are equivalent in terms of goodness of fit, and that the over-glorification of the logit model is at best misguided if not unfounded.

### iii.2 Evidence of Differences in High Dimensional Spaces

Simulated evidence: We generate observations in the interval . For each link function, we compute the sign of for . We then generate a table containing the percentage of times the signs differ.

Computational Demonstrations on the Email Spam Data:

Unlike all the other data sets encountered thus far, the email spam data set is a fairly high dimensional data set. It has a total of

variables and observations.

Clearly, the results depicted in Table (8) reveal some drastic differences in performance among the four link functions on this rather high dimensional data. The boxplots below reinforce these findings as they show that in terms of goodness of fit measured through AIC and BIC, the compit model deviates substantially from the other models.

## Iv Conclusion and discussion

Throughout this paper, we have explored both conceptually/methodologically and computationally the similarities among four of the most commonly used link functions in binary regression. We have theoretically shed some light on some of the structural reasons that explain the indistinguishability in performance in the univariate settings among the four link functions considered. Although section 2 concentrated mainly on the equivalence of the logit and probit, the Appendix provides a similar derivation for both the cauchit and the complementary log log link functions. We have also demonstrated by computational simulations that the four link functions are essentially equivalent both structurally and predictively in the univariate setting and in low dimensional spaces. Our last example showed computationally that the four link functions might differ quite substantially when the dimensional of the input space becomes extremely large. We notice specifically that the performance in high dimensional spaces tends to defend on the internal structure of the input: completely orthogonal designs tending to bode well with all the perfectly symmetric link functions while the non orthogonal designs deliver best performances under the complementary log log. Finally, the sparseness of the input space tends to dictate the choice of the most appropriate link function, Cauchit tending to be the model of choice under high level of sparseness. In our future work, we intend to provide as complete a theoretical characterization as possible in extremely high dimensional spaces, namely providing the conditions under which each of the link function will yield the best fit for the data.

## References

• Armagan and Zaretzki (2011) Armagan, A. and R. Zaretzki (2011). A note on mean-field variational approximations in bayesian probit models. Computational Statistics and Data Analysis 55, 641–643.
• Basu and Mukhopadhyay (2000) Basu, S. and S. Mukhopadhyay (2000). Bayesian analysis of binary regression using symmetric and asymmetric links. Sankhya: The Indian Journal of Statistics 62(3), 372–387.
• Chakraborty (2009) Chakraborty, S. (2009). Bayesian binary kernel probit model for microarray based cancer classification and gene selection. Computational Statistics and Data Analysis 53, 4198–4209.
• Chambers and Cox (1967) Chambers, E. and D. Cox (1967). Discrimination between alternative binary response models. Biometrika 54(3/4), 573–578.
• Csató et al. (2000) Csató, L., E. Fokoué, M. Opper, B. Schottky, and O. Winther (2000). Efficient approaches to gaussian process classification. In S. A. Solla, T. K. Leen, and e. K.-R. Müller (Eds.), Advances in Neural Information Processing Systems, Number 12. MIT Press.
• Feller (1940) Feller, W. (1940). On the logistic law of growth and its empirical verification in biology. Acta Biotheoretica 5, 51–66.
• Feller (1971) Feller, W. (1971).

An Introduction to Probability Theory and Its Applications

(Second ed.), Volume II.
New York: John Wiley and Sons.
• Lin and Hu (2008) Lin, G. D. and C. Y. Hu (2008). On characterizations of the logistic distribution. Journal of Statistical Planning and Inference 138, 1147–1156.
• Nadarajah (2004) Nadarajah, S. (2004). Information matrix for logistic distributions. Mathematical and Computer Modelling 40, 953–958.
• Nassar and Elmasry (2012) Nassar, M. M. and A. Elmasry (2012). A study of generalized logistic distributions. Journal of the Egyptian Mathematical Society 20, 126–133.
• Schumacher et al. (1996) Schumacher, M., R. Robner, and W. Vach (1996). Neural networks and logistic regression: Part i. Computational Statistics and Data Analysis 21, 661–682.
• Tamura and Giampaoli (2013) Tamura, K. A. and V. Giampaoli (2013). New prediction method for the mixed logistic model applied in a marketing problem. Computational Statistics and Data Analysis 66, 202–216.
• van den Hout et al. (2007) van den Hout, A., P. van der Heijden, and R. Gilchrist (2007). The Logistic Regression Model with Response Variables Subject to Randomized Response. Computational Statistics and Data Analysis 51, 6060–6069.
• Zelterman (1989) Zelterman, D. (1989). Order statistics for the generalized logistic distribution. Computational Statistics and Data Analysis 7, 69–77.

## V Appendix A

###### Theorem 4.

Consider an i.i.d sample where is a real-valued predictor variable, and is the corresponding binary response. First consider fitting the cauchit model to the data, and let denote the corresponding estimate of . Then consider fitting the logit model and to the data, and let denote the corresponding estimate of . Then,

 ^β(cauchit)≃π4^β(logit).
###### Proof.

Given an i.i.d sample and the model , the loglikelihood for is given by

 ℓ(β)=logL(β)=n∑i=1{yilogπ(xi)+(1−yi)log(1−π(xi))}. (5)

For the Cauchit for instance, . We use the Taylor series expansion around zero for both and .

 logπ(xi)=−log2+2βxiπ−2β2x2iπ2−2(π2−4)β3x3i3π3+O(x4i)

and

 log(1−π(xi))=−log2−2βxiπ−2β2x2iπ2+2(π2−4)β3x3i3π3+O(x4i)

A first order approximation of the derivative of the log-likelihood with respect to is

 ℓ′(β)=∂ℓ(β)∂β = = n∑i=1{4πxiyi−2πxi−4π2βx2i}

Solving yields

which simplifies to

 ^β(cauchit)=π2⎡⎢ ⎢ ⎢ ⎢ ⎢⎣2n∑i=1xiyi−n∑i=1xin∑i=1x2i⎤⎥ ⎥ ⎥ ⎥ ⎥⎦