I Introduction
Given where denotes the
dimensional vector of characteristics and
denotes the binary response variable, binary regression seeks to model the relationship between
and using(1) 
where
(2) 
for a dimensional vector of regression coefficients and is the cdf corresponding to the link functions under consideration. Specifically, the cdf is the inverse of the link function , such that . Table (1) provides specific definitions of the link functions considered in this paper, along with their corresponding cdfs.
Model  Link function  cdf 

Probit  
Compit  
Cauchit  
Logit 
The above link functions have been used extensively in a wide variety of applications in fields as diverse as medicine, engineering, economics, psychology, education just to name a few. The logit link function for which
(3) 
is the most commonly used of all of them, probably because it provides a nice interpretation of the regression coefficients in terms of the ratio of the odds. The popularity of the logit link also comes from its computational convenience in the sense that its model formulation yields simpler maximum likelihood equations and faster convergence. In fact, the literature on both the theory and applications based on the logistic distribution is so vast it would be unthinkable to reference even a fraction of it. Some recent authors like
Zelterman (1989), Schumacher et al. (1996), Nadarajah (2004), Lin and Hu (2008) and Nassar and Elmasry (2012)provide extensive studies on the characteristics of generalized logistic distributions, somehow answering the ever increasing interest in the logistic family of distributions. Indeed, applications abound that make use of both the standard logistic regression model and the socalled generalized logistic regression model, as can be seen in
van den Hout et al. (2007) and Tamura and Giampaoli (2013). The probit link, for which(4)  
is the second most commonly used of all the link functions, with Bayesian researchers seemingly topping the charts in its use. See Basu and Mukhopadhyay (2000), Csató et al. (2000), Chakraborty (2009) for a few examples of probit use in binary classification in the Bayesian setting. Armagan and Zaretzki (2011)
is just another one of the references pointing to the use of the probit link function in the statistical data mining and machine learning communities.
In the presence of some many possible choices of link functions, the natural question to ask is: how does one go about choosing the right/suitable/appropriate link function for the problem at hand? Most experts and nonexperts alike who deal with binary classification tend to almost automatically choose the logit link, to the point that it  the logit link  has almost been attributed a transcendental place. From experience, experimentation and mathematical proof, it is our view, a view shared by Feller (1971) and Feller (1940), that all these link function are equivalent, both structurally and predictively. Indeed, our conjectured equivalence of binary regression link functions is strongly supported by William Feller in his vehement criticism of the overuse of the logit link function and a tendency to give it a place above the rest of existing link functions. In Feller (1971)’s own words: An unbelievably huge literature tried to establish a transcendental "law of logistic growth"; measured in appropriate units, practically all growth processes were supposed to be represented by a function of the form (3) with representing time. Lengthy tables, complete with chisquare tests, supported this thesis for human populations, for bacterial colonies, development of railroads, etc. Both height and weight of plants and animals were found to follow the logistic law even though it is theoretically clear that these two variables cannot be subject to the same distribution. Laboratory experiments on bacteria showed that not even systematic disturbances can produce other results. Population theory relied on logistic extrapolations (even though they were demonstrably unreliable). The only trouble with the theory is that not only the logistic distribution but also the normal, the Cauchy, and other distributions can be fitted to the same material with the same or better goodness of fit. In this competition the logistic distribution plays no distinguished role whatever; most contradictory theoretical models can be supported by the same observational material.
As a matter of fact, it’s obvious from the plot of their densities for instance that the probit and logit are virtually identical, almost superposed one on top of the other. It is therefore not surprising that one would empirically notice virtually no difference when the two are compared on the same binary regression task. Despite this apparent indistinguishability due to many of their similarities, it is fair to recognize that the two functions different, at least by definition and by their very algebra. Chambers and Cox (1967) argue in their paper that probit and logit will yield different results in the multivariate context. Their work is a rarety in a context where most researchers seem to have settled comfortably with the acceptance of the fact that the two links are essentially the same from a utility perspective. For such researchers, using one over the other is determined solely by mathematical convenience and a matter of taste. We demonstrate both theoretically and computationally that they all predictively equivalent in the univariate case, but we also provide a characterization of the conditions under which they tend to differ in the multivariate context.
Throughout this work, we perform model comparison and model selection using both Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). Taking the view that the ability of an estimator to generalize well over the whole population, provides the best measure of it ultimate utility, we provide extensive comparisons of the performances of each link functions based on their corresponding test error. In the present work, we perform a large number of simulations in various dimensions using both artificial and real life data. Our results persistently reveal the performance indistinguishability of the links in univariate settings, but some sharp differences begin to appear as the dimension of the input space (number of variables measured) increased.
The rest of this paper is organized as follows: section presents some general definitions, namely our meaning of the terms predictive equivalence and structural equivalence, along with some computational demonstrations on simulated and real life data. This section also clearly describes our approach to demonstrating/verifying our claimed results. We show in this section, that for low to moderate dimensional spaces, goodness of fit and predictive performance measures reveal the equivalence between probit and logit. Section provides our formal proof of the equivalence of probit and logit. Section reveals that there might be some differences in performance when the input space becomes very large. Our demonstration in this section in based on the famous AT&T dimensional Email Spam Data set. Section provides a conclusion and a discussion, along with insights into extensions of the present work.
Ii Definitions, Methodology and Verification
Throughout this work, we consider comparing models both on the merits of goodness of fit, and predictive performance. With that in mind, we can then define equivalence both from a goodness of fit perspective and also from a predictive optimality perspective.From a predictive analytics perspective for instance, an important question to ask is: given a randomly selected vector , what is the probability that the prediction made by probit will differ from the one made by logit? In other words, how often do the probit and logit link functions yield difference predictions? This is particularly important in predictive analytics in the data mining and machine learning where the nonparametric nature of most models forces the experimenter to focus on the utility of the estimator rather than its form. We respond to this need by defining what we call the predictive equivalence.
ii.1 Basic definitions and results
Definition 1.
(Binary classifier) Given an input space
and a binary response space , we define a (binary) classifier to be a function that maps elements of to , or more specificallyIn the generalized linear model (GLM) framework, given a link function with corresponding cdf , a binary classifier under the majority rule takes the form
where and is the linear component. For instance, the logit binary classifier is given by
and the the probit binary classifier is given by
where and are as defined in Table (1).
We shall measure the predictive performance of a classifier
by choosing a loss function
and then computing the expected loss (also known as risk functional) as follows:Under the zeroone loss function , the risk functional is the misclassification rate, more specifically
In practice, cannot be computed in closedform because the distribution of is unknown. We shall therefore use the socalled the average test error or average empirical prediction error as our predictive performance measure to compare classifiers.
Definition 2.
(Average Test Error) Given a sample , we randomly form a training set and a test set . We typically run replications of this split, with of the data allocated to the training set and to the test set. The test error here under the symmetric zeroone loss is given by
from which the average test error of over random splits of the data is given by
where is the test error yielded by on the th split of the data.
Definition 3.
(Predictively concordant classifiers) Let and be two classifiers defined on the same dimensional input space . We shall say that and are predictively concordant if drawn according to the density ,
In other words, and are predictively concordant if the probability of disagreement between the two classifiers is . When , we say that and are perfectly predictively concordant.
Definition 4.
(Predictively equivalent classifiers) Let and be two classifiers defined on the same dimensional input space . We shall say that and are predictively equivalent if the difference between their average test errors is negligible, i.e., .
Lemma 1.
If , and , then .
Demonstration: Figure (3) below shows that the scaled version of the logistic cdf lines up almost perfectly with the standard normal.
Lemma 2.
Let denote the standard normal cdf. Then
Theorem 1.
The probit and logit link functions are perfectly predictively concordant. Specifically, given an input space and a density on ,
for all drawn according to .
Proof.
Definition 5.
Let and be two binary regression models based on two different link functions defined on the same dimensional input space. We shall say that and are structurally equivalent if there exists a nonzero real constant such that for all . In other words, the parameters of are just a scaled version of the parameters of , so that knowing the parameters of is sufficient to completely determine the parameters of , and viceversa.
Theorem 2.
The logit and probit models are structurally equivalent.
Proof.
Thanks to Lemma (1), we can write
where
We have therefore found a nonzero real constant such that . ∎
ii.2 Computational Verification via Simulation
To get deeper into how strongly related the probit and logit models are, we now seek to estimate via simulation, the constant coefficient that relates their parameter estimates. Indeed, we conjecture that and are linearly related via the regression equation
where is the intercept and is the noise term. To estimate one instance of , we generate random replications of the dataset, and for each replication we estimate a copy of , and with it we also compute an estimate of the correlation coefficient between and . By repeating the estimation times, we gather data to determine the central tendency of and the corresponding correlation.

For r = 1 to R

For s = 1 to S

Generate a replicate of the random sample of

Estimate the logit and probit model coefficients and


End

Store the simulated data

Fit , the regression model using

Extract the coefficient from

Compute estimate of correlation between and


Collect , then compute relevant statistics.
Example 1: We consider a random sample of observations where the are equally spaced points in an interval , that is, , and are drawn from one of the binary regression models. For instance, we set the domain of to and generate the ’s from a Cauchit model with slope and intercept , i.e., , with
Using replications each running random samples, we obtain the following results, see Fig (4). The most striking finding here is that the estimated coefficient of determination is roughly equal to , indicating that the knowledge of logit coefficient almost entirely helps determine the value of the probit coefficient. Hence our claim of structural equivalence between probit and logit. The value of the slope appears to be in the neighborhood of .
Example 2: We now consider the famous Pima Indian Diabetes dataset, and obtain parameter estimates under both the logit and the probit models. The dataset is dimensional, with , , , , , and . Under the logit model, the probability that patient has diabetes given its characteristics is given by
where
We obtain the parameter estimates using R, and we display in the following table their values.
Model  npreg  glu  bp  skin 

Probit  
Logit  
Ratio 
Model  bmi  ped  age 

Probit  
Logit  
Ratio 
As can be seen in the above Table (LABEL:tab:pima:1), the ratio of the probit coefficient over the logit coefficient is still a number around for almost all the parameter. Indeed, the relationship
appears to still hold true. The deviation from that pattern observed in variable skin
is probably due to the extreme outlier in its distribution. It is important to note that although our theoretical justification was built under the simplified setting of a univariate model with no intercept, the relationship uncovered still holds true in a complete multivariate setting, with each predictor variable obeying the same relationship.
Example 3: We also consider the benchmark Crabs Leptograpsus dataset, and obtain parameter estimates under both the logit and the probit models. The dataset is dimensional, with , , , and . Under the logit model, the probability that the sex of crab is male given its characteristics is given by
where
We obtain the parameter estimates using R, and we display in the following table their values.
Model  FL  RW  CL  BD 

Probit  
Logit  
Ratio 
As can be seen in the above Table (4), the estimate of the ratio of the probit coefficient over the logit coefficient is still a number around for almots all the parameter. Indeed, the relationship
appears to still hold true. It is important to note that although our theoretical justification was built under the simplified setting of a univariate model with no intercept, the relationship uncovered still holds true in a complete multivariate setting, with each predictor variable obeying the same relationship.
Fact 1.
As can be seen from the examples above, the value of lies in the neighborhood of , regardless of the task under consideration. This supports and confirms our conjecture that there is a fixed linear relationship between probit coefficients and logit coefficients to the point that knowing one implies knowing the other. Hence, the two models are structurally equivalent. In a sense, wherever logistic regression has been used successfully, probit regression will do just as a job. This result confirms what was already noticed and strongly expressed by Feller (1971) (pp 5253).
ii.3 Likelihoodbased verification of structural equivalence
In the proofs presented earlier, we focused on the parameters and never mentioned their estimates. We now provide a likelihood based verification of the structural equivalence of probit and logit. Without loss of generality, we shall focus on the univariate case where the underlying linear model does not have the intercept , so that . With denoting the predictor variable for the th observation, we have the probability model . Let and denote the estimates of for the logit and the probit link functions respectively. Our first verification of the equivalence of the above link functions consists of showing that and are linearly related through , with a coefficient of determination very close to and a slope that remains fixed regardless of the task at hand. We derive the approximate estimates of theoretically using Taylor series expansion, but we also confirm their values computationally by simulation.
Theorem 3.
Consider an i.i.d sample where is a realvalued predictor variable, and is the corresponding binary response. First consider fitting the probit model to the data, and let denote the corresponding estimate of . Then consider fitting the logit model and to the data, and let denote the corresponding estimate of . Then,
Proof.
Given an i.i.d sample and the model , the loglikelihood for is given by
Under the logit link function, we have . Now, using a Taylor series expansion around zero for the two most important parts of the loglikelihood function, we get
and
The derivative of the approximate loglikelihood function for the logit model is then given by
which, upon ignoring the higher degree terms in the expansion becomes
It is straightforward to see that solving for yields
If we now consider the probit link function, we have . Using a derivation similar to the one performed earlier, and ignoring higher order terms, we get
where and . This leads to
It is then straightforward to see that
or equivalently
∎
It must be emphasized that the above likelihoodbased theoretical verifications are dependent on Taylor series approximations of the likelihood and therefore the factor of proportionality are bound to be inexact. It’s reassuring however to see that our computational verification does confirm the results found by theoretical derivation.
Iii Similarities and Differences beyond Logit and Probit
Other aspects of our work reveal that the similarities proved and demonstrated above between the probit and the logit link functions extend predictively to the other link functions mentioned above. As far as structural equivalence or the lack thereof is concerned, Appendix A contains similar derivations for the relationship between cauchit and logit, and the relationship between compit and logit. As far as, predictive equivalence is concerned, we now present a verification based on the computation of many replications of the test error.
iii.1 Computational Verification of Predictive Equivalence
We now computationally compare the predictive merits of each of the four link functions considered so far. To this end, we compare the estimated average test error yielded by the four link functions. We do so by running replications of the split of the data set into training and test set, and at each iteration we compute the corresponding test error for the classifier corresponding to each link functions. For one iteration/replication for instance, , , and are the values of the test error generated by probit, compit, cauchit and logit respectively. After replications, we have random realizations of each of those four test errors. We then perform various statistical calculations on the replications, namely
median, mean, standard deviation, kurtosis, skewness, IQR etc…
, to assess the similarity and the differences among the link functions. We perform the similar replications for model comparison using both AIC and BIC.Example 4: Verification of Predictive Equivalence on Artificial Data: where and are drawn for a cauchy binary regression model with and , namely where
Table(5) shows some statistics on replications of the test error. The above results suggest that the four link functions are almost indistinguishable as the estimated statistics are almost all equally across the examples.
probit  compit  cauchit  logit  

median  0.16  0.16  0.16  0.16 
mean  0.16  0.16  0.16  0.16 
sd  0.04  0.04  0.03  0.04 
skewness  0.21  0.26  0.26  0.24 
kurtosis  3.18  3.51  3.20  3.20 
cv  22.56  22.46  22.25  22.57 
IQR  0.05  0.04  0.05  0.05 
min  0.06  0.04  0.06  0.06 
max  0.30  0.32  0.31  0.30 
Example 5: Verification of Predictive Equivalence on the Pima Indian Diabetes Dataset:
We once again consider the famous Pima Indian Diabetes dataset. The Pima Indian Diabetes Dataset is arguably one the most used benchmark data sets in the statistics and pattern recognition community. As can be see in Table (
6), there is virtually no difference between the models. In other words, on the Pima Indian Diabetes data set, the four link functions are predictive equivalent.probit  compit  cauchit  logit  

median  0.25  0.24  0.25  0.25 
mean  0.25  0.25  0.26  0.25 
sd  0.04  0.04  0.05  0.04 
skewness  0.06  0.07  0.06  0.06 
kurtosis  2.92  2.95  2.95  2.92 
cv  17.84  18.33  17.62  17.85 
IQR  0.06  0.06  0.07  0.06 
min  0.09  0.07  0.10  0.09 
max  0.43  0.40  0.45  0.42 
It’s also noteworthy to point out that all the four models also yield similar goodness of fit measures when scored using AIC and BIC. Indeed, Figure (5) reveals that over the replications of the split of the data into training and test set, both the AIC and BIC are distributionally similar across all the four link functions.
Despite the slight difference shown by the Cauchit model, it is fair to say that all the link functions are equivalent in terms of goodness of fit. Once again, this is yet another evidence to support and somewhat reinforce/confirm Feller (1971)’s claim that all these link functions are equivalent in terms of goodness of fit, and that the overglorification of the logit model is at best misguided if not unfounded.
iii.2 Evidence of Differences in High Dimensional Spaces
Simulated evidence: We generate observations in the interval . For each link function, we compute the sign of for . We then generate a table containing the percentage of times the signs differ.
probit  compit  cauchit  logit  

probit  0.000  0.004  0.000  0.000 
compit  0.004  0.000  0.004  0.004 
cauchit  0.000  0.004  0.000  0.000 
logit  0.000  0.000  0.000  0.000 
Computational Demonstrations on the Email Spam Data:
Unlike all the other data sets encountered thus far, the email spam data set is a fairly high dimensional data set. It has a total of
variables and observations.probit  compit  cauchit  logit  

median  0.08  0.13  0.07  0.07 
mean  0.10  0.13  0.07  0.08 
sd  0.03  0.03  0.04  0.01 
skewness  1.75  0.88  9.16  4.86 
kurtosis  9.29  8.56  103.95  40.58 
cv  34.41  20.61  51.15  14.32 
IQR  0.04  0.04  0.01  0.01 
min  0.06  0.07  0.05  0.06 
max  0.41  0.38  0.62  0.18 
Clearly, the results depicted in Table (8) reveal some drastic differences in performance among the four link functions on this rather high dimensional data. The boxplots below reinforce these findings as they show that in terms of goodness of fit measured through AIC and BIC, the compit model deviates substantially from the other models.
Iv Conclusion and discussion
Throughout this paper, we have explored both conceptually/methodologically and computationally the similarities among four of the most commonly used link functions in binary regression. We have theoretically shed some light on some of the structural reasons that explain the indistinguishability in performance in the univariate settings among the four link functions considered. Although section 2 concentrated mainly on the equivalence of the logit and probit, the Appendix provides a similar derivation for both the cauchit and the complementary log log link functions. We have also demonstrated by computational simulations that the four link functions are essentially equivalent both structurally and predictively in the univariate setting and in low dimensional spaces. Our last example showed computationally that the four link functions might differ quite substantially when the dimensional of the input space becomes extremely large. We notice specifically that the performance in high dimensional spaces tends to defend on the internal structure of the input: completely orthogonal designs tending to bode well with all the perfectly symmetric link functions while the non orthogonal designs deliver best performances under the complementary log log. Finally, the sparseness of the input space tends to dictate the choice of the most appropriate link function, Cauchit tending to be the model of choice under high level of sparseness. In our future work, we intend to provide as complete a theoretical characterization as possible in extremely high dimensional spaces, namely providing the conditions under which each of the link function will yield the best fit for the data.
References
 Armagan and Zaretzki (2011) Armagan, A. and R. Zaretzki (2011). A note on meanfield variational approximations in bayesian probit models. Computational Statistics and Data Analysis 55, 641–643.
 Basu and Mukhopadhyay (2000) Basu, S. and S. Mukhopadhyay (2000). Bayesian analysis of binary regression using symmetric and asymmetric links. Sankhya: The Indian Journal of Statistics 62(3), 372–387.
 Chakraborty (2009) Chakraborty, S. (2009). Bayesian binary kernel probit model for microarray based cancer classification and gene selection. Computational Statistics and Data Analysis 53, 4198–4209.
 Chambers and Cox (1967) Chambers, E. and D. Cox (1967). Discrimination between alternative binary response models. Biometrika 54(3/4), 573–578.
 Csató et al. (2000) Csató, L., E. Fokoué, M. Opper, B. Schottky, and O. Winther (2000). Efficient approaches to gaussian process classification. In S. A. Solla, T. K. Leen, and e. K.R. Müller (Eds.), Advances in Neural Information Processing Systems, Number 12. MIT Press.
 Feller (1940) Feller, W. (1940). On the logistic law of growth and its empirical verification in biology. Acta Biotheoretica 5, 51–66.

Feller (1971)
Feller, W. (1971).
An Introduction to Probability Theory and Its Applications
(Second ed.), Volume II. New York: John Wiley and Sons.  Lin and Hu (2008) Lin, G. D. and C. Y. Hu (2008). On characterizations of the logistic distribution. Journal of Statistical Planning and Inference 138, 1147–1156.
 Nadarajah (2004) Nadarajah, S. (2004). Information matrix for logistic distributions. Mathematical and Computer Modelling 40, 953–958.
 Nassar and Elmasry (2012) Nassar, M. M. and A. Elmasry (2012). A study of generalized logistic distributions. Journal of the Egyptian Mathematical Society 20, 126–133.
 Schumacher et al. (1996) Schumacher, M., R. Robner, and W. Vach (1996). Neural networks and logistic regression: Part i. Computational Statistics and Data Analysis 21, 661–682.
 Tamura and Giampaoli (2013) Tamura, K. A. and V. Giampaoli (2013). New prediction method for the mixed logistic model applied in a marketing problem. Computational Statistics and Data Analysis 66, 202–216.
 van den Hout et al. (2007) van den Hout, A., P. van der Heijden, and R. Gilchrist (2007). The Logistic Regression Model with Response Variables Subject to Randomized Response. Computational Statistics and Data Analysis 51, 6060–6069.
 Zelterman (1989) Zelterman, D. (1989). Order statistics for the generalized logistic distribution. Computational Statistics and Data Analysis 7, 69–77.
V Appendix A
Theorem 4.
Consider an i.i.d sample where is a realvalued predictor variable, and is the corresponding binary response. First consider fitting the cauchit model to the data, and let denote the corresponding estimate of . Then consider fitting the logit model and to the data, and let denote the corresponding estimate of . Then,
Proof.
Given an i.i.d sample and the model , the loglikelihood for is given by
(5) 
For the Cauchit for instance, . We use the Taylor series expansion around zero for both and .
and
A first order approximation of the derivative of the loglikelihood with respect to is
Solving yields
which simplifies to
∎