Abstract
Selection of appropriate link function for binary regression remains an important issue for data analysis and its influence on related inference. We prescribe a new datadriven methodology to search for the same, considering some popular classification assessment metrics. A casestudy with World Happiness report,2018 with special reference to immigration is presented for demonstrating utility of the prescribed routine.
Keywords: Binary regression, Link function, Latent stressstrength, World happiness report, Crossvalidation.
2010 MSC: 6207. 62J12. 62P25.
1 Introduction
“Happiness is the joy that we feel when we’re striving after our potential.”
 Ancient Greek World
In 1979, at Bombay Airport, the King of Bhutan, Jigme Singye Wangchuck, replied to a query from one Indian journalist that, “ We do not believe in Gross National Product because Gross National Happiness is more important ”. This was the beginning of the great philosophy as reported by Tashi Dorji, in his famous article,“ The story of a king, a poor country and a rich idea ” in 2012 (see Dorji, 2012). UN started to follow this philosophy and as a result the Sixth World Happiness Report of 2018 since 2012 on the basis of Gallup World Poll(GWP) Data 20052017, being published. World Happiness Index measures the happiness level of the 156 countries and ranks them accordingly. Among these 156 countries, this report includes the happiness ranking of 117 countries on the basis of immigration (see Helliwell, 2018). This report considered a Cantrill Ladder (an imaginary ladder with 0 to 10 steps from bottom to top indicating the increasing level of happiness with higher steps) to measure the level of happiness. Parameters viz. Life Ladder, Log GDP per capita, Social support, Healthy life expectancy at birth, Freedom to make life choices, Generosity, Perceptions of corruption, Positive affect(the average of previousday affect measures for happiness, laughter, and enjoyment), Negative affect (the average
of previousday affect measures for worry, sadness, and anger), Confidence in national government, Democratic Quality and Delivery Quality, for better understanding of wellbeing are also reported countrywise.
Nowadays glocal world is experiencing the wave of heavy migration. People have migrated to different countries due to different reasons according to their perceived level of aspiration. The primary question, while deciding to migrate to a new country with new environment, society, culture, habits and unknown people surrounding, what will be the best choice for destination? How one can decide the place of destination, which suits him/her best?
From statistical perspective of this study, happiness score is the concerned response variable and the various parameters mentioned above are covariates. Chapter 2 of World Happiness Report(2018) focuses on international migration. In statistical appendices of this chapter, pooled ordinary least square regression is performed for assessing the impact of each covariate on the response. It is to be noted that, the response variable here is score on an ordinal scale and thus it does not comply with the usual assumptions of ordinary least squares. A more technically correct thing would be to categorize each country into two categories: happy or unhappy based on the score and perform binary regression for the same purpose as reported in the mentioned statistical appendices while this will result in marginal loss of information in response variable. This being said, categorization of ordinal response is one wayout for drawing statistically correct inference. Question remains: why two categories? Choice of number of categories is not rigid but while doing this, natural intuition leads one to the dichotomy of happiness and unhappiness. Extension to more than two categories may make the physical interpretations clumsy. For performing binary regression, many link functions are available in literature both symmetric (e.g probit link) and asymmetric (e.g complementary loglog) but only some are popular. For details on binary regression and use of link functions, see Cox(2018) and Agresti and Kateri(2011). The choice of proper link function is important as misspecification of the same might have adverse effect on inference and prediction. Some notable attempts for making a statistical choice for the appropriate link function can be found in Czado and Santner(1992), Huettmann and Linke(2003) and Jingwei(2014). In the current work we consider crossvalidation based approach along with a number of important assessment indices to get a datadependent choice of link function. Crossvalidation based approaches are taken up in this work in view of the prediction purpose of binary regression modelling. We analyze multiple datasets from the same context to find out the appropriate choice of link function and refrain from reporting the significance of individual covariates which was easily done once the suitable link was established. We hope from the best fitted model we can improve the decision of choosing the country and increase the level of confidence of the immigrant that will be effective for living a good life with increasing potentiality for mankind. Motivation behind working with this dataset:

Reliable, collected by a proven surveygroup.

With moderate to large number of observations, crossvalidation and thus the prescribed routine performs well. Thus, demonstration of the routine by considering this dataset is valid.

“Happiness”, in its true sense is a determinant of immigration.

Happiness index is the most talked after indicator in recent times.
In the next section we formulate the statistical problem associated with the dataset. In section 3, we briefly discuss different assessment metrics used in this paper. Section 4 discusses two important crossvalidation schemes. Then we present a note on the concerned survey and nature of covariates present in the study. In section 6, construction of working datasets and findings from numerical results are given. We finish with a short discussion on relevance and scope of the study. Relevant tables are are provided in appendix.
2 Formulation of the problem
Suppose, data on different countries where, the response variable () for th country along with covariates () are given for . As mentioned in section 1, the response is ordinal and it is a well known fact that, such responses cannot be modelled efficiently with usual linear and generalized linear models (see Crichtonn and Hinde, 1992). Thus, with some fixed threshold we consider,
which is a reflection of two contradictory latent forces viz. and as follows:
If we take as success, a natural interpretation for would be the positive force and would be the negative force. Latency of and prevents one from directly modelling . The set of covariates is naturally partitioned into and , the former influencing and the latter .
Our interest here is to explore and investigate the possibilities of modeling of unobserved and . One can assume both to follow independent normal, logistic, cauchy or extreme value distribution among others. The problem of choosing appropriate model is equivalent to that of choosing link function for binary regression where, is modelled as:
where,
is the vector of parameters and
is a link function induced by the assumed common latent stressstrength distribution . In this study, we consider the following four link functions as:
Probit link:
is cdf of standard normal distribution.

Logit link: is cdf of standard logistic distribution.

Cauchit link:
is cdf of standard Cauchy distribution.

Complementary loglog link: is cdf of standard extreme value distribution.
For observed , the likelihood function involves chosen structure of
and maximum likelihood estimator for
is obtained through iterative reweighted least squares. For details on connection of latent variable modelling with link functions see Cox(2018), Albert and Chib(1993) and Banerjee and Biswas(2003). Here we will apply some known methods for assessing suitability of link functions and simultaneously use cross validation to achieve desired level of predictive performance for the models.3 Different measures of assessment
In order to assess performance of link functions tried in our case study, we shall employ well known assessment measures available in literature (see Tharwart, 2018). In binary classification problems, prediction of one of the two classes (usually +ve and ve) is based on a new set of covariates. A positive (negative) sample point classified as positive (negative) is referred as True Positive(negative) classification whereas a positive (negative) sample point classified as negative (positive) is called false negative(positive) or Type II Error. The corresponding confusion matrix is shown below:
True or Actual Class  
Positive  Negative  
Predicted Class  True  True Positive (TP)  False Positive (FP) 
False  False Negative (FN)  True Negative (TN) 
Based on the confusion matrix, we consider the following four metrics:
is the simplest and commonly used measure and it is sensitive to imbalanced data. On the contrary and are not sensitive to imbalanced data. When misclassifying true positive is more serious error than misclassifying true negative, we should decide upon and for the opposite scenario, should be the metric to note. Along with these three, we also consider another popular assessment metric:
This measure is based on receiver operating characteristics (ROC) curve and overcomes the inability of the ROC in comparing different classifiers for being a scalar rather than a function itself.
4 On crossvalidation methods
Since main reason for modelling here, is to predict probability of being happy, it is inevitable to subject the proposed models to rigorous cross validation for achieving perfection, in addition to the assessments discussed in last section. In this study, we implement three useful crossvalidation routines, briefly discussed below. For details on various crossvalidation approaches and their relative performance, see section 5.1 of James et al.(2013).

Leavepout CV: This Method comprises of using out of observations as the validation set while the rest observations are taken as training set. This exercise is repeated in all possible ways to partition a sample of into two sets, one with and the other with elements. Obviously with large and even moderate , the number of validation sets may explode with . For , this method reduces to Leaveoneout cross validation (LOOCV). In numerical study, we apply LOOCV and LPOCV with two different choices of : Hold Leave and Hold Leave .

kfold CV: One way to avoid exhaustive CV method as above is to apply k fold CV, where the sample is randomly partitioned into k subsamples of same size. Out of these k subsamples, one is taken as the validation set and rest are used for training. This process is repeated such that each of the k subsample are taken as training set. In numerical study, we perform this with and .
5 A note on GWP survey
Gallup World Poll (GWP) conducted surveys over 160 countries since 2005 with 1000 sample (for large countries it can be of size 2000) of adult population semiannually, annually, and biennially . This survey includes almost 100 questions in a similar manner for the people of different region of world either through telephone (generally in the developed countries) for almost 30 minutes or direct interview (generally in developing countries) for almost 1 hour. World Happiness Report (WHR), 2018 used this GWP data for developing the happiness index, and modelling it with the covariates given and described in Table 5. In accordance with the formulation given in section 2, we identify the factors which contribute positively towards happiness of migrants to be covariates for and the remaining as covariates for and the same is given in column 3 of Table 5.
6 Numerical study and findings
We consider four different but related datasets for demonstrating the method of choosing the suitable link function or equivalently, latent stressstrength models. As mentioned in section 2, datasets reporting happiness score with related covariates are available for different years. Here, we consider these datasets for 2015, 2016 and 2017. For stability, we synthetically reproduce another dataset with the current year happiness score along with the covariatevalues averaged over the years 2005 to 2017. The 2017 dataset contains information on 10 covariates (excluding positive effect and negative effect) while the other datasets contain all 12 covariates. The working datasets (https://worldhappiness.report/ed/2018/) have been used for ranking the countries according to the migrants’ satisfactory level using the happiness score. As indicated in section 1 and section 2, we categorize the countries to be a good choice for migration from “happiness” perspective if the score is greater than or equal to 6. Remaining countries fall into the other category. Thus, the transformed binary response variable, is as follows:
Obviously, the transformed variable is response in the current study. Our main interest is to model with available covariates using different link functions and to find out the most suitable one.
For each of the four datasets discussed above, we compute different assessment metrics , , and with crossvalidation routine mentioned in section 4. The numerical results for years 2017, 2016, 2015 and the aggregate data are given in Table 1, Table 2, Table 3 and Table 4, respectively. For LOOCV, the metrics except cannot be calculated as, the single test sample will either be or .
With respect to each metric of assessment, we identify the bestperforming link function as the one which has the maximum number of rank over all crossvalidation routines. Tie if any, is resolved by going to the next stage and checking for the next rank and so on. Using this scheme, we arrive at the following conclusions:
With respect to :

For 2017, logit performs best followed by probit.

For 2016, cauchit performs best followed by complementary loglog.

For 2015, cauchit performs best followed by complementary loglog.

For aggregate data, complementary loglog performs best followed by probit.
With respect to :

For 2017, probit performs best followed by logit.

For 2016, cauchit performs best followed by logit and probit.

For 2015, cauchit performs best followed by logit.

For aggregate data, logit performs best followed by probit.
With respect to :

For 2017, complementary loglog performs best followed by cauchit.

For 2016, cauchit and complementary loglog both performs equally well.

For 2015, complementary loglog performs best followed by cauchit.

For aggregate data, cauchit performs best followed by probit.
With respect to :

For 2017, cauchit performs best followed by logit.

For 2016, probit performs best followed by logit.

For 2015, cauchit and complementary loglog performs equally well.

For aggregate data, probit performs best followed by logit.
Overall it is observed that the cauchit link function is the best or second best in half of the cases followed by complementary loglog link. Practitioners of binary regression modelling should therefore give efforts to search for the best one from a set of available link functions for better inference and prediction.
7 Discussion
It is true that, there is a tendency among analysts to opt for logit link function while dealing with binary response modelling despite the fact that the distributional assumptions underlying such choice of link function may not hold very often. This has a potential of generating statistically incorrect findings and consequences may be costly in some domain of research. The findings of our investigation confirms the issue and highlights how different link functions come upfront surpassing the established myths with data from the same context and with respect to different periods and assessment metric. The data driven methodology to look for the best link function presented in this short case study aims to provide a meaningful way to address the issue.
References
Agresti, A., & Kateri, M. (2011). Categorical data analysis. Springer Berlin Heidelberg.
Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American statistical Association, 88(422), 669679.
Banerjee, T., & Biswas, A. (2003). A new formulation of stress–strength reliability in a regression setup. Journal of statistical planning and inference, 112(12), 147157.
Cox, D. R. (2018). Analysis of binary data. Routledge.
Crichton, N., & Hinde, J. (1992). Investigation of an ordered logistic model for consumer debt. In Advances in GLIM and Statistical Modelling (pp. 5459). Springer, New York, NY.
Czado, C., & Santner, T. J. (1992). The effect of link misspecification on binary regression inference. Journal of statistical planning and inference, 33(2), 213231.
Dorji, T. (2012). https://blogbhutan.wordpress.com/2012/06/11/thestoryofakingapoorcountryandarichidea/
Helliwell, J., Layard, R., & Sachs, J. (2018). World Happiness Report 2018, New York: Sustainable Development Solutions Network
Huettmann, F., & Linke, J. (2003, May). Assessment of different link functions for modeling binary data to derive sound inferences and predictions. In International Conference on Computational Science and Its Applications (pp. 4348). Springer, Berlin, Heidelberg.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: Springer.
Li, J. (2014). Choosing the proper link function for binary data (Doctoral dissertation). Link: https://repositories.lib.utexas.edu/handle/2152/26363
Tharwat, A. (2018). Classification assessment methods. Applied Computing and Informatics.
Appendix
LOOCV  
Efficiency Measure  Probit  Logit  Cauchit  CLogLog 
0.92437  0.92437  0.91597  0.91597  
        
        
        
LPOCV 5050  
Efficiency Measure  Probit  Logit  Cauchit  CLogLog 
0.87237  0.87187  0.87790  0.87373  
0.84482  0.84403  0.83985  0.79121  
0.88116  0.88157  0.89471  0.91237  
0.86316  0.86325  0.86555  0.84927  
LPOCV 7525  
Efficiency Measure  Probit  Logit  Cauchit  CLogLog 
0.89867  0.89930  0.89880  0.89550  
0.87655  0.87454  0.85686  0.81494  
0.91414  0.91528  0.91872  0.93653  
0.89235  0.89235  0.88829  0.87556  
5 Fold CV  
Efficiency Measure  Probit  Logit  Cauchit  CLogLog 
0.92437  0.88235  0.90756  0.89916  
0.87507  0.90728  0.89931  0.83459  
0.88805  0.90810  0.92839  0.95928  
0.92180  0.90560  0.88214  0.91859  
10 Fold CV  
Efficiency Measure  Probit  Logit  Cauchit  CLogLog 
0.88225  0.92437  0.89076  0.91597  
0.87395  0.74789  0.83673  0.92941  
0.91176  0.93389  0.92577  0.92913  
0.88229  0.87057  0.89675  0.89499 
LOOCV  
Efficiency Measure  Probit  Logit  Cauchit  CLogLog 
0.88618  0.88618  0.91057  0.89431  
        
        
        
LPOCV 5050  
Efficiency Measure  Probit  Logit  Cauchit  CLogLog 
0.85126  0.85197  0.86955  0.85765  
0.81341  0.81373  0.81845  0.75867  
0.87120  0.87292  0.89362  0.90662  
0.84398  0.84503  0.85686  0.83296  
LPOCV 7525  
Efficiency Measure  Probit  Logit  Cauchit  CLogLog 
0.87861  0.84800  0.88290  0.88119  
0.84960  0.84794  0.84388  0.82313  
0.89580  0.89557  0.90249  0.91086  
0.87036  0.86985  0.86929  0.86545  
5 Fold CV  
Efficiency Measure  Probit  Logit  Cauchit  CLogLog 
0.88618  0.89431  0.90244  0.90244  
0.87434  0.82488  0.90129  0.77108  
0.91578  0.89938  0.94026  0.92898  
0.85664  0.86206  0.87976  0.86569  
10 Fold CV  
Efficiency Measure  Probit  Logit  Cauchit  CLogLog 
0.89434  0.90244  0.90244  0.92683  
0.89282  0.89377  0.85528  0.85163  
0.89649  0.89959  0.90646  0.92203  
0.88272  0.84327  0.86956  0.87964 
LOOCV  
Efficiency Measure  Probit  Logit  Cauchit  CLogLog 
0.91869  0.91057  0.91057  0.91867  
        
        
        
LPOCV 5050  
Efficiency Measure  Probit  Logit  Cauchit  CLogLog 
0.87959  0.88045  0.90095  0.87831  
0.85221  0.85256  0.86183  0.75874  
0.89429  0.89517  0.92173  0.93732  
0.87642  0.87669  0.89002  0.85069  
LPOCV 7525  
Efficiency Measure  Probit  Logit  Cauchit  CLogLog 
0.89487  0.89554  0.90303  0.89390  
0.87261  0.87254  0.87889  0.83568  
0.91338  0.91342  0.92243  0.93295  
0.88657  0.88759  0.89545  0.87789  
5 Fold CV  
Efficiency Measure  Probit  Logit  Cauchit  CLogLog 
0.87801  0.86992  0.91869  0.94309  
0.86267  0.91722  0.91223  0.81844  
0.91540  0.91054  0.88686  0.96386  
0.86579  0.88979  0.84357  0.89257  
10 Fold CV  
Efficiency Measure  Probit  Logit  Cauchit  CLogLog 
0.90244  0.90244  0.90244  0.92683  
0.88880  0.80434  0.888577  0.92547  
0.91426  0.90871  0.90089  0.93413  
0.89381  0.90527  0.89636  0.92222 
LOOCV  
Efficiency Measure  Probit  Logit  Cauchit  CLogLog 
0.90647  0.89928  0.88489  0.89928  
        
        
        
LPOCV 5050  
Efficiency Measure  Probit  Logit  Cauchit  CLogLog 
0.85891  0.85887  0.86636  0.86317  
0.81299  0.81192  0.80615  0.76611  
0.88721  0.88719  0.89612  0.91165  
0.84916  0.84902  0.85106  0.83805  
LPOCV 7525  
Efficiency Measure  Probit  Logit  Cauchit  CLogLog 
0.88414  0.88363  0.87677  0.88368  
0.82103  0.81873  0.80982  0.77989  
0.91621  0.91539  0.90996  0.93042  
0.87367  0.87269  0.86159  0.85784  
5 Fold CV  
Efficiency Measure  Probit  Logit  Cauchit  CLogLog 
0.87769  0.83453  0.89928  0.90647  
0.77307  0.85172  0.81814  0.76321  
0.88082  0.90521  0.90692  0.92817  
0.89934  0.83118  0.85167  0.87317  
10 Fold CV  
Efficiency Measure  Probit  Logit  Cauchit  CLogLog 
0.89928  0.89928  0.89208  0.90647  
0.78801  0.85911  0.82494  0.78705  
0.94224  0.93713  0.87470  0.93064  
0.88832  0.89685  0.84048  0.83493 
Covariates  Description  Type  

GDP per Capita 

Strength  
Social support 

Strength  
Healthy Life Expectancy 

Strength  
Freedom to make life choices 

Strength  
Generosity 

Strength  
Corruption Perception 

Stress  
Positive affect 

Strength  
Negative affect 

Stress  
Confidence in national government 

Strength  
Democratic and Delivery Quality 

Strength 
Comments
There are no comments yet.