Selection of link function in binary regression: A case-study with world happiness report on immigration

10/17/2019
by   Ardhendu Banerjee, et al.
0

Selection of appropriate link function for binary regression remains an important issue for data analysis and its influence on related inference. We prescribe a new data-driven methodology to search for the same, considering some popular classification assessment metrics. A case-study with World Happiness report,2018 with special reference to immigration is presented for demonstrating utility of the prescribed routine.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/06/2021

Tractable Bayes of Skew-Elliptical Link Models for Correlated Binary Data

Correlated binary response data with covariates are ubiquitous in longit...
09/14/2019

Efficiency Metrics for Data-Driven Models: A Text Summarization Case Study

Using data-driven models for solving text summarization or similar tasks...
01/14/2019

BoostNet: Bootstrapping detection of socialbots, and a case study from Guatemala

We present a method to reconstruct networks of socialbots given minimal ...
05/28/2018

Core Conflictual Relationship: Text Mining to Discover What and When

Following detailed presentation of the Core Conflictual Relationship The...
05/13/2021

Global Wheat Challenge 2020: Analysis of the competition design and winning models

Data competitions have become a popular approach to crowdsource new data...
06/10/2019

Incorporating Open Data into Introductory Courses in Statistics

The 2016 Guidelines for Assessment and Instruction in Statistics Educati...
08/13/2020

Flexible Modeling of Hurdle Conway-Maxwell-Poisson Distributions with Application to Mining Injuries

While the hurdle Poisson regression is a popular class of models for cou...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Abstract

Selection of appropriate link function for binary regression remains an important issue for data analysis and its influence on related inference. We prescribe a new data-driven methodology to search for the same, considering some popular classification assessment metrics. A case-study with World Happiness report,2018 with special reference to immigration is presented for demonstrating utility of the prescribed routine.

Keywords: Binary regression, Link function, Latent stress-strength, World happiness report, Cross-validation.

2010 MSC: 62-07. 62J12. 62P25.

1 Introduction

“Happiness is the joy that we feel when we’re striving after our potential.”

- Ancient Greek World

In 1979, at Bombay Airport, the King of Bhutan, Jigme Singye Wangchuck, replied to a query from one Indian journalist that, “ We do not believe in Gross National Product because Gross National Happiness is more important ”. This was the beginning of the great philosophy as reported by Tashi Dorji, in his famous article,“ The story of a king, a poor country and a rich idea ”- in 2012 (see Dorji, 2012). UN started to follow this philosophy and as a result the Sixth World Happiness Report of 2018 since 2012 on the basis of Gallup World Poll(GWP) Data 2005-2017, being published. World Happiness Index measures the happiness level of the 156 countries and ranks them accordingly. Among these 156 countries, this report includes the happiness ranking of 117 countries on the basis of immigration (see Helliwell, 2018). This report considered a Cantrill Ladder (an imaginary ladder with 0 to 10 steps from bottom to top indicating the increasing level of happiness with higher steps) to measure the level of happiness. Parameters viz. Life Ladder, Log GDP per capita, Social support, Healthy life expectancy at birth, Freedom to make life choices, Generosity, Perceptions of corruption, Positive affect(the average of previous-day affect measures for happiness, laughter, and enjoyment), Negative affect (the average of previous-day affect measures for worry, sadness, and anger), Confidence in national government, Democratic Quality and Delivery Quality, for better understanding of well-being are also reported country-wise.

Nowadays glocal world is experiencing the wave of heavy migration. People have migrated to different countries due to different reasons according to their perceived level of aspiration. The primary question, while deciding to migrate to a new country with new environment, society, culture, habits and unknown people surrounding, what will be the best choice for destination? How one can decide the place of destination, which suits him/her best?

From statistical perspective of this study, happiness score is the concerned response variable and the various parameters mentioned above are covariates. Chapter 2 of World Happiness Report(2018) focuses on international migration. In statistical appendices of this chapter, pooled ordinary least square regression is performed for assessing the impact of each covariate on the response. It is to be noted that, the response variable here is score on an ordinal scale and thus it does not comply with the usual assumptions of ordinary least squares. A more technically correct thing would be to categorize each country into two categories: happy or unhappy based on the score and perform binary regression for the same purpose as reported in the mentioned statistical appendices while this will result in marginal loss of information in response variable. This being said, categorization of ordinal response is one way-out for drawing statistically correct inference. Question remains: why two categories? Choice of number of categories is not rigid but while doing this, natural intuition leads one to the dichotomy of happiness and unhappiness. Extension to more than two categories may make the physical interpretations clumsy. For performing binary regression, many link functions are available in literature both symmetric (e.g probit link) and asymmetric (e.g complementary log-log) but only some are popular. For details on binary regression and use of link functions, see Cox(2018) and Agresti and Kateri(2011). The choice of proper link function is important as mis-specification of the same might have adverse effect on inference and prediction. Some notable attempts for making a statistical choice for the appropriate link function can be found in Czado and Santner(1992), Huettmann and Linke(2003) and Jingwei(2014). In the current work we consider cross-validation based approach along with a number of important assessment indices to get a data-dependent choice of link function. Cross-validation based approaches are taken up in this work in view of the prediction purpose of binary regression modelling. We analyze multiple data-sets from the same context to find out the appropriate choice of link function and refrain from reporting the significance of individual covariates which was easily done once the suitable link was established. We hope from the best fitted model we can improve the decision of choosing the country and increase the level of confidence of the immigrant that will be effective for living a good life with increasing potentiality for mankind. Motivation behind working with this data-set:

  • Reliable, collected by a proven survey-group.

  • With moderate to large number of observations, cross-validation and thus the prescribed routine performs well. Thus, demonstration of the routine by considering this data-set is valid.

  • “Happiness”, in its true sense is a determinant of immigration.

  • Happiness index is the most talked after indicator in recent times.

In the next section we formulate the statistical problem associated with the data-set. In section 3, we briefly discuss different assessment metrics used in this paper. Section 4 discusses two important cross-validation schemes. Then we present a note on the concerned survey and nature of covariates present in the study. In section 6, construction of working data-sets and findings from numerical results are given. We finish with a short discussion on relevance and scope of the study. Relevant tables are are provided in appendix.

2 Formulation of the problem

Suppose, data on different countries where, the response variable () for -th country along with covariates () are given for . As mentioned in section 1, the response is ordinal and it is a well known fact that, such responses cannot be modelled efficiently with usual linear and generalized linear models (see Crichtonn and Hinde, 1992). Thus, with some fixed threshold we consider,

which is a reflection of two contradictory latent forces viz. and as follows:

If we take as success, a natural interpretation for would be the positive force and would be the negative force. Latency of and prevents one from directly modelling . The set of covariates is naturally partitioned into and , the former influencing and the latter .

Our interest here is to explore and investigate the possibilities of modeling of unobserved and . One can assume both to follow independent normal, logistic, cauchy or extreme value distribution among others. The problem of choosing appropriate model is equivalent to that of choosing link function for binary regression where, is modelled as:

where,

is the vector of parameters and

is a link function induced by the assumed common latent stress-strength distribution . In this study, we consider the following four link functions as:

  • Probit link:

    is cdf of standard normal distribution.

  • Logit link: is cdf of standard logistic distribution.

  • Cauchit link:

    is cdf of standard Cauchy distribution.

  • Complementary log-log link: is cdf of standard extreme value distribution.

For observed , the likelihood function involves chosen structure of

and maximum likelihood estimator for

is obtained through iterative re-weighted least squares. For details on connection of latent variable modelling with link functions see Cox(2018), Albert and Chib(1993) and Banerjee and Biswas(2003). Here we will apply some known methods for assessing suitability of link functions and simultaneously use cross validation to achieve desired level of predictive performance for the models.

3 Different measures of assessment

In order to assess performance of link functions tried in our case study, we shall employ well known assessment measures available in literature (see Tharwart, 2018). In binary classification problems, prediction of one of the two classes (usually +ve and -ve) is based on a new set of covariates. A positive (negative) sample point classified as positive (negative) is referred as True Positive(negative) classification whereas a positive (negative) sample point classified as negative (positive) is called false negative(positive) or Type II Error. The corresponding confusion matrix is shown below:


                                          True or Actual Class
Positive Negative
Predicted Class True True Positive (TP) False Positive (FP)
False False Negative (FN) True Negative (TN)

Based on the confusion matrix, we consider the following four metrics:

is the simplest and commonly used measure and it is sensitive to imbalanced data. On the contrary and are not sensitive to imbalanced data. When mis-classifying true positive is more serious error than mis-classifying true negative, we should decide upon and for the opposite scenario, should be the metric to note. Along with these three, we also consider another popular assessment metric:

This measure is based on receiver operating characteristics (ROC) curve and overcomes the inability of the ROC in comparing different classifiers for being a scalar rather than a function itself.

4 On cross-validation methods

Since main reason for modelling here, is to predict probability of being happy, it is inevitable to subject the proposed models to rigorous cross validation for achieving perfection, in addition to the assessments discussed in last section. In this study, we implement three useful cross-validation routines, briefly discussed below. For details on various cross-validation approaches and their relative performance, see section 5.1 of James et al.(2013).


  • Leave-p-out CV: This Method comprises of using out of observations as the validation set while the rest observations are taken as training set. This exercise is repeated in all possible ways to partition a sample of into two sets, one with and the other with elements. Obviously with large and even moderate , the number of validation sets may explode with . For , this method reduces to Leave-one-out cross validation (LOOCV). In numerical study, we apply LOOCV and LPOCV with two different choices of : Hold -Leave and Hold -Leave .

  • k-fold CV: One way to avoid exhaustive CV method as above is to apply k- fold CV, where the sample is randomly partitioned into k sub-samples of same size. Out of these k sub-samples, one is taken as the validation set and rest are used for training. This process is repeated such that each of the k sub-sample are taken as training set. In numerical study, we perform this with and .

5 A note on GWP survey

Gallup World Poll (GWP) conducted surveys over 160 countries since 2005 with 1000 sample (for large countries it can be of size 2000) of adult population semiannually, annually, and biennially . This survey includes almost 100 questions in a similar manner for the people of different region of world either through telephone (generally in the developed countries) for almost 30 minutes or direct interview (generally in developing countries) for almost 1 hour. World Happiness Report (WHR), 2018 used this GWP data for developing the happiness index, and modelling it with the covariates given and described in Table 5. In accordance with the formulation given in section 2, we identify the factors which contribute positively towards happiness of migrants to be covariates for and the remaining as covariates for and the same is given in column 3 of Table 5.

6 Numerical study and findings

We consider four different but related data-sets for demonstrating the method of choosing the suitable link function or equivalently, latent stress-strength models. As mentioned in section 2, data-sets reporting happiness score with related covariates are available for different years. Here, we consider these data-sets for 2015, 2016 and 2017. For stability, we synthetically reproduce another data-set with the current year happiness score along with the covariate-values averaged over the years 2005 to 2017. The 2017 data-set contains information on 10 covariates (excluding positive effect and negative effect) while the other data-sets contain all 12 covariates. The working data-sets (https://worldhappiness.report/ed/2018/) have been used for ranking the countries according to the migrants’ satisfactory level using the happiness score. As indicated in section 1 and section 2, we categorize the countries to be a good choice for migration from “happiness” perspective if the score is greater than or equal to 6. Remaining countries fall into the other category. Thus, the transformed binary response variable, is as follows:

Obviously, the transformed variable is response in the current study. Our main interest is to model with available covariates using different link functions and to find out the most suitable one.

For each of the four data-sets discussed above, we compute different assessment metrics , , and with cross-validation routine mentioned in section 4. The numerical results for years 2017, 2016, 2015 and the aggregate data are given in Table 1, Table 2, Table 3 and Table 4, respectively. For LOOCV, the metrics except cannot be calculated as, the single test sample will either be or .

With respect to each metric of assessment, we identify the best-performing link function as the one which has the maximum number of rank over all cross-validation routines. Tie if any, is resolved by going to the next stage and checking for the next rank and so on. Using this scheme, we arrive at the following conclusions:

With respect to :

  • For 2017, logit performs best followed by probit.

  • For 2016, cauchit performs best followed by complementary log-log.

  • For 2015, cauchit performs best followed by complementary log-log.

  • For aggregate data, complementary log-log performs best followed by probit.

With respect to :

  • For 2017, probit performs best followed by logit.

  • For 2016, cauchit performs best followed by logit and probit.

  • For 2015, cauchit performs best followed by logit.

  • For aggregate data, logit performs best followed by probit.

With respect to :

  • For 2017, complementary log-log performs best followed by cauchit.

  • For 2016, cauchit and complementary log-log both performs equally well.

  • For 2015, complementary log-log performs best followed by cauchit.

  • For aggregate data, cauchit performs best followed by probit.

With respect to :

  • For 2017, cauchit performs best followed by logit.

  • For 2016, probit performs best followed by logit.

  • For 2015, cauchit and complementary log-log performs equally well.

  • For aggregate data, probit performs best followed by logit.

Overall it is observed that the cauchit link function is the best or second best in half of the cases followed by complementary log-log link. Practitioners of binary regression modelling should therefore give efforts to search for the best one from a set of available link functions for better inference and prediction.

7 Discussion

It is true that, there is a tendency among analysts to opt for logit link function while dealing with binary response modelling despite the fact that the distributional assumptions underlying such choice of link function may not hold very often. This has a potential of generating statistically incorrect findings and consequences may be costly in some domain of research. The findings of our investigation confirms the issue and highlights how different link functions come upfront surpassing the established myths with data from the same context and with respect to different periods and assessment metric. The data driven methodology to look for the best link function presented in this short case study aims to provide a meaningful way to address the issue.

References

Agresti, A., & Kateri, M. (2011). Categorical data analysis. Springer Berlin Heidelberg.

Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American statistical Association, 88(422), 669-679.

Banerjee, T., & Biswas, A. (2003). A new formulation of stress–strength reliability in a regression setup. Journal of statistical planning and inference, 112(1-2), 147-157.

Cox, D. R. (2018). Analysis of binary data. Routledge.

Crichton, N., & Hinde, J. (1992). Investigation of an ordered logistic model for consumer debt. In Advances in GLIM and Statistical Modelling (pp. 54-59). Springer, New York, NY.

Czado, C., & Santner, T. J. (1992). The effect of link misspecification on binary regression inference. Journal of statistical planning and inference, 33(2), 213-231.

Dorji, T. (2012). https://blogbhutan.wordpress.com/2012/06/11/the-story-of-a-king-a-poor-country-and-a-rich-idea/

Helliwell, J., Layard, R., & Sachs, J. (2018). World Happiness Report 2018, New York: Sustainable Development Solutions Network

Huettmann, F., & Linke, J. (2003, May). Assessment of different link functions for modeling binary data to derive sound inferences and predictions. In International Conference on Computational Science and Its Applications (pp. 43-48). Springer, Berlin, Heidelberg.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: Springer.

Li, J. (2014). Choosing the proper link function for binary data (Doctoral dissertation). Link: https://repositories.lib.utexas.edu/handle/2152/26363

Tharwat, A. (2018). Classification assessment methods. Applied Computing and Informatics.

Appendix

LOOCV
Efficiency Measure Probit Logit Cauchit C-Log-Log
0.92437 0.92437 0.91597 0.91597
- - - -
- - - -
- - - -
LPOCV 50-50
Efficiency Measure Probit Logit Cauchit C-Log-Log
0.87237 0.87187 0.87790 0.87373
0.84482 0.84403 0.83985 0.79121
0.88116 0.88157 0.89471 0.91237
0.86316 0.86325 0.86555 0.84927
LPOCV 75-25
Efficiency Measure Probit Logit Cauchit C-Log-Log
0.89867 0.89930 0.89880 0.89550
0.87655 0.87454 0.85686 0.81494
0.91414 0.91528 0.91872 0.93653
0.89235 0.89235 0.88829 0.87556
5 Fold CV
Efficiency Measure Probit Logit Cauchit C-Log-Log
0.92437 0.88235 0.90756 0.89916
0.87507 0.90728 0.89931 0.83459
0.88805 0.90810 0.92839 0.95928
0.92180 0.90560 0.88214 0.91859
10 Fold CV
Efficiency Measure Probit Logit Cauchit C-Log-Log
0.88225 0.92437 0.89076 0.91597
0.87395 0.74789 0.83673 0.92941
0.91176 0.93389 0.92577 0.92913
0.88229 0.87057 0.89675 0.89499
Table 1: Based on 2017 data-set
LOOCV
Efficiency Measure Probit Logit Cauchit C-Log-Log
0.88618 0.88618 0.91057 0.89431
- - - -
- - - -
- - - -
LPOCV 50-50
Efficiency Measure Probit Logit Cauchit C-Log-Log
0.85126 0.85197 0.86955 0.85765
0.81341 0.81373 0.81845 0.75867
0.87120 0.87292 0.89362 0.90662
0.84398 0.84503 0.85686 0.83296
LPOCV 75-25
Efficiency Measure Probit Logit Cauchit C-Log-Log
0.87861 0.84800 0.88290 0.88119
0.84960 0.84794 0.84388 0.82313
0.89580 0.89557 0.90249 0.91086
0.87036 0.86985 0.86929 0.86545
5 Fold CV
Efficiency Measure Probit Logit Cauchit C-Log-Log
0.88618 0.89431 0.90244 0.90244
0.87434 0.82488 0.90129 0.77108
0.91578 0.89938 0.94026 0.92898
0.85664 0.86206 0.87976 0.86569
10 Fold CV
Efficiency Measure Probit Logit Cauchit C-Log-Log
0.89434 0.90244 0.90244 0.92683
0.89282 0.89377 0.85528 0.85163
0.89649 0.89959 0.90646 0.92203
0.88272 0.84327 0.86956 0.87964
Table 2: Based on 2016 data-set
LOOCV
Efficiency Measure Probit Logit Cauchit C-Log-Log
0.91869 0.91057 0.91057 0.91867
- - - -
- - - -
- - - -
LPOCV 50-50
Efficiency Measure Probit Logit Cauchit C-Log-Log
0.87959 0.88045 0.90095 0.87831
0.85221 0.85256 0.86183 0.75874
0.89429 0.89517 0.92173 0.93732
0.87642 0.87669 0.89002 0.85069
LPOCV 75-25
Efficiency Measure Probit Logit Cauchit C-Log-Log
0.89487 0.89554 0.90303 0.89390
0.87261 0.87254 0.87889 0.83568
0.91338 0.91342 0.92243 0.93295
0.88657 0.88759 0.89545 0.87789
5 Fold CV
Efficiency Measure Probit Logit Cauchit C-Log-Log
0.87801 0.86992 0.91869 0.94309
0.86267 0.91722 0.91223 0.81844
0.91540 0.91054 0.88686 0.96386
0.86579 0.88979 0.84357 0.89257
10 Fold CV
Efficiency Measure Probit Logit Cauchit C-Log-Log
0.90244 0.90244 0.90244 0.92683
0.88880 0.80434 0.888577 0.92547
0.91426 0.90871 0.90089 0.93413
0.89381 0.90527 0.89636 0.92222
Table 3: Based on 2015 data-set
LOOCV
Efficiency Measure Probit Logit Cauchit C-Log-Log
0.90647 0.89928 0.88489 0.89928
- - - -
- - - -
- - - -
LPOCV 50-50
Efficiency Measure Probit Logit Cauchit C-Log-Log
0.85891 0.85887 0.86636 0.86317
0.81299 0.81192 0.80615 0.76611
0.88721 0.88719 0.89612 0.91165
0.84916 0.84902 0.85106 0.83805
LPOCV 75-25
Efficiency Measure Probit Logit Cauchit C-Log-Log
0.88414 0.88363 0.87677 0.88368
0.82103 0.81873 0.80982 0.77989
0.91621 0.91539 0.90996 0.93042
0.87367 0.87269 0.86159 0.85784
5 Fold CV
Efficiency Measure Probit Logit Cauchit C-Log-Log
0.87769 0.83453 0.89928 0.90647
0.77307 0.85172 0.81814 0.76321
0.88082 0.90521 0.90692 0.92817
0.89934 0.83118 0.85167 0.87317
10 Fold CV
Efficiency Measure Probit Logit Cauchit C-Log-Log
0.89928 0.89928 0.89208 0.90647
0.78801 0.85911 0.82494 0.78705
0.94224 0.93713 0.87470 0.93064
0.88832 0.89685 0.84048 0.83493
Table 4: Based on the aggregate data-set
Covariates Description Type
GDP per Capita
Purchasing Power Parity as given by World Development Indicators
Strength
Social support
It is the national average of the binary responses (either 0
or 1) to the GWP question “If you were in trouble, do
you have relatives or friends you can count on to help
you whenever you need them, or not?”
Strength
Healthy Life Expectancy
The time series of healthy life expectancy at birth
are based on data from the World Health Organization
(WHO), the World Development Indicators (WDI),
and statistics published in journal articles taken as
non-health adjusted life expectancy and adjusted the
time series of total life expectancy to healthy life
expectancy by simple multiplication, assuming that
the ratio remains constant within each country over the
sample period.
Strength
Freedom to make life choices
It is the national average of responses to the GWP question “Are
you satisfied or dissatisfied with your freedom to choose what
you do with your life?”
Strength
Generosity
It is the residual of regressing national average of response to the
GWP question “Have you donated money to a charity in the past
month?” on GDP per capita.
Strength
Corruption Perception
The measure is the national average of the survey responses
to two questions in the GWP: “Is corruption widespread
throughout the government or not” and “Is corruption
widespread within businesses or not?” The overall perception
is just the average of the two 0-or-1 responses.
Stress
Positive affect
It is defined as the average of three positive affect measures in
GWP: happiness, laugh and enjoyment in the Gallup World Poll waves 3-7.
Strength
Negative affect
It is defined as the average of three negative affect measures in
GWP, worry, sadness and anger,
Stress
Confidence in national government
GWP asked the question that “Do you have confidence in
each of the following, or not? How about the national government?
Strength
Democratic and Delivery Quality
This is based on WGI, which accounts Voice and Accountability, Political
Stability and Absence of Violence, Government Effectiveness, Regulatory
Quality, Rule of Law, Control of Corruption. The indicators are on a scale

roughly with mean zero and a standard deviation of 1. In WHR to reduce the

dimensions to two using the simple average of the first two measures as an
indicator of democratic quality, and the simple average of the other four measures
as an indicator of delivery quality.
Strength
Table 5: Description and classification of the covariates