Panel surveys provide a markedly valuable data source for studying social phenomena from a causal perspective by measuring attitudes and (reported) behavior of individuals over time and are therefore used extensively in the social sciences. However, the validity of results drawn from panel data dependents on the ability of survey organizations to establish and also maintain a high quality panel over time. One of the most severe threats to panel data quality is selective nonresponse of panel members, which can lead to biased survey estimates if response propensities are correlated with the outcome of interest (Lynn2009; Groves2006). Moreover, nonresponse may accumulate over multiple panel waves and can therefore be linked to panel attrition, i.e. panelists dropping out of the panel altogether (Watson2009). Decreasing sample sizes thereby reduce the efficiency of estimates, limit detailed subgroup analyses and might eventually force survey managers to draw costly refreshment samples.
Whereas nonresponse in panel studies has traditionally been tackled by using nonresponse weights (Groves2009, Rendtel2009
) or imputation methods(Rubin1987), recent work studies panel nonresponse from a prediction perspective (Kern2019, Buskirk2018). In this context, machine learning methods (ML, see Hastie2009 for an overview) are employed to predict nonresponse in advance, shifting the focus from post- to pre-correction of panel nonresponse. The prediction models are typically sought to inform the data collection process akin to the idea of adaptive designs (Groves2006a; Lynn2017), e.g. by targeting likely nonrespondents in new panel waves with higher incentives. Against this background, prediction accuracy in future panel waves becomes the center of attention.
While previous work indicates that machine learning may be used to accurately predict attrition and non-participation for selected hold-out waves or samples (e.g., Klausch2017; Lugtig2018), there is little guidance how to efficiently train, tune and evaluate nonresponse prediction models in a longitudinal context. This study proposes a framework for building and evaluating nonresponse models that particularly focuses on incorporating information from multiple panel waves. With respect to model building, this approach utilizes features that aggregate previous (non)response patterns, e.g. by counting the total number of non-participations up to a given wave. Concerning model tuning and evaluation, temporal cross-validation (rolling forecasting origin technique, Hyndman2018
) is employed by iterating through pairs of panel waves. This allows for tracking the performance of prediction models in multiple test sets and selecting the model (i.e., the combination of model type, hyperparameter setting and feature groups) that performs best over time. The evaluation approach thereby mimics the application of the prediction models in the field and enables performance comparisons under deployment conditions. The proposed framework draws on the model building and evaluation schemes that are implemented in the general purpose risk modeling and prediction toolkittriage (Crockett2018).
The present study illustrates the longitudinal perspective on nonresponse prediction with data from a German probability-based mixed-mode panel (GESIS Panel; Bosnjak2017
). We consider penalized logistic regression, decision trees, random forests, extremely randomized trees and extreme gradient boosting for building the prediction models. We construct multiple feature groups by aggregating over different time frames to study the effect of including historic information on prediction performance. Altogether, this results in 4000 prediction models that are built and evaluated over 20 training and test waves. We focus on discussing these models with respect to their ability to identify likely nonrespondents in advance, in line with the higher-level goal of developing a prediction-based targeted design.
The remainder of this paper is structured as follows. Section 2 briefly reviews previous approaches for modeling and predicting nonresponse in panel studies. Section 3 lays out the proposed framework and, in parallel, its application, i.e. the data (section 3.1), the feature generation process (section 3.2), the temporal cross-validation strategy (section 3.3) and the model types that are used in the empirical example (section 3.4). The results are presented in section 4, which includes identifying and selecting optimal model settings over time (section 4.1) and a more detailed evaluation of the selected models in the last test wave (section 4.2). We close by discussing the advantages and limitations of our longitudinal prediction approach in section 5.
2 Analyzing nonresponse and attrition in panel studies
Nonresponse and attrition in panel studies have been analyzed with a multitude of statistical methods in previous research, depending on the objective of the respective modeling task. On a higher level, these objectives may be classified asexplanation, which typically entails the usage of parametric methods, and prediction
, which turns the focus towards supervised learning techniques.
A common approach for studying factors that affect nonresponse is to model contact and cooperation propensities with logistic regressions on a wave-by-wave basis, typically by focusing on predictors from the respective previous wave (see e.g. Siegers2019, Lipps2007, Behr2005). As this can lead to many analyses, the probability of responding in a given wave may be analyzed longitudinally on the person-wave level, which allows the inclusion of time as a covariate (e.g., Uhrig2008, Lipps2009, Behr2005). A related approach is to particularly study the time-dependence of attrition with survival analysis (e.g., Struminskaya2015, Richter2014). Further extensions include modeling the response process as the occurrence of sequential events (establishing contact with the sample member and obtaining cooperation given successful contact) by using bivariate probit models (Watson2009, Nicoletti2005) or analyzing the count of complete interviews per year with multilevel negative binomial regression (DeLeeuw2017).
Another set of studies focuses on modeling different types of response patterns, e.g. by creating a multiclass outcome which differentiates between attrition at different points in time (i.e., in earlier vs. later panel waves, Durrant2010). While this approach covers monotonic attrition patterns, further work separates attriting from returning respondents in multinomial regressions to study factors that affect temporary nonresponse (Voorpostel2010, Burkam1998). This perspective is extended by Lugtig2014 who categorizes respondents based on the similarity of their response patterns using latent class analysis.
In this context, it is worth noting that studies that model panel nonresponse with parametric regression methods also commonly include information about response patterns in previous waves as predictors. This includes adding panel experience, temporary drop-out in previous waves or the absolute number of previously completed surveys to the set of explanatory variables (e.g., Siegers2019, Rossmann2016). Related work studies the effects of changes and events between waves on panel attrition (Trappmann2015, Voorpostel2011). A thorough inclusion of historic information is implemented by Kocar2019, who constructs longitudinal predictors by computing average survey outcome rates (for participation, non-contact, and refusal) prior to a wave, the number of consecutive survey outcomes prior to a wave, and changes in response status over all previous waves.
While the outlined studies typically focus on studying factors that cause nonresponse and attrition in panel surveys, recent work approaches nonresponse from a prediction perspective. Initial work in this field primarily investigates and compares the usage of different machine learning methods with different prediction objectives. Kern2019 predict refusals in the German Socio-Economic Panel Study (GSOEP) with one pair of panel waves and report favorable prediction performance of tree-based ensemble methods in comparison with logistic regression. Klausch2017 focuses on short and long term attrition in the LISS panel (Longitudinal Internet Studies for the Social sciences) and demonstrates remarkable prediction performance independent of the prediction method (data in wide format). Also using the LISS panel, Mulder2018 report favorable performance of random forests compared to multilevel logistic regression (data in long format). In contrast, Lugtig2018 find little differences in the performance of logistic regression and random forests when predicting attrition in the German Internet Panel (GIP) on a wave-by-wave basis. Further examples include Liu2018 (2018; predict re-participation in the Surveys of Consumers), McKay2019 (2019; predicts nonresponse in Understanding Society), Wuerbach2019 (2019; predict participation status in the NEPS Newborn Cohort) and Bach2019 (2019; estimate response propensities for the LISS panel, SOFT and EPBG survey). Note, however, that the aforementioned studies use different methods and metrics to evaluate prediction performance which limits the comparability of results. More importantly, previous studies that use ML for predicting nonresponse do not make full use of the longitudinal data structure in terms of model building and evaluation. Against this background, we propose to employ feature generation by aggregating over time in connection with temporal cross-validation as a general framework for building and evaluating nonresponse prediction models with panel data.
In this study, data from the GESIS Panel (Bosnjak2017) will be used to exemplify the longitudinal prediction approach. The GESIS Panel is a probability-based mixed-mode access panel that is based on a random sample from German population registers, started in 2013 and is conducted on a bi-monthly basis. In the initial recruitment sample the target population consisted of the German-speaking population between the ages of 18 and 70. The GESIS Panel comprises two participation modes, i.e., an online as well as a mail mode.
The GESIS Panel is an access panel, which means that it is open to the academic research community for data collection. Researchers from universities and non-commercial research institutes can submit their studies. The proposals are then peer-reviewed and realized within a five-minute slot in the event of a favorable decision. The studies (currently more than 40) can be cross-sectional or longitudinal.
In this study, we use data that starts with the first complete wave of the GESIS Panel, wave ba (February 2014), and ends with wave ed (August 2017), which results in a total of 22 panel waves.111The waves are labeled based on the following naming scheme: [year: a = 2013, b = 2014, …][waves: a …f = 1 …6], i.e. wave ed corresponds to year 2017 and the fourth wave in that year. In addition, we use information from the recruitment interview which was conducted in 2013.
As in other panel studies, the GESIS Panel sample is subject to attrition, which can be caused by various mechanisms. On the one hand, participants drop out of the panel if, for example, they leave the panel due to illness or death. Panelists can also unsubscribe from the panel themselves if they no longer wish to participate (voluntary attrition). In addition, participants will be excluded if they have not participated in a wave at least three times in a row (involuntary attrition). Panelists cannot re-join the GESIS Panel if they dropped out at one point in time.
To provide some context for the following analysis, Figure 1 presents attrition and participation rates in the GESIS Panel by mode for our period of study. The three lines in the lower part of the plot display the cumulative percentage of attrition (overall, online, offline) relative to the active panel population before wave ba was fielded (i.e., all panelists who completed the GESIS Panel welcome survey). It can be seen that offliners are more likely to leave the panel than online participants. Furthermore, there is a strong increase in attrition between wave bc (June 2014) and wave bd (August 2014). This is due to the fact that at this point persons are excluded from the panel due to three consecutive nonresponses for the first time. In the upper part of the plot, it becomes apparent that the overall participation rate (relative to all panelists invited for a given wave) initially fluctuates quite strongly and then stabilizes at around 90 percent from 2015 onwards. The participation rate of online users is almost always higher than the participation rate of offliners, whereas the two rates converge towards each other over time.
Note that to partly compensate for the effects of panel attrition in the GESIS Panel, a refreshment sample is recruited every two years. In the following analysis, however, we will solely focus on the first cohort of the GESIS Panel, which provides data over the longest time span. Specifically, our analysis includes active panel members of the first cohort for each wave, i.e. panelists who permanently drop out in a given wave are excluded from our analysis in the following waves. As a result, we start with a sample size of in wave ba and end with in wave ed.
The main objective of this study is to predict nonresponse in the GESIS Panel on a wave-by-wave basis. More precisely, we use a binary variable that indicates participation (0) or non-participation (1) as our outcome for each panel wave. We define participation as a complete or partially complete interview with sufficient information (AAPOR categories 11 and 12;AAPOR2016) and consider all other outcomes as non-participation. In wave ed (August 2017), the main AAPOR categories that combine into non-participation of the then active panel members were “319: Nothing ever returned” (7.39%), “212: Break-off” (0.67%) and “211221: Logged on to survey, did not complete any items” (0.46%). The proportion of non-contact was 0.24% (AAPOR category 3311).
Non-participation in panel waves has been linked to a variety of factors that operate on different stages of the fielding process (locating the sample member, contacting the sample member, obtaining cooperation; Watson2009). Since non-participation in the GESIS Panel can be considered to be predominantly driven by non-cooperation and our longitudinal setup requires us to focus on predictors that are available for all panel waves, we use respondent characteristics and previous interview experience as main explanatory concepts. We thereby follow previous studies that identified cooperation in previous waves and enjoyment and perceived difficulty of previous interviews as important drivers of nonresponse and attrition (e.g., Hill2001; Olsen2005; Frankel2014).
The predictor variables (in the machine learning literature these variables are also called“features”) are summarized in Table 1. With respect to respondent characteristics, we focus on socio-demographic variables and general personal traits. These variables were derived from the recruitment interview of the GESIS Panel and are treated as time-invariant in the following analysis.222Socio-demographic information are updated annually in the GESIS Panel, which does not readily match with the structure of our longitudinal feature generation scheme. With respect to (previous) interview experience, we consider cooperation-related variables from the recruitment interview (e.g., interviewer-assessments on the willingness to participate in the GESIS Panel) as well as response status, survey evaluation (e.g., whether the questionnaire was considered as too long) and general survey participation (e.g., whether the respondent answered the survey in one piece) from each panel wave. We further include binary missing indicators that flag missing values for each substantive variable.
The wave-specific variables are by definition time-variant and therefore require some form of pre-processing scheme in order to be efficiently used for predicting nonresponse in a given wave. We create features from time-variant variables by aggregating over time, i.e. summing the occurrence of a given category in previous waves. As an example, this results in e.g. counting the number of complete interviews in the last three waves (for each wave). Note that for response status, this strategy resembles model building with multiple lags of the dependent variable. When constructing longitudinal features, we consider different time horizons to compare the predictive power of short- (last wave, Block II), medium- (last three waves, Block III) and long-term (all previous waves, Block IV) response histories. With this setup, we structure our complete set of predictors in four blocks:
|Respondent/ Socio-demographic characteristics||Age, sex, migration background, education, marital status, household size, employment status, job type, personal income, household income, house type, house condition, social status, life satisfaction, general trust||Recruitment interview|
|Survey cooperation||Survey experience, willingness to respond (in recruitment interview), willingness to participate (in recruitment interview), willingness to participate (in panel), probability of participation (in panel), provided telephone number, provided e-mail address||Recruitment interview|
|Response status||Response status (complete interview, partial interview, non-participation)||Panel waves (ba-ed)|
|Survey evaluation||Questionnaire: Interesting, diverse, important for science, long, difficult, too personal, overall assessment||Panel waves (ba-ed)|
|Survey participation||Mode, participation interrupted, participation location||Panel waves (ba-ed)|
Block I: Time-invariant
Respondent/ socio-demographic characteristics from recruitment interview
Survey cooperation in recruitment interview
Block II: Time-variant (last wave)
Response status, survey evaluation and participation in last wave
Block III: Time-variant (aggregated, last three waves)
Response status, survey evaluation and participation over the last three waves
Block IV: Time-variant (aggregated, all previous waves)
Response status, survey evaluation and participation over all previous waves
These feature blocks are built for each wave of the GESIS Panel, akin to generating an outcome variable for each panel wave. Note that since the last block aggregates over all previous waves at a given point in time, the amount of information that is summarized in this block increases as we move towards more recent panel waves. When building the prediction models, we allow models to use (all) feature blocks simultaneously and (each block) separately, which results in five feature groups (i.e., block I, II, III, IV, all). This allows us to study the effect of aggregating over multiple panel waves with different time horizons on prediction performance.
3.3 Temporal Cross-Validation
A key requirement for evaluating and comparing nonresponse prediction models with panel data is that the evaluation method closely aligns with the intended usage of these models in the field. Commonly, the goal is to predict nonresponse in a new panel wave, based on information from previous waves. This task corresponds to the objective of forecasting (with time series data), for which various evaluation methods have been proposed (see West2006, Tashman2000, Bergmeir2012). Common techniques include fixed-origin evaluation, where the time point which separates the fit and test period – the origin – is fixed, and rolling-origin evaluation, where the origin moves forward in time (Hyndman2018). Temporal cross-validation resembles the rolling origin technique by structuring training and test sets by time, such that – in a panel data setting – completely new sets of panel waves are used for performance evaluation and historical information is used for training. Repeating this process allows for tracking the performance of a given prediction model over time, enabling model selection based on performance trends.
Figure 2 illustrates the temporal cross/validation setup of the present study. The first training set draws on features from the recruitment interview () and from the first complete wave of the GESIS Panel (, February 2014) and builds the outcome based on wave two (, April 2014). The first test set uses features from the recruitment interview () and from the first two GESIS Panel waves (, February and April 2014) and builds the outcome based on wave three (, June 2014). Note that for all blocks the same number of variables is created each time, whereas the information that is included and aggregated in those variables is time-dependent. In summary, this setup ensures that predictions models are trained with information that are given up to a specific point in time and evaluated with respect to their prediction performance in a new panel wave. This process is repeated up to wave ed (August 2017), i.e. additional pairs of training and test sets are created by moving forward in time which results in 20 training and 20 test sets in total (note that testing starts in wave three).
Temporal cross-validation allows for mimicking a variety of specific use cases of predictive modeling with longitudinal data (see https://dssg.github.io/triage/). While this application focuses on predicting one wave ahead, multiple extensions such as aggregating outcomes over multiple waves or allowing a time gap between features and the outcome can be implemented. We further chose to re-train models for each new panel wave that becomes available, whereas other model update frequencies could be investigated as well. Finally, in our setup each model is trained by only considering one row of data or one outcome event per panelist, respectively. An alternative approach would be to incorporate multiple outcome events per panelist, which would result in more observations for each training set, but also in a more complicated data structure (long format with events nested in panelists).
3.4 Model types
We use the outlined temporal cross-validation schema to compare the performance of a number of prediction models that are built with different machine learning methods. Following previous studies on nonresponse prediction, we focus on tree-based methods that are typically well-suited for prediction tasks with many features and potentially complex non-additive and/or non-linear relationships between the predictors and the outcome of interest (e.g. Kern2019, Klausch2017
). In this context, we include single decision trees (CART) as a rather simple and interpretable approach as well as prominent (random forests) and more recent (extremely randomized trees, XGBoost) ensemble methods that have been shown to perform well in a variety of settings (see e.g.Chen2016, Fernandez-Delgado2014, Caruana2005). We further consider penalized logistic regression to also include parametric “benchmark” models. In total, five model types are studied:
Penalized Logistic Regression
Logistic regression with a (lasso/ridge) penalty on the vector of regression coefficients(Tibshirani1996)
Recursive partitioning technique, repeatedly splits predictor space into homogeneous subregions (Breiman1984)
Tree-based ensemble method, grows decorrelated decision trees based on bootstrap samples (Breiman2001)
Extremely Randomized Trees (ExtraTrees)
Tree-based ensemble method, grows trees based on randomly sampled predictors and cut points at each node (Geurts2006)
Extreme Gradient Boosting (XGBoost)
Sum-of-trees method, builds a sequence of trees akin to optimization via gradient descent (Chen2016)
Each prediction method is associated with a set of hyperparameters that has to be tuned to achieve optimal performance. We consider exhaustive grid search with the hyperparameter settings that are outlined in Table 2.333We restrict ourselves to a rather limited set of try-out values to keep the total number of models at a reasonable level. Note that we include nearly unpenalized logistic regression models in our model set by including a high value for the inverse of the regularization strength (C) in the tuning grid.
In the longitudinal context, the present tuning strategy results in building one (e.g.) random forest for each tuning parameter setting and training set and then evaluating the performance of random forests with this setting over time. Since we also consider different feature groups (section 3.2), a prediction model for a given model type and training set consists of a combination of hyperparameter settings and feature groups that are included. All in all, this results in 200 models (i.e., 5 feature groups 40 tuning parameter settings) that are built for each training set, i.e., 4000 (200 20) unique models in total.
The computational infrastructure for implementing the analysis is provided by the Python (3.6.4) package triage (2.2.0, Crockett2018) and draws on PostgreSQL (9.5.17) for data management and scikit-learn (0.19.1, Pedregosa2011) and xgboost (0.90, Chen2016) for model building. Data preparations, model selection and post-modeling are conducted with R (3.4.4, RCoreTeam2018).
|Logistic Regression||penalty||l1, l2|
|C||0.05, 0.1, 1, 1000||8|
|Decision Trees||max_depth||3, 5, 10|
|Random Forest||max_features||sqrt, log2|
|Extra Trees||max_features||sqrt, log2|
|XGBoost||max_depth||3, 5, 10|
|n_estimators||250, 500, 1000|
|Note: scikit-learn default settings are used for parameters not listed.|
We structure the results section by first investigating the prediction performance of the trained models over the full set of test sets with respect to different performance metrics (section 4.1). Besides summarizing overall performance levels given all available features, section 4.1 discusses selecting optimal combinations of feature groups and hyperparameter settings for each model type. In a production setting, this aligns with condensing the full set of models to a smaller set of candidates, and eventually, to just one final model that would be implemented in the field. We then evaluate and compare the selected best models with respect to prediction performance, similarities of the predicted lists and feature importances in the last test wave (section 4.2).
4.1 All waves
Due to the large number of models that were built in each training set, we start by presenting prediction performances only for models that use all feature blocks, i.e. models that were allowed to draw on all available information. Figure 3
displays the development of ROC-AUCs (area under the receiver operating characteristic curve) of these models over all test sets, grouped by model type.444An interactive version of this graph which includes all models can be found here: https://ckern.shinyapps.io/predicting-nonresponse/. Note that ROC-AUC varies between , with 0.5 representing a non-informative model. First, it can be seen that the highest ROC-AUCs that can be reached by any model type range roughly between 0.85 and 0.9, indicating strong prediction performance among the top models. Second, the performance curves do not follow a clear (up- or downward) trend over time, although selected test waves are associated with lower average ROC-AUCs (e.g., wave cc, June 2015). Third, the top performance curves include random forest, extra trees and penalized logistic regression models, followed by XGBoost and decision tree models (on average). Lastly, almost all models achieve considerably higher performance levels when compared to unregularized “baseline”logit models that use only time-invariant characteristics as features (feature block I, gray line), indicating a strong benefit of including features that aggregate over multiple panel waves.555Table A.1 presents logit coefficients of the first (trained 04/14) and last (trained 06/17) baseline logistic regression (C = 1000, penalty = l2).
The effect of in- or excluding different feature blocks on model performance is illustrated in Figure A.1
. It can be seen that a large share of variance in ROC-AUCs for a given model type and test set is induced by different feature group settings. Excluding features that aggregate over multiple previous waves (feature block III, IV) typically leads to performance curves with comparably low average ROC-AUCs (orange and yellow lines). In most cases, the best performance levels are achieved by combining all feature blocks, whereas the strongest contribution to ROC-AUC seems to come from feature block IV (aggregating over all previous waves). Furthermore, FigureA.1 shows that performance differences between feature blocks III and IV increase after a couple of test waves, as the amount of information that is aggregated into block IV increases over time.
The reported performance trends can be used to select the optimal hyperparameter and feature group combination for each model type. For this purpose, we first compute the average ROC-AUC for all models up to the second last test set, enabling performance comparisons in the last test set under deployment conditions in later steps. On this basis, we follow a simple “best average” criterion by selecting the respective hyperparameter and feature group setup that is associated with the highest mean ROC-AUC for each model type.666Note that other selection rules could be used by e.g. accounting for the variance of a given performance metric or by giving more weight to more recent test sets (https://dssg.github.io/triage/dirtyduck/docs/audition/).. The performance results of the selected models are summarized in Table A.2. We will discuss these models in more detail in section 4.2.
compares models that use all feature blocks with respect to precision at top 5% and top 10% as alternative evaluation metrics. In this context, panelists with predicted nonresponse probabilities that are among the highest 5% (10%) of all scores are predicted as being nonrespondents and then evaluated against the true classes (). Note that this approach aligns with the intervention perspective in which panelists with the highest predicted risk scores may be targeted in an adaptive design. For both measures, random forests, extra tree models and penalized logistic regressions are among the best performing models while also being associated with relatively low variance in performance for a given test set. The temporal cross-validation results show a decreasing trend in the best precision values over time, indicating that targeting panelists with high predicted risk scores may become less accurate for future waves. For the last test set and best models, about 45% to 50% of the panelists that are predicted as being nonrespondents based on the 10% cutoff are truly nonrespondents (see Table A.2), which are still markedly higher numbers than the baseline performance (precision based on the outcome distribution; 8.83% in the last wave).
Complementing Figure A.2, Figure A.3 displays recall at top 5% and top 10% for models that use all feature blocks. This evaluation focuses on the proportion of nonrespondents that are detected among all nonrespondents in a given wave when applying the same classification thresholds as before (). It becomes apparent that similar patterns as with previous metrics can be observed, although the recall curves do not follow a clear (up- or downward) trend over time. Recall percentages between 51% and 57% for the last test set and best models based on the top 10% cutoff illustrate that focusing on high risk observations with a restrictive threshold leads to relatively precise predictions (see above), but misses a large share of nonrespondents that are present in the data (see Table A.2).
4.2 Most recent wave
After summarizing overall performance trends and selecting optimal settings for each model type we focus on more detailed analyses of the respective best models in the last test set. We thereby follow a potential deployment scenario in which a final best model would be used to target likely nonrespondents in the most recent wave of the GESIS Panel, after model selection based on temporal cross-validation.
Figure 4 presents ROC and precision-recall curves, which plot sensitivity versus one minus specificity and precision versus recall over the full range of applicable classification thresholds for the best models in the most recent test set. The selected candidate models achieve high prediction performance with ROC-AUCs between 0.86 (penalized logistic regression, XGBoost) and 0.89 (random forest, extra trees), indicating that after model tuning and feature group selection all model types perform comparably well (see also Table A.2). Note that this also holds true for the best, but relatively simple, decision tree model. A similar result is given by the precision-recall curves, where all models considerably improve over the precision baseline of a random classifier (which equals to 8.83% for any classification threshold).
The selected models can also be compared with respect to the resulting lists of panel members with particularly high nonresponse risks. This allows us to evaluate whether different optimal models result in similar sets of panelists who might be targeted in a prediction-based intervention. We generate lists of panelists-at-risk for each model by classifying individuals with predicted risk scores that are among the highest 10% of all scores as likely nonrespondents. We then compute Jaccard similarites between these lists,
with , being the total number of predicted nonrespondents in list , and being the number of panelists that are predicted as nonrespondents in both lists (Tan2018). varies between , with higher values indicating more similar lists. The computed Jaccard similarites are presented in Figure 5. It can be seen that the best prediction models do not only result in similar performance, but also produce similar lists of likely nonrespondents with large overlap. Maybe not surprisingly, the strongest agreement occurs between random forests and extra trees, which produce lists that are almost interchangeable.
Finally, the selected models can be compared with respect to the importance of different types of features in the context of model building. Figure A.4 displays mean importance scores for each feature block and feature concept combination (see Table 1), including a “na” category that summarizes the importance of missing indicators. The underlying importance scores correspond to absolute logit coefficients for penalized logistic regression, improvement in accuracy for XGBoost and gini importance for the remaining model types. All importance scores have been scaled to have a maximum value of 100. On the one hand, Figure A.4 indicates that the tree ensembles draw on a more diverse set of features compared to the logistic regression and decision tree model, as shown by the fewer spots of zero mean importances. On the other hand, however, the highest mean importances for the tree ensembles are achieved by feature sets that include aggregated missing indicators (i.e., sums of missings over different time spans). Since these variables carry the information that a panelist did not participate in a given wave, response status in previous waves appears to be the by far most important factor when building the present prediction models.
This study presented and applied a longitudinal framework for building and evaluating prediction models to predict nonresponse in panel studies based on multiple panel waves and machine learning. This approach particularly focuses on providing effective prediction models that can be used to target likely nonrespondents in future panel waves in the context of an adaptive design. Using data from the GESIS Panel, it has been shown that generating features by aggregating over multiple panel waves can be used to build models with competitive prediction performance. Furthermore, temporal cross-validation was used to evaluate and select prediction models in a data setting that mimics the prospective usage of these models in the field. The GESIS Panel results indicated that after parameter tuning and feature group selection all model types performed comparably well for identifying prospective nonrespondents in the present application, with ROC-AUCs between 0.86 and 0.89, precision@top10 between 45% and 50% and recall@top10 between 51% and 57% for the best models in the last test wave. It was also shown that the prediction models particularly benefited from including response status information from multiple previous waves as features.
The presented analyses can be extended in various ways. First, we used a simple binary outcome that only distinguished between participation and non-participation in a given panel wave. A refined analysis would partition non-participation in subclasses that align with different potential interventions (e.g. for non-contact and refusal) or could model individual trajectories by focusing on transitions to first non-participation and to final drop-out (again enabling different treatments). This can be paired with predicting over various time horizons, as different interventions might need different time frames for implementation. Second, the set of predictor variables could be extended by e.g. including more substantive variables (on topics that are included in multiple waves), building features from online paradata (for the web respondents) or by accounting for changes in socio-demographic characteristics over time. Third, the presented feature aggregation approach could be compared with an alternative strategy that directly includes every measurement of a given variable as predictors. Note, however, that longitudinal feature aggregation has a similar benefit of allowing models to given stronger weights to more recent features when feature blocks that aggregate over different time spans are included. Fourth, we considered rather shallow tuning grids for model training and better performance of e.g. the XGBoost models may be achieved with more sophisticated tuning settings, potentially combined with techniques for reducing class imbalance (e.g., Chawla2002). In addition, model tuning and selection solely focused on prediction performance, whereas the key objective of a prediction-based intervention would be to increase response rates and to decrease bias due to group-specific response propensities. An extended application could therefore consider model selection based on prediction performance and bias reduction, e.g. by including dissimilarity indices (Rossmann2016) or R-indicators (Schouten2012) as selection criteria.
Furthermore, it is important to note that while the presented prediction framework can generally be implemented with data from any panel study, this paper focused on its application with one specific data source. Although the GESIS Panel shares several design aspects with other studies (e.g., LISS, GIP, ELIPSS, see Blom2016), additional work is needed to learn about the generalizability of the presented results. These applications would thereby allow to gather insights about which design features drive performance differences across panels, i.e. for which type of panel prediction-based targeting might be most successful.
Lastly, we want to stress that building a highly accurate and carefully selected prediction model is only the first step in the process of developing a prediction-based targeted design. The adaptive design literature discusses a wide range of design features that may be altered based on nonresponse predictions, with mixed results regarding their effectiveness (Lynn2017). Among those options, differential incentives might be viewed as the most promising avenue for increasing response rates among groups with low (predicted) participation propensities (Zhang2010; Zagorsky2008). However, as panel studies might already offer some form of baseline incentives to all panelists as part of their panel maintenance strategy, additional rewards that are targeted might be less effective (Mercer2015). A further complication is that respondents with low response propensities may be worse reporters, i.e. persuading reluctant respondents to participate in a panel wave might increase measurement error (Bach2019). A prediction-based intervention would therefore need careful evaluation and long-term monitoring with respect to different aspects of data quality to study the potential benefits and risks of this approach.
Appendix A Appendix
|Extra Trees||Prec.@10 pct||0.76||0.58||0.59||0.59||0.55||0.54||0.60||0.60||0.56||0.58||0.48||0.50||0.52||0.55||0.55||0.53||0.53||0.59||0.53||0.48|
|Random Forest||Prec.@10 pct||0.75||0.58||0.59||0.58||0.54||0.55||0.61||0.58||0.56||0.59||0.47||0.50||0.52||0.56||0.54||0.54||0.52||0.57||0.53||0.50|
|Logistic Regression||Prec.@10 pct||0.75||0.57||0.58||0.61||0.55||0.53||0.62||0.59||0.56||0.56||0.46||0.50||0.53||0.58||0.54||0.57||0.52||0.58||0.53||0.50|
|Decision Tree||Prec.@10 pct||0.64||0.55||0.58||0.61||0.49||0.52||0.59||0.55||0.55||0.57||0.47||0.47||0.48||0.47||0.53||0.54||0.50||0.59||0.51||0.45|
|: max_features: log2, min_samples_leaf: 10, feature groups: all|
|: max_features: log2, min_samples_leaf: 10, feature groups: all|
|: penalty: l1, C: 0.05, feature groups: all|
|: max_depth: 3, max_features: none, feature groups: all|
|: max_depth: 3, n_estimators: 250, learning rate: 0.05, feature groups: all|