On Design of Problem Token Questions in Quality of Experience Surveys

by   Jayant Gupchup, et al.

User surveys for Quality of Experience (QoE) are a critical source of information. In addition to the common "star rating" used to estimate Mean Opinion Score (MOS), more detailed survey questions (problem tokens) about specific areas provide valuable insight into the factors impacting QoE. This paper explores two aspects of the problem token questionnaire design. First, we study the bias introduced by fixed question order, and second, we study the challenge of selecting a subset of questions to keep the token set small. Based on 900,000 calls gathered using a randomized controlled experiment from a live system, we find that the order bias can be significantly reduced by randomizing the display order of tokens. The difference in response rate varies based on token position and display design. It is worth noting that the users respond to the randomized-order variant at levels that are comparable to the fixed-order variant. The effective selection of a subset of token questions is achieved by extracting tokens that provide the highest information gain over user ratings. This selection is known to be in the class of NP-hard problems. We apply a well-known greedy submodular maximization method on our dataset to capture 94 of the information using just 30



There are no comments yet.



Analysis of Problem Tokens to Rank Factors Impacting Quality in VoIP Applications

User-perceived quality-of-experience (QoE) in internet telephony systems...

Token Swapping on Trees

The input to the token swapping problem is a graph with vertices v_1, v_...

Resolving the Multiple Withdrawal Attack on ERC20 Tokens

Custom tokens are an integral component of decentralized applications (d...

UTXO in Digital Currencies: Account-based or Token-based? Or Both?

There are different interpretations of the terms "tokens" and "token-bas...

Dynamic Question Ordering in Online Surveys

Online surveys have the potential to support adaptive questions, where l...

Corrigendum to Improve Language Modelling for Code Completion through Learning General Token Repetition of Source Code

This paper is written because I receive several inquiry emails saying it...

Flexible S-money token schemes

S-money [Proc. R. Soc. A 475, 20190170 (2019)] schemes define virtual to...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Several Internet telephony applications employ end-of-call user surveys to gather data on in-call QoE  [1]. In addition to the five star rating (MOS [2]), the percentage of calls rated 1 or 2 (poor call rate, or PCR) is often tracked as a measure of media quality. Previous studies have shown the value of combining PCR with an additional problem token questionnaire (PTQ) to gather detailed insights  [3]. The UI design of the PTQ used in this study is provided in  [3].

The range of questions used to capture these detailed problem areas has been studied in depth  [4, 5]. However, to the best of our knowledge, the impact of presentation order on response rate has not been studied in a live, deployed system. Our work is motivated by the practical challenges faced in analyzing questionnaire data. For example, our analysis showed that the contribution of one-way audio (one side can hear, but the other side cannot) to PCR dwarfed the other areas by a factor of two on mobile platforms. Further investigation, showed that the most important factor for this gap was the display order of questions. The ‘no sound’ token was placed at the top, and users were more likely to select this area purely due to its position in the survey. While one-way audio remains one of the top problem areas, we found that after randomization of the order, other impediments such as audio distortion and poor image quality occurred at comparable levels.

In mobile environments, the screen size is limited, so if designers want to avoid using a scrollbar, the number of questions needs to be kept small. The key question is how to minimize the number of questions, while maximizing their power in explaining PCR. Moreover, studies have shown the benefit of shortening surveys (without losing information) for improving response rate and improved data quality [6]. Identifying this subset of questions belongs to a class of NP-hard problems  [7]. In order to solve this, we follow the lead of Krause et al., leveraging the fact that information gain is a submodular function, and can be optimized using provable greedy approaches  [8, 9]. As shown in Section IV, this approach maps well to the our problem. The main contributions of this paper are:

  1. [leftmargin=*]

  2. Results of a large scale randomized, controlled experiment in a live VoIP system to show the bias introduced by fixed order questions.

  3. Provide an efficient solution to select a subset of tokens that maximizes information and minimizes correlation.

Ii Related Work

There is a rich area of research and practice in general survey design, validation and question order  [10, 11]. Factor analysis is commonly employed to analyze surveys with the number of factors being smaller than the number of questions.  [12]. There are many standards for subjective audio and video quality surveys, such as the ITU standards  [13, 4, 5]

. In this paper, our goal is not to replace these surveys, but instead to improve their utility by providing recommendations for presentation order, and a methodology to select a subset of informative questions. The selection of a subset of correlated random variables for maximizing the information gain has been studied in detail  

[8, 9] - this paper focuses on the application of these methods for QoE problem area surveys.

Iii Impact of Question Display Order

We study the impact of question order on our PTQ using a randomized controlled experiment. The control population was shown the original questionnaire with fixed token order. For video calls, the audio tokens were always shown on the left while the video tokens were always shown on the right. The treatment population was shown the questions in randomized order. For video calls in the treatment population, the position of the audio and video panels (left/right) were selected at random. The details of the experiment are below:

  • [leftmargin=*]

  • One-to-one calls that included both audio-only and video VoIP calls on desktop platforms.

  • The control and treatment group each contained 450,000 calls spanning over 100,000 unique users.

The experiment aimed to answer the following questions:

  1. [leftmargin=*]

  2. Is there a change in the percentage of questionnaire responses?

  3. What is the change in response rate of the individual audio and video tokens?

We present all results using relative measures due to commercial confidentiality.

Iii-a Overall Questionnaire Response Rate

We wanted to understand if randomizing the questions impacts the overall response rate of the PTQ. A user is said to respond to a PTQ if any token selection is made. We considered audio-only and video population. The differences in overall token response rate between the control group and the treatment group for these segments is shown in Table I. From a statistical perspective, there is no change in the percentage of responders in the audio population, but there is a change in the video population at the significance level (i.e. ). However, we find that the relative difference of for video surveys is small enough that the benefits (Section III-B) of the randomized questionnaire outweigh the minor reduction in response rate.

Population Relative Delta p-value
TABLE I: The difference in overall response rate in tokens between control group and treatment group

Iii-B Response Rate of Individual Questions

The difference in response rates of individual tokens between the fixed order and randomized order for video, desktop tokens is shown in Table II. Negative sign (red) indicates that the response rate went down whereas a positive sign (green) indicates the response rate went up. Although a change in the response rate was expected, there are some significant insights from these results:

  1. [leftmargin=*]

  2. The response rate of the top token is dramatically impacted for audio and video tokens. The decrease in response rate between the two variants is greater than . This shows the propensity for selecting the top token.

  3. For audio, the response rates of the top four tokens decreased dramatically, However, for video the response rates of the bottom four tokens increased. This shows that panel position (left, right) impacts response rate.

The impact of panel position is even more pronounced in mobile environments. In mobile, we found the average response rate for tokens that require users to scroll was lower. These results motivate the need for showing a small set of informative tokens to ensure that we do not lose the user’s attention while responding to the questionnaire.

Audio problem Token Relative delta p-value
I could not hear any sound
The other side could not hear any sound
I heard echo in the call
I heard noise in the call
Volume was low
The call ended unexpectedly
Speech was not natural or sounded distorted
We kept interrupting each other
Video problem Token Relative delta p-value
I could not see any video
The other side could not see my video
Image quality was poor
Video kept freezing
Video stopped unexpectedly
The other side was too dark
Video was ahead or behind audio
TABLE II: The difference in response rate of individual tokens for fixed vs. randomized display order in video desktop calls.

Iv Token Subset Selection

The question we are trying to address is the following: Given a limited budget of questions, , is there a systematic process of selecting the questions to maximize information? In this paper, we propose the selection of tokens by applying the algorithm described by Nushi et al. [9]. An overview of this approach is provided next.

Iv-a Information Gain and Submodular Function Optimization

Information gain () captures the amount of information “shared” between two random variables. Mathematically, between variables and , is defined as ; where represents the entropy (uncertainty) of a random variable. It should be clear that has the property of monotonicity  [14]. We can easily see that for any two sets of random variables, . Building on the monotonicity property and borrowing notation from [4], IG also has a “diminishing returns” property. The incremental information gain obtained by adding a new element to a subset is higher than the incremental information gain obtained by adding a new element to its superset. Mathematically, if we consider two set of token variables from the universe of problem tokens, , such that , and we consider a token and then the principle of diminishing returns property is shown in the equation below:


In other words, the marginal benefit of reducing the uncertainty in by adding a new token to a smaller set is higher than any superset. This property is known as “submodularity”.  [8] Krause et al. show that interaction effects (e.g., mutual exclusion) between variables can result in to not be strictly submodular. However, we do not see such interaction effects for our token dataset, and therefore no violation of submodularity of information gain.

The optimization of submodular functions is a known NP-hard problem, Nemhauser et al. [3, 4] provide a greedy algorithm to solve this problem that is of the computationally expensive and exhaustive solution. The algorithm is iterative and fairly straightforward: In iteration 0, we start with an empty set, . At every iteration , we add the token, , that maximizes the discrete derivative of information gain and set is constructed using the equation:


where and represents the information gain of PC by jointly considering the tokens for all candidate tokens t. Note that iteration 1 picks the token that provides the highest univariate information gain. In subsequent iterations, this method selects tokens that provide the information not already captured by the existing token set. By design, this results in selection of the least correlated tokens at every iteration. We view this method of selecting tokens as maximizing the return of information for tokens shown (hereafter referred to as RITS).

Fig. 1: Relative AUC performance of different strategies in selecting tokens is shown on left while Jaccard similarity scores are shown on right. Note: Scales removed for confidentiality.

Iv-B Evaluation of RITS

Data gathered from the treatment (randomized PTQ) population of the experiment was used for evaluation. The RITS method was evaluated using the following quantitative metrics:

  1. [leftmargin=*]

  2. AUC: Area under an ROC curve; [15]

  3. JS: Jaccard Similarity Coefficient [15]

While AUC measures the ability of the token set to discriminate between a good call and a poor call, Jaccard similarity measures the pair-wise degree of overlap between the tokens. The ideal token set has an AUC close to 1 and a JS score of 0. Uncorrelatedness (i.e. a low JS score) is important when breaking down an overall quality metric into distinct factors. For a given token count, , we studied and compared the performance of RITS with the following approaches:

  1. [leftmargin=*]

  2. Random: Select a random subset of tokens.

  3. AUC-Greedy: Select tokens sorted in descending order of their univariate AUC.

In our evaluation, we used the random forest implementation from the Python scikit-learn library

[16] to obtain the classification boundary with default settings. The error bars were obtained using 100 independent runs of train/test splits.

Iv-C Results using RITS

The AUC and JS scores of the different token subset selection strategies are shown in Figure 1. Note that we represent the scale in terms of the maximum values obtained for our dataset, and hide the labels for corporate confidentiality. This allows for relative comparison between the selection strategies. The RITS method significantly outperforms AUC-greedy and random method for all in-terms of the AUC criterion. The shape of the RITS curve highlights the ”Diminishing” returns property. The RITS method also has significantly lower JS score compared to the AUC greedy method for . This is because RITS is designed to find the tokens that provide information that is not already covered by the existing set of tokens. Since the AUC-greedy method does not consider correlation, it performs poorly on the Jaccard similarity measure. Using the RITS method, the first five tokens capture of the total information content in our dataset.

V Summary

In this paper, we studied two aspects of the problem token questionnaire design for VoIP applications - display order and token subset selection. Based on over 900,000 calls gathered from a randomized controlled experiment in a live system, we showed that there is a strong bias in response rate due to the presentation order of questions. The most dramatic impact is experienced by the top-most token. In mobile environments, scrolling can lead to a reduction in response rate by as much as

. Motivated by these observations, we studied the problem of selecting a subset of tokens that maximize information while minimizing correlation. We achieved this by mapping it to the problem of submodularity maximization studied in the machine learning community. By doing so, we were able to retain

of the information using just of the questions. Finally, we would like to emphasize that these methods and results can vastly benefit the media community as they significantly improve the quality of data gathered from any QoE survey.


  • [1] J. Jiang et al., “Via: Improving internet telephony call quality using predictive relay selection,” in Proc. ACM SIGCOMM, August 2016.
  • [2] ITU-T, “Mean Opinion Score (MOS) terminology,” 1996, Rec. ITU-T P.800.1.
  • [3] “Hidden for blind review.”
  • [4] ITU-T, “Methods for subjective determination of transmission quality,” 1996, Rec. ITU-T P.800.
  • [5] ——, “Subjective video quality assessment methods for multimedia applications,” 2008, Rec. ITU-T P.910.
  • [6] D. S. Allen, “The impact of shortening a long survey on response rate and response quality,” 2016.
  • [7]

    I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,”

    Journal of machine learning research, vol. 3, no. Mar, pp. 1157–1182, 2003.
  • [8] A. Krause and D. Golovin, “Submodular function maximization.” 2014.
  • [9] B. Nushi et al., “Learning and feature selection under budget constraints in crowdsourcing,” in Fourth AAAI Conference on Human Computation and Crowdsourcing, 2016.
  • [10] R. M. Groves et al., Survey methodology.   John Wiley & Sons, 2011, vol. 561.
  • [11] S. G. McFarland, “Effects of question order on survey responses,” Public Opinion Quarterly, vol. 45, no. 2, pp. 208–215, 1981.
  • [12] D. Weintraub et al., “Validation of the questionnaire for impulsive-compulsive disorders in parkinson’s disease,” Movement Disorders, vol. 24, no. 10, pp. 1461–1467, 2009.
  • [13] ITU-T, “Subjective performance evaluation of network echo cancellers,” 1998, Rec. ITU-T P.831.
  • [14] T. M. Cover and J. A. Thomas, Elements of information theory.   John Wiley & Sons, 2012.
  • [15] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining.   Addison-Wesley, 2006.
  • [16] F. Pedregosa et al., “Scikit-learn: Machine learning in python,” Journal of machine learning research, vol. 12, no. Oct, pp. 2825–2830, 2011.